As anyone who runs a Web site knows, up to two-thirds of the raw unfiltered traffic is from spiders and robots. Penton's Eric Shanfelt wrote an excellent post on how publishers may inadvertently (or intentionally) inflate their traffic by including these numbers.
We all must struggle with ways to filter out this activity, either by filtering IP address, user agent, domain, or even usage pattern (hitting the same page on the site 1,000 times in a minute, for example).
The better software gives us tools to handle this. But I find even so, the spiders always seem one step ahead. We now have begun to see spiders filling out forms on our site. In January, we saw a huge spike, and before I succumbed mutual back slapping around these hallowed halls about how successful we are, I did a little digging. Turns out 100% of the increase is from direct, non-referral sources. I.e., Google, Yahoo, MSN referrals are the same. I doubt a huge chunk of people would suddenly type the urls of our web sites into their browsers.
Digging deeper, I saw ample evidence of robotic activity (hitting the same 4-year-old article hundreds of times, for example). What's weird is, when I looked at the IP address range, there was literally no pattern. There were hundreds of different IP addresses. A whois revealed they all resolve to the same entity, further confirming suspicions.
While our analytics vendor (we use Unica's NetTracker, which I like better than others I've tried) is struggling to advise me on how to deal with this, it seems to me the best solution is to simply exclude all traffic that doesn't accept cookies, since as far as I have observed, spiders don't accept cookies (today, anyway).
Unica feels this would be throwing out the baby with the bathwater, but I'm not so sure. You pretty much need cookies enabled to do ANYTHING useful on most Web sites today.
Too severe? Would love to hear your thoughts.
Signed,
Tired of fighting spiders and robots
Thursday, February 08, 2007
Subscribe to:
Post Comments (Atom)

1 comments:
Most spiders don't call JavaScript. So, that should take most of the spider activity out. Google Analytics is free and uses JavaScript tags. So, you might think of implementing that next to NetTracker and benchmark each against one another.
Post a Comment