This probably impacts a minority of web analytics practitioners, normally only those with large websites with millions of page views per month / week / day. I know that there are lots of you out there. :)
- Your reports and queries from your web analytics solutions start slowing down, especially if you are segmenting the data or looking at vast amounts of history (and most definitely if your vendor has a backend that actually allows you to write custom queries against that massive amounts of data).
It will be a rare vendor that will admit that this challenge afflicts them, but it does.
To deal with the above two problems the standard operating procedure it to do data sampling.
I find that there is a bunch of confusion about sampling your data and implications of making that decision (other than that if you sample the data you'll save money).
So here's the 411 on data sampling. There are three primary ways of sampling your data.
- Code Red: Sampling web pages on your site.
- Code Orange: Sampling data collected from each page.
- Code Green: Sampling data processed when you run the query / report.
Here are more details on each option…….
Code Red: Sampling web pages on your site:
Implications: Perhaps the least palatable of the three options. If you ever want to know anything about pages you might find interesting it is possible that you have no data.
You also probably not have a complete picture of your website, as in you forgot to tag page x and your marketing department sent off a million direct marketing emails pointing to that page or page y got indexed by Google and is attracting a bunch of traffic now and you have no idea.
Code Orange: Sampling data collected from each page:
So when the page loads only every tenth time it will send data to your vendor. This means less data is collected for the vendor to store and process, when reports run you get that sampled data.
Now in the report you have "lower" numbers than your real numbers but there is usually some approximation applied (say multiply ever number by ten) to get the "correct" numbers for you.
Implications: Better than not collecting data at all for some pages on your site. In this case atleast you have some representative data for all your pages. Even with the multiplier you are getting a "approximately ok" view of your over all metrics.
For pages that don't get lots of page views (say beyond your top twenty or so pages) it also means that if you segment data it might not be of optimal quality.
If you have a choice between red and orange, always choose orange.
Code Green: Sampling data processed when you run the query / report.
All data from your website is collected and stored by your vendor.
But your web analytics application allows you to select the amount of data you want to statistically sample so that your queries run faster and your reports come back quicker even when you hit humongous amounts of data.
To extend the example above in this scenario you would simple say: "use statistical sampling of 10".
Now when the query runs it will use statistical sampling to run the query really really fast and get you close to correct data.
Implications: You collect all the data from your website. If you ever want to wait a loooong time for your query to come back with God's perfect answer, you can.
As in the case of ClickTracks above it is best if you can choose the sampling level, rather than your vendor because you can fine tune that sampling in a very white box way to your own comfort level.
The nice thing about this methodology is that you can sample the data and get reasonably fast results if you are querying massive amounts of data (in terms of months of history or number of users or page views etc).
But if you are querying small segments of the data (say everyone who came from source x or everyone who visited only these pages and purchased or just the last weeks data or… you get the idea) then you have the option of saying "don't sample the data, just run the query against the all the data" and you will get confident results.
Your Action Item:
Is your web analytics vendor sampling data? If so what method are they using? Find out and ensure that you are aware of the implications. Not every company can make the best possible choice between red, orange and green, but atleast now you are well informed.
What do you all think? Please share your feedback via comments.
PS: For non-American readers: 411 is the telephone number you would dial to get information such as a phone number and I have come to learn from my time here that the term us used commonly as "give me the 411 on that". Wikipedia 4-1-1.