Web Analytics Data Sampling 411

hmmmm This probably impacts a minority of web analytics practitioners, normally only those with large websites with millions of page views per month / week / day. I know that there are lots of you out there. :)

When you are generating that much data from your website with your javascript tag based solutions there are a couple of "delightful" problems:

  1. It starts costing you lots of money because most javascript tag based solutions are pay for play (seems fair, it costs them money to collect your data).
  2. Your reports and queries from your web analytics solutions start slowing down, especially if you are segmenting the data or looking at vast amounts of history (and most definitely if your vendor has a backend that actually allows you to write custom queries against that massive amounts of data).

It will be a rare vendor that will admit that this challenge afflicts them, but it does.

To deal with the above two problems the standard operating procedure it to do data sampling.

I find that there is a bunch of confusion about sampling your data and implications of making that decision (other than that if you sample the data you'll save money).

So here's the 411 on data sampling. There are three primary ways of sampling your data.

  • Code Red: Sampling web pages on your site.
  • Code Orange: Sampling data collected from each page.
  • Code Green: Sampling data processed when you run the query / report.

Here are more details on each option…….

mountaindew coderedCode Red: Sampling web pages on your site:

Under this option either by choice or on advice from your vendor you only add the javascript tag to some pages on your website.

Typically you might add the javascript tags to a bunch of your most busy pages and forget the rest (CEO: "we should atleast track our important pages even if we can't afford to track the site!").

Implications: Perhaps the least palatable of the three options. If you ever want to know anything about pages you might find interesting it is possible that you have no data.

You also probably not have a complete picture of your website, as in you forgot to tag page x and your marketing department sent off a million direct marketing emails pointing to that page or page y got indexed by Google and is attracting a bunch of traffic now and you have no idea.

Code Orange: Sampling data collected from each page:

Rather than every single page view on your site being collected there are ways to say in the javascript tag code: "just collect every tenth page view" (or every fiftieth or hundredth).

Oranges

So when the page loads only every tenth time it will send data to your vendor. This means less data is collected for the vendor to store and process, when reports run you get that sampled data.

Now in the report you have "lower" numbers than your real numbers but there is usually some approximation applied (say multiply ever number by ten) to get the "correct" numbers for you.

Implications: Better than not collecting data at all for some pages on your site. In this case atleast you have some representative data for all your pages. Even with the multiplier you are getting a "approximately ok" view of your over all metrics.

For pages that don't get lots of page views (say beyond your top twenty or so pages) it also means that if you segment data it might not be of optimal quality.

If you have a choice between red and orange, always choose orange.

Code Green: Sampling data processed when you run the query / report.

All data from your website is collected and stored by your vendor.

But your web analytics application allows you to select the amount of data you want to statistically sample so that your queries run faster and your reports come back quicker even when you hit humongous amounts of data.

To extend the example above in this scenario you would simple say: "use statistical sampling of 10".

clicktracks data sampling options

Now when the query runs it will use statistical sampling to run the query really really fast and get you close to correct data.

Implications: You collect all the data from your website. If you ever want to wait a loooong time for your query to come back with God's perfect answer, you can.

As in the case of ClickTracks above it is best if you can choose the sampling level, rather than your vendor because you can fine tune that sampling in a very white box way to your own comfort level.

The nice thing about this methodology is that you can sample the data and get reasonably fast results if you are querying massive amounts of data (in terms of months of history or number of users or page views etc).

But if you are querying small segments of the data (say everyone who came from source x or everyone who visited only these pages and purchased or just the last weeks data or… you get the idea) then you have the option of saying "don't sample the data, just run the query against the all the data" and you will get confident results.

If your vendor permits always choose code green over choosing code red or orange, knowing that if you are using a asp paid javascript tag solution then you'll still have to pay per page view to get this benefit. Or you go free or go to web log files or bring data collection in-house (which vendors like ClickTracks, Unica, WebTrends allow you to do).

Your Action Item:

Is your web analytics vendor sampling data? If so what method are they using? Find out and ensure that you are aware of the implications. Not every company can make the best possible choice between red, orange and green, but atleast now you are well informed.

What do you all think? Please share your feedback via comments.

[Like this post? For more posts like this please click here, if it might be of interest please check out the book.]

PS: For non-American readers: 411 is the telephone number you would dial to get information such as a phone number and I have come to learn from my time here that the term us used commonly as "give me the 411 on that". Wikipedia 4-1-1.

Comments

  1. 1

    Avinash,

    There's another fundamental difference between orange and green which you didn't mention. With the orange method, you get data from 1/10th of the *page views*. With the green method, at least in ClickTracks (I can't vouch for anyone else), you get data from 1/10th of the *sessions*.

    Why does this difference matter? The green method gives you more accurate results because sessions are what you usually want to analyse. For example, if you want to do some funnel analysis, you have a representative sample of the sessions and you can see how those people progressed through the funnel. If you've used the orange method, the individual sessions have gaps in them, so how are you going to see how people moved through the funnel?

  2. 2

    Following on from Stephen's comment, some of the 'video playback' solutions now appearing use a cross between your orange and green methods, to selectively record 1-in-N *sessions*. If the random selection in the embedded javascript decides to record you, then everything you do is recorded, providing a complete and accurate record of a session.

    Of course, that's vital for these systems (since recording random pages across sessions is not entirely useful), but the same methodology could be applied to more normal analytics data collection script too.

  3. 3

    An interesting and thought-provoking post as always, Avinash!

    I agree with Stephen that you need to sample sessions to be able to analyze click streams accurately. Also, I think Phil’s idea is really good – of course a tracking script can be used to sample sessions, not only page views.

    Notice, however, that there is another and even better method in between the orange and green solution. Avinash talks about page view sampling, Stephen and Phil talk about session sampling. But what you really need is visitor or user sampling (which is actually an option in our client-side tracking solution).

    Visitor sampling recognizes and then ignores returning visitors (i.e. cookies) who were not selected for the random sample in the first place. Similarly, it recognizes and then re-selects returning visitors who were selected previously. This is necessary to do if you want to (1) ensure a representative distribution of first-time visitors vs. returning visitors, and (2) keep a complete record of each (randomly selected) visitor’s session history. The latter is of course crucial if you want to study how behavior / click streams changes as the visitor becomes more and more experienced with the website.

    In this way you can have the cake and eat it! You can both save a lot of storage space (i.e. money), have fast query times and obtain very accurate results. You can even introduce dynamic sampling rates to solve the problem of very infrequent events/page views, but this is another loooong story :-)

  4. 4

    Just a quick note: although one can do tag data collection in-house with WebTrends (a big plus in many cases), one is still subject to the pay by volume of page view model. Can be quite expensive as well.

  5. 5

    Hi Avinash, I like the way you make this quite dry subject (statistical sampling methods) both "delightful" with images (especially mountain dew) to match and easy to understand with green (go), orange (pause), red (stop) symbolism. To make the complex simple is one of the hardest things to do :) By the way, it is 100 to call the operator in England.

    Today I was running an "intelligent" filter against data so that it would capture not just all the visits from a particular source but also all subsequent traffic that originally was generated from this source even if it came from a different source later. If you could have intelligent code green that would be cool because you could get data every 10th but with a standard deviation or to pick up traffic from different representative visit types to try to get a more representative sample (it could be that if you run it every 10th you have a disproportionally high chance of getting data on a particular visitor type).

    On another note, I was reading some really cool things that MIT social media lab are doing to measure the influence of visitors on myspace by using social network analysis (betweenness/clustering etc) to get a much more accurate picture of true influence/importance of visitors rather than just their actual total number of ties ie friend-links. If we could use social network analysis metrics in mainstream analytics solutions, that would add dimensions that would so give an edge to true business/marketing intelligence and eventual effectiveness. Does anyone have an opinion on this?

    Web Analytics Princess signing out :) (sorry I may have got a bit carried away). If you're not too busy (as if – sorry), my new blog post is all about social networking analysis and I would really really appreciate your critique.

  6. 6
    c0t0s0d0 says

    hi avinash,

    one consideration to both the "delightful" problems can be google analytics… doesn't cost a dime :-)

    c0t0s0d0.

  7. 7

    Dr. Turner: I should stop blogging at midnight. Sorry to overlook something so obvious! Excellent point.

    Christian: You take the cake! Literally. I was not aware that this was even a option. Perhaps the other vendors can find some inspiration in your suggestion and adopt that approach. But not sure if you want them to! :)

    Jacques: I am a bit confused but if you purchase the software and bring it in-house then for increased number of page views / sessions all you are paying for is bandwidth or storage (both typically much cheaper than paying a vendor per page view). That is my experience, hence the suggestion.

    Marianina: This sounds very uncool but I don't think any of us know what is going to become of this "social" phenomenon and even less about how to even start to think about thinking of ways to measure it. I suspect something cool will come of it in the next x amount of time, it is hard to guess or anticipate what it will be. For a number of reasons. Until then you and I can have some fun guessing and making stuff up (just look at my how to measure success of your blog posts!).

    c0t0s0d0:

    one consideration to both the “delightful” problems can be google analytics…

    That is good option for many companies but for many other companies a different tool could be a best fit solution. In any case it is important to be aware of how data is sampled and ask all vendors what they use.

    PS: Dylan Lewis loves riddles, I should ask him to see if he can solve what c0t0s0d0 stands for!! :)

    Thanks everyone (sorry for the delay in replying and the "reply all", it has been a very busy day and the long comment reply is becuase I am afraid that the cruise ship we are going on might not have internet access!!).

    -Avinash.

  8. 8
    anonymous says

    I think you may have forgotten one. I work with a vendor who can apply a percentage governor to the JS lib file and cookie a sample of the UVs hitting the site. if the % is set to 10 then 1/10 of the UVs are tracked. This is a bit different that what you explain or how I understood your explanation so I might have missed something.

    It seems closest to the ORANGE method but not sure since you imply its page based vs. UV based. the code JS still sends data for every page but for only a % of the UV traffic based on a cookie identifier. You can even create new metrics based on the sample to get up to almost normal levels by creating a multiplier on UVs or PVs.

  9. 9

    Very insightful session, and clear that progress has been and continues to be made in this area. That said, there remains a HUGE gap in the dialog. When we talk about how to draw meaningful conclusions from statistically relevant samples of the data, we can't forget about the underlying data quality to begin with. If, for example, the data itself has an inherent error rate which renders it statistically impaired, then no amount of sampling, however precise or sophisticated, will be able to restore statistical significance. A lot can and routinely does go wrong at the source which undermines all efforts flowing from it. For example, tags can be missing entirely. If they're not missing, they can be inoperable (e.g., due to syntax errors and the like). If they're operating properly, they can be generating inaccurate variables. If the variables are accurate, the beacon parameters can be wrong. These are not hypotheticals. The incidence of such errors is shocking (error rates average well over 25% for relatively sophisticated practitioners – CODE SUPER BRIGHT RED!). And by the way, this is "systemic" error and not "random" error, so the affects of such poor data quality cannot be made up through quantity. The implications are profound. Refining sampling methods is absolutely necessary for many reasons, but in truth, doing so without concurrently addressing the underlying data quality problems is a whole lot like painting over dirt.

  10. 10
    c0t0s0d0 says

    PS: Dylan Lewis loves riddles, I should ask him to see if he can solve what c0t0s0d0 stands for!! :)

    c0t0s0d0 = cylinder zero, track zero, slice zero, disk zero (from my solaris admin days :-)

  11. 11

    Well it's a little unfair to Dylan… :-)

    c0t0s0d0 = cylinder zero, track zero, slice zero, disk zero

    Urm. No actually. :-)

    For starters it should be: c0t0d0s0.
    c == controller
    t == target
    d == drive
    s == slice

    This is a "partition" identifier. Not a position on disk identifier. That would be pointless at this level.

    Target is the SCSI ID.
    When you use the format command, you drop the slice to work with disks and hence set up your slices (or partitions).

    eg:

    # format
    ...
    0. c0t0d0
    /pci@1f,4000/scsi@3/sd@0,0
    1. c0t1d0 sol_8
    /pci@1f,4000/scsi@3/sd@1,0
    2. c0t2d0
    /pci@1f,4000/scsi@3/sd@2,0
    ...

    Three very different 9G SCSI drives on the one backplane. You might even decode some of that from the /pci… address too. :-)

    Speaking as someone who still is a Solaris sysadmin and has been for about 15 years. :-)

    Cheers!

  12. 12

    c0t0s0d0 : How delightfully clever!!!

    I don't have your real email address but please consider this as my thanks for coming back and helping clear up the mystery. That was kind of you.

    -Avinash.

  13. 13

    Its useful info. Avinash Thank you so much!

  14. 14

    Hi, Avinash.
    We have a number of children websites of our company site with around 27 million visits per month, and up until now, we were only using 1 GA account, 1 Web property and many profiles (which filter access to a specific site by the server folder accessed) to track it all.

    Now, we are doing more a granular analysis of traffic source, but the problem is that even when I look into one single day of "New Visits" that come from organic for a site with just about 100.000 visits per month, GA returns sampling data for my queries. And it's worse, because as I try to compare day by day data, the results for a single day is always different from one report to another.

    How can I analyse this kind of data if I can never get a "correct" (or at least always the same) number or the ones I get are from such a small sample size that are just about useless for any actionable insight?

    Is there a work around for this? Or should I consider running a second GA account on the same site just to track this kind of granular data?

    Thank you very much for the help and all your posts!
    Decio

    • 15

      Decio: The best option for complex GA requests is to hire a GACP (Google Analytics Authorized Consultant) to go through the requirements and validate and recommend the right path.

      You'll find a list here: http://bit.ly/gaac

      In your case there are at least two or three different ways of doing this, depending on your evaluation of the tradeoffs. Please work with a GACP, they can help you.

      -Avinash.

  15. 17
    Rafael Galante says

    I have a lot of problems with data sampling every time I add a secondary dimension in GA.
    The worst part is that I never know whenever I can trust the data.

    Is there a way that I can run a statistical test to understand when I can trust the numbers? Some kind of rule of thumb?

    Tks!

  16. 18

    Hi Avinash,

    It has always been a delight reading your posts. They are quite informative & I have learnt a lot from you over the years.

    I am facing a problem with data collection in Google Analytics. My website's hits are multiple times the GA limit of 10 Million hits/month. I am evaluating whether to manually set a sampling rate or to continue as is. I have a question – Is GA recording all the data right now or is it collecting only 10 million hits & rest its projecting?

    Any direction in this regard will be extremely helpful.

    Best Regards,

    Vivek

    • 19

      Vivek: All the data is collected. Then sampled per Google's public guidelines:

      About data sampling | Google Analytics Support

      Here's food for thought….

      You said you are getting multiple times of the 10 million hits limit, but let's reduce your site down to just 10 million hits. Somewhere in that massive size is a profitable business model – else what a waste of 10 million hits! If there is a business model, you are far far better off paying for GA Premium so that you can get all the data and a whole lot of sophisticated analtyical tools that will allow you to make profitable business decisions faster. It costs $150k per year according to this google query:

      https://www.google.com/search?q=how+much+does+google+analytics+premium+cost

      Finding insights that help you make just 1.5 cents per hit will recover the cost. It is highly likely that with a decent Analysis Ninja that you can make a hundred times that – if not more.

      The same benefits would accrue if you decide to pay for Adobe Analytics which is a wonderful tool.

      Bottom-line: With a site that big, saving $150k might be a little like cutting your legs off to run faster. :)

      Avinash.

      • 20
        Vivek Singla says

        Hi Avinash,

        Thanks a lot for the clarification. It helps a lot.
        I agree with you on purchasing a more powerful analytics tool. We are evaluating different options for the same.

        Best regards
        Vivek

Trackbacks

  1. […]
    La presentación de informes en tiempo real, basados en grandes volúmenes de tráfico, son la principal justificación para la utilización de esta técnica que, siendo más precisos, presenta diferentes variantes que pueden dotarla de mayor protagonismo (recomiendo este post para profundizar en este tema).

    La carrera de obstáculos, invita a pensar que la situación ideal pasa por la utilización de una única herramienta. Avinash hizo hace unas semanas un checklist muy interesante al respecto en su blog.
    […]

  2. […]
    Standard deviation tells us how much we deviate from the mean, in other words we can interpret it as a confidence level. For example if you have monthly sales, your daily sales will be different every day. Then you can use the standard deviation to calculate how much you deviate from the monthly average. There are two Standard Deviation formulas in Excel that you can use.
    =stdev -when you have sample data -> Avinash Kaushik explains in more details how sampling works https://www.kaushik.net/avinash/web-analytics-data-sampling-411/
    […]

  3. […]
    标准偏差告诉我们,我们是多么地偏离平均值,换句话说,我们可以把它解释为一个置信水平。例如,如果您有每月销售,每天销售将是不同的每一天。然后,你可以使用标准差来计算偏离平均每月多少。 有两个标准差公式在Excel中,您可以使用。 =,STDEV,当你有样本数据- >阿维纳什考希克采样如何工作的更详细解释,https://www.kaushik.net/avinash/web-analytics-data-采样411 /
    […]

  4. […]
    Standard deviation tells us how much we deviate from the mean, in other words we can interpret it as a confidence level. For example if you have monthly sales, your daily sales will be different every day. Then you can use the standard deviation to calculate how much you deviate from the monthly average. There are two Standard Deviation formulas in Excel that you can use.
    =stdev -when you have sample data -> Avinash Kaushik explains in more details how sampling works https://www.kaushik.net/avinash/web-analytics-data-sampling-411/
    […]

Add your Perspective

*