Data Quality Sucks, Let's Just Get Over It

daisyThere is perhaps no other activity that is as much a bane of our existence in Web Decision Making as reconciling web data! Nothing seems to tie to anything, each time you rerun the numbers they change, there are tons of people fanning the flames of cookie deletion and how that causes bad data, arguments around using logging and tagging as data collection mechanisms, and how many cookies each browser can accept anyways and on and on and on. And like dutiful little hamsters we spin in our wheels round and round and round and round and never make any progress and still…… nothing ties to anything else.

Let me state as emphatically as I possibly can: Data quality on the internet absolutely sucks. And there is nothing you can do about it. Atleast for now.

The sooner we, especially we, internalize this, the sooner we can get all just get over it with and the sooner we can move on.

(Oh and it really does not matter what your favorite neighborhood vendor, the $0 one or the $1 million one, says. Pretty much all vendors use the exact same way to collect data. Yes each vendor probably has some nice little innovation but they can’t help the fact the internet is a weird little animal constantly evolving and changing. That is its inherent beauty, charm and why the World Wide Web is such a delight. :~) )

Note what I am not saying: I am not saying you should not implement your web analytics tool completely & correctly. I am not saying you should not ensure your campaigns are tagged. I am not saying you should not follow the technical recommendations of your vendor. Do all that. But then move on, don't try to go for 100% when 95% will do just fine!

This post is for those amongst us who want 100% perfection, who have a difficult time with accepting that good enough now is better perfect four weeks from now.

Here are some of the reasons people have actually said to me when it comes to data quality (especially of clickstream data):

  • Cookies “blow”, in many ways
  • This is different and I don’t like it
  • Robots do weird stuff and mess thing up
  • Tagging has limitations and no one accepts javascripts
  • I can do it better than vendor x/y, with five minutes of coding
  • Nothing ties to my webtrends and webtrends is right
  • People behave strangely and mess up numbers
  • Everyone is out to get me
  • I miss my mommy

Ok so maybe I was just making up the last one (but not the second from the last one).

Most of the reasons above are true and real, and there are many more that are not on the list (What about google pre-fetching results? What about an entire sub experience that is now ajax, which most vendors can barely measure anything about? What about IE7 doing wacky stuff to third party cookies? What about click fraud? What about the impact of CDN's?).

We should simply not expect that data quality will exist on the internet, atleast for now. Atleast not like we are used to in traditional backend systems like ERP and CRM where almost everything typically does tie again and again. On the World Wide Web there is too much complexity (people, process, rapid change) and data collection too primitive to ensure accurate data capture and data analysis.

Now I’ll be the first one to admit that your decision makers are not going to let you get by with my lofty proclamation that data quality sucks. And make no mistake it will take time to convince them and win their trust that even though data quality is sub optimal we can still make great decisions.

Here is my recommended six step plan:

    Step 1: Resist the urge to dive deep into the data to find root cause. IMHO this is a time consuming and futile exercise. And by the time you do figure out some semblance of a explanation there are more new reasons why data won't tie (atleast at a macro level). Especially if you are off by less than 10%.

    Step 2: Assume a level of comfort with the data. When presented with data understand clearly how it is collected and then encourage yourself, and your decision makers, to assume a level of comfort with the data. Say you look at something and you could say I only trust it 80%. The other guy might say 75% and perhaps the nice lady might say only 70%. That is ok.

    Human beings are complex and each a collection of its life experiences so we will each decide differently. That is ok, but by trial and error arrive at a comfort level with the data. It is now your friend, different from Step One where it was out to get you fired.

    Step 3: Start making decisions that you are comfortable with. This step takes the most courage, it is the biggest leap of faith. But trust me after this it is all fun. I can look at your table of data and with just 70% comfort level I can make business decisions.

    If important KPI ONE changed by X% or YYY,YYY numbers with 100% confidence I would have spent $90k on the next campaign or completely changed the site architecture or built a new checkout process etc. But with only 70% confidence in that KPI change I can still decide that I will only spend $60k or run a multivariate test before I change site architecture to get more confidence in data or I’ll still build a new checkout process because I only need 70% confidence in the data because checkout is so important.

    This example is a simple illustration that it is possible to make decisions with less than 100% confidence in the data. Encourage that behavior and it is ok if you trust the data more than your decision makers, they will come around with time. It is important that you model this behavior, if in your gut you find it hard to make this leap of faith, it will be monumentally harder for your decision makers to.

    Step 4: Over time drill deeper in micro specific areas & learn more. I love data detective work, oh the joys of trawling through one terabyte of data looking for answers!! Once you get into the groove of making decisions, rather than being paralyzed by data quality not being good, I recommend that you find small narrow niches of data segments to drill into. The goal will be to understand why data for that narrow niche might not be what you expect.

    So you could take all the traffic from a particular referring url or a particular search key phrase or all your email marketing traffic in a week or everyone who saw a particular page or….. for each one of us our poison will be different. By narrowing your focus you’ll reduce the number of distractions, increase the chances of isolating causality and start to understand better your complex website ecosystem.

    Step 5: Get more comfortable with data and its limitations over time. As you understand data better over time (data collection, storage, manipulation, processing and analyzing) you'll make the appropriate adjustments in your interpretation and quest for Web Insights. This in turn will increase your comfort level of data over time, from 70% to 73% to 78% to 81% and on and on. While you will perhaps never get 100% confidence you’ll start making significantly more confident decisions.

    Aim for small increments of improvement in comfort and confidence levels for yourself, and reduce that increment by 50% for your decision makers (this is much harder for them).

    Step 6: Consistency in calculations = Good. On the web: Absolute numbers rarely matter, segmented trends do. This is important to remember, especially if you find that you are unable to get beyond Step One. The quest to get a absolute number right is especially futile because of all the reasons discussed above. As long as you stay consistent and look at trends *and* important segments in those trends you will reduce the chances of making sub optimal decisions, even if there is a small delta in data quality. (If you have not had a chance I recommend my post on segmentation.)

There are atleast two cases where Data Quality is dicey, but deserves special love.

  1. Switching from one analytics tool to another: My recommendation is rather than reconcile run the two in parallel for four to eight weeks and simply benchmark the differences in key metrics between the two. Then create a multiplier and use that if you want to do historical trend comparison.

    So you are going to replace Omniture/WebTrends/HBX/CoreMetrics with ClickTracks/Google Analytics/WebTrends/Omniture, run the two in parallel, note that Visitors from your old platform are always 15% greater than the new one. Use that multiplier for old data trend comparisons.

  2. Cart & Checkout process analysis: You want a great degree of precision here because of the amount of money on the line. If you want to spend time reconciling, this is the place to do it. Javascripts are a sub optimal way to collect this data. If you platform allows use something like what ATG allows for: Event Logging. Each time someone is in this process it precisely captures the data from the server (not the pages) along with the business context. This makes a powerful set of data to analyze for key insights.

Agree? Disagree? Can javascript tag based or web log based data be of better quality? Would you recommend more/different steps to account for sub optimal clickstream data quality? Please share your feedback via comments.

Update: Here are two articles that you might find to be of value when dealing with data quality challenges…

Good luck!


  1. 1

    Hi Avinash -

    Not sure I have the ability to comment on the technical apsects of your post, but as you said the key is to recognise that data quality is never optimal…once, you know this, what I say to clients, and as decision maker myself, I'd rather know NOW 70 or 80% of the truth rather than 100% (knowing that truth is a "relative" concept of course), to take a decision… Too many "technically oriented people" forget this (I was probaly too much like this in the past, my intial training is in computer science as well), and at the end, the key is to decide and learn over time…specifically from mistakes…At a time where marketers finally recognize that a a lot consumer decisions are "irrational", data is good but so is Intuition…data dive helps shape the right intuition…to make the right decision…


  2. 2

    I couldn't agree more. I can't tell you how many times our clients ask us to compare statistics from Analytics Package A to Analytics Package B. It's a time consuming process that usually ends in failure. We try to stress two things to people:

    1. Look at the trends in your data
    2. Make one change at a time so you can accurately measure the results.

    Measuring change to your website is a scientific, controlled experiment. If you start with a baseline statistic and implement controlled change then there will be little doubt as to the results of that change. It doesn't matter if your baseline is off by 2, 5 or 10 percent from the actual numbers. If you implement a single change then there will be little doubt of its effect when you're looking at the trends in your data.

    I think Caleb touched on this in his comment a couple of weeks ago:

    Start with benchmarks, make a report that can be easily shared throughout the organization. Host a meeting to talk about it, then pick one area to take action on that will show some real results and give the credit to the people who actually made the changes (the IT, developers, designers, etc…)

  3. 3

    Excellent post. I wish that more people outside the analytics community understood that you are never going to get two different tools to give you the same numbers, and that your time is much better spent doing analysis.

  4. 4
    Andy Choi says:

    Insightful post! Just wonder if you have tried any of the data quality tools from the enterprise software world?

  5. 5
    Maggie says:

    Hi Avinash,

    Thanks a lot for another wonderful article.

    When I apply web analytics with our GA data ,I found some of the analysis results dont quite make sense to me. i.e 80+% of site users convert within a day, 60+% of site users convert with only one visit. Does it mean we are an excellent site? However, we are a site with only 1+% conversion rate and 50% bounce rate.

    Therefore, I am wondering if the whole GA data is not accurately set since we have a huge bunch of profiles, 50ish. Do you have any good suggestion what I can do to prove my guessing and how I can verify the data in GA is good quality ?(though I know it is sampling which might not exactly the same to our internal tracking)

    Thanks a lot



  1. [...] Second, Avinash Kaushik has a fantastic blog post titled Data Quality Sucks, Let'?s Just Get Over It concerning the quality of analytics data and how to make informed decisions based on your comfort level with the data. He stresses that you will not get 100% accurate data but that doesn't matter. You can still make informed decisions with the data that your analytics solution provides. It is a really great post and I highly recommend giving it a read. [...]

  2. [...] We all know that web analytics data quality sucks and we're all OK with that. Thanks Avinash! But there comes a point when the data quality is just abysmal. This usually means that there is something wrong with the web analytics application configuration. [...]

  3. [...] Data Quality Sucks, Let’s Just Get Over It » Occam’s Razor by Avinash Kaushik "We should simply not expect that data quality will exist on the internet, atleast for now." Nice list of "my butt hurts!" type excuses… (tags: via:mrchippy data integration databases web2.0 ajax stats) [...]

  4. [...]
    I'm hearing it all over. There's a new day on the horizon, a day when we in the web world recognizes that none of this is really an exact science anyway, so why pretend?

    I’m hearing it all over. There’s a new day on the horizon, a day when we in the web world recognize that none of this is really an exact science anyway, so why pretend?

    Enough with the weighted regressions and Taguchi Methods already. It’s time to take the anal out of analysis. Instead of Web Analytics, I propose Web Casualytics. Or Fuzzylytics maybe. Or Estimytics.

  5. Blogtipping Day: fixing the world one data quality problem at a time…

    If you slapped the words "data quality" on a hotdog stand someone would offer to buy it before the end of the day. Data quality tools are hot properties so I am BlogTipping three vendor blogs….

  6. [...] Sampling, sampling, sampling.  My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics.  I recommend reading his post!  I'll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data.  Sites looking for the "long tail" of visitors and using data sampling or other data trimming methods may not find it.  [...]

  7. [...] First off, it is largely a waste of time to try to completely reconcile data from two different web analytics tools. This post really isn’t about that. Mark Twain, Lee Segall, or perhaps someone else coined the saying, “A man with one watch knows what time it is; a man with two watches is never quite sure.” The same is true for web analytics. Thanks to different data capture methods, different data processing algorithms, different data storage schemas, and different definitions, no two tools running concurrently will ever report the same results. The good news, though, is that most tools will show very similar trends. WebTrends preaches, “in web analytics, it’s the trends that matter — that’s why it’s part of our name!” But, even in the broader web analytics community, this is widely accepted. Avinash Kaushik had a great post titled Data Quality Sucks, Let’s Just Get Over It way back in 2006, but it still applies. Read more there! [...]

  8. [...] A lot of skepticism around analytics centers around the fact that the numbers are never perfect. Cookie deletion, cookie rejection, turning off javascript can all skew numbers. True. But, let’s heed Avanish Kaushik’s advice and get over it. Why? Because it doesn’t matter. Why doesn’t it matter? Because it’s *off* consistently across your site. That’s why trends are so important. You can argue that the statement, “I got 1,500 visits last month” may or may not be perfectly true. You can’t argue, however, the truth to the statement, “our visits are up 10% over last month.” See the difference? The 1,500 number doesn’t mean anything anyway, even if it were a concrete, true, golden number. Putting it in context with the previous month (or other date range) makes all the difference. You can do the same thing with all your other metrics. [...]

  9. [...] Verschillende webanalyticspakketten geven andere resultaten, klikgegevens van mediabureaus en ‘bannerboeren’ komen nimmer overeen met de zelf geregistreerde data en gemeten conversies matchen niet met de daadwerkelijke kassa-aanslagen. Webanalisten hebben in het gevecht tegen de techniek de handdoek eigenlijk al lang in de ring gegooid, quoten desgevraagd Goeroe Avinash; ‘Data quality sucks, let’s just get over it!’, proberen te redden wat er te redden valt (‘Make data quality suck less’) en hebben een tas vol munitie paraat om de belangrijke schuldvraag af te slaan: [...]

  10. [...]
    Aside from technical issues that actually skew the results, conversion tracking consistently returns results which differ from other stats sources, such as say Google Analytics.

    Don’t worry, this isn’t a source for panic. Its widely accepted that pretty much all stats packages are inaccurate to some extent or another. The data you are gathering is still very useful.

    There are a couple of ways to get the best value from your stats if they are coming from a few different sources. They both mean deciding on a baseline:

  11. [...]
    Avinash Kaushik, Analytics evangelist hos Google og mannen bak bøkene Web Analytics 2.0 og Web Analytics: An Hour A Day stadfestet allerede I 2006 at datakvalitet ofte var misvisende, og som regel ville fortsette å være det. Hans blogginnlegg “Data Quality Sucks, Let’s Just Get Over It” er fremdeles relevant lesning. Selv om debatter om datakvalitetens nøyaktighet kan ha interesse i enkelte sammenhenger så er det lite formålstjenelig å henge seg unødig opp i det.

  12. [...] I like what Web analytics guru Avinash Kaushik said on this subject: “Data quality on the internet absolutely sucks. And there is nothing you can do about it…Now I’ll be the first one to admit that your decision makers are not going to let you get by with my lofty proclamation that data quality sucks. And make no mistake it will take time to convince them and win their trust that even though data quality is sub optimal we can still make great decisions.” (from Data Quality Sucks, Let's Just Get Over It) [...]

  13. [...]
    Don’t worry, this isn’t a source for panic. Its widely accepted that pretty much all stats packages are inaccurate to some extent or another. The data you are gathering is still very useful.

    There are a couple of ways to get the best value from your stats if they are coming from a few different sources. They both mean deciding on a baseline:

  14. [...]
    Date comparisons are great and allow you to see trends over time. As web analytics we care about trends rather than absolute values, right?

    However in Google Analytics, by default you are comparing your selected date range to a date range in the past, but there are cases where you’d want to compare the data to a range in the future. It’s basically a different perspective, similar to saying that 20 is 100% larger than 10 (basis 10), as opposed to 10 being 50% smaller than 20 (basis 20).

  15. [...]
    Was man ruhig öfter mal sagen sollte: Webanalyse ist keine exakte Wissenschaft! Siehe Jim Sterne: It’s Not an Exact Science oder auch Avinash Kaushik: Data Quality Sucks, Let’s Just Get Over It. Ersterer ist von November 2012, letzterer von Juni 2006! In den letzten 6 Jahren hat sich nichts geändert, und das liegt an der Natur der Sache. Der wichtige Part: “But then move on, don’t try to go for 100% when 95% will do just fine!”

Add your Perspective