Web Data Quality: A 6 Step Process To Evolve Your Mental Model

Off CenterIt seems absolutely dumb to argue that while the quality of data used to make decisions is important, it is actually not that important to have the highest data quality.

Generations of Analysts, Data "People", Decision Makers have grown up with the principle of GIGO. Garbage in, garbage out.

It made a lot of sense for a very long time. Especially because we used to collect so little data, its lack of even a little quality crapified the decision a lot.

GIGO also fueled our every expanding quest for data perfection and data quality. There are entire companies built around helping your "clean up" your data. Especially if you look at the offline traditional business intelligence, erp, crm, data warehouse worlds.

The web unfortunately threw a big spanner into the works.

Couple important reasons.

First, it is important to realize that we collect a lot of data on the web (type of data, elements of data, what not).

Second, our beloved world wide web, remember still a little baby, is imperfect at every turn. We use data collection methodologies that reflect our efforts to do the best we can, but they are inherently flawed. Just take javascript as an example. It is good at what it does. But not everyone has javascript turned on (typically around 2-3%). Zing: imperfection.

A lot of data. Imperfect data collection system.

Here is the most common result of this challenge: The "Director of Analytics" spends her meager resources in the futile quest for clean data.

Money is spent on consultants (especially the "scarady cats" who deftly stir this issue to favor their personal businesses). Everyone tries to reconcile everything across systems and logs. Omniture gets kicked out and WebTrends gets put in, supposedly for it "far superior" data quality (!!).

Makes me sad.

In the debate for perfect data is is important to realize that the reality is a lot more nuanced.

No Possible Complete Data on Le Web.

I humbly believe that the world of data perfection ("clean auditable data") does not exist any more. It did for a long time because life was cleaner, mistakes were human made, sources were fewer and there wasn't enough data to begin with (sure terabytes of it, but of what 300 fields? 600?).

On the web we now have too many sources of data. Quantatitive, qualitative, hearsay (sorry, surveys :), competitive intelligence, and so much. [Web Analytics 2.0 ] But these sources are "fragile".

Sometimes because of technology (tags / cookies / panels / ISP logs). Sometimes because of privacy reasons. Sometimes because we can't sample enough (surveys, usability tests). Sometimes because it is all so new, we don't even know what the heck we are doing and the world is changing too fast around us.

Killing the Holy Cows.

The old people who did BI (me for sure, maybe you?) and moved to the web have had to come to the realization that the old rules of making decisions are out of the door. Not just because that mental model of what now counts for "data" means but also because what counts for "decisions" has changed, the pace at which those decisions need to be made have changed. It took companies a long time to die in the past. That process happens at "web speed" now.

Given all that if I don't change, I'll become a hurdle to progress. If I don't change, I can't help my company make the kind of progress it should.

You need to fundamentally rewire your brain, like I have had do rewire mine (it was painful): The data is not complete and clean, yet it is more data of more type and it contains immense actionable insights.

If you would only get over yourself a little bit.

So how to do this if you really do want to be God's gift to web analysis?

The Six Step Soul Cleansing Process.

Based on my own personal evolution in this space I recommend you going through the following six step cleansing process to ensure that you are doing this right, and you move beyond the deeply counter productive data obsession.

1) Follow best practices to collect data, don't do stupid stuff.

2) Audit your data periodically to ensure you are collecting as complete a data set as possible (and as accurately as possible, #1).

3) Only collect as much data as you need: There is no upper limit to the amount of data you can collect and store on the web.

4) Ditch the old mental model of Accuracy, go for Precision (more here: Accuracy, Precision & Predictive Analytics). It might seem astonishing but your analysis will actually get more accurate if you go for precision.

5) Be comfortable, I mean really freaking comfortable, with incompleteness and learn to make decisions.

6) [In context of decision making] It used to be Think Smart, Move Fast. I think the next generation of true Analysis Ninjas will: Move Fast, Think Smart. Remember there is an opportunity cost associated with the quest for perfection.

Web Data Quality Cycle

An example of #1 is if you are using third party cookies in your web analytics tool like Omniture or CoreMetrics or WebTrends etc then you deserve the crappy data you are getting. For #2 use various website scanning tools for ensuring complete implementation, each vendor has their own, just ask. #3 is the reason more attempts to data warehouse web analytics data end up as massive expensive failures, or why you then get trapped constantly "mowing the grass".

You are not going to believe me but in #4 if you actually go for precision your analysis will actually get more accurate over time (whoa!).

#5 is the hardest thing for Analysts (and for many Marketers) to accept. Especially those that have doing data analysis in other fields. They are simply not comfortable with 90% complete data. Or even 95%. They work really really hard to get the other 5% because without that they are unable to accept that they could make business recommendations. Sometimes this is because of how their mental model is. Sometimes is is because the company is risk averse (not the Analyst's fault). Sometimes it is out of a genuine, if misplaced, desire to give the prefect answer.

Of course the net result is that lots of data collection, processing and perfection exercises happen. The business is starved for any insights to make even the most mundane decisions. I have had to layoff Analysts who simply could not accept incompleteness and had to have data that was clean and complete. Very hard for me to do.

#6 is a huge challenge because it requires an experience that most of us don't possess. Of having been there. Because of working in companies that plug us into the tribal knowledge and context. Because we work in massively multi layered bureaucracies in large companies. In my heart of heart I believe, sadly, that it will take a new generation of Analysts and a new generation of leaders in companies. Still we must try, even as I accept the criticism that the 10/90 rule is not followed and that we don't have enough Smart Analyst.

So: Best practices that collect as complete a data set as possible precisely allowing you to look beyond the incompleteness resulting you in moving fast while thinking smart.

Before You Jump All Over Me and Yell: Heretic!

Notice what I am not saying.

I am not saying make wrong decisions.

I am not saying accept bad data.

I am not saying don't do your damdest to make sure your data is as clean as it can be.

What I am saying is that your job does not depend on data with 100% integrity on the web. Your job depends on helping your company Move Fast and Think Smart.

I am also not saying it is easy.

Reality Check:

We live in the most data rich channel in the universe, we should be using data to find insights, no matter how a little bit off the perfect number they might be.

Just consider this.

How do you measure the effectiveness of your magazine ad? Now compare that to the data you have from doubleclick. How about measuring the ability of your TV ad to reach the right audience? Compare that with measuring reach through Paid Search (or Affiliate Marketing or …..). Do you think you get better data from Neilsen's TV panel of between 15k – 30k US residents to represent the diversity of TV content consumption of 200 million tv watching Americans?

faith based initiatives

There is simply no comparison. So why waste our life trying to get perfect data from our web sites and online marketing campaigns? Why does unsound, incomplete, and faith based data from TV, Magazines, Radio get a pass? Why be so harsh to your web channel? Just because you can collect data here means you won't do anything because it is imperfect?

Parting Words of Wisdom:

Stuart Gold is a VP at Omniture. Here's a quote from him:

"An educated mistake is better than no action at all."


The web allows you to make educated mistakes. Fast. With each mistake you become smarter. With each mistake your next step becomes more intelligent.

Make educated mistakes.


Ok now its your turn.

What do you think of the web data quality issue? What are the flawed assumptions I have made in making my recommendation above? How do you ensure your data is as complete and as precise as it can be? Got tools or horror stories to share? What is the next data collection mechanism on the horizon that will be our salvation on the web?

I look forward to your comments and feedback. Thanks.

Couple other related posts you might find interesting:


  1. 1

    Things are starting to pick up at work in the evangelism area. I put out an analysis last week to prove that we were cannibalizing our branded keywords through a PPC campaign. Folks were shocked at the level of detail available through Google Analytics, and dismayed that it's no one's "job" to look at the data. I'm making it mine :D

    Thanks for another great post. Keep it up!

  2. 2
    Scott Oliver says

    This post is right on, Avinash. I started my web analytics work with a TV network that started a web site to broaden its appeal. Being used to measuring performance with the Nielsen TV sample data you mention, the expectation of HIPPOs was that in the online world, everything was measurable and the data would be 100% complete and accurate.

    Those of us in the WA biz know better by now. According to my experience (and Google Analytics University :), the expectation should be that WA data is best used for trending analyses – which is a excellent way to make better business decesions.

  3. 3

    I think you're exactly right. The whole concept of 100% accurate data is a fools errand. It's impossible to achieve and offers no real benefit.

    The worst is when managers start piling one tool on top of another because they want to "compare them for inaccuracy". I pull out my hair and ask, "but if we already know they're all inaccurate how can we compare them?!"

  4. 4

    Avinash, that's a very important recommendation ("Incomplete data is ok, spend too much time on getting complete data is not ok").

    Thanks for reminding about it.

  5. 5

    “It’s better to be approximately right than exactly wrong” – Edward Tuft

  6. 6

    this was a really interesting topic however it is generally overlooked. i think companies are happy that they finally have a tool to measure traffic and don't bother quality after all. there are enough problems already with reporting and deadlines.

    also another (probably very stupid) topic:
    there always should be some explanation (explanation of the obvious) accompanying charts and analysis.
    i think after all most of us struggle with writing text about charts and already visualized data.
    do you have any stories/thoughts about this?

  7. 7


    Great post. We are always struggling with the "but this isn't everything from the web" question in our projects.

    Isn't collecting the 90% of all conversation from the top 20 (or so) sites (that we know are real and not link farms) better than collecting 100% of all the conversation by including the next 20,000 sites, many of them of dubious origin and purpose?

    We are looking for enough information to base decisions on – not all information.

    MotiveQuest LLC

  8. 8

    I work at a smaller firm, so 3-5% sometimes turns out to be 5 to 10 visits on some of our client's sites.

    Help me connect the dots… I've read all of the recommended posts you've listed above, and I still have people *freaking out* when there is one small issue with quality.

    One instance: using the Visits With Conversions segment in GA, one referrer showed a bounce rate… it became a big issue because "There cannot be a bounce when someone converted. By definition that means the visit went more than one page!" and on and on. That is someone I work with, who ended up being really gracious after I search high and low to find an answer and came back empty-handed.

    However, clients have no such mercy, and I tell you what, one little "mistake" in the Analytics always, without questions, throws their entire faith in Analytics down the toilet. We try to manage expectations (we have measurable data, but remember it's not 100% accurate). That is, the 3% error becomes 100% error in their minds. I would ask you (or any other readers) for help in knowing what to say. I have tried the advice you posted here, but I still cannot seem to allay some people's mistrust.

  9. 9
    Dan Epley says

    I've made the mistake of not auditing my data (#2). I was using a JavaScript analytic tool and while making some changes screwed up the code on my conversion page. So although I was getting data on all the other pages, no conversions were showing up. The lesson: clients aren't very happy if you can't show them conversions, ouch!

  10. 10

    As a general philosophy I would agree. It seems that what is most important is being "direccionally" correct.

    However, I hope web analytics software makers understand the point on precision and consistency. They should seek precision in all of their features at all times. Features without precision are not features, only distractions.

    When challenging skeptical HIPPOs, that last thing I need is to have my data seem possibly inconsistent or unreliable.

  11. 11

    >> They are simply not comfortable with 90% complete data. Or even 95%. They work really really hard to get the other 5% because without that they are unable to accept that they could make business recommendations.

    — An example from PPC:
    Let's say you have 3 ads running in a campaign. Based on real, but imperfect data, the CPA of one of the ads is 2X better than the others. If you make a decision to cut low performers sooner rather than later, you instantly lower CPA. The data may have been imperfect, but the resulting increase in sales is specific and measurable. Not only that, but once (this round) of improvements at the ad level are set, you can move on to additional testing of keywords, landing pages, etc., yielding yet more sales.

    Paralyzing yourself waiting for perfect data holds up the specific experiment you're running, but also stymies all the tests that follow it.

    Great topic. Thanks Avinash.

  12. 12

    hi Avinash,

    Great post. Getting "perfect" data quality is impossible. The web just changes too much to have that be a realistic goal.

    Most HiPPOs have no idea of how much information you can get from web analytics packages anyway. Compared to its offline partner, web data contains far more information and insights. Like you have said many times, trending is the key to understanding changes. Exact numbers are much less important.

  13. 13


    Great post. I buy into your 6 Steps. You don't need to convince me. Like Adrian mentioned, it's others that need the convincing.

    I think that part of the problem is that people have different perspectives on numbers being reported. People coming from a finance background, for example, may believe that if numbers are being reported, they are accurate and that you can take them to the bank. At some point, somebody compares the numbers you are reporting to numbers that an outside agency is reporting and there is a difference….which shakes their confidence.

    I think that as these numbers become a more important indicator of ROI and are used to justify further spending, they become more visible. They probably start appearing on more powerpoint slides higher and higher up in the organization. So, even if your immediate boss and her boss 'get it' and are comfortable with imperfect data that provides actionable insights, others that are higher up and far removed from the world of web analytics and your blog, probably don't 'get it.' So, I think that there is some ceiling in each organization hierarchy where there is a perceived danger of reporting numbers any higher up that the person immediately below that level doesn't have 200% confidence in.

    So, I think part of the challeng here is educating up.


  14. 14

    "At some point, somebody compares the numbers you are reporting to numbers that an outside agency is reporting and there is a difference….which shakes their confidence."

    Totally agree.
    Happens almost every week.

  15. 15


    Thank you, you hit the nail on the head for me. I've internalized the lessons here and have moved forward finding actionable data. The issue for me is "educating up." How have you approached this?

    BTW-Between the comment I left this morning and now, I had a publisher email me AND cc the client with a report they compiled. The publisher's report shows 50 more clicks on a banner ad than what our Analytics show… I do not look forward to addressing this discrepancy.

    So, as another comment stated: this happens every week.

  16. 16
    rahul deshmukh says

    Good post Avinash. Some pushback comes from business owner's belief that 3-5% data gap is also ok for Revenue, but that is a wrong assumption. The concept of 3-5% is applicable for traffic only (visits, visitors). As long as it is in context and you can tease out the variance, it is fine to not sweet over it. Caveat for very large sites is that 3-5% would be very meaningful.

  17. 17

    Great post on a needed topic. And the tenets you laid out in your 6 steps make total sense and worth following.

    Being a student of cognitive science, I found your choice of the term 'mental model' interesting. In addtion to the facts you mentioned, this term is more appropriate in this context than one would think :-).

    First, it is accepted that in the use of mental models folks gravitate towards a deterministic solution to avoid the multiplicity of choices and solutions. And as you mentioned, with incomplete data one has more decisions to make on which path to take and so one keeps after the 'complete data' paradigm to reduce these choices.

    Second, crazy as it might seem, it is okay to have multiple mental models to deal with different aspects of a domain. The same applies to the web data. Accuracy vs Precision needs and requirements would be different depending on what you are trying to address.

    And lastly, all mental models start out as incomplete, unstable and adhoc. The same is the case with the web (even if one has prior BI experience). With no experience in using the web data, there is no schemata to choose from and so one uses his/her past experience as an analogy to create a strawman mental model. But as one gains experience in the domain, these mental models start to take shape and new heuristics and best practices gets added to get to an optimal execution.

    Again, enjoyed the read and great advice. Agree with the other comments that people need to grasp these concepts (moving fast, accuracy vs precision etc.) if they want to excel in this domain.

  18. 18

    I totally agree, and I'd like to share a little tip with all those (like me) dealing with clients who expect us to provide the ultimate truth.
    The main reason we work with analytics is to understand what's going on, and come up with actions that help improve our offering for our site visitors.
    If I see that 20% of traffic comes from xyz.com, and 15% comes from Keyword A, I will definitely investigate in these two sources, will definitely call the guys at xyz.com and try to improve my ranking on Keyword A.
    What would change if you convinced me that xyz.com is responsible for 17% and not 20% of my total traffic?
    Absolutely nothing.
    My actions will not change based on the tiny improvement in accuracy.
    This might be a response to clients who are demanding 100%, just show them that you will still do the same thing.

  19. 19

    Thanks Avinash for this post. I went through the difference between Accuracy and Precision, which I understood in theory but not in terms of its application.
    Even regular statisticians live in an uncertain world and are all the time bridging data, extrapolating and doing all kinds of things.

    Another area is the in depth interview used for information (not data) collection about behavior of the population. I don't know whether this is used by folks on the web but it is very useful when you have to estimate behavior of the population. You may need more than one or two estimators and use normative measures of accuracy of your estimator. (This is one possible application of checking out whether the person is an accurate or precise estimator) This is clearly not a part of Web Analytics but strangely I have not found any reference to people performing such interviews even though it is of vital value. Getting a person to talk face to face is also very hard at times, this will remain a challenge for Web based surveyors.

    I wandered within the blog and really liked the statement 'think book not diary' and also the clear manner in which you have spelled out goals, though you have not updated some data.
    Incidentally, the term GIGO emerged in the 1960s from computer programming. Thanks.

  20. 20


    I, too, would be interested in an approach for 'educating up.' I think that Avinash's post titled 'Convert Data Skeptics: Document, Educate & Pick Your Poison' helps when the person you are trying to educate is near your level. They have time to look at the details. However, when you are a mere analyst in a Fortune 100 company and your data starts making it's way to the CMO (80 levels above you) does he or she have time to learn about the 'nuances' of the web analytics world and our data when those numbers are just on one of 140 slides appearing in a presentation? What if one of those slides is showing numbers coming from compete.com that shows how your site (1.5 million visits last month) is performing against a competitor? Another slide shows what your web analytics tool is capturing (880k visits last month). The CMO immediately questions the differences.

    Above, Avinash said this 'In my heart of heart I believe, sadly, that it will take a new generation of Analysts and a new generation of leaders in companies. ' I definately believe that in time it will get easier for us as more and more leaders have moved up through the ranks and know what we know about web data from first hand experience. They will have realistic expectations of the numbers and we won't have the problem of 'educating up' to people that we've never met before that are expecting accurate data from us. Until then….

  21. 21

    Another important post, Avinash. To me, the most important point is near the end, when you compare web data with TV and magazine data. I've always thought that web data is *more* accurate than most of the other data we have. It's just that

    (1) we're familiar with what data we can and can't expect in other fields (for example, we're comfortable using magazine circulation as a proxy for readership);

    (2) people somehow expect data to be "100% accurate" because computers are involved.

    I also think web analytics companies haven't helped themselves by using language like "unique visitors" but then having to issue disclaimers about why it isn't really unique visitors but an approximation. However, that shouldn't obscure the fact that our data is actually really good, and much better than what we can get in offline media. Now we just need to convince the HiPPOs of that…

  22. 22

    Hi Avinash,

    Great post. I think we should educate companies and HIPPOs about data quality. I totally agree with your arguments .We need to change our mental model and involve this model.
    However having a BI background I would like to give some examples why it so hard for some BI analysts (or web analysts) to change their mental model. Traditionally BI mistakes were very costly.
    • Imagine doing a mistake in a national mailing campaign. That could cost hundreds of thousands to the company.
    • Imagine recommending a launch of a new CPG product. What confidence level you should have to do recommendations?
    • You needed to evaluate offline campaigns. You needed correct/accurate/clean data to defend the campaign effectiveness.

    General offline companies needed to have very good data quality , define data quality procedures, data quality champions, etc . Analysts were brainwashed even from universities that data management is 80% of every analytical project. Everything was manual so even one small mistake could have a serious impact. Many BI analysts fired for data quality issues.

    Online is not the same and we should have a new breed of analyst that are comfortable with quality/ plethora of web data.

    I am sure that we all have stories of with data quality issues in BI/database marketing industry.


  23. 23

    I totally agree with your post. Just love it. I also agree with Christos: the biggest problem is most likely the HIPP, who should be aware of the fact that 100% complete data is an Utopia.

    We just have to base our decisions on imperfect data (just like science does).


  24. 24
    Maria Saltz says

    Great perspective on a very hot issue.

    I think that 100% data accuracy is a myth, but it is important to get comfortable with the discrepancy and get get the stakeholders comfortable with a certain degree of imperfection.

    Currently, up to my ears in new implementation "issues", I would love to get to the point that my boss stops asking the question "Are you sure the data is accurate?" and starts asking "How quickly can we act on these recommendations?"

    Definitely going through the site audit and definitely will be trying to reconcile with agency's data (sort of like getting a second opinion). Not expecting to get reconciliation to 100% match, but hoping to be able to put a confidence level around the data and make bets based on that.

  25. 25
    dipoogle says

    Hi Avinash,

    I recently came across your blog and got a chance to listen to your recent google webinar. Your posts got me interested in Web Analytics. Thanks for that. I have few comments/questions.

    I think people are comfortable with inaccurate data as long as they know the inaccuracy range (eg. +/- 5%). Do we have something out there which tells you how inaccurate your data is? For example, lets take the case of measuring clicks and impressions or your favorite bounce rate. Has someone documented all the things that can cause data to be inaccurate and some probability on how often it happens?

    A more important question is – how do you measure precision? Do we have any tools out there to quantify and improve your precision?

    I have more questions but lets get through these first.


  26. 26

    To Adrian's and Rick's point, I don't think the issue is as intractable as presented. If the concern is, "As the data gets passed along up the corporate hierarchy…" then I think the concern is flawed — the farther up you go, the greater the imperative that the insight is being passed along.

    Data –> Information –> Insight –> Action (or any of the 75 flavors of this concept) apply. Diving into the data can be seductive. And it's hard work. And it requires understanding the data limitations. And it requires understanding the nuances of what "time on page" really is reporting. So, great, you understand all that, AND you've shifted your own mental model to be comfortable with incomplete data (including some unexplainable inconsistencies across multiple data sources). It's human nature to want to "show the hard work I did to get to the insight"…but that's really not necessary, and it can be risky — that's a focus on you rather than a focus on your audience. Show the insight. Tell the story of that insight — how you got there and its implications.

    This is different from "showing the data" that led to the insight. The difference is in the orientation — you may need to show some data to tell the story, but that's the only data you should show. If a request for more detail comes, then provide it, and include mental-model-shift guidance/tips at that point. In short, if you provide insights that are valid, and those insights are acted upon, and that works out well, then you're doing the right thing by your company (or your client), and you will increase your credibility and reputation.

    In the case of clients where you have to include data because it's a "facilitated good" that proves you're delivering something — push "the detailed data" back into the appendix and focus on the insights / the story up front. If you nail that, the appendix will get a cursory scan at best.

  27. 27

    I am fully aware of imperfect web analytics data but it is really hard to explain this reality in a company who used to use DWH reports for years.

    When I started my job everybody was saying Omniture was such a bad tool, and none of its figures made any sense. When I started to learn web analytics, Omniture and our website I realized it was our crappy implementation that caused this. Otherwise Omniture is a great tool.

    Implementation QA process is so important to get the best data, I spent my days and night to test it over and over again to make sure we capture everything properly.

    Even after this when senior management sees 1 % difference between DWH and Omniture reports they freak out. I think as to your point next generation of analysts and management will be more confident to use web's imperfect data.

  28. 28

    Everyone: Some fantastic comments and delightful feedback here. I am very impressed.

    I was hoping to reply to them all here but I have channeled my response to your challenges of dealing with HiPPO's into a stand alone blog post. It will be the next one I write.

    Please look for it in a couple days. Meanwhile please accept my apologies for a lack of direct replies here.

    Thanks so much for the conversation.


  29. 29
    Captain Obvious says

    Consider a hypothetical conversation between an analyst and HIPPO.

    Ninja: "For this project, as a benchmark, we get 10,000 uniques a month to our web site and…"
    Hippo: "How do you get that number?"
    Ninja: "We've been tracking that metric for 12-months, and that is an average over the course of those 12-months. (Ready for the storm that's brewing:) Of course, that data is flawed. "
    Hippo: "What do you mean, 'flawed'?"
    Ninja: "It's not going to be 100% accurate."
    Hippo: "Why not? How accurate are we now?"
    Ninja: "We're about 95% accurate."
    Hippo: "What do we have to do to get 100% data accuracy?"
    Ninja: "Hire an outside firm, hire more analysts, get more analytic software in place."
    Hippo: "How much will that cost?"
    Ninja: "Approximately $10,000 a month."
    Hippo: "And that will get us 100% data accuracy?"
    Ninja: "No."
    Hippo: "Then why would we do that?"
    Ninja: "We wouldn't."
    Hippo: "What would we do?"
    Ninja: "Focus on the 95% data that we have and perform tests using what metrics we can to determine if we are moving in a net positive direction."
    Hippo: "And how do we do that?"
    Ninja: "I'm glad you asked…"

    Of course there's a slew of other questions that would be raised, and I'm no Avinash, so just have to wait and see his next post!

  30. 30

    I have appreciated the comments on this post, especially the interaction I've received by Rick and others about this topic.

    Please don't hit me over the head if I'm assuming too much (you know what happens when we assume), but I would like to push the perspective to a broader position because it seems most replies assume working at a larger corporation or e-commerce site. Framing the discussion in those terms leaves out some of the other #wa's who may be reading this blog, and, like me, only deal with websites that may average at or just below 2,000 visits a month.

    There are actionable insights within those small numbers; however, the dynamics of presenting those insights seems to be fairly different :-

  31. 31

    Captain Obvious: Nice story!

    It is very hard to say no to a HiPPO and I think you ended up with the right answer in the end (right answer from the HiPPO, you were always right :)).

    But I would have done one thing differently (snarky that I am):

    Hippo: "What do we have to do to get 100% data accuracy?"
    > Your answer > Ninja: "Hire an outside firm, hire more analysts, get more analytic software in place."
    > My answer > Ninja: "Who cares! I can tell you what we should do today with 95% confidence!."


    Next week I am hoping to share other strategies that will reduce the chances you'll get mired in this data quality discussion in the first place. Fingers crossed.


  32. 32

    garbage in garbage out. so true. if its not so great then its probably not worth it.

  33. 33

    Avinash- Great post. Strive for completeness and accuracy but don’t become paralyzed when you don’t have ALL the data. Timely business decisions are usually best made on just enough data. As a product manager, I would never be able to introduce new products, if I had to have complete, 100% dead-on, numbers that reflect the size of a market. How many companies right now are waiting to even dip their toes in social media until they have fully quantified the return they can get from diving in? No kidding about making decisions to keep pace with the web. The train is leaving the station.

    Good focus on precision. Consistent data is critical to making sound iterative decisions. It rarely takes looking at a whole population, to be able to judge and optimize the effectiveness of something you are doing. A merchant can readily make a product placement decision based on the rate of change in which a product sells in a specific place on a web site, without absolute alignment with their back end numbers; a marketer usually has enough data to fine tune an offer, creative or placement without waiting until the campaign ends to analyze the data; customer support can often deduce the breadth of an issue by detecting a change in how often specific topics were searched for, without having to turn to their call center logs.

    Also I wanted to make sure there wasn’t any confusion about Webtrends and cookies. Since early in 2006 Webtrends has used first party cookies as the de facto method to identify visitors. Our approach also lets us track them across different first party domains. This is critical when trying to correlate say the conversion rate of a specific search phrase that led to http://www.avinash.com to your checkout process on avinash.checkout.com. Impossible to do this without being able to track a visitor across domains.

    On the subject of completeness, regular auditing of a site to ensure all of the tracking is in place is a good maintenance practice, but having a culture in marketing and development that builds measurement into their projects from the onset is more important. There is no going back in time to add tracking code to an email that has already dropped or a micro-site that has already launched. Initial responses have been lost.

    Eric Rickson

  34. 34

    Right on Avinash!

    Seeking to remedy that 2-3% inaccuracy is a classic case of diminishing returns. Data purity is often used as a distraction from more important issues in a business. GIGO is a much easier point to make to your CEO then explaining the long tail of search for example.

    You keep us on our toes!


  35. 35

    Great reminder – could you post this on a Marketing/Ad agency Blog?

    One of the things that we ask of our 'clients' is "how would getting the exact numbers change your decisions". In most cases (99.99%) it would not change the decisions. Fool on.

  36. 36

    This is a nice sentiment but nothing new in the BI area – another example of web analytics not learning from "old school" analytics.

    Although operational databases are usually very accurate, of the 100's of customer databases I've data-mined over the past ten years, I cant think of many that were over 95% accurate, with customers data being ommitted, duplicated, mis-inputted etc.

    The sentiment is right, go for precision in the data that helps you make your decision, but will this be a cultural change for most marketers? It shouldn't be. If it is news for you, either your tech people have been lying to you about you data quality for years or you havent been collecting much data about the customer.

  37. 37

    Great post, Avinash! this is something I talk about a lot in my Webtrends classes – the fact that data on the web is inexact and will probably remain so for some time to come. It’s a big paradigm shift for companies, so I really try to focus on explaining why the data is inexact and what we can learn from it anyway. Then again, I come from a qualitative background in education, so i’m used to data that is imperfect, yet useful.

    I’d be interested to hear – what made the shift in thinking so tough for you?

  38. 38
    Pat Moore says

    Interesting how you don't include Google Analytics in the set of Web Analytics data sourced by third party cookies providing crappy data.

    At least Omniture lets you have the raw (crappy) data they collect from your site.

  39. 39

    Pat: I am sorry I don't understand your comment completely. Could you please share a bit more context?

    Google Analytics, like a handful of other analytics tools in the market, only uses first party cookies. They substantially improve the quality of the data you collect, though cookies of any kind are by no means perfect.

    With regards to "at least Omniture lets you have the raw (crappy) data they collect from your site", there is full API access to Google Analytics, you can pull the data and have at it if you please. And you can do so for free without exorbitant contracts or access fees. You can also do this with WebTrends for example (pull all your data without extra fees). The free Yahoo! Web Analytics tool also has a nice API.

    I hope I am kind of in the ball park in terms of the response you were looking for. If not please do let me know.

    Sandra: I think it was simply a matter of having worked in the traditional BI (business intelligence) and DW (data warehousing) world for such a long time. Everything tied (atleast within 99%) and numbers got reported to the SEC (government). From there to this wild wild west was a bit of a learning experience.

    Of course it did not take me too long to realize how awesome this wild wild west was, so much more different kinds of data and all the possible insights. :)


  40. 40
    Pat Moore says

    OK. I stand corrected. I thought GA was also deployed by a third party cookie. Is that new?

    I also didn't realize I could get my data back from GA. I don't see that option. How can I get the data in GA?

  41. 41
    Pat Moore says

    One more question. Just to understand the extent to which I am getting crappy data with my third party Web Analytics data: what percent of visitors don't accept third party cookies?

  42. 42

    Pat: GA has never been on third party (mostly because 3rd party stinks!).

    Here's a website for exporting data in bulk from Google Analytics:


    Or you can use a more user friendly version developed by Juice Analytics:


    With regards to your question about cookies. . . .

    Your best option is to work with a consultant for your tool. Say for example if you work with Google Analytics you would get a GAAC (www.bit.ly/gaac). If you work with Omniture they have a professional consulting side as well.

    Here is an article on cookies and some benchmarks and numbers:

    A Primer On Web Analytics Visitor Tracking Cookies


  43. 43

    I cant say I know much about statistics for adds but the point about selective I think is important.

    And also very associated the possibility to put large areas in black-boxes or to other people. Hence looking at one large internet society I would find maybe fem very conntected people of a little difference type and look and talk to.

    With a lot of resources one could maybe do it different but this works quite ok when you get used to identity the point you follow.

    I imagine something similar would be important looking at your data sources for add-revenou.

  44. 44

    Great post, Avinash! this is something I talk about a lot in my Webtrends classes – the fact that data on the web is inexact and will probably remain so for some time to come.

    It’s a big paradigm shift for companies, so I really try to focus on explaining why the data is inexact and what we can learn from it anyway.

  45. 45

    Hello! I come from a stand site Daquan zzwzdq.com. I am a Chinese.
    Read your article, I think we should educate businesses and related data quality hippopotamus.

  46. 46


    You are perfectly right. But even in BI, before or alongside web analytics, perfection exists only in dreams. That's why I use the term GIQO to describe what we do in data mining : Garbage In, Quality Out.

    A good data mining algorithm selects and uses the pieces of data that are meaningful. The rest is discarded as garbage. The result is a good, useful model.

    This illustrates the difference between reporting and analytics. Numbers in reports have to be correct, because the goal of reporting is to provide correct numbers. Conclusions of analyses have to be correct because the goal of analytics is to come to trustworthy conclusions, even if they are based on crappy data!

  47. 47
    blogmekan says

    It’s a big paradigm shift for companies, so I really try to focus on explaining why the data is inexact and what we can learn from it anyway.

  48. 48

    Those of us in the WA biz know better by now.

    According to my experience (and Google Analytics University :), the expectation should be that WA data is best used for trending analyses – which is a excellent way to make better business decisions.


  1. […] Web Data Quality: A 6 Step Process To Evolve Your Mental Model (Occam's Razor by Avinash Kaushik) […]

  2. […]
    – If it is reporting : the numbers better be correct. In a large enterprise I bet there will be two sources of the same numbers. The results will be compared and there will be trouble. – if it is descriptive data mining, like clusterings or descriptive classifications : the data better be as correct as possible. Errors are acceptable within reasonable limitations, as long as the picture "fits".
    – if it is data mining for targeting purposes : the data has to be stable in time. Correct ? I do not care. Does this sound crazy ? Perhaps. But really : I do not care ! If they put the size of the shoes of someone in the "Birthday" variable this poses no problem. For the data mining algorithm does not take the meaning of the variable names into account. "var1", "var2", var3", etc do equally well.
    The only thing that matters is : how good is the predicting quality of the targeting model ? You can only obtain a good predicting model with variables that have prediction power (are related to the target) and that are stable, meaning the meaning of the variable does not change over time. I do not like it when IT people correct flaws in the data. It diminishes the model quality and I have to rebuild them.
    So better use your time to build targeting models than to try to get the data to be perfect. – if it is web analysis : this is yet another story, neatly explained by Avinash Kaushik in this post. […]

  3. […]
    Li um texto essa semana sobre qualidade de dados na web e, pensando a respeito do que Avinash Kaushik (autor do artigo) escreveu, tirei algumas lições interessantes. Não vou fazer uma resenha aqui. Vou viajar por aquilo que achei mais interessante e que vai me servir nesse mundo louco da web.

    Se estiver interessado em detalhes, dê uma passada no site:https://www.kaushik.net/avinash/2009/05/web-data-quality-6-step-process-evolve-mental-model.html.

  4. […] Web Data Quality: A 6 Step Process For Actionability | Occam's Razor by … (tags: web statistics analytics data internet roi testing usability) […]

  5. […]
    There are limitations that every web site deals with as well. When IT folks tell you the data is crap because of deleted cookies, users using different browsers/computers, blocking javascript, etc., etc. … smile and say, “get over it.” There is nothing we can do about these and the data will never be perfect, but that’s ok. That’s why trending is important.

    Take a look at this recent post by Avinash Kaushik about being comfortable with incomplete and imperfect data. Read. Heed.

  6. Link Love Friday with Cory B. says:

    Link: Web Data Quality: A 6 Step Process to Evolve Your Mental Model

    Love: data, but don’t allow imperfect data to cause you to freeze and not act. One of my favorite lines from this post says there is no limit to the amount of data to you can collect and store on the Internet, and it’s headache-inducingly correct. I’ve mentioned in previous posts the importance of collecting data, analyzing data and then providing an interpretation of that data for insight into what action should be taken, and I of course still feel that way, but I’m not a Quant. There’s a point where granular becomes so microscopic that the difference in dataset A and dataset B will not cause your client to change his or her decision. Therefore, you need to accept imperfection and act. I know we’re big into models and science and equations, but so was Wall Street, and we saw what happened there. Certainly collect your data, but don’t allow it to bog you down into indecision, and don’t allow incomplete data to bolster that indecision. After all, it’s all incomplete (esoteric alert!).

  7. […] Link: Web Data Quality: A 6 Step Process to Evolve Your Mental Model […]

  8. […]
    You can lose hair trying to reconcile all of these data sources, or you can act based on trends you see across the tools. Google Analytics may be telling you one thing about referrals from Facebook, Facebook may be telling you another, and your CRM system may be telling you something else. However, if each system is proportionally telling you that conversions from traffic driven by Facebook decreased significantly, then act. If you wait for 100% data confidence, you’ll miss the boat.

    For more on freeing yourself from data accuracy dependency, see Avinash Kaushik’s blog post on the subject.

  9. […]
    I know this doesn’t directly address this wish, but I truly believe in what Avinash says about Accuracy vs. Precision. Essentially, it’s all about focusing much less on trying to collect and report on every single piece of data, and all about focusing much more on analyzing and taking action from the good, precise data that you already have.

  10. […]
    Matthew: Meno dati campionati. Specie quando si lavora con le API.
    Joe: E’ importante avere un dato completo e meno accurato, piuttosto che il contrario, come dice Avinash.
    Marco Cilia: Lo dico sempre nei miei corsi: se anche quel che vedete fossero il 90% dei VERI visitatori del sito, non sarebbe comunque più tranquillizzante sapere che quel 90% è coerente nel tempo, piuttosto che concentrarsi sull’avere il 100 un giorno, l’87 il giorno dopo e il 95 quello dopo ancora? le decisioni è meglio prenderle su dati coerenti!

  11. […]
    De valkuil van deze pijler is dat je blijft hangen in het steeds meer willen verzamelen van data en het 100% correct maken van de data. Avinash deelde 2 jaar geleden al een mooi artikel met ons hierover namelijk “Web Data Quality: A 6 Step Process To Evolve Your Mental Model”

  12. […]
    Kaushik points out two obstacles in the way of data quality in online marketing: the high volume of data we collect the imperfections in the data collection system He cautions against the “futile quest for clean data.”

  13. […]
    The Internet generates a never-ending avalanche of all kinds of data; many of the sources are related to digital marketing and social media. But keep in mind that Internet data (big or otherwise) impacts the decision-making needs of many areas of an enterprise, not just Marketing. Web analytics and digital marketing expert Avinash Kaushik has this to say about Internet data quality:
    Ditch the old mental model of Accuracy, go for Precision (more here: Accuracy, Precision & Predictive Analytics). It might seem astonishing but your analysis will actually get more accurate if you go for precision.
    I am not saying accept bad data…What I am saying is that your job does not depend on data with 100% integrity on the web. Your job depends on helping your company Move Fast and Think Smart.
    Kaushik further sees the direction of data quality morphing from the ‘old school’ BI need for data Accuracy to future insights driven by predictive analytics relying on data Precision

  14. […]
    One of the guys that talks about this so well is Avinash Kaushik. if you don’t know the name, he is a New York Times best selling artist, one of Google’s head guys and is on the board of several Universities. His blog, Occam’s Razor, is pure gold for the modern day data analyst and I recently ran into one of his older posts named, A 6 Step Process to Evolve Your Mental Model. In it, he describes the idea that there is no set of perfect data and people need to follow a much simpler path to “Virtuous Data”.

  15. […]
    The old people who did BI and moved to the web have had to come to the realization that the old rules of making decisions are out of the door. Not just because that mental model of what now counts for “data” means but also because what counts for “decisions” has changed, the pace at which those decisions need to be made have changed. Read more

  16. […]
    In the train’s example, the data was there (tickets purchased after 17:00 are fo the next day). Some analyst discovered it and a designer used it to change something, saving London commuters some taps on the screen, and increasing customers satisfaction getting them home earlier. That analyst had the proper mindset! How could we apply this mindset to different websites?

  17. […]
    I follow Avinash Kaushik (and read his blog and books) because he has a laser focus on communication and straightforward insight. I started this post to talk about a great Evian commercial he just posted on Google+ and Facebook but it triggered a thought about one of his older blog posts which is still worth your time to read, Web Data Quality: A 6 Step Process To Evolve Your Mental Model. There are two lines you should think about:

  18. […]
    The risk we face in trying to move beyond simple tracking of last-click conversions can be best described as “garbage in, garbage out,” something Avinash himself spells out in detail on his blog. Picking the wrong actions to measure, or the wrong value per action, will drive you to make the wrong marketing decisions. Indeed, once you start moving beyond simple metrics, it becomes quite easy to manipulate numbers to show the results you want.

  19. […]
    Les entreprises d’aujourd’hui ont besoin d’être agiles pour survivre. Les entrepreneurs qui les dirigent ont besoin de prendre des décisions rapidement, malgré l’environnement instable qui les domine. Vouloir sans cesse connaitre 100% de la situation avant de décider ne vous permet que de “procrastiner”. Vous pouvez certainement arriver aux mêmes conclusions avec 60% ou même 30% de l’information… Avinash lui même sur son blog écrit: Be comfortable, I mean really freaking comfortable, with incompleteness and learn to make decisions. (Source)

  20. […]
    Tras completar este curso, te sentirás más confiado al utilizar, configurar y modificar Google Analytics, pues sabes qué hay detrás de la plataforma. Conceptos como el muestreo de datos dejarán de tener un vago tinte negativo (es una herramienta gratuita, después de todo), comprenderás las dificultades que hay detrás de lo que parece tan sencillo, y comprenderás por qué Avinash Kaushik le da poca importancia a que los datos de la analítica web sean naturalmente un poco inexactos.

  21. […]
    It’s better to be quick and nimble. As Google Analytics evangelist Avinash Kaushik once wrote when reflecting about the need for data quality, it’s better to “move fast and think smart” than it is to try to strive for perfectly clean data. If you have a large website, think about running an audit of the most important part of your website first. Once you have fixed any issues uncovered, re-assess, and determine if it’s necessary to audit additional sections of your website.

Add your Perspective