There is perhaps no other activity that is as much a bane of our existence in Web Decision Making as reconciling web data! Nothing seems to tie to anything, each time you rerun the numbers they change, there are tons of people fanning the flames of cookie deletion and how that causes bad data, arguments around using logging and tagging as data collection mechanisms, and how many cookies each browser can accept anyways and on and on and on. And like dutiful little hamsters we spin in our wheels round and round and round and round and never make any progress and still…… nothing ties to anything else.
Let me state as emphatically as I possibly can: Data quality on the internet absolutely sucks. And there is nothing you can do about it. Atleast for now.
The sooner we, especially we, internalize this, the sooner we can get all just get over it with and the sooner we can move on.
(Oh and it really does not matter what your favorite neighborhood vendor, the $0 one or the $1 million one, says. Pretty much all vendors use the exact same way to collect data. Yes each vendor probably has some nice little innovation but they can’t help the fact the internet is a weird little animal constantly evolving and changing. That is its inherent beauty, charm and why the World Wide Web is such a delight. :~) )
Note what I am not saying: I am not saying you should not implement your web analytics tool completely & correctly. I am not saying you should not ensure your campaigns are tagged. I am not saying you should not follow the technical recommendations of your vendor. Do all that. But then move on, don't try to go for 100% when 95% will do just fine!
This post is for those amongst us who want 100% perfection, who have a difficult time with accepting that good enough now is better perfect four weeks from now.
Here are some of the reasons people have actually said to me when it comes to data quality (especially of clickstream data):
- Cookies “blow”, in many ways
- This is different and I don’t like it
- Robots do weird stuff and mess thing up
- Tagging has limitations and no one accepts javascripts
- I can do it better than vendor x/y, with five minutes of coding
- Nothing ties to my webtrends and webtrends is right
- People behave strangely and mess up numbers
- Everyone is out to get me
- I miss my mommy
Ok so maybe I was just making up the last one (but not the second from the last one).
Most of the reasons above are true and real, and there are many more that are not on the list (What about google pre-fetching results? What about an entire sub experience that is now ajax, which most vendors can barely measure anything about? What about IE7 doing wacky stuff to third party cookies? What about click fraud? What about the impact of CDN's?).
We should simply not expect that data quality will exist on the internet, atleast for now. Atleast not like we are used to in traditional backend systems like ERP and CRM where almost everything typically does tie again and again. On the World Wide Web there is too much complexity (people, process, rapid change) and data collection too primitive to ensure accurate data capture and data analysis.
Now I’ll be the first one to admit that your decision makers are not going to let you get by with my lofty proclamation that data quality sucks. And make no mistake it will take time to convince them and win their trust that even though data quality is sub optimal we can still make great decisions.
Here is my recommended six step plan:
- Step 1: Resist the urge to dive deep into the data to find root cause. IMHO this is a time consuming and futile exercise. And by the time you do figure out some semblance of a explanation there are more new reasons why data won't tie (atleast at a macro level). Especially if you are off by less than 10%.
Step 2: Assume a level of comfort with the data. When presented with data understand clearly how it is collected and then encourage yourself, and your decision makers, to assume a level of comfort with the data. Say you look at something and you could say I only trust it 80%. The other guy might say 75% and perhaps the nice lady might say only 70%. That is ok.
Human beings are complex and each a collection of its life experiences so we will each decide differently. That is ok, but by trial and error arrive at a comfort level with the data. It is now your friend, different from Step One where it was out to get you fired.
Step 3: Start making decisions that you are comfortable with. This step takes the most courage, it is the biggest leap of faith. But trust me after this it is all fun. I can look at your table of data and with just 70% comfort level I can make business decisions.
If important KPI ONE changed by X% or YYY,YYY numbers with 100% confidence I would have spent $90k on the next campaign or completely changed the site architecture or built a new checkout process etc. But with only 70% confidence in that KPI change I can still decide that I will only spend $60k or run a multivariate test before I change site architecture to get more confidence in data or I’ll still build a new checkout process because I only need 70% confidence in the data because checkout is so important.
This example is a simple illustration that it is possible to make decisions with less than 100% confidence in the data. Encourage that behavior and it is ok if you trust the data more than your decision makers, they will come around with time. It is important that you model this behavior, if in your gut you find it hard to make this leap of faith, it will be monumentally harder for your decision makers to.
Step 4: Over time drill deeper in micro specific areas & learn more. I love data detective work, oh the joys of trawling through one terabyte of data looking for answers!! Once you get into the groove of making decisions, rather than being paralyzed by data quality not being good, I recommend that you find small narrow niches of data segments to drill into. The goal will be to understand why data for that narrow niche might not be what you expect.
So you could take all the traffic from a particular referring url or a particular search key phrase or all your email marketing traffic in a week or everyone who saw a particular page or….. for each one of us our poison will be different. By narrowing your focus you’ll reduce the number of distractions, increase the chances of isolating causality and start to understand better your complex website ecosystem.
Step 5: Get more comfortable with data and its limitations over time. As you understand data better over time (data collection, storage, manipulation, processing and analyzing) you'll make the appropriate adjustments in your interpretation and quest for Web Insights. This in turn will increase your comfort level of data over time, from 70% to 73% to 78% to 81% and on and on. While you will perhaps never get 100% confidence you’ll start making significantly more confident decisions.
Aim for small increments of improvement in comfort and confidence levels for yourself, and reduce that increment by 50% for your decision makers (this is much harder for them).
Step 6: Consistency in calculations = Good. On the web: Absolute numbers rarely matter, segmented trends do. This is important to remember, especially if you find that you are unable to get beyond Step One. The quest to get a absolute number right is especially futile because of all the reasons discussed above. As long as you stay consistent and look at trends *and* important segments in those trends you will reduce the chances of making sub optimal decisions, even if there is a small delta in data quality. (If you have not had a chance I recommend my post on segmentation.)
There are atleast two cases where Data Quality is dicey, but deserves special love.
- Switching from one analytics tool to another: My recommendation is rather than reconcile run the two in parallel for four to eight weeks and simply benchmark the differences in key metrics between the two. Then create a multiplier and use that if you want to do historical trend comparison.
So you are going to replace Omniture/WebTrends/HBX/CoreMetrics with ClickTracks/Google Analytics/WebTrends/Omniture, run the two in parallel, note that Visitors from your old platform are always 15% greater than the new one. Use that multiplier for old data trend comparisons.
- Cart & Checkout process analysis: You want a great degree of precision here because of the amount of money on the line. If you want to spend time reconciling, this is the place to do it. Javascripts are a sub optimal way to collect this data. If you platform allows use something like what ATG allows for: Event Logging. Each time someone is in this process it precisely captures the data from the server (not the pages) along with the business context. This makes a powerful set of data to analyze for key insights.
Agree? Disagree? Can javascript tag based or web log based data be of better quality? Would you recommend more/different steps to account for sub optimal clickstream data quality? Please share your feedback via comments.
Update: Here are two articles that you might find to be of value when dealing with data quality challenges…
-
Web Data Quality: A 6 Step Process To Evolve Your Mental Model
Slay The Analytics Data Quality Dragon & Win Your HiPPO's Love!
Good luck!
Hi Avinash –
Not sure I have the ability to comment on the technical apsects of your post, but as you said the key is to recognise that data quality is never optimal…once, you know this, what I say to clients, and as decision maker myself, I'd rather know NOW 70 or 80% of the truth rather than 100% (knowing that truth is a "relative" concept of course), to take a decision… Too many "technically oriented people" forget this (I was probaly too much like this in the past, my intial training is in computer science as well), and at the end, the key is to decide and learn over time…specifically from mistakes…At a time where marketers finally recognize that a a lot consumer decisions are "irrational", data is good but so is Intuition…data dive helps shape the right intuition…to make the right decision…
Cheers
Laurent
I couldn't agree more. I can't tell you how many times our clients ask us to compare statistics from Analytics Package A to Analytics Package B. It's a time consuming process that usually ends in failure. We try to stress two things to people:
1. Look at the trends in your data
2. Make one change at a time so you can accurately measure the results.
Measuring change to your website is a scientific, controlled experiment. If you start with a baseline statistic and implement controlled change then there will be little doubt as to the results of that change. It doesn't matter if your baseline is off by 2, 5 or 10 percent from the actual numbers. If you implement a single change then there will be little doubt of its effect when you're looking at the trends in your data.
I think Caleb touched on this in his comment a couple of weeks ago:
Excellent post. I wish that more people outside the analytics community understood that you are never going to get two different tools to give you the same numbers, and that your time is much better spent doing analysis.
Insightful post! Just wonder if you have tried any of the data quality tools from the enterprise software world?
Hi Avinash,
Thanks a lot for another wonderful article.
When I apply web analytics with our GA data ,I found some of the analysis results dont quite make sense to me. i.e 80+% of site users convert within a day, 60+% of site users convert with only one visit. Does it mean we are an excellent site? However, we are a site with only 1+% conversion rate and 50% bounce rate.
Therefore, I am wondering if the whole GA data is not accurately set since we have a huge bunch of profiles, 50ish. Do you have any good suggestion what I can do to prove my guessing and how I can verify the data in GA is good quality ?(though I know it is sampling which might not exactly the same to our internal tracking)
Thanks a lot
Maggie
Maggie: You might find this post to be of value as you try to figure out how to reconcile the difference between various systems:
~ The Ultimate Web Analytics Data Reconciliation Checklist
There is not enough context in your comment for me to add any specific value. But I'll add that if you are struggling with this even a bit, it might be a great idea to hire a GACP to look at it for you and sort things out for you quickly. You'll find a list here: http://www.bit.ly/gaac
-Avinash.
Thank you Kaushik for this nice read.
You're right that if we seek perfection in something that is not perfect, we may be only using the seat of the pants. Yet, I think that more systematic and intentional work will be always favored by many.
Mehdi: We should bring a systematic approach to implementation, and ensure that the data collection mechanisms are implemented accurately and completely.
But. Data will never be perfect on the web, we should train ourselves to recognize the point of diminishing margins and move into data analysis. Even with 90% confidence, we can make tons of extremely smart decisions for our businesses.
Avinash.