There is perhaps no other activity that is as much a bane of our existence in Web Decision Making as reconciling web data! Nothing seems to tie to anything, each time you rerun the numbers they change, there are tons of people fanning the flames of cookie deletion and how that causes bad data, arguments around using logging and tagging as data collection mechanisms, and how many cookies each browser can accept anyways and on and on and on. And like dutiful little hamsters we spin in our wheels round and round and round and round and never make any progress and still…… nothing ties to anything else.
Let me state as emphatically as I possibly can: Data quality on the internet absolutely sucks. And there is nothing you can do about it. Atleast for now.
The sooner we, especially we, internalize this, the sooner we can get all just get over it with and the sooner we can move on.
(Oh and it really does not matter what your favorite neighborhood vendor, the $0 one or the $1 million one, says. Pretty much all vendors use the exact same way to collect data. Yes each vendor probably has some nice little innovation but they can’t help the fact the internet is a weird little animal constantly evolving and changing. That is its inherent beauty, charm and why the World Wide Web is such a delight. :~) )
Note what I am not saying: I am not saying you should not implement your web analytics tool completely & correctly. I am not saying you should not ensure your campaigns are tagged. I am not saying you should not follow the technical recommendations of your vendor. Do all that. But then move on, don't try to go for 100% when 95% will do just fine!
This post is for those amongst us who want 100% perfection, who have a difficult time with accepting that good enough now is better perfect four weeks from now.
Here are some of the reasons people have actually said to me when it comes to data quality (especially of clickstream data):
- Cookies “blow”, in many ways
- This is different and I don’t like it
- Robots do weird stuff and mess thing up
- I can do it better than vendor x/y, with five minutes of coding
- Nothing ties to my webtrends and webtrends is right
- People behave strangely and mess up numbers
- Everyone is out to get me
- I miss my mommy
Ok so maybe I was just making up the last one (but not the second from the last one).
Most of the reasons above are true and real, and there are many more that are not on the list (What about google pre-fetching results? What about an entire sub experience that is now ajax, which most vendors can barely measure anything about? What about IE7 doing wacky stuff to third party cookies? What about click fraud? What about the impact of CDN's?).
We should simply not expect that data quality will exist on the internet, atleast for now. Atleast not like we are used to in traditional backend systems like ERP and CRM where almost everything typically does tie again and again. On the World Wide Web there is too much complexity (people, process, rapid change) and data collection too primitive to ensure accurate data capture and data analysis.
Now I’ll be the first one to admit that your decision makers are not going to let you get by with my lofty proclamation that data quality sucks. And make no mistake it will take time to convince them and win their trust that even though data quality is sub optimal we can still make great decisions.
Here is my recommended six step plan:
- Step 1: Resist the urge to dive deep into the data to find root cause. IMHO this is a time consuming and futile exercise. And by the time you do figure out some semblance of a explanation there are more new reasons why data won't tie (atleast at a macro level). Especially if you are off by less than 10%.
Step 2: Assume a level of comfort with the data. When presented with data understand clearly how it is collected and then encourage yourself, and your decision makers, to assume a level of comfort with the data. Say you look at something and you could say I only trust it 80%. The other guy might say 75% and perhaps the nice lady might say only 70%. That is ok.
Human beings are complex and each a collection of its life experiences so we will each decide differently. That is ok, but by trial and error arrive at a comfort level with the data. It is now your friend, different from Step One where it was out to get you fired.
Step 3: Start making decisions that you are comfortable with. This step takes the most courage, it is the biggest leap of faith. But trust me after this it is all fun. I can look at your table of data and with just 70% comfort level I can make business decisions.
If important KPI ONE changed by X% or YYY,YYY numbers with 100% confidence I would have spent $90k on the next campaign or completely changed the site architecture or built a new checkout process etc. But with only 70% confidence in that KPI change I can still decide that I will only spend $60k or run a multivariate test before I change site architecture to get more confidence in data or I’ll still build a new checkout process because I only need 70% confidence in the data because checkout is so important.
This example is a simple illustration that it is possible to make decisions with less than 100% confidence in the data. Encourage that behavior and it is ok if you trust the data more than your decision makers, they will come around with time. It is important that you model this behavior, if in your gut you find it hard to make this leap of faith, it will be monumentally harder for your decision makers to.
Step 4: Over time drill deeper in micro specific areas & learn more. I love data detective work, oh the joys of trawling through one terabyte of data looking for answers!! Once you get into the groove of making decisions, rather than being paralyzed by data quality not being good, I recommend that you find small narrow niches of data segments to drill into. The goal will be to understand why data for that narrow niche might not be what you expect.
So you could take all the traffic from a particular referring url or a particular search key phrase or all your email marketing traffic in a week or everyone who saw a particular page or….. for each one of us our poison will be different. By narrowing your focus you’ll reduce the number of distractions, increase the chances of isolating causality and start to understand better your complex website ecosystem.
Step 5: Get more comfortable with data and its limitations over time. As you understand data better over time (data collection, storage, manipulation, processing and analyzing) you'll make the appropriate adjustments in your interpretation and quest for Web Insights. This in turn will increase your comfort level of data over time, from 70% to 73% to 78% to 81% and on and on. While you will perhaps never get 100% confidence you’ll start making significantly more confident decisions.
Aim for small increments of improvement in comfort and confidence levels for yourself, and reduce that increment by 50% for your decision makers (this is much harder for them).
Step 6: Consistency in calculations = Good. On the web: Absolute numbers rarely matter, segmented trends do. This is important to remember, especially if you find that you are unable to get beyond Step One. The quest to get a absolute number right is especially futile because of all the reasons discussed above. As long as you stay consistent and look at trends *and* important segments in those trends you will reduce the chances of making sub optimal decisions, even if there is a small delta in data quality. (If you have not had a chance I recommend my post on segmentation.)
There are atleast two cases where Data Quality is dicey, but deserves special love.
- Switching from one analytics tool to another: My recommendation is rather than reconcile run the two in parallel for four to eight weeks and simply benchmark the differences in key metrics between the two. Then create a multiplier and use that if you want to do historical trend comparison.
So you are going to replace Omniture/WebTrends/HBX/CoreMetrics with ClickTracks/Google Analytics/WebTrends/Omniture, run the two in parallel, note that Visitors from your old platform are always 15% greater than the new one. Use that multiplier for old data trend comparisons.
Update: Here are two articles that you might find to be of value when dealing with data quality challenges…
Web Data Quality: A 6 Step Process To Evolve Your Mental Model