Data Mining And Predictive Analytics On Web Data Works? Nyet!

fresh Strong Russian word: Nyet [No]. By the end of this post I hope you'll agree. Worst case you'll have food for thought.

This in-depth post covers a complex topic that might not apply to everyone, but it covers an area where companies have struggled to try to show return on the investments made in skills, technology and time. The post promises clarity and guidance that hopefully will result in you saving tons of aggravation and yes even a nice chunk of change.

Data Mining and Predictive Analytics have promised a the earth, the moon and the sun for sometime now, in all channels we do business in. My personal point of view is that on the web they fall far short of even the most pessimistic promises. For now.

As someone who has grown up in the world of traditional decision support systems (massively large data warehouses, business intelligence systems and tools, ERP & CRM systems) I have had the opportunity to be on both marketing / business side as well as development and implementation side of things.

There is nothing cooler than imagining all the wonderful things that will come if you simply move beyond reporting, and even analysis, to doing true data mining and predictive analytics. It is hard but can be rewarding.

Lots of consultants (yes I realize the irony here) will sell you this very effectively.

On pure web data though sadly it does not work.

Much as you might desire it, much as you might will it to happen. Your traditional data mining efforts and resources and $$$ spent on doing predictive analytics will yield very few and rare actionable insights. Most of the time it will prove to be a sub optimal use of time and energy.

[I can see the smart analysts amongst you get off your chair and mutter obscenities under your breath.]

There are a few very powerful, and non-obvious, elements working against you when it comes to finding exploitable trends and patterns in your web data, the kind that you are used to in offline and erp/crm type environments. Before you decide to pour $$$ and systems and people into your web predictive analytics efforts please consider the rest of this post.

I recently had the great opportunity to present at the bay area ACM Data Mining Special Interests Group. Here is the last slide of my presentation:

Data mining and predictive analytics challenge

The slide, on my behalf, captures the essence of the challenge when it comes to doing Predictive Analytics with web data. Let me explain.

#1 Type of Data:

It is important to realize that web data for the most part is completely anonymous, usually incomplete and really really unstructured. When you want to do traditional data mining (and not just analysis) and predictive analytics all of these things are poison.

You are looking for larger complex trends and patterns in the data for people, products, outcomes, behavior over large enough periods of time so that you can find something insightful that can also be exploitable.

That is really hard to do when the core things you are relying to capture data are anonymous cookies and javascript tags that can be very, shall we say, sensitive. And that's just the tip of the iceberg.

All this makes it much much harder to tie behavior of people to outcomes they might be driving (on any kind of website, ecommerce or not). Yes if you capture login id's and have connected that to a actual human's details from your offline system and do this for every single person who visits this problem eases a bit (the anonymity part) but most of it is still there.

#2 Number of Variables:

People behave in crazy ways offline, they have multiple touch points and dont use perfect names and addresses etc. All this is much more insane in the online world.

We have discussed on this blog how it is not a online world or a offline world but rather it is a nonline world! This means people flow between channels and touch points and there could be a outcome (lead, purchase, problem resolution) at a completely different channel than were most of the interaction was. You can imagine how this will completely screw up your SAS or SPES or Clementine or other home grown solutions.

Here is another thing that lots of us underestimate. It is easier to Mine and then Predict when there is a certain amount of non-siloed existence. On the web Google is competing with a guy and his pony putting together a new search engine. Not only are there pretty much no barriers to entry but it is easy for your customers to flirt with your competitors and for your competitors to react to you in a massively efficient manner.

So are three visits to purchase typical? (What about two visits to a store in between?) Is $15 off to people from Florida the best strategy? (What happens to that when your competitors run aggressive PPC?) Is "Tony" and all visits attributed to Tony really Tony? (What about cookies and my wife and I and Damini all surfing Amazon on the same login?)

And here is what happens, by the time you control for the variables you can count and account for (while throwing away all that you can't) literally you are left with a glass of water (and you started with a ocean full of water) and your ability to predict anything scalable for massively actionable insights is deeply limited. It is just a glass of water after all. :)

#3 Multiple Primary Purposes:

On the web this issue complicates things. We are trying to predict the outcome of our website, a complex being that exists to do lots (even things that your website was not created for).

So if it is unlike you other channels where a visit and outcome is fairly easily identifiable at the highest level then how do you Mine and Predict?

I have often stressed the importance of measuring Primary Purpose because of the power that comes from real understanding of why people visit the website. Two things connected to Primary Purpose mess up your Mining and Prediction efforts:

1) You don't know all of the primary purposes (click here to find out how you can find out).
2) It is incredibly difficult to take your massive collection of clicks and visits and then assign them into each primary purpose bucket and then predict on top of that.

3) See below.

#4 Multiple Visit Behavior:

This really screws things up. You can predict frames of minds (primary purpose) when you send people pieces of mail. You can predict what people want/think when they want into your supermarket / store. You can make up even more examples of things we all analyze and Mine and Predict.

It is a pain to go to a store and then go there six more times. On the web this is trivial. Hardly any website converts in one visit.

It is also a pain to go to the store for every problem you have or every question you have. On the web this is trivial. You can have the same person come to your website as a different persona many times to solve a different issue.

The question as you get ready to analyze your multi terabyte database is: How can you isolate this behavior in your clicks? With how much confidence?

On paper it sounds easy but in practice it is incredibly hard to accommodate for multiple visit behavior, even if you have nixed the problem of collecting data accurately for each person and for each of their visits.

#5 Missing Primary Keys, Data Silos, Lack of Holistic Datasets:

One way to get better at prediction is to take you data out of the web analytics silos and merge it with other sets of customer data in your company (stores and supermarkets, phone channels, others). If you knew all the costumer touch points and had merged the data then it gets much much easier to understand current behavior and predict future behavior and outcomes.

This nirvana scenario is crushed by a couple of rather rotten tomatoes.

We are all familiar with untagged campaigns and pages. We also know that the url parameters don't always work in helping us collect data. The issue that causes more problems is the fact most companies don't quite put the forethought required to create the right "primary keys" that will allow data from different channels to be hooked up together.

There are even problems with name and address and phone numbers collected and stored differently, causing both a data reconciliation nightmare but specific to this post causing major challenges in analyzing outcomes.

For data mining and predictive analytics to yield positive ROI your company will have to put a lot of forethought into the process of data collection and storage across channels and in the deep bowels of your web / erp / crm systems. If that action item is not marked completed then it is optimal to focus on that first before cutting a chq for tools / people to do Mining and Predictions.

#6 Massive Pace of Change on the Web:

Sure Google, Yahoo, Cnn, Craigslist, Amazon, Ebay, New York Times are always going to be there. It might even seem like things never change.

Unfortunately for you and I the game is not quite the same. The web is constantly changing. The way people experience it, the way people compete, the way people read and recommend and buy, the way everything happens.

Doing mining and predictive analytics on past behavior requires a certain amount of "stability" about your future (customers, business, outcomes etc etc). But if the "environment" changes too much, or even enough, then your predictions on past behavior will have only tiny chances of success.

For now this is perhaps one of the biggest challenges to Analysts and Statisticians who are working hard to get some of the traditional mining and predictive algorithms to work on our web data.

The Wikipedia article on Predictive Analytics ends with this statement:

"Predictive analytics adds great value to a businesses decision making capabilities by allowing it to formulate smart policies on the basis of predictions of future outcomes. A broad range of tools and techniques are available for this type of analysis and their selection is determined by the analytical maturity of the firm as well as the specific requirements of the problem being solved."

I'll leave that thought with you and stress that you consider:

1] maturity of your firm

2] requirements of the problem you are solving

3] the six items mentioned in this post and weather

4] you fixed all the "low hanging fruit"?

Ok now its your turn.

What do you all think? Do you agree this is hard? Perhaps you have already subdued this tough problem? Perhaps there is a flaw in my hypothesis?

Please share your tips, tricks, war stories, critique, brickbats via comments.

[Like this post? For more posts like this please click here, if it might be of interest please check out my book: Web Analytics: An Hour A Day.]


  1. 1

    I predict you will get at least one additional comment.


    Yet another great article Avinash! If I had to disagree it would only be in the level of detail one chooses(?) to predict to. Which is possibly a/the implied key point in using data mining et al???

    In that, in my traditional role I am frequently forced to make accurate predictions out to 5 years based on here'n'now web data. How many people. How many are simultaneous. Pages. *Hits*!
    Namely: Capacity Planning; and the lessor known twin: Disaster Recovery.

    In that case though, one can pad a little extra for unforeseen growth. As the cost of not having enough capacity could be far more expensive to the org.
    Other solutions do now exist, but I'm straying too far here. :-)

    I guess what I'm saying, is that even with cruddy data, sometimes we still have to predict. Will we or not.

    "It's too hard. It's not possible." isn't an acceptable answer! :-D

    – Steve

  2. 2

    I would have blasted the writer had this article been written by Web Analytics critic, but when you write I am bound to give it some serious thought.

    A research from Fair Isaac had earlier shown click fraud is much higher (around 15%) than what Yahoo or Google claims. Most people didn't agree with it and later Fair Isaac itself came to back foot with saying it was done with limited data.

    Now again after reading this post from you, I think there are few things in web that should be completely new born in the field of data mining and Analytics.

    BUT I feel, data challenges are there in any field be it online or offline. Problems are there but in web, solutions are yet to come unlike offline business. So, my take is over the time these issues gets sorted out and analytics win here too.

    All the points you have shown are real challenges to all of us involved in this field. But equally important is to know, how interested are business managers of our clients in knowing what we are trying to tell them. They might not be interested in cleaning the data that you expect it to be.

    This can be true as the website owner does not need to waste any resource on people visiting the site by chance or with no buying intentions. They dont need to prioritize people whom to attend and whom not to unlike Retail stores.

    Companies giving online ads will be interested to know their conversion, but knowledge of relative conversion (keyword targetted vs site targetted; one keyword vs other) is fine. And to find anything relative, its not very hard. In the environment of huge visits, the time frame for one campaign can be fairly less.

    Over the time the behavior changes. It changes quite frequently in web, thanks to ongoing technological advances and web 2.0 products. This only calls for more active monitoring. It may be necessary to monitor weekly or daily, but that wont be too difficult too. It is just that, making business manager understand this need is quite challenging.

    It is actually in the interest of all business houses not to use web analytics results themselves. It will be best if given to third party who can handle this and implement this with web marketing initiatives. This can probably save web analytics product cost too. And any web marketing company can fairly use one tool to analyse many web sites of different companies. Forming better understanding of the market.

    I should say, this is indeed a great post from you. Simply disturbing and challenging. Thank you so much for publishing it here.


  3. 3

    Hmmm… this is sure to bring some interesting discussion! Change the topic from "data mining and predictive analytics" to "web analytics" and reread the post. One could easily come to the similar conclusions regarding web analytics and throw the towel (of course, we won't!).

    In a way, predictive analytics attempts to automate what we all try to do with web analytics: improve our future based on what we know from our past. It as the same challenges and the same risks. It depends on the strength of our magnifying glass.

    But I ultimately agree: before jumping on the predictive analytics bandwagon, there's a hell of a lot of other easier things to do :)

  4. 4

    Hi Avinash,
    your posts are always interesting and this one is the one I wanted to read since I'm reading you. See I'm a master graduate in Business Intelligence and I did a lot of predictive analysis for all kind of data including web data. Many practionners forget that data is never clean and perfect. I'm not saying all data that come from the cash register or an e-commerce site are relevant but I think you see the problem bigger than it is. In BI we always say two things on data analysis: 1)data cleaning takes 80% of the time, analysing takes 20% 2) BE aware of GIGO (garbage in, garbage out). First, the unstructured data could be structured with categorical variable or statistical methods (means or frequency). Second, to escape the GIGO problem, you need people like me ;).

    Moreover, Scientific Litterature provides a lot of articles on predictive models like survival analysis, logistic and linear regression, RFM (Recency, Frequency, Monetary value) and association rules (Market basket analysis) based on web data. For example, Amazon is using association rules to promote offers relative to your previous purchase.


  5. 5

    I agree, nice post Avinash.

    I am inclined to agree completely with you about the difficulties inherent in predictive analysis on the web. But I don't think the problems you outline here are exclusive to website analysis, they are found whenever you try and predict behavior – they are the bane of all marketing based analysis.

    Take advertising research. What is the optimal level of ad spend in which mediums? Your basic marketing mix question. You can throw all the relevant variables you can find into an equation and still have little predictive power as there is a level of complexity you can't collect data for – the actual decision process inside someone's head, the qualitative nature of an ad, environmental factors at POS. In some cases it's like trying to figure out how the world works by looking through a straw.

    The mistake a lot of people make is to say that this is all about 'better data'. It's not. There are some things you just can't know fully. Making decisions from incomplete information is the art of it all.

    "..give me the grace to accept the things I cannot measure, the courage to measure the things I can, and the wisdom to know the difference."

  6. 6

    In a complex enterprise with many CTPs and many different Customer segments, building the Customer Holistic view, is quite a challenge.

    Can the Customer Holistic view incorporate web data ? – Currently not.

    Great post.

  7. 7

    In some ways, you describe a difference between art and science.

    From a science standpoint, one cannot hope to do true data mining on web analytics data, due to all the reasons you outline in your post. My "science" training knows all the imperfections, quickly becoming paralyzed by the magnitude of imperfections.

    From an art standpoint, an almost infinite amount of information can be used in an appropriate manner. My marketing background knows there are a zillion things that can be done. The artist filters what can't be used from what can, and makes actionable decisions.

    In many ways, the web analytics community is being forced to re-learn what other analytics individuals had to learn over past decades. Retail analysts had to make decisions with imperfect data. Catalog analysts had to make decisions with imperfect data. Now, a new generation of analysts are forced to make decisions with data more flawed than ever before. The most revered analysts of the future will emerge from this learning process.

  8. 8

    Most of my work to date has been with data sets that are considered very reliable and have nothing to do with the web, but even these data sets are terribly messy. Data analysis on the web is going to be difficult until/unless it loses its anonymity, as your first point suggests.

    From there, the other things could be tackled and at least partially resolved, but the lack of solid differentiation between visitors seems like the real crutch.

    Very interesting post.

  9. 9

    I see the challenges as far more around expectations. Like the rest of web analytics, it's really useful at addressing localized issues, so accept it for that and as Avinash loves to say, "get over it" for the rest.

    My goal for predictive analytics is to end up with testable hypotheses. (example: People who visit finance also go to sports. So should we promote sports or something else?)

    Once you have those and you follow-through with testing, you're bound to improve performance.

    So instead of focusing on what I can't get, just accept that anything that improves my visibility can be helpful.

    Avinash – are you feeling binary this month? :-) Predictive analytics sucks – or not? Hire a college newbie or experienced person? Maybe you're just trying to provoke responses. Well, guess it worked on me – I don't like to take forced choices.

  10. 10

    Actionable Insights is the key phrase here! With the best web analytics tools on the market we can already over deliver on insight, and with added multi-variate testing technology we can take it to another level before we even need to think predictive. Most organisations have yet to master a basic insight driven optimisation process and have not got the resources to deal with the top ten issues effectively, so there are many more miles left in web analytics. If no Action results from the insight then what was the point? Predictive analytics will only come of age when we have mastered taking action.

    That said in our experience there is some value in appending basic web analytic data to customer records – it can certainly improve direct mail campaign results!

    Many thanks for stimulating the grey cells!


  11. 11

    To quote JFK we build predictive analytic models to improve our websites "not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win" :-)

    If it were easy, everyone would do it. It is hard. Yet the data you have about your customers and their behavior (on web and off) is a critical asset your competitors do not have. Mining it for insights and turning those insights into predictions that can help you target and treat your customers is a must-do for corporate survival. We cannot do it perfectly and there are many problems (which you clearly articulate) but I think we must continue to try and continue to improve.

    Check out Best Buy or MyCoke Rewards – they are using predictive analytics and web analytics in combination and getting results.


    The EDM blog
    My ebizQ blog
    Author of Smart (Enough) Systems

  12. 12

    Steve : Good point. I agree with you on the need to predict traffic increases etc but those I would not put in the category of "predictive analytics". I was specifically referring to the type of analytics that falls in the area of: "take this terabyte and hack at it until you can predict why Steve bought something from us, what influenced him, and when will he buy next and what kind of promotion will work for him". Doing that, but at scale.

    It holds a lot of promise for us "data whores". But on the web I don't think, for now, that it is possible to deliver on that on paper promise.

    JS : My thought, slightly controversial I admit, is that for now we are not even at a stage where we need to worry about data being dirty (and it is dirty!). It is more that we have other fundamental problems that are working against us. Until some of those go away and we have some semblance of "stability" we are not going to get much for our efforts.

    I am confident that this will change. I am confident that we'll get to a point when it will be worth our while to make the multi-million dollar investments in software, hardware and people.

    Paul : You are Comment Prince for the day for your absolutely brilliant quote. I'll repeat it because I loved it so much: “..give me the grace to accept the things I cannot measure, the courage to measure the things I can, and the wisdom to know the difference.”

    Kevin : It is so kind of you to share your feedback, especially becuase you have had such amazing success doing this for so many companies. I highly recommend your blog, Mine That Data, to anyone who wants to learn from your stories from experience at Nordstrom, Eddie Bauer and Lands' End.

    Chuck : My hope is to provide food for thought before someone decides to plunk down $1 million or more in software, hardware and skills based on the promise of Predictive Analytics. I find many amongst us don't make as informed a decision as we possibly could. It a complex decision to make, many variables, this post would be just one of the variables I encourage you to add to that decision making process.

    As regards the "binary"…. my hope, indeed one of the "brand values", is to provide context and cover a issue in a fair and balanced way (not the Fox News kind!) but in the end to explicitly take a position and share my point of view. I respect the intelligence of the readers to make up their own points despite my own POV (and the position provides people something to start from!).

    Matthew : 100% agreed. Aggregated web data integrated into offline channels can provide great insights, even from simple analysis, and there is much to be gained from tying records of your purchasers (the 1% conversion rate) with offline records.

    James : Love the riff on the JFK quote. I am did a presentation today to the Bay Area Computer Human Interaction group and I had a slide with the picture of JFK with the quote: "Ask not what analytics can do for you, ask what you can do for analytics…" :)

    I have to admit that I read both the posts you reference in your email and I am unsure of how they are examples of "predictive analytics" on top of web data (integrated with enterprise data or otherwise). For example while the mycoke example is cool it simply looks like capture of personal information of customers and then based on data they give you the company will send them personalized offers. Great program undoubted but not a use of data mining and predictive analytics in the true sense of the word. Ditto for Best Buy and the MyRewards program.

    My post was attempting to cover any website on the planet and taking millions of clicks and visits data, mostly anonymous if not all of it, and then trying to do predictive analytics on top of that. I realize that some of the people on those sites will buy and in that way share their data and that is of some value. But traditional predictive analytics used in offline words is not the most optimal investment. For now. Things will change in due course.

    Everyone : Couple themes emerge from barrage of emails I have exchanged today (totally surprising for such a complex topic and a 1,900 word post!):

      1) Someone said: "If you do this then start small, see what you find, if it is actionable then build on it some more and then check again for actionability. If you don't find actionability then scrap it and do other things."

      2) There is a ton of value in taking aggregated data from the web for Visitors that you do know and merge it will other company data. Visitors you know would be those that purchased at your website or logged in or created their customer record with you. This will be a small percent of your visitors but key learnings can be had.

      3) Sometimes "smart enough" is plenty good. (Inspired from James's book!).

    Thanks for your comments, I feel privileged to be able to have such a intelligent conversation with you all on such a complex topic.


  13. 13

    This post made me laugh out loud, as I realised that there must be lots of people not just me, asking you questions about this topic (I am actually feeling quite sick though today).

    There is a dichotomy between touchy feel marketing eg top mobile phone provider going for furniture designer to design handset to get the result consumers want, and the more scientific side of marketing – web analytics/ analyse, test, optimise. Datamining/predictive analytics appear to be the next evolutionary stage for many in web analytics. But lots of imperfect/flawed data results in completely imperfect predictions. Even when/if in the future, all the data is near perfect, datamining many not even be the perfect solution due to the nature of the competitor playing field (google vs one man pony – as you noted).

    Also in the FT they were writing about how all the top financial institutions' predictive models for assessing risk have all gone completely out of the window because market conditions/their losses in hedge funds are 25 standard deviations away from the norm which is a one in a million chance (not exactly what it said but something like that). More my point being that market conditions are key to a predictive/statistical model and if we did rely on them without fully understanding how wrong they can be, there can be very serious consequences and complications – eg current financial institution situation.

    Thanks for a very thoughtful and timely post!

    Marianina, Web Analytics Princess

  14. 14

    Being an "Analyst" for the better part of my career, this was a very interesting article to read. One thing I will say for you — you surely know how to provoke thoughts and discussions :-).

    Overall, I would agree with you — for most part :-). I definitely agree that one should not pump $$$ or jump into Mining or Predictive Analytics without getting their act together on the foundational aspects of Webanalytics. There are many issues to be taken care of before one goes digging into their Data Warehouse for 'Insights'. Some of the key things being (and you mention those in your blog):

    a) do you have an identifier that place the visitor and can be used as a key to link back to other information in your data warehouse.
    b) is that identifier fundamental enough? You can pass an identifier (say some form of visitor id) through your tags, but then the data from other channels might be using another identifier (say account number) — and the relationship between them could be 1 to many or many to many, which would preclude you from exactly identifying who it was that visited your site and performed a certain transaction.
    c) is the data collection complete? Essentially, what are you tagging and at what level.
    d) do you access to your data? This is more for those who are using vendors to host the data. Predictive analytics and modeling will benefit if you have constant and continuous access to your data.

    Where I slightly disagree is with the picture that predictive analytics is not yet there for the web arena. I think [and this is my purely biased and opinionated thought:-)] predictive analysis (PA) can still be useful — it all depends on what level and for what one is using it for. For sure, PA in the online world cannot compare with the complexity or depth we can use it on our offline non-web data. However, to take your example where "…people flow between channels and touch points and there could be a outcome at a completely different channel than were most of the interaction was..", I think PA can still be used to see if interaction in one channel is resulting in a transaction in another (An example I can think of is a retailer who looks at online behavior to predict their store sales).

    I also agree that "Tony" and all visits attributed to Tony might not be really Tony. But it would be a great first step to have some prediction or patterns emerging for the 'household' or 'environment' where Tony is. This brings me to my other point — I think that more than in the offline world, in the online world it is very very critical for the analyst to make some key assumptions around which his analysis is going to revolve and what it should be (and should not be) used for.

    And lastly, I again agree with the complexity in isolating various behaviors using clicks. But again, we might learn something by going in the reverse direction — instead of trying to predict what a person is trying to do, why not 'mine' the same behavior across multiple visitors and see if there is something you can improve on your website to increase the conversions on that behavior.

    Again, these are my loose thoughts…Overall, I think this is a great thought provoking post and a good read for anyone in the WA community.

    And Paul, I too loved your quote on "….the wisdom to know the difference.." :-))

  15. 15

    One of things I constantly struggle with as a usability specialist is trying to predict what people want when it's currently not being presented to them as an option. Analytics can really only tell you how people react to what you're already doing. You can test options, track trends, and project those trends into the future, but the one thing the data never tells you is how people would react to something you haven't done.

  16. 16

    Dr. Pete : Your comment does go the heart of the issue.

    One of the fallacies of predictive analytics (and to some extent all analytics) is that if only we had enough data about our customers (or the world) then we could "predict" anything or create profitable scenarios. Unfortunately that fails exactly at the point you mention.

    My best bet at solving for that is to do "site visits" / "follow me homes" of current and potential customers. Effectively "watching them in their native environment without disturbing them too much".

    That observational power, done right, yields lots of insights and problems that the customers have. They will rarely give you solutions, just problems. That's ok because it is our job to think of innovative solutions and the customer's job to give us problems. That's how money is made. :)

    As a complement to analytics various qualitative methodologies like lab usability, testing and experimentation, follow me homes etc are a great way to get a great understanding of your customers.

    I cover this a lot more in my book but I should write about this in the blog as well.

    Thanks so much for the comment.


  17. 17

    Avinash, great post!!! In my opinion the problem is not in data mining and predictive analysis but in:

    1- The information is stored in isolated databases and with heterogeneous formats making very difficult to join them.

    2- The phrase "You can track all the online information" (that everybody use every time) is true, but it doesn´t mean that is a simple job, actually it is extremely difficult. Web analytics growth with this belief, driving web analytics to an undeveloped situation.

    3- Infer over another inference.

    If you develop an scenario with not even a minimum of certainty are you gonna be confident about the result? Well, at least I don´t.

    I think our main goal must be focusing on moving from isolated and heterogeneous information to centralized and homogeneous one.

    So answering to your "Nyet" I would say "???????" (depends) ;-)

  18. 18

    The best use I have ever seen of predictve analytics in the web sphere was determining the points at which a buy/no buy decision was potentially being made by the site viewer. Once this had been modelled, and real time data flowed through the model, an applet attached to a live Customer Services Rep would pop up and try to "drag the customer over the line…"

    None of this would have been possible without the model output from the predictive analytics

  19. 19

    I read your debate with interest. As a lay marketing person helping developing countries to target international companies that have a high probability of expanding their operations beyond their home country (foreign direct investment)I have a tough challenge. Can I use data mining techniques and predictive analytics – using published 'hard' financial data and 'soft' web information – to assess the probability of a company being ready to consider cross-border growth? If so are there any off-the shelf software packages I can use? Any views/advice would be very helpful indeed.

  20. 20

    After reading this great article, I was wondering if a GREAT web analyst is one that clearly understand data mining.

  21. 21

    Winn : I am not sure that a Great web analyst needs to absolutely understand data mining. Certainly it is a helpful skill.

    I would offer this as a "check list". . . .

    Top Ten: Signs You Are A Great Analyst

    There are some interesting thoughts in comments on that blog as well.

    Hope this helps a little bit.


  22. 22

    i want to know how to use statistics tools like datamining, statistical significance, correlation etc.. to the seo.

    I need numerical examples so that only i can understant how to use the google analytics data and how to evaluate the problems.

    if there any tool to compute these data.


  23. 23

    Points well taken, still there are instances were web mining works well in real life.

    Anonymity is an issue; here are some common examples on how web analysts work around the anonymity problem:

    a) On a commercial site, web mining can help optimize campaign design/messaging/placement to increase click through rates. Multivariate testing can help optimize many alternative campaign options for completely anonymous users.
    b) Tracking anonymous but unique visitors over time can help identify optimal behavior patterns. Allowing personalized online messages for targeting anonymous (but uniquely identified) visitors.
    c) Authenticated online purchasers or subscribers are not anonymous. Such groups offer great cross/up sell opportunities via web data mining.

    Web mining makes the need for Holistic data even more critical
    For CRM, holistic data in needed. Detailed web data is an important component for understanding customer behavior across all touch points. This is easier said than done, but data integration is a tough problem even among offline sources.

  24. 24

    I can't contradict even a single detail on this wonderful and very informative post. Avinash is an expert on this kind of industry and he continues to post useful article for its reader. Now, I completely understand the so called Data Mining.

  25. 25
    Colette says

    By your post, you are thinking with a classical approach to use DM techniques… but there are too many new possibilities of use DM and predictive modeling from web data… ;-)

    A bit more, I´m sure that we are starting a new stage in DM power applications by the amount of web data information.

    But the key is not DM or Predictive Analitycs or AI , the question is …for what? : For understand the past, for reduce dimensionality, for predict some KPI's or behavior of the future? …

    How about to use consumer rating product´s with buy propensity or sales projections?

    How about real time intelligence engines for sales?

    How about to use – in a classical way – for inferencing?

    I don´t try to view the world under topics as 'web analyst' or 'data miner' or 'crm analyst'…

    I prefer understand the techniques, the business needs and limitations and choose the best way to do.

    Good post. Thanks

  26. 26

    As a statistician, I am inclined to see value in Controlled Studies. By this, I mean the kinds of studies done to see if new drugs are safe and effective, if a particular course curriculum can improve test scores (whatever that means, what ever that indicates) etc etc.

    Controlled studies are good because they use a well established and very carefully reviewed process for accounting for all the myriad reasons resulting in a desired response other than the one in which you are interested. The irrelevant factors are removed and only the real results of your program get measured.

    Suppose you want to know, using Avinash's example above, if $15 off to the people from Florida does any good. There are a million different reasons why people buy something – you only want to know if your 15 bucks to these people made any difference. (Of course, they really aren't people and we should all be honest about that: they are unique IP addresses, which is another thing all together!) Statistically controlled studies get the real answers and it really works.

    Problem is, as far as I can tell, nobody is doing studies comparing a Treated Group with a Control Group. I've never even seen the requisite infrastructure attempted. At least around here (, that's about to change.

    David Corliss

  27. 27

    Your points are well taken; I would like to see an updated post considering the original was from 2007…!

    Particularly interesting is your #6 – "Massive Pace of Change on the Web" In my experience, this is where predictive algorithms combined with behavioral targeting succeed on the web. Specifically – assume you want to improve the relevancy of homepage promotions and make sure each promotion resonates with each particular user. KPI = CTR. You develop the creative offers to be included and deploy predictive behavioral targeting javascript to your homepage. As data accumulates, and as more data accumulates, the algorithms develop evolutionary models that begin to associate content with each visitor. As visitor characteristics change, so to does the content that is delivered.

    Granted, massive pace and change and the ability to predict outcomes for and are radically different, but there are automated methods available that are effective in improving relevancy and CTR.


    Website Optimization Manager – Website Testing, Personalization and Behavioral Targeting

  28. 28
    banyan08 says

    Data Mining and Predictive Analytics on Web Data is very useful article. The information about raditional decision support systems (massively large data warehouses, business intelligence systems and tools, ERP & CRM systems) I have had the opportunity to be on both marketing / business side as well as development and implementation side of things is very useful. Predictve analytics procedures are the best to be really appreciated. And after reading the full article it is very informative about the data mining.

  29. 29

    I have been waiting for this article, and it sort of disappoints me (obviously I am not alone in this).

    I agree with you that starting with predictive analytics on web data is not what one should do, due to all the reasons you described so eloquently. It must be something like searching for a very small needle in a huge haystack.

    But I do strongly believe in data mining/predictive analytics on the customer database and afterwards enhance those predictive models with click data (aggregated, categorized) from identified (logged in) customers.

    Like playing poker, but with marked cards. :-)

  30. 30

    Zyxo: The launch of Google Instant yesterday shows that some things we did not think possible two years ago are possible now. So there is hope, lots of it.

    In the context of Web Analytics since this post was published (Sep 2007) not a lot has changed when it comes to addressing the core problems highlighted in the blog post. In fact I would argue things have gotten even worse in context of the social web and data fragmentation (Page 242 if you have Web Analytics 2.0).

    But I remain hopeful that things will evolve. There is a lot of power in all this data we collect to improve the relevancy of the web experience we create for our customers.

    We need to make sure we are not being duped by over-aggressive consultants / vendors who sell the hype of predictive web analytics (in context of site analytics) only to deliver multi-year projects that enrich them but fail to deliver any value to the client (because of the reasons outlined in the post).

    More to come! :)


  31. 31

    Hi Avinash,

    Would like to know now from you that do you still think its possible or not.


    • 32

      Akhil: If you see my latest blog post on Universal Analytics Strategy and Tactics, you'll see that we are getting better and better at collecting data that can be tied to each person and across multiple devices.

      This sets a foundation in place to allow us to do better data mining and predictive analytics.

      The secret will be our ability to get unified access across the physical and digital (something I also talk about in the above post).

      We have also seen insights being made available proactively using data mining via solutions like Intelligence Alerts in Google Analytics.

      These are early signs of something more practical (rather than hot air, empty promises or big talk from big data :)) in the market.

      We are not quite there yet, but amazing developments in the right direction.



  1. […] The right answer, as always, is that the amount depends entirely on the return you're generating from your site. Avinash Kaushik refers to connecting your website and your business as "the nonline world." I like using the "phone bill" model: If your website isn't as valuable to your business as your phone, you're probably not getting the right return yet. But don't assume your phone is more valuable just because your customers call you. They may have gotten the phone number from your website. Or from Google, after it crawled your site. Your website and your phone complement one another, or should in a well-designed small business website. […]

  2. […] Avinash Kaushik has a great blog post about trying to do predictive analytics with web data: "Data Mining And Predictive Analytics on Web Data Works? Nyet!" Avinash is one of the truly brilliant minds in web analytics, so it’s great to see him put his brainpower behind explaining this assertion. And, it’s timely, in light of the new book by Ian Ayres, a Yale Law School professor and econometrician. I really need to order the book and read it, as I’ve got preconceived notions based on watching an interview with Ayres. […]

  3. […]
    But predictive analytics continues to be the hot topic. That's because predicting user behavior is every marketer's (and business owner's) dream. Imagine using your data to forecast future visitor behavior and market trends, rather than just analyze the past… Ah… Wouldn't that be grand. But even with our data collection becoming ever more sophisticated, there are so many variables and uncertainties, that it may not be possible at all. Given the history of analytics over the last 17+ years, I prefer to think that it's simply not possible yet.

Add your Perspective