The Great Web Data Capture Debate: Web Logs or JavaScript Tags?

Sea AnemoneThere are many different options at our disposal when it comes to collecting web clickstream data. We can use web logs, web beacons, javascript tags and packet sniffers. Each methodology comes with its own unique set of benefits and challenges.

[Read this entry in wikipedia for pro's and con's of web logs and javascript tags, Dr. Stephen Turner has done a great job there. Aurélie's post and Juan's post have great insights into packet sniffing as a source of clickstream data.]

But if one takes a quick pulse of the Practitioner conversations around data capture it becomes clear very quickly that the largest number of current implementations (shear volume) use either web logs (usually due to history) or javascript tags (usually due to recent evolution of most vendors simply abandoning all other methods except this one).

The secondary level pulse is around people debating which of these two methodologies is “better” and hence which one should they be using. There are lots of conversations that outline benefits of one methodology or the other. There are even more technically nuanced geeky conversations by one party bashing the other.

What is missing is someone risking their neck and going out on a limb to make one recommendation when it comes to choosing web logs or javascript tags (assuming that you have ruled out the others). Never one to miss the opportunity take a unnecessary risk I’ll go out and make a recommendation:

    You should use JavaScript tags as your weapon of choice when it comes to collecting data from your website.

The only assumption is that you don’t have a website that is so amazingly unique that there is no other website with a web serving platform on the planet like yours.

Here are four important reasons for picking a side (that has not hurt Fox News and I am hoping it won’t come back to bite me either, their slogan is: We Report. You Decide):

Separating Data Serving & Data Capture (gaining efficiency and speed):

    With web logs data serving (web pages with data going out from your web servers upon user requests) is tied completely with data capture (as the web pages go out the server logs information about that in web log files). Every time you want a new piece of data you are tied to your IT organization and there ability and structure to respond to you. In most companies this is not a rapid response process.

    With javascript tags data capture is separate from data serving. Web pages can go out from any where (from the company web server, from the visitors local cache or from a akamai type, or ISP, cache farm) and you will still collect data (page loads, javascript tag executes, data goes to server – asp or in-house).

    The beauty of this is that the company IT department and website developers can do what they are supposed to do, serve pages, and the “Analytics department” and do what they are supposed to do, capture data. It also means that both parties gain flexibility in their own jobs, speaking selfishly this means the Analytics gals/guys can independently enhance code (which does not always have to be updated in tags on the page) to collect more data faster.

    The reliance on IT will not go down to 0%, it will end up around 25%, but it is not 100% and that in of itself opens up so many options when it comes to data capture and processing.

Type and Size of Data:

    Web logs were built for and exist to collect server activity, not business data. Over time we have enhanced them to collect more and more data and store it with some semblance of sanity to meet the needs to business decision makers. They still collect all the technical data as well as the business data (often from multiple web servers that support a single website each of whom has a log file that then needs to be “stitched back” to give the complete view of each user).

    Javascript tags were developed to collect clickstream data for business analysis. In as much they are much more focused about what they do and only collect data that they need (though admittedly not all the javascript tags running around are smart and they do collect unnecessary data). What this means is that with javascript tags you have a much smaller amount of data to capture, store and process each night (or minute or hour or days) and it can be a much saner existence (logically, operationally and strategically).


    For better or for worse most vendors are moving away from supporting versions of their products that support web logs as a source of data. Many only offer javascript tag (or packet sniffer) versions of their products. History will decide if this was a good thing but the practical implication of this is that most innovation that is happening in terms of sophistication of data capture, new ways of reporting or analyzing data, meeting the needs to Web 2.0 experiences etc are all happening in the javascript data capture environments.

    This presents us with a stark choice of having to build and own our own company only customized means of capturing this new data and keeping pace with other innovations or relying the on the expertise that is out there (regardless of which Vendor you prefer) and keeping pace with all the innovation.

    Often this is a easy choice to make of any company that considers its core competency to be to focus on its business and not developing web analytics solutions (though admittedly if you are Wal-Mart you can absolutely do that – for example they have invented their own database solution since nothing in the world can meet their size and scale).


    Increasingly we are heading towards doing a lot more measurement and customer experience analysis beyond just clickstream. Two great examples of this are experimentation and testing (especially multivariate testing) and personalization / behavior targeting. In both cases “add-on” solutions are tacked on to the website and testing / targeting happens. Often these solutions come with their own methods of collecting and analyzing data and measuring success.

    But as we head for a integrated end to end view of the customer behavior, for optimal analysis, we have to find ways of integrating data from these add-ons into the standard clickstream data (else you are optimizing just for each add-on which is not a great thing).

    Integrating with these add-on solutions (which often also use javascript tags and cookies and url identifiers) is significantly easier if you use javascript tags. It is easy to read cookies in web logs etc, but he pace at which you can integrate and the ease at which you can integrate is faster if you are using javascript tags.

It is important to point out that you should consider your choice in the context of your own unique needs. Please read carefully for detailed pros and cons of each data capture methodology (because javascript tagging does have important con’s that need to be considered carefully, web logs also have their benefits, including obvious ones like they are the only place your find search robot data).

In the end though if you have to make a choice between web logs and javascript tags and 1) you need some “advanced non-standard” considerations you should think through then they are above 2) if you want someone else to make the choice for you then it is above.

If you love Web Logs, what's wrong with the recommendation above? If you swear by Packet Sniffing, is it superior to tags in the four ways outlined above? If you can’t live without JavaScript tags, what else is missing above? If are are confused and this helped, please share that via comments!!

[Like this post? For more posts like this please click here.]


  1. 1

    It Depends. Would be my trite response. :-)

    Steve's First Rule of Analytics: Know Thy Audience!


    For *ME*, JS is pretty much a waste of time and effort. Something like 50% these days of my current users either can't, won't or don't run JS.
    Clickstream is all well and good, but if I'm ignoring 50% of my user base am I better off? No.

    Aside: My understanding was that JS Page Tagging was to get around caching issues. Not specifically clickstream – that's more the result?

    The other one I've noticed with GA in particular – may not be equal to all JS/PT; is that if you get it wrong up front, that data gathered is effectively useless. I am always revisiting my logs – going to back to 2002 to extract long term trending info; reports for partner sites and so on.
    As new ideas come to the fore, I can reapply them and see how they worked with what we thought then and now. Hindsight.

    IMHO the big danger here is that once you set your mind in stone on using X, then you have exposed yourself to missing out on understanding part of your audience. Which is why I *analyse* my site via logs, but still run GA as a cross check, verify etc.

    BTW. I really disagree with the separation of "business" vs "server" data. The servers support the business. It's *all* business data. Details vs Overview Summaries. Both have their very important place. IT (Should!) supports the business. If they don't, then IMNSHO, they are failing the business.

    Data is data. It's not the collecting or type of data that matters. It's how you use it. Everything else is a distraction from what really counts.

    Final thought: Not to have a go at Debbie@Maxamine ;-) but any data collection method that then needs a funky verification tool is fundamentally broken IMNSHO opinion. Here's some pretty graphs, and we'll use them to make or break this business, but we have no rigour in their reliability. Ouch. Post Filtering of data is one thing, not getting the data in the first place is something else.

    my 2c, as ever
    – Steve

  2. 2

    Steve: I try really hard on this blog not to regress to the mean. It is easy to do, especially as you grow and diversity increases. In this case though the pull of the mean was too strong. :)

    Let me answer your comment in this way: I was lucky enough to have you as my "web server – data collection – process – and – analysis wiz kid" I would trust you to capture clickstream data with a pen and paper.

    I would ditch all technology and trust you when you tell me that you will capture data with a piece of paper and a pen. I mean that in all seriousness.

    You are very very good at what you do and exceptionally talented. I am not, and I suspect that the "mean" is not either.

    Think of this post and recommendation as something for the rest of us! :)

    Thanks so much for the comment and the thoughts. They are always welcome.


  3. 3

    Great post, Avinash. I've always believed that the choice is much less important than people think. There are advantages of Javascript tags (e.g., avoids caching problems, easier to collect additional variables) and advantages of logfiles (e.g., contains search engine robot activity, absence of vendor lock-in), and we can argue (or worry) about them all day. But in the end, a good analyst can get excellent actionable data from either technology. When you start segmenting your data, which is really at the heart of web analytics, then you're comparing the relative numbers from two groups of visitors, and whether you're 10% above or 10% below some unknowable "true" figure seems much less important.

  4. 4

    Hi Avinash, great read as usual :D

    Regarding the comments you made on the pro's and con's of Javascript and the implied tagging implementation requirements and constraints, I'd like to point to a recent article by Bruce Tate, CTO of GoodWell, which he recently posted on the IBL community site.

    In a nutshell, he describes how it is increasingly difficult to ignore or rule out Javascript in this Web 2.0 age.

    Basically, if you're going to track user behavior using Javascript, you might as well leverage that implementation to track your overall Web traffic ;)

    Here is the link to the article



  5. 5

    but what about the combination of the two?
    How do you track your downloads without using logfiles?

  6. 6


    When the size og your log file will be 1 gig by day, tell me how you will be able to calculate (in a finite time) clickstream from back to 2002?

    Also, my web site structure happen to change over time, making clickstream data analysis difficult.

    I still believe the world is made of shade of gray, not black or white.

  7. 7

    SEO Portal: It is important to stress that your unique needs might mean a different solution, that is perfectly ok.

    As regards to combo solutions, if there are resources and tools that make it easy for you to do it then that's absolutely great. Usually though with finite resources (people and $$$) and a need for speed it might make sense to simplify and make a single choice.

    You are right on downloads (Dr. Turner also rightly added robots and vendor independence as benefits for logs), they will only be in logs though there is very finite data you can get about download from a log file. Many companies are starting to use Akamai Download Manager type apps, in which case there is deep and rich information about the downloads that one can get from there. IMHO the information you'll get there is actually business actionable (repeated attempts, tie to a user, aborts etc etc) which we simply can't get from a log file.

    Not every company will / can use a Download Manager in which case if downloads reporting is important log files are a good place to be.

    Dr. Turner: Absolutely agree with the sentiment you have expressed in your comment. That's it. No more comment from me! :)



  8. 8

    I wholeheartedly agree with JavaScript tagging for core tracking. But there are limitations with JS tagging in the volume of data you can collect (e.g. the IE URL 2083 char limit) and in many cases the type of reporting you can create out of your web analytics solution. Another alternative besides standard Apache/IIS web logs is custom application logging. The big drawback here is that you need IT resources to build the custom application logging and load it into a data warehouse. Application logging allows our company to capture 20-30 different variables per business event. Our JS tagging can't handle the combination of that many variables and their sizes. Application logging has also allowed us to analyze search engine crawlers and other unwanted automated activity with wonderful variable depth. We do have unique identifiers in place that allow us to tie our JS tagging solution back to our application logging at the individual business event level. Our JS tagging solution is effectively a subset of our application logging in regards to custom variables, but the incremental benefit of JS tagging is that it provides our UV, referrer, and other standard web metric counts (all the goodies you get out of the box with a web analytics provider). One more thing, I can't tell you how many times we have cross-validated between both tracking systems due to 'bugs' in the data capture in one or both systems.

  9. 9


    who said it isn't 1Gb a day? :-)

    Can I humbly suggest you're approaching the problem in the wrong way, urm, twice. :-)
    What percentage of that 1Gb is relevant to clickstream? What percentage is robots and other gumpf.
    Do upfront filtering of both and you have probably reduced 1Gb to 50Mb. And if you can't clickstream that….
    Practically every dedicated loganalysis tool I've ever seen and/or used does a *really* poor job of prefiltering efficiently.
    There are stock tools that do it far better. egrep, gawk and perl, for example, are far more efficient at pre-filtering. Use multi-cpu machines such that all are working on part of the problem. The pre-filter stages will also help with IO latency. Especially if you leave the logs compressed. IO latency is a killer for big log analysis.

    eg. I do all my analysis on a HT 3Ghz dual Xeon IBMx235 server. I can process a full years worth of logs in about an hour or so. Multiple staged filtering, all 4 effective CPU's working their butts off, disks ticking along nicely.
    Worst case, do the filtering on multiple machines and stage it that way. Easy!
    Hardware is cheap – your time is $$$$$.

    Another trick – get an OS that can run in memory file systems. eg /dev/shm under linux or /tmp on Solaris. Stage your files into that. Thus you end up processing the files in memory. IO latency goes to zippo. Especially, if you build your filter chain/pipeline, such that the next log is there ready and waiting before anything needs it.

    Which brings me to the other issue: Re-analysis is not as immediate, usually, as day to day analysis. So it doesn't matter if it runs overnight. Or longer.
    I don't reanalyse 4+ years of data every day. Maybe 4-8 times a year. I did easily double that in two weeks at the start of 2006. It varies.

    If your structure/layout changes, is clickstream even worth redoing? It's no longer relevant any more. The value is not so much in the clickstream analysis anymore but elsewhere.
    So that part, IMHO, becomes moot anyway.

    Oh I agree the world is not black and white. It's in colour! ;-)

    But the key message I, probably badly, tried to put across:
    Don't ignore any source of data. Cross check, verify, recheck those assumptions. It's too easy to get lulled into a false sense of security in this game.

    Javascript Tagging, and here's where I do agree with Avinash, is probably more than good enough for most people who genuinely care about how their site is used. But when *I* check my personal site (vs work's), it's next to useless. Misses huge swathes of folk. And that percentage changes too.

    I'd have to ask him, it should be fairly obvious, but I'd betcha I'm invisible to any non-log analysis Avinash runs here. Why? I run the noscript & adblock plugin's in Firefox. Got fed up with all the javascript rubbish that too many sites throw my way. And if a site totally breaks without javascript. Shrug, plenty of others that don't. Not my loss. Very few indeed are the web sites that have content compelling enough for me to care.

    Evil Grin: I know one site that switched on javascript tagging of the … conversion step you'd call it. To better track it and such. *BIG* site. 1Gb a day? Try per half hour if not more. ;-)
    Well they hadn't really looked at how their users actually used that conversion step and immediately alienated a huge chunk of their customers. Big Mistake. They reverted that change *very* quickly.
    This was about 4-8 months ago from memory.

    makes some excellent points. At work I actually use about 6 additional logging subsystems as adjuncts to the base Apache logging. It varies. Some capture App specific stuff, others search engine (internal) and so on.
    As they are all different, they assist with cross checks – bug hunting and so on. Impact is insignificant.
    Pure Gold tho for getting different views on the system as a whole. And that's the key here. It's not a webserver, or even a webserver farm. It's a system! that just so happens to serve webpages as it's principle function.

    – Steve

  10. 10

    Great post. On combo-solutions: There are several WA tools available that allow for an integration of both page tags (javascript) AND log file analysis. This encompasses the best of both worlds as you can set up rules to determine what should be interpreted as page views, clicks, exits, impressions, file downloads, form entries (with specific info grabbed from the UI), etc… Robot filters can be used, historic log data can be integrated, cache busting can be used, web 2.0 tags can be implemented, and pages with inaccurate tags (for let's say page name) can still be evaluated.

    These solutions are usually not any more expensive to personnel time or $$$ than the drop and go page tag solutions out there.


  11. 11


    I agree, filtering is needed when working with log files. But we already remove bot, css, xml and images from the logs, and it's still 800Mb of data each day. I don't say that javascript is better than log data mining, I just say that sometime log data mining is not worth the trouble. It all depend of the goal of the site.

    One think log data mining will never be replace by javascript tag for, is: analysis load on server. But even this can be replace by product like Coradiant (

  12. 12

    Hi there,

    Your statement about Wal-Mart building their own data collection solution is in-correct or alluded to they were using an inhouse solution which is incorrect. They do have their own Datawarehouse where they also do combined primary, secondary for brick and motar and tertiary analysis for websites but they do use a 3rd party vendor to do the primary website data collection, processing, and analysis.

    Your Statement: "Often this is a easy choice to make of any company that considers its core competency to be to focus on its business and not developing web analytics solutions (though admittedly if you are Wal-Mart you can absolutely do that – for example they have invented their own database solution since nothing in the world can meet their size and scale)."

    Wal-Mart uses Omniture to track and collect data for all their properties including their Windows Media Player music site. Do a search for "Omniture" on and you will their tags.

  13. 13

    GMAC88: The statement was correct, perhaps I was tardy in providing full context. My reference to Wal-Mart database soulution was about their multi petabyte back end that runs all of their logistics, erp and business intelligence systems (nothing to do with the web or web analytics).

    It has been fairly well established that one of Wal-Mart's core strategic advantage is its IT system and the core essence of that IT system sits on a custom built database (not oracle or sybase or db2 etc). I was referring to that but clearly from your comment I should have been more expansive.

    Thanks for the comment.

    PS: I did follow your instructions and did a search for Omniture on, but this is what I get:

    I am kidding of course, I know what you meant when you said "search for Omniture"! :)

  14. 14

    I don’t see why you think tagging separates the Data Serving and Data Capture groups when the site has to be tagged, unless the tagging you are talking about is really basic. The huge advantage of packet sniffing (I prefer to call it passive data capture because it is a lot more than just capturing packets) is that it really does separate analytical data collection from the web site groups and the data can be used to feed into many different applications. Passive data capture sees the datalink layer so it can give accurate timing on a page load even if the graphics are served by Akamai because there is a final acknowledgment to the page load. It cleans, filters and sessionizes the data in real-time to give one clean log to load. We have customers that exceed 100 million pageviews per day after we have filtered out all of the locally served graphics, stylesheets, robots, local and remote test tools, etc. Many of the graphics are served by Akamai, but users know that and don’t need that in the analytics. The argument about local caching is mainly theoretical. The pages they really care about are never cached because they are the secure buying transactions and they have such a large statistical basis it doesn’t matter much anyway. We are finding customers where the marketing guys love what they are getting from a tag solution but its only basic stuff and they want more. The IT guys don’t want to do custom tagging. The solution is passive data capture that can feed any log file analysis package or emulate a tag server to feed packages expecting data from tags. See

  15. 15

    I still did not get what is meant by 'search for omniture' on

  16. 16

    Avinash, here from your reconciliation checklist.

    where you mention one source for problems is people running with "javascript turned off (2-3% typical)". I think it's actually a bit more subtle than that.

    With ad blockers it's fairly easy to configure which parts of a complex page load, and to turn off not just the display ads, but also to refuse to load the Google Analytics tracking javascript. So your "turned off" reader might just be running silent, with most of the javascript working but the trackers gone.

    Since all of that code runs in the user's browser, it's not beyond comprehension that someone suitably motivated could throw junk into the data stream as well, but I haven't seen code to do that yet.

  17. 17

    thank you…

  18. 18
    Abdullah says

    GMAC88: Walmart use Ominture to track
    not to track their data .

    what Avinash talked about is Walmart's data generated from day to day transactions.
    not data from walmart's website

  19. 19

    Great post, thanks for that. I guess the "debate" between logs & tags is basically over, and one does indeed have to have only very specific reasons to still saw through log files.

    Though, from what I can see, there is no real paid-for log analysis solution that I would need for such things. From web server log files, I want to know about things like 404 errors and bandwidth from downloads, and that's something I can get from AWstats with a few mods.

  20. 20

    Forgive my ignorance, but is it safe to say that site side analytics tools use log files exclusively and ad servers use javascript tagging or other techniques of tagging exclusively? To put it another way, is the page tagging technique used for campaign measurement of ad performance as it relates to site side conversions while the log file technique is used for site side engagement, or do site side analytics tools also utilize page tagging?

  21. 21

    Matt: Vast generalization but pretty much every web analytics tool, testing tool, survey tool, ad serving tool etc etc now use javascript to collect data. Javascript as in page tagging.

    Log files are of value in some cases, as outlined in this post, and a few tools still support log file parsing.


  22. 22

    This topic seems to rumble on 3-4 years later, which shows it's still not really resolved.

    In my mind tagging is fundementally flawed (javascript can be easily switched off; not available on all platforms especially mobile; and often fails to be called due to tag positioning or load times and internet imperfections).

    Log analysis is old hat and shows only half the story.

    I've been looking at some new tools on the markets such as PION (Atomic Labs), who use packet sniffing and export the data into 3rd party web analytic tools. Is anyone aware of any others who do this kind of thing, as this seems to be the only logical way forward (unless your happy with google style high level analytics – javascript is fine. If you need EVERY transaction, you have to look elsewhere).


  23. 23

    Craig: I am afraid I have to disagree with your assertion that javascript tagging is "fundamentally flawed" or that log analysis is "only half the story".

    Both statements are "fundamentally flawed". :)

    PION is an improvement on how packet sniffing used to be done, it still suffers from the limitations that would have to be overcome with tagging / log filing / data combining with a existing tagging solution, though half of one's life and resources will go towards creating that "perfect" (let me emphasize in words the quotes) system.

    There is no system on Planet Earth on Jan 11 2010 that will collect EVERY (emphasis yours) transaction. Sad, but absolutely true.

    Every data collection mechanism today comes with a set of pro's and a set of con's. The duty we have as stewards of spending money for our companies wisely is: Which data collection system provides the best balance of investment and ROI (insights)? In an imperfect world.

    The answer will be different for each company, and there is no blanket right answer for everyone.

    I'll fail if I also did not add that perfection is vastly overrated.


    PS: For your data collection investigation here's one more: From Coradiant the Analytics In A Box solution: No javascript tagging required.

    PPS: If you want to learn more about pro's and con's about each data collection mechanism (in short blurb's) here is a video that might be of value (the first part):

  24. 24

    Thanks for the response Avinash, some very fair points indeed, I think I slightly overstated the death of the tag ;-)

    Ultimately it depends on what you are using the data for and the complexity of your site. Moreover for the vast majority js solutuions are more than sufficient, after all the majority of affiliate/marketing business has been built on the back of this kind of tagging and they've been paying cash off the back of them.

    The overstated dream if you like, is to easily integrate transactional data to behavioural, with maximum flexibility to manipulate the raw data and import into other BI systems (so also as accurate as possible); in addition maintenance and ease of implementation are part of the consideration (though that is clearly debatable also and depends on your setup if your dev is outsourced etc). I know Unica, webtrends, etc are also focusing on these areas too and having open systems seems to becoming the general trend now which is great.

    Thanks for the link, will definately have a look into the various hardware solutions in this space too.

    I guess with more standardisation, is it too much to wish for vendors to seamlessly combine W3C standard logs, JS, and any other methods (packet sniffing etc), and extract one state of the truth to wherever you want to use that data?

  25. 25

    Like Craig, it seems to me this debate isn’t resolved. What percentage of users have medium-high to high security settings that block third and first party cookies?

    Steve comments that “something like 50%…of my current users either can’t, won’t or don’t run JS.”

    Edward Vielmetti commented that the percentage is more 2-3%.

    Has there been any research that offers a more empirically substantiated percentage of users that don't use JS?

  26. 26

    W3C stats says that currently about 5% surf the Web with JS off. (

    Still, I'm seeing a huge difference between visits in GA and visits in the log server analysis software I'm using called Summary (

    I've already turned off known robots in the report, and I'm filtering all images, multimedia, stylesheets, XML feeds, and JS.

    What other factors do I need to consider to reconcile the difference in visits between the two software?

  27. 27

    With regards to both your comments….

    Data trumps intuition. You can use web analytics tools to measure the difference.

    See the below screenshot of a report for my blog's javascript status as measured using Yahoo! Web Analytics:

    Javascript Status - Yahoo! Web Analtyics

    You don't need to believe what anyone else says, you can measure your websites and your own customers.

    Comparing log files and javascript driven data is like comparing apples and horses.

    At some point they were related (early in the formation of planet earth :)) but they have evolved so much just in the last few years that it would be nearly impossible to come close with both methods.

    But if you do want to reconcile your data, here is a guide you can use:

    The Ultimate Web Analytics Data Reconciliation Checklist

    Good luck!


  28. 28
    Yeshwant says

    Hello Guy's,

    I am new to Analytics, could any one explain me which tool support both log and tag based solution? I don't want to use two products for log analysis and for tagging.

    • 29

      Yeshwant: Most modern web analytics tools have abandoned log files as their source of data for a lot of reasons (including the fact that they are a terrible source of data for digital business analysis). So there are no log file versions of Google Analytics, Sitecatalyst, CoreMetrics etc.

      You can probably still buy a version of Urchin that will process log files, it will be old and contain a tiny fraction of the value you'll get from GA (so you are still getting two irreconcilable tools). You can also likely buy a version of WebTrends that runs on log files, and of course they also have a javascript version (again, irreconcilable datasets).

      Unless you have a very specific niche need (say extracting search engine crawler behavior), you are a infinitely better off with a javascript based solution. And just one solution.



  1. […] This will likely be the last news post until after Christmas. Here’s what’s caught my attention today.

    1. Avinash Kaushik discusses the merits of javascript analytics over web log files. […]

  2. […]
    Interesting Read: The Great Web Data Capture Debate: Web Logs or JavaScript Tags?

Add your Perspective