Excellent Analytics Tip#1: Compute Statistical Significance

yellow.thumbnail We all wish that our key internal partners, business decision makers, would use Web Analytics data a lot more to make effective decisions. How do we make recommendations / decisions with confidence? How can we drive action rather than pushing data? The challenge is how to separate Signal from Noise and make it easy to communicate that distinction.

This is where Excellent Analytics Tip #1, a recurring series, comes in. Leverage the power of Statistics.

Consider this scenario (A):

    You do send out two offers to potential customer. Here is how the outcomes look:

  • Offer One Responses: 5,300. Order: 46. Hence Conversion Rate: 0.87%
  • Offer Two Responses: 5,200. Order: 55. Hence Conversion Rate: 1.06%

Is Offer Two better than Offer One? It does have "better" conversion rate, by 0.19%. Can you decide which one of the two is better with just 40 to 50 responses? We got 9 more orders from 100 fewer visitors.

Applying statistics tells us that the results, the two conversion rates, are just 0.995 standard deviations apart and not statistically significant. This would mean that it is quite likely that it is noise causing the difference in conversion rates.

Consider this scenario (B):

    You do send out two offers to potential customer. Here is how the outcomes look:

  • Offer One Responses: 5,300. Order: 46. Hence Conversion Rate: 0.87%
  • Offer Two Responses: 5,200. Order: 63. Hence Conversion Rate: 1.21%

Applying statistics will now tell us that the two numbers are 1.74 standard deviations apart and the results rate 95% statistically significant. 95% significance is a very strong signal. Based on this, and only a sample of 5k and sixty odd responses, we can confidently predict success.

Powerful benefits to presenting Statistical Significance rather than simply Conversion Rate:

  1. You are taking yourself out of the equation, it is awesome to say "according to the God's of Statistics here are the results…"
  2. Focusing on quality of Signal means that we appear smarter than people give us Analysts credit for.
  3. You take then thinking and questions out of the equation. Either something is Statistically Significant, and we take action, or we say it is not Significant and let's try something else. No reporting, just actionable insights.

Is this really hard to do?

No! Simply use the spreadsheet below, which comes to us via the exceedingly kind Rags Srinivasan:

In the spreadsheet you get even more bang for your buck. On sheet number one you can apple the 1-tailed or 2-tailed test to your statistical significance calculations. Here are the steps: Choose from the drop down in cell D7. Complete cells B13, C13, B14 and C14 (essentially how many participants or visitors etc were there and how many conversions you got). In cell C18 you'll see if the results were statistically significant or not.

In sheet number two, for those of you who are a bit advanced, you can apply the chi-squared test. This test is more optimal for when you see very small conversion rates (not unusual on the web). It is a more skeptical test with a higher threshold for differences. The benefit is that small statistical anomalies don't look like real differences.

When in doubt go with sheet number two, the chi-squared test.

Two small tips:

  1. This is a best practice but aim for 95% or higher Confidence. That is not always required but it is recommended.
  2. "Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital." –Aaron Levenstein

Agree? Disagree? Not really a Excellent Analytics Tip? Please share your feedback via comments.

Comments

  1. 1
    Mark McLaren says:

    Thank you for starting your excellent blog. I found you via Robbin Steif's LunaMetrics Blog.

    Regarding the use of standard deviation as a means of interpreting order results. In general, I completely agree about the importance of removing bias from test results as much as possible.

    What else do we need to know about the groups involved in the test?

    Are they essentially the same group or are they two completely different groups? (I'm assuming you would want to send offers to as many people as possible; hence, they are the same group – less 100 people in the second case.)

    How were members of the group(s) selected? Do you need a random sample in order to apply principles of standard deviation? 5,000+ is a good size group from which to draw conclusions, but I take it group members were not chosen randomly.

  2. 2
    Jeff Leong says:

    Dear Avinash,

    Thank you for sharing your blog. This has been long awaited and definitely worth the time to read – subscribe to.

    I'm wondering now based on this article, how this model plays in with UX decision when applying A/B tests. We often mind minimal differences and often base our decision the winner between the two.

    Is there a way or method of filtering noise when the difference is minimal? Assuming all multivariate elements are correctly in place?

    Congratulations on the great blog, the industry will soon be catching on!

    Jeff

  3. 3
    June Li says:

    Hi Avinash,
    I had the pleasure of hearing you speak at the 2005 eMetrics. I'm very happy that you've decided to blog. I too found your blog through Robbin Steif's .

    It's excellent that you are giving us real examples of how statistics can be used, and providing tool references. I look forward to additional case studies and discussions.

    Will you also be posting about monitoring and managing outside influences? Sometimes the Noise dampens the signal or deflects the signal.

    Thanks,
    Web: http://www.clickinsight.ca
    Blog: clickinsight.blogspot.com

  4. 4

    Mark McLaren: Thanks for your kind words about the post, I am glad you found it helpful.

    What else do we need to know about the groups involved in the test?

    Are they essentially the same group or are they two completely different groups? (I’m assuming you would want to send offers to as many people as possible; hence, they are the same group – less 100 people in the second case.)

    In the specific example I used, and the spreadsheet, you control for one thing usually. you can have as many groups as you want. For example you can send one offer to people who live in CA and NY and FL and OR and OH and plug that into the spreadsheet against a control and know which works best.

    Alternatively you could try 5, 6, 10 whatever number of different offers to a bunch of folks and see which one converts best.

    The problem becomes when you want to test different offers to differnt groups (or many different content in different locations on the same page). Now you are in the world of multivariate and need to apply advanced statistics (think Taguchi).

    Doing multivariate is awesomely powerful and yields great results, but beyond my humble spreadsheet.

    Do you need a random sample in order to apply principles of standard deviation? 5,000+ is a good size group from which to draw conclusions,

    The beauty of using statistics is that the standard deviations required, and amount of Statistical Significance (my suggestion of 95% or higher), will drive how big a sample you need. There is no fixed number (like 5k).

    Hope this is the kind of information you were looking for.

  5. 5

    Jeff: Glad to see your post…

    I’m wondering now based on this article, how this model plays in with UX decision when applying A/B tests. We often mind minimal differences and often base our decision the winner between the two.

    If we are doing a/b testing (asuming the Success Goal is clearly articulated and measurable and that it is not "impact on brand") then it would be a sin not to use the spreadsheet in the post above to seperate Signal from Noise. Simply looking at Conversion Rate (or similar metric) difference is very dangerous because of exactly what you say, how much is enough to be confident.

    The great news is that most current a/b testing solution (atleast the ones that so "page testing") already include statistical computations to help us make better decisions.

    If you don't see atleast 90% plus statistica confidence take the results with a grain of salt.

  6. 6
    Jaimie Scott says:

    Hi Avinash,

    I too am very happy to see your blog. I found it through Clint Ivy's blog and I am enjoying reading your posts very much. I find them to be quite informative.

    You say above:
    "You can easily adapt the spreadsheet, as we have, to compute statistical difference between absolute numbers (say you want to know if the difference Page Views Per Visitor or Average Time on Site between segment One and Two is Significant)"

    It's not obvious to me how to do this. Can you elaborate?

    Thanks.

  7. 7
    Aurélie says:

    Hi Avinash,

    Good to find you blogging, sharing thoughts and experiences. It's quite some interestign stuff and I hope you enjoy the experience.

    I read your different posts on Saturday morning and your thoughts stayed with me for the entire week-end. Thank you.

    Yes, statistical significance. I totally join you in the idea and would only add that tests that do not render truely significant results should not be communicated upon. I remember in my first job having warned of the non significance of a test only to find it had heavily influenced a commercial strategy. I vowed never again!

    Another pavlovian reaction was to consider that any number of responses under 200 should not be taken into account as it holds high proability of not being representitive. I usually follow this first rule and adapt the variables in order to remain loyal to the statistical representitiveness of a sample. Quite pavlovian, I agree.

    And the last thing is that I'll bare statistical significance in mind but would like to suggest another possible subject: correlation between conversion rates.

    Siegert suggested this formulation for a client yesterday:
    “Is a visitor engaging into A but not engaging into B, converted easier into a lead, than someone engaged into C and B?”
    In other words, you've got kind of low level conversion events that influence or not higher goals.
    I'm having diffculty formulating this, sorry.
    Hope it made sense, keep up the good work, cheers from expensive Brussels ;-)
    Aurélie

  8. 8

    Aurélie: Thanks for the thoughtful comment, I am sorry to have spoilt your entire weekend with my posts afterall there are so many more beautiful things in life.:)

    I completely agree with the care around communicating anything that is not of significance, there is always a danger that inspite of your warning the will jump into the lake.

    Another pavlovian reaction was to consider that any number of responses under 200 should not be taken into account as it holds high proability of not being representitive.

    (For our readers here is something on pavlovian reaction.)

    In the world of Multivariate we can detect a strong signal even with small samples. We use somethings like This Page to calculate sample set.

    “Is a visitor engaging into A but not engaging into B, converted easier into a lead, than someone engaged into C and B?”

    Corelations are important, very, and of course my simply little spreadsheet won't account for that. Specially for complex web interactions it is important to understand the lower level conversion events might influence higher level (ultimate) goals.

  9. 9
    Kerry Kim says:

    Hi Avinash, my thanks also to you for sharing. Any additional insights you might have about the key drivers of adoption you've experienced would be greatly appreciated.

    Regarding statistical significance, it appears that the reference in your post used a one tailed z test for testing whether there is a significant difference between two sample proportions. Wouldn't it have been more precise to use a two tailed test? If not, why?

  10. 10

    Kerry: The example used was quite a simple one to show that we can accomplish much applying statistics to our standard KPI's with very little stress.

    Wouldn’t it have been more precise to use a two tailed test? If not, why?

    You are right, one can get quite sophisticated and get ever better results. The emphasis of the article was how to detect statistical significance in a simple case. I hope to blog more about how we can apply advanced methodologies in testing (to build on my experimentation and testing post).

    Thanks for taking the time to post a comment.

  11. 11
    Vicky Brock says:

    Hi Avinash,

    I so much agree on the importance of taking into account statistical significance – an essential part of the "so what" factor!

    This is a neat chi square tool to test for statistical significance:

    http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

    I do love your blog, bye, Vicky

  12. 12
    Hakim Aly says:

    Although a 95% CL seems to be common (other than in a medical/pharmaceutical context), in a marketing context a lower CL may be quite appropriate. As you know, the choice of significance level(or it's complement, Confidence Level) depends on the cost of being wrong.

    A 5% significance level (95% CL) means there is a 5% probability of being wrong. This is Type I error, i.e., concluding that one RR% is higher than another (statistically significant)when in fact it is not. Acting on this wrong conclusion may result in incurring costs that do not yield revenue or profit to offset the costs.

    Type 2 error is when one does not reject the null hypothesis when if fact it is false. In this
    case, the cost associated with the decision to not roll out a marketing tactic is the foregone revenue/profit that would otherwise have been generated.

    In many situations, the cost of Type II error exceeds the cost of Type I error. Clearly, a trade-off is involved, but a lower CL of 90% of even 80% may not be out of line.

    Ultimately, each business needs to decide for itself what an appropriate CL is for purposes of assessing test results.

    Would be interested in your thoughts. Hakim

  13. 13
    Hakim Aly says:

    Regarding the question of 1-tail vs. 2-tail test, the former is appropriate when one wants to determine whether one RR% is statistically HIGHER (or LOWER) than another. The latter is appropriate if one wants to know if a RR% is DIFFERENT FROM another.

    I would suggest that in most marketing situations, we are more interested in the former (higher than) than the latter (different from).

    In a few cases, we may want to know whether a proposed course of action may harm the response rate, in which case a 2-tail test would be appropriate.

  14. 14
    Curtis says:

    Thanks for the great insight. Do you have more details on exactly how the statistical significance is calculated? I'm curious how you derived the std deviations in the scenarios above with just the sample size and order counts.

  15. 15

    this is to continue where mr hakim aly left. you may find this little piece at my blog, http://directindia.blogspot.com/2007/10/no-beta-yes-risk.html, interesting.

    i've also given links for templates in the piece.

    pac

  16. 16
    Philip says:

    The Analytical Group link for a free spreadsheet download appears to have changed. It's now http://www.analyticalgroup.com/sigtest.html (with html instead of htm).

    Thanks for all your great articles, Avinash!

    Note : Thanks very much for the correction Philip! -Avinash.

  17. 17
    Jbuser says:

    Avinash,

    As a stats guy, I am a little concered with the assumptions behind the model. I downloaded it and the first thing I noticed, was that it makes some pretty large assumptions with confidence levels (anything with z-score (I am assuming) of 1.65 and 2.33 = 95%). I understand that this is probably there to make things "easy" but I think it can be misleading. Also of note, was that IF their is a z-score assumption (which unsure of), there are some other assumptions underneath the covers (which I coudn't get to), and z-scores are only for known pop means and sta. dev. Do you know what Brian is using? Is it possible to get this information?

    Finally, one concern with the "plug and chug" nature of the spreadsheet is what you always must be wary of, and that is making statistical significance a badge of honor. All it tells you is given the values you have, is there a difference between the two. What you must do, more than anything else, is make sure your testing methods are solid BEFORE the test. Otherwise, you are going to be putting in values after values and getting significance or not and making some very important decisions when the whole test could be wrong. Practical vs. Statistical.

  18. 18
    michael choe says:

    all –

    i share similar concerns as jbuser…

    for example, 1.74 standard deviations is not 95% significance. 1.96 standard deviations is 95% confidence.

    also, for computing standard deviation (s) of 2 or more proportions (in this case, conversion rate), i think it's a good practice to assume the largest margin of error. in this case, margin of error = 2 * sqrt(0.5^2/n), where n is the size of your smallest sample. this is what pollsters such as zogby do when communicating poll results about hillary/obama, etc.

  19. 19
    Barbara says:

    Hi Avinash,

    Thanks a million for your great posts. I just got into web analytics and have found both this blog and your book extremely helpful. You mentioned utilizing the statcalc.xls spreadsheet to measure significance between pageviews. Can you kindly advise on how to do this? Your response is eagerly awwaited. :)

  20. 20
    Andrew Blank says:

    The link for more advanced stats – http://www.mwrms.com/wwwRMS/DirectMarketing/MarketingCalc2.asp does not seem to have anything to do with stats anymore. Unfortunately I couldn't find a live replacement.

  21. 21
    Lea SP says:

    What an insightful post Avinash.

    For someone who is just approaching web analytics from a statistical perspective, I'm constantly asked whether a particular report figure is "statistically significant" in terms of sample size. I readily understand the statcalc worksheet purpose of seeing whether the difference between two metrics is stat. significant, but how do you know if the metrics have enough sample size to gauge effectively?

    For example, I have two conversion rates: 0.7% and 8.8%. The sheet shows that there is a 99% confidence that these are statistically different which is a good start. But my rates are based off of A(4,155 clicks & 6 conversions) and B(80 clicks and 7 conversions). Are the conversion samples too small to be effective indicators in this case? What is an acceptable threshold here, and how do you find it on a case-by-case basis?

    Thank you!

  22. 22

    Avinash, this post is giving me nightmares and heart palpitations.

    My problem is similar to Lea SP's, but on steroids. I have two sample populations:

    Pop A: 200,000 participants, 800,000 conversions
    Pop B: 1,000 participants, 8,000 conversions

    Unfortunately, conversions in this case don't generate direct cash, or else I'd be buying an island somewhere, and not worry about calculating significance!

    But obviously, there should be something to be said of Pop B having way lower sample size and not being representative or even warranting comparison to Pop A, right?

    Regards,
    Garry

  23. 23
    Joshua Daniel Egan says:

    Since i am new to seo,i want to implement statistics in seo . i thought that statistical techniques cannot be implemented in seo or by the data given by google analytics. so can you suggest any statistical technique with example to my mail id it will be useful for me.i will be greatful to u

  24. 24

    I am on my way to becoming an analysis ninja. My two roadblocks thus far are JavaScript and statistics (KPI's? check. Segmentation? double check.)

    I am wondering if you could recommend any good entry-level books that could help teach me to do statistical analysis? Anything that uses web analytics scenarios is a *huge* plus.

    Thanks!

  25. 25

    Adrian: I am afraid I don't know any book on Statistics that covers web analytics scenarios. Though if you understood statistics I don't think anything would stand in your way in terms of applying it to challenges you face in your web analytics job.

    Taking a Statistics 101 or 201 course at your local university might just do the trick.

    This might seem odd but one of the best books I have read on Statistics is this one:

    Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith

    Simple and effective.

    Not necessarily just about statistics but this is an awesome book if you want to be a great analyst:

    Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets
    by Nassim Nicholas Taleb

    Hope this helps a bit.

    Avinash.

  26. 26
    Dave Rekuc says:

    Thank you for the easy to understand description above. I've found it very useful, however, I work for an ecommerce site that has a price range of anywhere from $3 an item to $299 an item. So, I feel like in some situations only looking at conversion rate is looking at 1 piece of the puzzle.

    I've often used sales/session or tried to factor in AOV when looking at conversion, but I've had a lot of trouble coming up with a statistical method to ensure my tests' relevance. I can check to see if both conversion and AOV pass a null hypothesis test, but in the case that they both do, I'm back at square one.

    Can anyone recommend a statistical method for this scenario?

    Thanks Avinash for the article, love your blog (and books)!

  27. 27

    Dave,

    You are correct in stating that looking at conversion rate alone is looking at one part of the puzzle.

    When you have items that vary in price, like you said from $3 to $299, your test for statistical significance of difference between conversion rates assumes an implicit hypothesis that is treated as given.

    A1: The difference in conversion rates does not differ across price ranges.

    and your null hypothesis (same, just added for completeness)

    H0: Any difference between the conversion rates is due to randomness

    When your data tells you that H0 cannot be rejected, it is conditioned on the implicit assumption A1 being true.

    But what if A1 is false? Either you explicitly test this assumption first or as simpler option, segment your data and test each segment for statistical significance. Since you have a range of price points I recommend you test over 4-5 price ranges.

    This is same as the case when you are A/B testing simple conversion rates and treat the population as the same (no male/female difference, no Geo specific difference etc).

    Hope this helps.

    -rags

  28. 28
    Dave Rekuc says:

    Thank you Rags, very helpful. I'll use the segmentation method in my next test. Unfortunately, this means waiting for a larger sample size than non-segmented data. However, I suppose it's worth it. Thanks!

  29. 29
    LisaS says:

    Hi,

    What do you think about using Fisher's Exact Test over the Chi-squared test? The 2nd link below suggests that using Fisher's Exact Test is the thing to do.

    http://www.langsrud.com/fisher.htm
    http://www.graphpad.com/faq/viewfaq.cfm?faq=

    Thanks,
    Lisa

    • 30
      LisaS says:

      The 2nd link got cut off, there's a 1114 after the equal sign:
      http://graphpad.com/faq/viewfaq.cfm?faq=1114

    • 31

      Lisa

      First, I see that you are from Boulder! Go Boulder! My hometown forever!

      Yes Fisher's Exact Test is a candidate here but it is needed only when we are seeing conversion numbers as tiny as 5 or lower. With Fisher's Exact test, when you are doing the calculations with a handheld calculator the permutations can get ugly for large numbers but with Excel we don't have that issue. Yet, for larger numbers Chi-square test does well, eliminating the need for a Exact test.

      On the other hand if the conversion rates are so low one should not even be wasting time over statistical anomalies (there may be statistical significance but is there economic significance)?

      Chi-square test is more skeptical than t-test and Fisher's Exact test is even more skeptical than Chi-square test. For most split tests that test the hypothesis that the two groups are the same, Chi-square will do.

      All that said I should admit that I am biased in picking a candidate test that I can explain to others in simple terms. It is relatively easier for me to explain t-test and chi-square test to those who are not statistically inclined (see here: http://iterativepath.wordpress.com/2011/07/03/a-closer-look-at-ab-testing/ ) than it is to explain Fisher's test.

      Regards

      -Rags

      • 32
        LisaS says:

        Thanks so much Rags for your reply. And yay for Boulder! I'm using your great spreadsheet tool. Thanks! So if I get a YES for statistical significance using the t-test, but not when using the chi-squared test. Would you still say YES, the test was significant? Is the t-test good enough in your opinion?

        Numbers for A are: 33548 conversions out of 212460 participants
        Numbers for B are: 33371 conversions out of 208143 participants

        • 33

          Lisa
          You need to decide whether or not you want to do Chi-square test or t-test.
          Chi-square test is more skeptical and it likely will more often find the difference to be not statistically significant.

          Since it operates on just the conversions, the procedure that uses Chi-square test also eliminates other hidden hypothesis that are prevalent with a procedure that uses t-test.
          One thing you should not do is use Chi-quare test, see it fail then use 2 tailed t-test, see it fail and use 1-tailed t-test and see it pass.

          http://iterativepath.wordpress.com/2010/06/01/is-your-ab-testing-tool-finding-differences-that-arent/

          -Rags

          • 34
            LisaS says:

            Hi Rags,
            Me again! You say that the Chi-square test is more skeptical than the t-test. I'm confused because I'm finding that the 2-tail t-test is more skeptical than the Chi-square test. Any thoughts on that?

            I'm using your spreadsheet, thanks for making it available!!
            Control – 10164 participants, 584 conversions
            Treatment – 18928 participants 1121 conversions

            Is it working right if I see that the 2-tailed test does not pass statistical significance, but the Chi-square test does indicate statistical significance? Perhaps the 2-tailed t-test is more skeptical than the Chi-square test?

            thanks!
            Lisa

  30. 35
    Denisse Gomez says:

    Hi,

    I have results based on a sample size of X , the results fall into two buckets, how do I figure out if those individual buckets are statistical significant to draw a conclusion.

    Thanks,
    Denisse

  31. 36
    Bryan says:

    Just downloaded the spreadsheet, and the chi-sqr tab isn't taking into account the number of test participants in each variation. That can't be right.

    Should "Expected value" be a weighted average rather than a straight average of conversions, or something more complex?

    • 37

      Bryan
      You are correct that the chi-square tab does not use number of test participants. I made a simplifying assumption that your A/B test evenly distributes (50-50) between two groups. Hence you see Row 24 calculating Expected value as 50% of total conversions.

      To use test participants you will fix Row 24 formula as follows (find Expected value for Control and Treatment, add another column) )

      Expected value for Control B24= ($B15+C15) * $B14/($B14+$C14)
      Expected Value for Treatment C24= (B15+C15) * $C14/($B14+$C14)

      You'll notice the two will be same if the test participants are evenly split.
      Then you'll change Row 25 as
      =((B24-B15)^2)/B24 + ((C24-C15)^2)/C24

      -Rags

      • 38
        Lisa S says:

        Thanks Rags and Bryan.

        That helps clarify the issue I was seeing when compariing the t-test to the chi-square test.

      • 39
        Bryan says:

        Rags, thanks for the update. Avinash, thanks for posting the new file. Lisa, glad to help.

        Can you guys tell me if this is appropriate to use in cases where conversions *exceed* visitors? I'm testing a "related articles" component that appears on article pages on a media site, and using page views as the conversion metric.

        Thanks!

        • 40
          Andrew Blank says:

          I've run into this as well. Did you ever get an answer? I don't see it in this section.

  32. 41
    S M says:

    Hi Avinash..

    I liked the blog very much. i am in the process of developing more statistical insights and it would be great if you could suggest ways that can help me.. an informative blog or statistical bible for web analytics.. your inputs would be highly desirable..

  33. 43
    Diego says:

    Thanks Rags for this post.

    Everything I can find regarding significance are based on conversion but the final goal for an ecommerce site is to increase sales (increasing conversion is just a mean and sometimes it can be missleading because you can increase conversion but if your AOV goes down your change could be impacting negatively in your sales).

    So my question is, why do we use conversion instead of sales ($)? Could the signficance be calculated based on sales? How could this be done?

    Thanks,
    Diego

    • 44
      Andrew Blank says:

      This makes complete sense. I've seen others just add up the dollar value based on the conversion type. The problem is that they aren't necessarily significant. I've read that it takes additional statistics to see if an alternative value wins. However I haven't seen a step by step approach to show how.

  34. 45
    Geo says:

    Hi Avinash,

    You've been my inspiration for quite a few years now, and your posts always bring wisdom to web analytics. Thank you for that!

    I'd like to note that the Excel sheet calculator that you link to is actually pretty much useless for practical purposes. Yes, it will do in a Baseline vs Test scenario, but how many times does a marketer face this situation in the real world? Not very often.

    Usually we have a Baseline vs Test 1 vs Test 2 vs Test 3 and so on. Depending on what you are testing (ad copy, landing pages, etc.) the number of variants you'll be testing against can quickly go way above the trivial 1 or 2 tests.

    Then you might say – but can't I simply to a pairwise comparison: each Test vs the Baseline and then take the best one? The short answer is no. The slightly longer answer is "Because of Multiple Comparison Probelm in Statistics" for which there is enough written all over the web.

    Thus you need a calculator that corrects for that in an intelligent way (too strict and you'll be correcting for things very rarely required/encountered in the real world, e.g. the Bonferroni correction).

    I've devised and published a calculator that handles many test variants easily (bulk input as well, copy paste from an excel table) and corrects for the multiple-correction error intelligently (the Benjamini-Hochberg-Yekutieli method). I believe it to be the best suitable calculator in the market right now. It's a available here – analytics-toolkit.com/statistical-calculators/ and you can sign up for free to use it (no newsletters and shit!). Feel free to give it a try and let me know what you think :-)

  35. 46
    davea0511 says:

    *Very* Helpful, Avinash.

    I was having a hard time visualizing confidence in A/B tests until I read this. The key is to mentally overlay the graph for A and B by knowing their stddev. Your page clued me into that.

    Thanks!

  36. 47
    Dan Grainger says:

    Dug out this article for a re-read given that I'm currently redesigning some Excel based testing tools. I've always been a big fan of statistical confidence, indeed it's one of the mandatory criteria in my book before making a call on an AB test.

    While it's simple enough to calculate stat confidence for conversion rates, there are (imho) other more important metrics which I add into the mix of decision making factors…such as revenue per visit, average order value and margin per order. So my question is this – given that these metrics aren't percentages (i.e. they can be greater than 1 and less than 0), what's the appropriate method to calculate statistical significance for them? I'm sure I have it somewhere, but wanted to draw on the experiences of you and your readers when encountering this!

Trackbacks

  1. […] Hiervoor kun je statistiek gebruiken, en dan met name statistische significantie. Je hoort het wel eens in reclames of je ziet het bij onderzoeken in de krant staan. In web analytics loop je de term ook tegen het lijf, zoals in het boek “Web analytics an hour a day” van Avinash Kaushik. Het stuk over statistische significantie staat ook op zijn blog. Wat betekent statistische significant nou eigenlijk, en hoe werken de tools die hij noemt? […]

  2. […] RKG on finding statistical significance in two Adwords tests  Interesting commentary on the value (or lack-of-value) in copy testing in PPC Avinash Kaushik on separating signal from noise with statistical significance […]

  3. […] Hiervoor kun je statistiek gebruiken, en dan met name statistische significantie. Je hoort het wel eens in reclames of je ziet het bij onderzoeken in de krant staan. In web analytics loop je de term ook tegen het lijf, zoals in het boek “Web analytics an hour a day” van Avinash Kaushik. Het stuk over statistische significantie staat ook op zijn blog. Wat betekent statistische significant nou eigenlijk, en hoe werken de tools die hij noemt? […]

  4. […] has an excellent post regarding statistical significance. Reading this post will help you understand what is correct way to choose winners and test Ad text […]

  5. How to Optimize PPC Tip #1 : Optimize Ad Texts | datumSense says:

    […] has an excellent post regarding the use of applying statistics to tell us when an ad’s outcome is statistical significant to confidently predict success. This post will help you understand the correct way to pick the best […]

  6. […] 8. Performing statistical analysis to eliminate noise. […]

  7. […]
    Ce qui me fascine le plus avec le concept de box-plot, ou l’utilisation des statistiques en général pour surveiller les performances d’un site Web est que ça nous permet d’être créatifs quand il faut être créatif et de faire autre chose quand tout va bien. Quand les limites de contrôle ne sont pas dépassées, ça ne sert à rien de perdre son temps à prendre des actions. Wow. Je suis quelqu’un d’intuitif et je ne comprends pas encore toute la portée de cet outil de mesure, mais j’entrevois des possibilités vraiment intéressantes.
    […]

  8. […]
    Sin ser un estadista, creo que es básico que un analista web profesional muestre sus resultados sobre una base estadística. Sin meterme a detalles (por miedo a perderme y confundirme!) y basado en un post de Avinash que habla sobre diferencias significativas, he estado utilizando la herramienta llamada: “The Teasley Statistical Calculator”:
    […]

  9. […]
    Here’s a drop-dead simple statistical significance calculator tool, an excel spreadsheet by Brian Teasley, which I think you will find useful. I know I do.

    For more on this, read
    Excellent Analytics Tip#1: Statistical Significance by Avinash Kaushik
    […]

  10. […]
    Avinash describes how to calculate statistical significance with analytics data
    Integrating a PayPal shopping cart with Google Analytics
    […]

  11. […]
    Hidden in this hypothesis testing are many implicit hypotheses that we take as truth. If any one of them prove to not true then our conclusion from the A/B testing will be wrong.
    Dave Rekuc ,who runs an eCommerce site, posed a question in Avinash Kaushik’s blog post on test for statistical significance and A/B testing. Dave’s question surfaces the very issue of one such hidden hypothesis
    […]

  12. […]
    En cuanto a decisiones más importantes, como por ejemplo decidir con que campaña nos quedamos basándose en su porcentaje de conversión, debemos basarnos siempre en datos significativamente estadísticos. Para poder comprobar si nuestros datos cumplen ese criterio, nos recomienda leer su artículo “Excellent Analytics Tip#1: Statistical Significance“, en el que explica como calcular ese factor, y donde nos podemos descargar un Excel que nos lo calcula todo. ¡Gracias Avinash! :)
    […]

  13. […]
    You have probably at some point used a spreadsheet like the one explained in a post by Avinash Kaushik to test for differences between two conversion rates. This spreadsheet uses the same method as I described in Multivariate testig part II: Associations where we assume that the conversion rates are independently normally distributed.
    […]

  14. […]
    Statistical Significance
    Avinash Kaushik has a great article on applying statistical significance in A/B testing. I’d advise you to read it!
    […]

  15. […]
    Data Analysis 101: Seven Simple Mistakes That Limit Your Salary (Occam’s Razor): This is written about web site marketing, but the principles apply for any kind of data presentation. Kaushik also has a handy Statistical Significance Calculator.
    […]

  16. […]
    For example, if you run an email test and send it to 5,000 people and your success metrics is clicks on the email, the volume you need to achieve should be on the number of clicks and the difference between the results as a percent of the size of the test. You can read more on statistical significance in this great post from Avinash Kaushik and use this great Excel from Rags Srinivasan.
    […]

  17. […]
    Excellent Analytics Tip #1: Statistical Significance – Occam’s Razor
    […]

  18. […]
    The good news is, many A/B testing tools have statistical significance built right in so you get an indication as to when your test is ready for interpretation. If you don't have that, however, there are also a number of free calculators and tools out there for understanding the statistical significance. HubSpot's is below, and you can also check out a more detailed excel spreadsheet over on the Occam's Razor blog.
    […]

  19. […]
    By definition of normal distribution, 95% of the data exists within two standard deviations of the mean. If we test a data point for significance and it falls outside of this 95% interval then we can confidently say it’s statistically significant. If the test generates a data point within the 95% confidence interval then we consider it to be statistically insignificant. Check out this post by Avinash Kaushik if you want to read more about the calculation process of statistical significance.
    […]

  20. […]
    这篇文章主要讲解两种检验数据的方法:分别是在Excel中使用已经写好函数的数据显著性计算器,和使用SPSS对详细客户数据进行显著性检验。 一、Excel-数据显著性计算器 假设有下列的数据: 那么我们可以使用Avinash Kaushik介绍的Excel-数据显著性计算器来检验,详细请查看http://www.kaushik.net/avinash/excellent-analytics-tip1-statistical-significance/
    […]

  21. […]
    Хорошей новостью для вас будет то, что у многих тестировочных инструментов показатели статистической значимости уже встроены, так что вам сразу же сообщат, когда тестирование будет готово для интерпретации. Если же встроенная функция отсутствует – не расстраивайтесь, есть множество бесплатных калькуляторов и программ, рассчитывающих статистическую значимость. Вот тут есть ссылка на HubSpot, а также вы можете просмотреть блог «Бритва Оккама» для получения более подробного шаблона из Excel.
    […]

  22. […]
    6. Monitor Performance and Decide When to Stop
    Over the course of the test you will need to monitor the performance of your variations to make sure neither of them are underperforming, causing them to hurt your users’ experience and damage your brand equity. Use this Excel template to know when your results are statistically significant so you can stop the test and declare the champion.
    […]

  23. […] Avinash Kaushik: Excellent Analytics Tip#1: Statistical Significance […]

  24. […]
    Calculator Spreadsheet for Statistical Significance [Avinash Kaushik]
    […]

  25. […]
    I hope most of the people working in online marketing had one or two courses on basic statistics back at university. At that time, you didn’t know what to use this shit for, therefore you didn’t make an effort to remember all those distributions and Central Limit Theorems your fugly prof was telling you. And then the way of data-driven thinking came. Now you wish you’d still remember some stuff (at least I do wish it). My point is, if you want to act data-driven you need to be able to understand and interpret the data correctly. And by correctly I mean based on science, maths, statistics. You remember n > 30 means the data is always statistatically significant? Well, you’re wrong. Read this. So, back to my point: As long as you’re not a good statistician, you can’t really help yourself, your company or your clients with data-driven decisions.
    […]

  26. […] 那么我们可以使用Avinash Kaushik介绍的Excel-数据显著性计算器来检验,详细请查看http://www.kaushik.net/avinash/excellent-analytics-tip1-statistical-significance/ […]

  27. […]
    Collect an equal amount of visitors for each variant. When you have a significant sample, you can compare the results of your tests. A true A/B test requires 95 percent confidence in the results. (Here’s a helpful post on figuring out what that means for your test.)
    […]

  28. […]
    There are volumes of academic texts filled with guidelines for determining the appropriate sample size for statistical relevance. While we don’t need to wade into those deep waters, there are two key decisions you do need to make. The first is what confidence level you want to have in your test. Analytics guru Avinash Kaushik’s recommends that you shoot for a 95% confidence level and who are we to disagree?
    […]

  29. […]
    Even with such great technology available, there are a few things to watch out for. The first is statistical significance, which has been written about enough (here, here and a mini-site here if you’re interested). Another is the common mistake of assigning a goal that measures the short-term effect of a test rather than the long-term effect on your business. We made this mistake at Segment.io, and that’s the story I’ll be sharing in this article.
    […]

  30. […]
    A few quick considerations when implementing this strategy: Remember bid multipliers are multiplicative. Always make a statistically significant decisions.
    […]

  31. […]
    Avinash Kaushik is one of the first to write about the importance of statistical significance in web analytics back in 2006. LunaMetrics, a major source for web analytics know-how, has only a couple of mentions and just one short recent article devoted to the topic. Even the Google Analytics blog itself has only three mentions of the term in all its years of existence. Their leading evangelist Justin Cutroni has just 1 (one) mention of the term on his blog. On the bright side: KISSMetrics (great innovators in web analysis) have announced just several days ago that they are implementing statistical significance into their reports.
    […]

  32. […]
    When I run a test I want to be confident that the results are what I’m looking for. I usually use a tool like Avinash Kaushik’s Significance Calculator or Kissmetrics Calculator to decide whether or not my test is at least 95% confident. That is the perfect scenario.
    […]

Add your Perspective

*