We all wish that our key internal partners, business decision makers, would use Web Analytics data a lot more to make effective decisions. How do we make recommendations / decisions with confidence? How can we drive action rather than pushing data? The challenge is how to separate **Signal** from **Noise** and make it easy to communicate that distinction.

This is where Excellent Analytics Tip #1, a recurring series, comes in. Leverage the power of Statistics.

Consider this scenario (A):

- You do send out two offers to potential customer. Here is how the outcomes look:
- Offer One Responses: 5,300. Order: 46. Hence Conversion Rate: 0.87%
- Offer Two Responses: 5,200. Order: 55. Hence Conversion Rate: 1.06%

Is Offer Two better than Offer One? It does have "better" conversion rate, by 0.19%. Can you decide which one of the two is better with *just* 40 to 50 responses? We got 9 more orders from 100 fewer visitors.

Applying statistics tells us that the results, the two conversion rates, are just 0.995 standard deviations apart and not statistically significant. This would mean that it is quite likely that it is **noise** causing the difference in conversion rates.

Consider this scenario (B):

- You do send out two offers to potential customer. Here is how the outcomes look:
- Offer One Responses: 5,300. Order: 46. Hence Conversion Rate: 0.87%
- Offer Two Responses: 5,200. Order: 63. Hence Conversion Rate: 1.21%

Applying statistics will now tell us that the two numbers are 1.74 standard deviations apart and the results rate 95% statistically significant. 95% significance is a very strong **signal**. Based on this, and only a sample of 5k and sixty odd responses, we can confidently predict success.

Powerful benefits to presenting Statistical Significance rather than simply Conversion Rate:

- You are taking yourself out of the equation, it is awesome to say "according to the God's of Statistics here are the results…"
- Focusing on quality of Signal means that we appear smarter than people give us Analysts credit for.
- You take then
*thinking*and*questions*out of the equation. Either something is Statistically Significant, and we take action, or we say it is not Significant and let's try something else. No reporting, just actionable insights.

Is this really hard to do?

No! Simply use the spreadsheet below, which comes to us via the exceedingly kind Rags Srinivasan:

**+**Statistical Significance Calculator

[Note: March, 2013: This is the updated version of the calculator. For a bit more context on version 2, please see this comment by Rags.]

In the spreadsheet you get even more bang for your buck. On sheet number one you can apple the 1-tailed or 2-tailed test to your statistical significance calculations. Here are the steps: Choose from the drop down in cell D7. Complete cells B13, C13, B14 and C14 (essentially how many participants or visitors etc were there and how many conversions you got). In cell C18 you'll see if the results were statistically significant or not.

In sheet number two, for those of you who are a bit advanced, you can apply the chi-squared test. This test is more optimal for when you see very small conversion rates (not unusual on the web). It is a more skeptical test with a higher threshold for differences. The benefit is that small statistical anomalies don't look like real differences.

When in doubt go with sheet number two, the chi-squared test.

Two small tips:

- This is a best practice but aim for 95% or higher Confidence. That is not always required but it is recommended.
*"Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital."*–Aaron Levenstein

Agree? Disagree? Not really a Excellent Analytics Tip? Please share your feedback via comments.

Thank you for starting your excellent blog. I found you via Robbin Steif's LunaMetrics Blog.

Regarding the use of standard deviation as a means of interpreting order results. In general, I completely agree about the importance of removing bias from test results as much as possible.

What else do we need to know about the groups involved in the test?

Are they essentially the

group or are they two completely different groups? (I'm assuming you would want to send offers to as many people as possible; hence, they are the same group – less 100 people in the second case.)sameHow were members of the group(s) selected? Do you need a random sample in order to apply principles of standard deviation? 5,000+ is a good size group from which to draw conclusions, but I take it group members were not chosen randomly.

Dear Avinash,

Thank you for sharing your blog. This has been long awaited and definitely worth the time to read – subscribe to.

I'm wondering now based on this article, how this model plays in with UX decision when applying A/B tests. We often mind minimal differences and often base our decision the winner between the two.

Is there a way or method of filtering noise when the difference is minimal? Assuming all multivariate elements are correctly in place?

Congratulations on the great blog, the industry will soon be catching on!

Jeff

Hi Avinash,

I had the pleasure of hearing you speak at the 2005 eMetrics. I'm very happy that you've decided to blog. I too found your blog through Robbin Steif's .

It's excellent that you are giving us real examples of how statistics can be used, and providing tool references. I look forward to additional case studies and discussions.

Will you also be posting about monitoring and managing outside influences? Sometimes the Noise dampens the signal or deflects the signal.

Thanks,

Web: http://www.clickinsight.ca

Blog: clickinsight.blogspot.com

Mark McLaren: Thanks for your kind words about the post, I am glad you found it helpful.

In the specific example I used, and the spreadsheet, you control for one thing usually. you can have as many groups as you want. For example you can send one offer to people who live in CA and NY and FL and OR and OH and plug that into the spreadsheet against a control and know which works best.

Alternatively you could try 5, 6, 10 whatever number of different offers to a bunch of folks and see which one converts best.

The problem becomes when you want to test different offers to differnt groups (or many different content in different locations on the same page). Now you are in the world of multivariate and need to apply advanced statistics (think Taguchi).

Doing multivariate is awesomely powerful and yields great results, but beyond my humble spreadsheet.

The beauty of using statistics is that the standard deviations required, and amount of Statistical Significance (my suggestion of 95% or higher), will drive how big a sample you need. There is no fixed number (like 5k).

Hope this is the kind of information you were looking for.

Jeff: Glad to see your post…

If we are doing a/b testing (asuming the Success Goal is clearly articulated and measurable and that it is not "impact on brand") then it would be a sin not to use the spreadsheet in the post above to seperate

SignalfromNoise. Simply looking at Conversion Rate (or similar metric) difference is very dangerous because of exactly what you say, how much is enough to be confident.The great news is that most current a/b testing solution (atleast the ones that so "page testing") already include statistical computations to help us make better decisions.

If you don't see atleast 90% plus statistica confidence take the results with a grain of salt.

Hi Avinash,

I too am very happy to see your blog. I found it through Clint Ivy's blog and I am enjoying reading your posts very much. I find them to be quite informative.

You say above:

"You can easily adapt the spreadsheet, as we have, to compute statistical difference between absolute numbers (say you want to know if the difference Page Views Per Visitor or Average Time on Site between segment One and Two is Significant)"

It's not obvious to me how to do this. Can you elaborate?

Thanks.

Hi Avinash,

Good to find you blogging, sharing thoughts and experiences. It's quite some interestign stuff and I hope you enjoy the experience.

I read your different posts on Saturday morning and your thoughts stayed with me for the entire week-end. Thank you.

Yes, statistical significance. I totally join you in the idea and would only add that tests that do not render truely significant results should not be communicated upon. I remember in my first job having warned of the non significance of a test only to find it had heavily influenced a commercial strategy. I vowed never again!

Another pavlovian reaction was to consider that any number of responses under 200 should not be taken into account as it holds high proability of not being representitive. I usually follow this first rule and adapt the variables in order to remain loyal to the statistical representitiveness of a sample. Quite pavlovian, I agree.

And the last thing is that I'll bare statistical significance in mind but would like to suggest another possible subject: correlation between conversion rates.

Siegert suggested this formulation for a client yesterday:

“Is a visitor engaging into A but not engaging into B, converted easier into a lead, than someone engaged into C and B?”

In other words, you've got kind of low level conversion events that influence or not higher goals.

I'm having diffculty formulating this, sorry.

Hope it made sense, keep up the good work, cheers from expensive Brussels ;-)

Aurélie

Aurélie: Thanks for the thoughtful comment, I am sorry to have spoilt your entire weekend with my posts afterall there are so many more beautiful things in life.:)

I completely agree with the care around communicating anything that is not of significance, there is always a danger that inspite of your warning the will jump into the lake.

(For our readers here is something on pavlovian reaction.)

In the world of Multivariate we can detect a strong signal even with small samples. We use somethings like This Page to calculate sample set.

Corelations are important, very, and of course my simply little spreadsheet won't account for that. Specially for complex web interactions it is important to understand the lower level conversion events might influence higher level (ultimate) goals.

Hi Avinash, my thanks also to you for sharing. Any additional insights you might have about the key drivers of adoption you've experienced would be greatly appreciated.

Regarding statistical significance, it appears that the reference in your post used a one tailed z test for testing whether there is a significant difference between two sample proportions. Wouldn't it have been more precise to use a two tailed test? If not, why?

Kerry: The example used was quite a simple one to show that we can accomplish much applying statistics to our standard KPI's with very little stress.

You are right, one can get quite sophisticated and get ever better results. The emphasis of the article was how to detect statistical significance in a simple case. I hope to blog more about how we can apply advanced methodologies in testing (to build on my experimentation and testing post).

Thanks for taking the time to post a comment.

Hi Avinash,

I so much agree on the importance of taking into account statistical significance – an essential part of the "so what" factor!

This is a neat chi square tool to test for statistical significance:

http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

I do love your blog, bye, Vicky

Although a 95% CL seems to be common (other than in a medical/pharmaceutical context), in a marketing context a lower CL may be quite appropriate. As you know, the choice of significance level(or it's complement, Confidence Level) depends on the cost of being wrong.

A 5% significance level (95% CL) means there is a 5% probability of being wrong. This is Type I error, i.e., concluding that one RR% is higher than another (statistically significant)when in fact it is not. Acting on this wrong conclusion may result in incurring costs that do not yield revenue or profit to offset the costs.

Type 2 error is when one does not reject the null hypothesis when if fact it is false. In this

case, the cost associated with the decision to not roll out a marketing tactic is the foregone revenue/profit that would otherwise have been generated.

In many situations, the cost of Type II error exceeds the cost of Type I error. Clearly, a trade-off is involved, but a lower CL of 90% of even 80% may not be out of line.

Ultimately, each business needs to decide for itself what an appropriate CL is for purposes of assessing test results.

Would be interested in your thoughts. Hakim

Regarding the question of 1-tail vs. 2-tail test, the former is appropriate when one wants to determine whether one RR% is statistically HIGHER (or LOWER) than another. The latter is appropriate if one wants to know if a RR% is DIFFERENT FROM another.

I would suggest that in most marketing situations, we are more interested in the former (higher than) than the latter (different from).

In a few cases, we may want to know whether a proposed course of action may harm the response rate, in which case a 2-tail test would be appropriate.

Thanks for the great insight. Do you have more details on exactly how the statistical significance is calculated? I'm curious how you derived the std deviations in the scenarios above with just the sample size and order counts.

this is to continue where mr hakim aly left. you may find this little piece at my blog, http://directindia.blogspot.com/2007/10/no-beta-yes-risk.html, interesting.

i've also given links for templates in the piece.

pac

The Analytical Group link for a free spreadsheet download appears to have changed. It's now http://www.analyticalgroup.com/sigtest.html (with html instead of htm).

Thanks for all your great articles, Avinash!

Note :Thanks very much for the correction Philip! -Avinash.Avinash,

As a stats guy, I am a little concered with the assumptions behind the model. I downloaded it and the first thing I noticed, was that it makes some pretty large assumptions with confidence levels (anything with z-score (I am assuming) of 1.65 and 2.33 = 95%). I understand that this is probably there to make things "easy" but I think it can be misleading. Also of note, was that IF their is a z-score assumption (which unsure of), there are some other assumptions underneath the covers (which I coudn't get to), and z-scores are only for known pop means and sta. dev. Do you know what Brian is using? Is it possible to get this information?

Finally, one concern with the "plug and chug" nature of the spreadsheet is what you always must be wary of, and that is making statistical significance a badge of honor. All it tells you is given the values you have, is there a difference between the two. What you must do, more than anything else, is make sure your testing methods are solid BEFORE the test. Otherwise, you are going to be putting in values after values and getting significance or not and making some very important decisions when the whole test could be wrong. Practical vs. Statistical.

all -

i share similar concerns as jbuser…

for example, 1.74 standard deviations is not 95% significance. 1.96 standard deviations is 95% confidence.

also, for computing standard deviation (s) of 2 or more proportions (in this case, conversion rate), i think it's a good practice to assume the largest margin of error. in this case, margin of error = 2 * sqrt(0.5^2/n), where n is the size of your smallest sample. this is what pollsters such as zogby do when communicating poll results about hillary/obama, etc.

Hi Avinash,

Thanks a million for your great posts. I just got into web analytics and have found both this blog and your book extremely helpful. You mentioned utilizing the statcalc.xls spreadsheet to measure significance between pageviews. Can you kindly advise on how to do this? Your response is eagerly awwaited. :)

The link for more advanced stats – http://www.mwrms.com/wwwRMS/DirectMarketing/MarketingCalc2.asp does not seem to have anything to do with stats anymore. Unfortunately I couldn't find a live replacement.

What an insightful post Avinash.

For someone who is just approaching web analytics from a statistical perspective, I'm constantly asked whether a particular report figure is "statistically significant" in terms of sample size. I readily understand the statcalc worksheet purpose of seeing whether the difference between two metrics is stat. significant, but how do you know if the metrics have enough sample size to gauge effectively?

For example, I have two conversion rates: 0.7% and 8.8%. The sheet shows that there is a 99% confidence that these are statistically different which is a good start. But my rates are based off of A(4,155 clicks & 6 conversions) and B(80 clicks and 7 conversions). Are the conversion samples too small to be effective indicators in this case? What is an acceptable threshold here, and how do you find it on a case-by-case basis?

Thank you!

Avinash, this post is giving me nightmares and heart palpitations.

My problem is similar to Lea SP's, but on steroids. I have two sample populations:

Pop A: 200,000 participants, 800,000 conversions

Pop B: 1,000 participants, 8,000 conversions

Unfortunately, conversions in this case don't generate direct cash, or else I'd be buying an island somewhere, and not worry about calculating significance!

But obviously, there should be something to be said of Pop B having way lower sample size and not being representative or even warranting comparison to Pop A, right?

Regards,

Garry

Since i am new to seo,i want to implement statistics in seo . i thought that statistical techniques cannot be implemented in seo or by the data given by google analytics. so can you suggest any statistical technique with example to my mail id it will be useful for me.i will be greatful to u

I am on my way to becoming an analysis ninja. My two roadblocks thus far are JavaScript and statistics (KPI's? check. Segmentation? double check.)

I am wondering if you could recommend any good entry-level books that could help teach me to do statistical analysis? Anything that uses web analytics scenarios is a *huge* plus.

Thanks!

Adrian:I am afraid I don't know any book on Statistics that covers web analytics scenarios. Though if you understood statistics I don't think anything would stand in your way in terms of applying it to challenges you face in your web analytics job.Taking a Statistics 101 or 201 course at your local university might just do the trick.

This might seem odd but one of the best books I have read on Statistics is this one:

Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith

Simple and effective.

Not necessarily just about statistics but this is an awesome book if you want to be a great analyst:

Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets

by Nassim Nicholas Taleb

Hope this helps a bit.

Avinash.

Thank you for the easy to understand description above. I've found it very useful, however, I work for an ecommerce site that has a price range of anywhere from $3 an item to $299 an item. So, I feel like in some situations only looking at conversion rate is looking at 1 piece of the puzzle.

I've often used sales/session or tried to factor in AOV when looking at conversion, but I've had a lot of trouble coming up with a statistical method to ensure my tests' relevance. I can check to see if both conversion and AOV pass a null hypothesis test, but in the case that they both do, I'm back at square one.

Can anyone recommend a statistical method for this scenario?

Thanks Avinash for the article, love your blog (and books)!

Dave,

You are correct in stating that looking at conversion rate alone is looking at one part of the puzzle.

When you have items that vary in price, like you said from $3 to $299, your test for statistical significance of difference between conversion rates assumes an implicit hypothesis that is treated as given.

A1: The difference in conversion rates does not differ across price ranges.

and your null hypothesis (same, just added for completeness)

H0: Any difference between the conversion rates is due to randomness

When your data tells you that H0 cannot be rejected, it is conditioned on the implicit assumption A1 being true.

But what if A1 is false? Either you explicitly test this assumption first or as simpler option, segment your data and test each segment for statistical significance. Since you have a range of price points I recommend you test over 4-5 price ranges.

This is same as the case when you are A/B testing simple conversion rates and treat the population as the same (no male/female difference, no Geo specific difference etc).

Hope this helps.

-rags

Thank you Rags, very helpful. I'll use the segmentation method in my next test. Unfortunately, this means waiting for a larger sample size than non-segmented data. However, I suppose it's worth it. Thanks!

Hi,

What do you think about using Fisher's Exact Test over the Chi-squared test? The 2nd link below suggests that using Fisher's Exact Test is the thing to do.

http://www.langsrud.com/fisher.htm

http://www.graphpad.com/faq/viewfaq.cfm?faq=

Thanks,

Lisa

The 2nd link got cut off, there's a 1114 after the equal sign:

http://graphpad.com/faq/viewfaq.cfm?faq=1114

Lisa

First, I see that you are from Boulder! Go Boulder! My hometown forever!

Yes Fisher's Exact Test is a candidate here but it is needed only when we are seeing conversion numbers as tiny as 5 or lower. With Fisher's Exact test, when you are doing the calculations with a handheld calculator the permutations can get ugly for large numbers but with Excel we don't have that issue. Yet, for larger numbers Chi-square test does well, eliminating the need for a Exact test.

On the other hand if the conversion rates are so low one should not even be wasting time over statistical anomalies (there may be statistical significance but is there economic significance)?

Chi-square test is more skeptical than t-test and Fisher's Exact test is even more skeptical than Chi-square test. For most split tests that test the hypothesis that the two groups are the same, Chi-square will do.

All that said I should admit that I am biased in picking a candidate test that I can explain to others in simple terms. It is relatively easier for me to explain t-test and chi-square test to those who are not statistically inclined (see here: http://iterativepath.wordpress.com/2011/07/03/a-closer-look-at-ab-testing/ ) than it is to explain Fisher's test.

Regards

-Rags

Thanks so much Rags for your reply. And yay for Boulder! I'm using your great spreadsheet tool. Thanks! So if I get a YES for statistical significance using the t-test, but not when using the chi-squared test. Would you still say YES, the test was significant? Is the t-test good enough in your opinion?

Numbers for A are: 33548 conversions out of 212460 participants

Numbers for B are: 33371 conversions out of 208143 participants

Lisa

You need to decide whether or not you want to do Chi-square test or t-test.

Chi-square test is more skeptical and it likely will more often find the difference to be not statistically significant.

Since it operates on just the conversions, the procedure that uses Chi-square test also eliminates other hidden hypothesis that are prevalent with a procedure that uses t-test.

One thing you should not do is use Chi-quare test, see it fail then use 2 tailed t-test, see it fail and use 1-tailed t-test and see it pass.

http://iterativepath.wordpress.com/2010/06/01/is-your-ab-testing-tool-finding-differences-that-arent/

-Rags

Hi Rags,

Me again! You say that the Chi-square test is more skeptical than the t-test. I'm confused because I'm finding that the 2-tail t-test is more skeptical than the Chi-square test. Any thoughts on that?

I'm using your spreadsheet, thanks for making it available!!

Control – 10164 participants, 584 conversions

Treatment – 18928 participants 1121 conversions

Is it working right if I see that the 2-tailed test does not pass statistical significance, but the Chi-square test does indicate statistical significance? Perhaps the 2-tailed t-test is more skeptical than the Chi-square test?

thanks!

Lisa

Hi,

I have results based on a sample size of X , the results fall into two buckets, how do I figure out if those individual buckets are statistical significant to draw a conclusion.

Thanks,

Denisse

Just downloaded the spreadsheet, and the chi-sqr tab isn't taking into account the number of test participants in each variation. That can't be right.

Should "Expected value" be a weighted average rather than a straight average of conversions, or something more complex?

Bryan

You are correct that the chi-square tab does not use number of test participants. I made a simplifying assumption that your A/B test evenly distributes (50-50) between two groups. Hence you see Row 24 calculating Expected value as 50% of total conversions.

To use test participants you will fix Row 24 formula as follows (find Expected value for Control and Treatment, add another column) )

Expected value for Control B24= ($B15+C15) * $B14/($B14+$C14)

Expected Value for Treatment C24= (B15+C15) * $C14/($B14+$C14)

You'll notice the two will be same if the test participants are evenly split.

Then you'll change Row 25 as

=((B24-B15)^2)/B24 + ((C24-C15)^2)/C24

-Rags

Thanks Rags and Bryan.

That helps clarify the issue I was seeing when compariing the t-test to the chi-square test.

Rags, thanks for the update. Avinash, thanks for posting the new file. Lisa, glad to help.

Can you guys tell me if this is appropriate to use in cases where conversions *exceed* visitors? I'm testing a "related articles" component that appears on article pages on a media site, and using page views as the conversion metric.

Thanks!

I've run into this as well. Did you ever get an answer? I don't see it in this section.

Hi Avinash..

I liked the blog very much. i am in the process of developing more statistical insights and it would be great if you could suggest ways that can help me.. an informative blog or statistical bible for web analytics.. your inputs would be highly desirable..

S M:If you are serious about this then a course at a local university (or online) is a good idea. Even something that just goes into Statistics 101 and perhaps the 201 level would be great.There are a couple of books that are a great (and a lot of fun) intro as well.

You can download, free, How To Lie With Statistics.

I also very much like this one: Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith

Finally for something that moves beyond statistics, but still in the rough general neighbourhood, checkout Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets by Nassim Nicholas Taleb

-Avinash.

Thanks Rags for this post.

Everything I can find regarding significance are based on conversion but the final goal for an ecommerce site is to increase sales (increasing conversion is just a mean and sometimes it can be missleading because you can increase conversion but if your AOV goes down your change could be impacting negatively in your sales).

So my question is, why do we use conversion instead of sales ($)? Could the signficance be calculated based on sales? How could this be done?

Thanks,

Diego

This makes complete sense. I've seen others just add up the dollar value based on the conversion type. The problem is that they aren't necessarily significant. I've read that it takes additional statistics to see if an alternative value wins. However I haven't seen a step by step approach to show how.

Hi Avinash,

You've been my inspiration for quite a few years now, and your posts always bring wisdom to web analytics. Thank you for that!

I'd like to note that the Excel sheet calculator that you link to is actually pretty much useless for practical purposes. Yes, it will do in a Baseline vs Test scenario, but how many times does a marketer face this situation in the real world? Not very often.

Usually we have a Baseline vs Test 1 vs Test 2 vs Test 3 and so on. Depending on what you are testing (ad copy, landing pages, etc.) the number of variants you'll be testing against can quickly go way above the trivial 1 or 2 tests.

Then you might say – but can't I simply to a pairwise comparison: each Test vs the Baseline and then take the best one? The short answer is no. The slightly longer answer is "Because of Multiple Comparison Probelm in Statistics" for which there is enough written all over the web.

Thus you need a calculator that corrects for that in an intelligent way (too strict and you'll be correcting for things very rarely required/encountered in the real world, e.g. the Bonferroni correction).

I've devised and published a calculator that handles many test variants easily (bulk input as well, copy paste from an excel table) and corrects for the multiple-correction error intelligently (the Benjamini-Hochberg-Yekutieli method). I believe it to be the best suitable calculator in the market right now. It's a available here – analytics-toolkit.com/statistical-calculators/ and you can sign up for free to use it (no newsletters and shit!). Feel free to give it a try and let me know what you think :-)

*Very* Helpful, Avinash.

I was having a hard time visualizing confidence in A/B tests until I read this. The key is to mentally overlay the graph for A and B by knowing their stddev. Your page clued me into that.

Thanks!

Dug out this article for a re-read given that I'm currently redesigning some Excel based testing tools. I've always been a big fan of statistical confidence, indeed it's one of the mandatory criteria in my book before making a call on an AB test.

While it's simple enough to calculate stat confidence for conversion rates, there are (imho) other more important metrics which I add into the mix of decision making factors…such as revenue per visit, average order value and margin per order. So my question is this – given that these metrics aren't percentages (i.e. they can be greater than 1 and less than 0), what's the appropriate method to calculate statistical significance for them? I'm sure I have it somewhere, but wanted to draw on the experiences of you and your readers when encountering this!