• Welcome to MEDVERSATION®
  • Log InREGISTER
  • SITE HELP
  • MEDVERSATION® is brought to you by Centocor Ortho Biotech Inc.
Introduction to Statistics
Outline Show

Type I and Type II Errors, Alpha-Levels and Beta-Levels, and Other Rules of Statistical Etiquette

The example we just worked through illustrates a couple of basic rules of the game of statistical inference. The starting point is almost always to assume that there is no difference between the groups; they are all samples drawn at random from the same statistical population. The next step is to determine the likelihood that the observed differences could be caused by chance variation alone. If this probability is sufficiently small (usually less than 1 in 20), then you “reject” the null hypothesis and conclude that there is some true difference between the groups. So, you are concluding that the independent variable (IV) had some effect on the dependent variable (DV) and that the samples therefore came from different populations, the “alternative hypothesis” (H1). That is the meaning behind all those p < 0.05s and p < 0.0001s that appear in the literature. They are simply statements of the probability that the observed difference could have arisen by chance.

Going back to the “reader IQ” experiment, we concluded that the population IQ of readers was greater than 100. Suppose, to carry on the demonstration, that it was actually 110 and not 107.5, the sample estimate. The distribution of sample means of size 25 drawn from the two populations (readers and the general population) could then be pictured as in Fig.1130. The small area from the general population curve to the right of the sample mean is the probability that we could have observed a sample mean of 107.5 by chance under the null hypothesis. This is called the alpha level (α level)— the probability of incorrectly rejecting the null hypothesis—and the resulting error is called, for no apparent reason, a Type I error. In this case, as we calculated, the probability of concluding that there was a difference when there wasn’t, and of committing a Type I error, is 0.006. By convention, statisticians generally use an alpha level of 0.05 as a minimum standard (which means that one out of every 20 studies that reports a significant difference is wrong).

So far, so good. The study showed a difference of 7.5 points, and when we worked it out, we found (to our delight) that the likelihood that this could have occurred by chance was so small that we were prepared to conclude that the groups differed. But suppose it hadn’t come out that way. Suppose that the observed difference had turned out to be 4.5 and not 7.5, in which case the z score would be 1.5 and the probability would be 0.067. Alternatively, if we had started with a sample size of 9, not 25, the alpha level would be more than 0.05 because the standard error would be 15/√9 = 5.0, the z score would be 1.5, and again the p value would be 0.067. In both cases, this isn’t small enough to conclude a significant difference.

In these circumstances, we might be concerned that we might not have had a big enough sample to detect a meaningful difference. How do we figure out whether we actually had a chance at detecting a meaningful difference? To work this one out, we have to first decide what is a meaningful difference, a task that has had monastic philosophers engaged in solemn contemplation since medieval times. For obvious reasons, then, we won’t get into that particular debate here. One quick rule of thumb in such quandaries is to say “10%.” So let’s suppose we wanted to detect a difference of 10 points (or 5, or 20, or 15—the choice is yours)…

Figure 1130 – Figure 3-5: Alpha and beta errors.


5343

Figure 1131 – Figure 3-6: Relationship between power and difference between means, standard deviation, and sample size.


5343

The next step is to figure out exactly the circumstances under which we would either declare “significant difference” or “no difference.” That is, if we look at the H0 distribution in Fig.1130, there is a critical value to the right of the mean, so that exactly 0.05 of the distribution is to the right of it and 0.95 is to the left of it. And if the observed mean falls to the right of it, then the alpha level is less than 0.05, and we party. If the mean is to the left of the critical value, we skulk off and drown our sorrows. (Note that whatever the conclusion, we get to have a drink.)

So where is this critical value? If we go back to the same table that gave us the z values (in some real stats book), we can work backwards from a p of 0.05 to discover that this equals a z value of 1.64. Since the SE is 3.0, the critical value is 1.64 × 3 = 4.92. So if the observed mean falls to the right of 4.92, it’s “significant”; to the left, it’s not.

If we now turn to the “meaningful difference” of 110, the question we want to ask as statisticians is, “What is the likelihood that we would have failed to detect a difference of 10 points?” or, more precisely, “What is the likelihood that we would have ended up with an observed mean of 4.92 or less from an alternative distribution (the H1 distribution) of reader IQs, centered on 110 with the same SE. The situation is as depicted in Fig.1130.

Analogous to the calculation of the alpha level, we can now work out the proportion of the H1 (reader) curve to the left of 104.92. This is just a z of (104.92 –110)/3 = 1.69, and the associated probability, from the same table we used all along, is 0.046. So the probability of concluding there was no difference in IQ, assuming the true IQ of readers is 110, is 0.046. We call this, in analogous fashion, the beta level (β-level), and the associated error, a Type II error.

One final trick. The area of the H1 curve to the right of the critical value is the likelihood of declaring that there is a difference (rejecting H0) when there really is. This value, (1 - β), is called the power of the test. Among other things, power is directly (though not linearly) related to the sample size. This is so because as the sample size increases, the standard error decreases, so the two curves overlap less and the power goes up.

Power is an important concept when you’ve done an experiment and have failed to show a difference. After all, as previously noted, the likelihood of detecting a difference that is there (the power) is directly related to sample size. If a sample is too small, you can’t show that anything is statistically significant, whereas if a sample is too big, everything is statistically significant. All this is shown in Fig.1131. As our first stats prof used to say, “With a big sample, you’re doomed to statistical significance.”

However, if the difference is not significant, all those yea-sayers (who really want to show that the newest, latest, and most expensive drug really is better) will automatically scream, “Ya didn’t have enough power, dummy!” Don’t panic! As you can see from the preceding discussion, it is evident that power depends not only on sample size but also on the size of the difference you want to detect. And because the experiment did not reject the null hypothesis, you have no more idea of how much the “true” difference is than you did before you started. After all, if you knew that, there would be no reason to do the study. So, if you guess at a really big difference, you will find enough power.

Conversely, on one occasion, when we had reported a significant difference at the < 0.001 level with a sample size of approximately 15 per group, one intellectually challenged reviewer took us to task for conducting studies with such small samples, saying we didn’t have enough power. Clearly, we did have enough power to detect a difference because we did detect it.

One final wrinkle to really test your mettle. Imagine that you have just completed an experiment (for the sake of argument, with two groups and measurements of IQ). You have just calculated the appropriate significance test, and it works out to a probability of 0.049999. Appropriately, you reject the null hypothesis and declare a significant difference. What is the likelihood that if you repeated the experiment exactly as before, you would come to the same conclusion and reject the null hypothesis? Most people say 95%. In fact, this question is posed to our stats class every year, with $2.00 going to anyone who gets the answer. So far, it has cost us $4.00. The answer, believe it or not, is exactly 0.50; that is, if the experiment rejected H0 at exactly the 0.05 level, the chance of replicating the experiment is 50%! To see why, look at Fig.1132.

Well, not exactly 0.50. A very astute reader challenged us on this point just before this edition came out. He pointed out that we are assuming, in our logic, that H1 is correct. But there is still a chance that H0 is correct. The relative probabilities of each are measured by the height of the two curves: H0 at z = 1.96 and H1 at z = 0, which turn out to be about 0.05 and 0.40. So the actual probability of rejecting H0 a second time is [(0.40 × 0.50) + (0.05 × 0.05)] / (0.40 × 0.05) = 0.45.5343 

Content on this page was last changed on March 19, 2009.

© 2002 BC Decker Inc. Show Disclaimer

References:

5343.  Norman GR, Streiner DL. PDQ Statistics . 3rd ed. Hamilton, Ontario: BC Decker Inc.; 2003.

Next Page: One-Tailed Tests and Two-Tailed Tests »

Last Complete Site Update On: July 22, 2010