Research Reports

The trouble with significance testing

Significance testing is often insufficient to reach clear conclusions and prone to incorrect interpretations.

Harish Sethu

May 20, 2016

Significance testing is often insufficient to reach clear conclusions and prone to incorrect interpretations.

You have probably heard of the p value, the ubiquitous quantity that shows up in research articles discussing the statistical significance of observed differences. The drugs we test in medicine and the interventions we evaluate in social sciences are often hawked or dismissed based on the claims of statistical significance or lack thereof as measured by the p value. It is also widely used in animal advocacy research.

But, the p value is often not only insufficient to reach clear conclusions but is also prone to incorrect interpretations by scientists and lay persons alike. It is more clarifying and meaningful for researchers to report confidence intervals and—even better—to provide descriptive statistics of all data including frequency distributions of the observed effects. Good statistical practice in research, since at least the 1980s, has recommended these replacements of the p value. For this reason, Humane League Labs will not use or report p values exclusively but only together with confidence intervals and the comprehensive descriptive statistics necessary for robust inferences. Let's examine why.

What is the p value?

Assume I want to test the hypothesis that handing out "go vegetarian" leaflets to a population increases the number of vegetarians in the population. I divide the population into two groups—an experimental group which receives my leaflets and a control group which does not. Since I cannot follow up with everyone in the population, I will use a random sample from each group to measure the effect of the leaflets. If it was indeed true that my leafleting increased the number of vegetarians in the population, I still may not observe this effect in the random samples I chose—because chance! Similarly, if my leafleting had no effect on the number of vegetarians in the population, it is still possible I will see an effect in the random samples I chose.

Let's say that, after leafleting, I find a larger percentage of vegetarians in the experimental sample than in the control sample indicating the possibility of a non-zero effect of leafleting. The p value is the probability that my experiment will find the same or a larger effect under the assumption that there is no real effect. In our case, the p value is the probability of finding the same or a larger percentage difference between the experimental and the control samples even if it was true that leafleting makes no difference at all. Obviously, we want as small a value of p as possible for our experiments.

To claim statistical significance, it is conventional to aim for p < 0.05—that is, a chance of less than 1 in 20 that an effect of the same or larger size is observed even when there was no actual effect. The point of reporting a p value in a study is to declare a quantitative measure of the quality of evidence gathered by the study's findings.

It is easily misinterpreted

Unfortunately, the p value is vulnerable to misinterpretations and is hard to use for comparisons between studies. In fact, sometimes, it is not even quite clear if it is measuring what we intend for it to measure.

A statistically significant result with a low value of p is often misinterpreted to mean a large effect. The p value does not actually measure the size of the effect. Instead, it measures the unlikeliness of making the observation one did if the stated hypothesis was not actually true.

Another common misinterpretation is to assign a probability to the truth of the hypothesis based on the measured p value. If I obtain a p value of 0.02 in my leafleting study, it does not actually mean that there is a 98% chance that my hypothesis is true. Similarly, a statistically insignificant result does not by itself imply that the hypothesis is false.

The p value is often also mistakenly used to compare results from two related experiments. Assume an experiment comparing leaflet A with a control group yields an effect with p = 0.03. Assume that another identical experiment comparing leaflet B with a control yields an effect with p = 0.04. Now, even though 0.03 < 0.04, this does not imply that leaflet A is more effective than leaflet B, because this difference may not actually be statistically significant even though the difference between the effects of each leaflet against the control was statistically significant.

Identical data can yield different p values!

With diligent care, the misinterpretations of the p value described above can be avoided. You would expect, however, that if the observed data does not change, the statistical significance of a result based on the data should not change either—but, quite disturbingly, they can! The p value depends not just on the observed data but also on unobserved data!

Let me illustrate with a very simple scenario. Suppose I plan to test a "go vegetarian" leaflet on 10 meat-eaters who I already know are meat-eaters. I will use a research protocol in which I will consider my leafleting intervention as effective if at least one of the 10 goes vegetarian. A month after I hand them the leaflets, I will follow up on them in a careful in-person interview one by one. Self-desirability and other biases can distort the truth about someone's vegetarianism, but let's assume my interviews are good enough that they can, with 95% probability, accurately identify a person as either a vegetarian or not. Suppose I interview 3 people and find that the first two are continuing to eat meat while the third one is now a vegetarian.

What is the p value of my study result? If I claim that my intention was always to interview a sample of exactly 3 people, then the p value of my finding is $1 - 0.95^3 \approx 0.14$ , an embarrassingly high number for a p value. My study result will surely be dismissed as statistically insignificant!

But, since my research protocol defined my leafleting as effective if at least one person goes vegetarian, I can claim that I always intended for my experiment to stop as soon as I found one vegetarian (which happened in this case when I interviewed my third subject.) The p value in this case is 0.95 × 0.95 × 0.05 ≈ 0.045, which tucks my study result neatly in the statistically significant category!

So, you see, the p value is not an impartial and objective measure of what the data tells you. For identical data and for identical interpretive method from the data (where observing at least one vegetarian implied success), one can slightly tweak the methodology after the data collection and claim statistical significance! In large experiments with many dozens of variables, opportunities for tweaking abound.

The p value holds within it a wicked capacity to mislead!

What is the alternative?

Despite all its troubles, significance testing can actually be useful in some contexts. For example, we sometimes want to determine if the observed data is consistent with no effect, such as when we want to examine if patterns of scores on a college admissions test is normal and not indicative of cheating. Non-significance, in this case, will render weak an argument that there was cheating. The p value, however, remains an inadequate tool to estimate the probability that an effect actually exists given an observed effect.

Statisticians often recommend that confidence intervals replace the use of p values in all contexts where it is possible. An effect size can be expressed as a confidence interval, using an exact estimate accompanied by our uncertainty in that estimate. For example, we may say that a leaflet turns 3 percent of meat-eaters into vegetarians and that the 95% confidence interval lies between 2.6 and 3.4. This would mean that if we ran the experiment an infinite number of times, in 95% of those runs of the experiment the true value will lie within the confidence interval for that run.

The quality of evidence for a study's results are as easily and as completely addressed by confidence intervals, but without the interpretive quandaries that come with significance testing. In fact, a confidence interval carries more information than the p value.

However, even confidence intervals are not perfect in the clarity with which they convey the quality of evidence—its subtleties can also trip up most researchers. For example, in the above instance, we cannot actually claim that there is a 95% probability that the true percentage of meat-eaters who go vegetarian as a result of the leaflet lies between 2.6 and 3.4! Even though the true value lies within the computed confidence interval 95% of the time, we cannot know how likely it is that this particular computed interval between 2.6 and 3.4 is one of those 95%. Confidence intervals offer a good estimate of the true value, but not a probabilisitic statement of where the true value lies.

Even better than p values and confidence intervals is the use of descriptive statistics, including the frequency distribution of the observed data. In particular, the distributional data on effect sizes can be spectacularly revealing for the inferences we seek.

Unfortunately, the p value remains in wide use today in peer-reviewed literature for a variety of reasons. One explanation is that scientists are usually only minimally trained in statistics—the analysis part of a study is often a procedural chore not deserving of their full attention. A second explanation may be a resistance to revealing more than the p value since detailed statistics on effect sizes do often expose the weaknesses of the evidence behind the findings of a study. A third explanation may be the fear that if you don't do your study exactly the way everyone else is doing it, your paper may be rejected by the reviewers who, after all, are also doing their studies the same way.

The p value is a symptom of our yearning for and our unhealthy reliance on one magical number to somehow encapsulate and say something meaningful about the complexity of patterns observed in the data. It is a crutch we use to oversimplify reality and distill the full spectrum of our observations into something more easily digestible—and, it always comes at some expense of the truth.