Tuesday, October 3, 2023

Understanding statistical significance: simple words, confusing meanings

Modern healthcare research uses bio–statistics extensively. There, one term you will often come across is statistical significance.

Health science and bio–statistics

In most branches of science, things happen predictably. If you drop a ball down, it will hit the ground. And it will hit at exact same location with speed that you can accurately calculate. Similarly, it is unlikely that if you mix calcium carbonate and sulphuric acid, you will get potassium permanganate. There are no maybe’s.

But, in biological research, the systems are too complex and intertwined. You may give an anti–histamine to two people: one will get allergy relief, while the other will get blurred vision and dry mouth.

Of course, all sciences involve few such situations. But, in healthcare field, pretty much everything is probabilistic. So, we have to use statistics — the science of probabilities — in healthcare research.

The most used phrase in such research is Statistical significance. Most researchers use it casually. The conclusions are sugar–coated with it. And, my gut feeling is most readers do not even understand how to interpret such claims correctly.

The danger is one can draw completely wrong meanings from them. So, I decided to devote a full article to it.

This website aims to give you articles that can improve your health. This article won’t do that. But, it will certainly help you understand reported results better. And, you will not have to live with incorrect black–and–white messages, such as “increasing blood vitamin D levels do not reduce fractures.”

So here is a primer on statistical significance. We will take an example from a research paper discussed in my comprehensive article on this website: Everything you wanted to know about running and knee osteoarthritis.

Statistical significance

In that article, we discussed a paper that showed that the risk of osteoarthritis (OA) in marathon runners is less than that in non–runners. Now, what do the following sentences mean?

  1. There was a statistically significant lower OA risk in marathoners (8.8%) than in non–runners (17.9%) (p < 0.001).
  2. There was no statistically significant lower OA risk in marathoners (8.8%) than in non–runners (17.9%) (p > 0.05).

Both use exactly same results but make a different claim. Both say that they found the risk of OA in marathoners was 8.8%, while that in non–runners was 17.9%. Yet, both conclude completely opposite things.

And, hold your breath, both could be exactly correct. Confused? That is why you need to understand what statistical significance is. That small little thing written at the end in brackets, p > 0.05, or p < 0.001, is a number specifying statistical significance.

Clinical trials

In a clinical trial, we try to find if any factor, such as the running marathons, increases the OA risk. Let us say that we found some decrease — the marathoners had lower incidence of OA than non–runners, as the sentences say (8.8% versus 17.9%).

However, the decrease could be because of two reasons: some actual reason, or pure luck. How do we distinguish between the two?

If there was an actual reason, we are right to conclude that running reduces the OA risk. However, if it was because of some random luck, it would be incorrect to say that running reduced the prevalence of OA.

Let us understand this a bit better. Where does this pure chance, or luck factor, come from? Can we remove it completely? And if not, how can we possibly conclude anything?

Where does the luck factor come from?

Let us look at the trial itself. We wanted to know if marathoners had an increased risk of OA. Now, there are millions of marathoners around the world. So, we would have to test every single marathoner. That would have told us if marathoners have any higher risk of OA than non–runners.

However, testing millions of marathoners is too difficult. So, we cheated a bit. We randomly selected only 675 marathoners and tested them.

In those 675 runners, the results were crystal clear: the chance of OA in those runners was 8.8%, while that for non–runners (based on an earlier study) was 17.9%. Thus, these 675 marathoners had less risk of getting OA than their non–runner counterparts.

There was no luck involved. We were 100% sure about these 675 marathoners. However, can we say the same about the rest of the millions of marathoners?

Sample and population

In statistics, the 675 marathoners we tested is called a sample. And, all the millions of marathoners around the world are called the population. We know the results in our sample. But, we don’t know, and mostly cannot know, the results in our population.

So, how do we make a claim about our population based on our findings in our sample? After all, making a claim only for a sample means nothing. Who cares about 675 marathoners and that they had low incidence of OA? People want to know about marathoners in general; and not our specific 675 marathoners.

Essentially, we will have to extend our findings from our sample to our population. That is where the luck factor, or pure chance, comes into the picture.

Random selection

What if almost all of millions of marathoners were osteoarthritic, with only 1,000 not having OA? And, by pure chance, or luck, or destiny, or whatever you call it, we ended up selecting 616 of them in our trial? That would have given us the results we saw in our sample, 616 / 675 = 91.2% of marathoners had no OA, and only 8.8% of them had it.

So, our sample would have 8.8% of marathoners with OA, while our population would have 99.999% of marathoners with OA.

Note that we had selected our 675 marathoners randomly. But, as luck would have it, we might have ended up selecting the very few marathoners in the world, who had no OA.

Thus, lady luck would have made a mockery of our claim that marathoners have lower incidence of OA, and we had no way of knowing it. As the famous author, Nassim Nicholas Taleb, has said, “we would be fooled by randomness“.

Living with randomness

We can rarely test 100% of our population. So, we have to live with testing only a sample.

Extending our conclusions from a sample to the whole population introduces the element of luck, or pure chance. So, we need to specify the luck factor. What was the chance that what we noted in our sample was just a quirk of randomness?

If we know that number, say 3%, then we can say that:
We are 100% sure that our 675 marathoners had less OA than non–runners, and
We are 97% sure that all marathoners in the world have less OA than non–runners.

Now we are getting somewhere.

Quantifying the luck factor

Let us say that we are given a coin. And, we are asked to find out if it is a normal coin, or a biased one.

A biased coin

When a normal coin is tossed, there is a 50% chance it will come out heads, and another 50% chance that it come out tails.

Ignore those movie sequences wherein a tossed coin stands on its edge. The chance of that happening is very low, about 1 in 6000, for a small coin. For the engineers amongst you: the angular momentum of the flipped coin ensures that it does not stand on its edge. But, it can happen 0.01667% of the times, if the coin’s edge is thick, such as a USA 5–cent coin.

A biased coin is the one that has been manipulated by someone so that after a coin flip, the chances of heads and tails will be unequal.

There are only two ways of knowing the ‘truth’ (whether it is a biased coin):

1. Do some forensic analysis on the coin in a research lab to find out if it is a modified one, or
2. Do billions or trillions of (scientifically speaking, infinite) such tosses to see if the results converge to 50% heads : 50% tails.

Since neither of these options are particularly easy, we choose a different approach. We say that we will toss the coin 5 times and see what the results tell us. Of course, we could have also said that we will flip the coin 10 times. But, let us stick to 5 tosses, for understanding the principles.

Coin tosses

Let us say that we tossed the coin 5 times, and we got all 5 heads. Now, what can we say about the coin being biased? This is how the argument proceeds:

When you toss a normal coin 5 times, there are 2 x 2 x 2 x 2 x 2 = 32 different possible outcomes. Only one of those 32 possibilities is all 5 heads. So, if you toss a normal coin 5 times, there is a 1 in 32, or 1 / 32, or about 3%, chance that all the tosses will come up heads. Thus, there is a 3% chance that 5 heads out of 5 tosses would have come out of pure luck, in a normal coin.

In other words, there is a 97% chance that the 5 heads out of 5 tosses would have come out of some underlying reason. Maybe, the tosses were improper, or the coin was biased.

Let us assume that we ensured proper and fair tosses. So, either the coin was biased (97% chance), or the coin was normal (3% chance).

Now, we will not know the absolute truth — whether the coin is biased — since the two options discussed above are both time–consuming, and hence, avoidable. So, we say that we are willing to live with some uncertainty in our conclusions, as long as that uncertainty is ‘very small’. In this case, the uncertainty is 3%.


So, we say that:
Based on the results (5 heads in 5 tosses), we ‘conclude’ that there is a 97% chance that the coin is biased, and a 3% chance that it is normal (and it was pure luck that we got 5 heads in 5 tosses).

This is also written as the probability value, or p–value, or p = 3%, or 0.03.

In short, we say that:
(Based on the results), the coin is biased, p = 0.03.

Note that the conclusions are always based on the results. Hence, we don’t have to write that.

“The coin is biased, p=0.03” means that we conclude that “the coin is biased, but 3% chance that we may be wrong.”

Now, it would be foolish to say something like “the coin is biased, p=0.50”. After all, p = 0.5, or a 50% chance that you may be wrong, means you are just guessing in thin air.

So, where do you draw the line? 3% chance of being wrong? 10% chance?

Drawing conclusions in the presence of luck

A rule of thumb used by researchers is that if any conclusion has 5% or less chance of being wrong, it is fairly conclusive.

So, a p–value of less than 5%, or p < 0.05, is considered convincing enough. Such conclusion is said to be statistically significant.

But why p < 0.05?

There is nothing sacrosanct, or special, about 5%, or 0.05. For most decisions in life, 95% certainty, or 5% possibility of being wrong, is considered fine.

However, if there are serious decisions, such as deciding on whether a new type of cancer treatment is better than the earlier one, you cannot live with 5%, or 1 in 20, chance of being wrong. In such cases, you may demand more certainty. You may say that you will allow 1%, or 1 in 100, possibility of being wrong, or even 0.1%, or 1 in 1000.

This ‘less than 1 in 1000’ chance of being wrong about your conclusion, is written as p < 0.001. So, a result with p < 0.05 is fairly sure, or statistically significant. That with p < 0.01 is even more certain, and that with p < 0.001 is chest–thumping certain. Well, scientists don’t like the word ‘chest–thumping’. They will just say that it is statistically ‘highly’ significant.

By the way, in our biased coin tosses, can you guess what the sample and the population were? After all, you had only one coin. Well, the population was infinite number of tosses that you could do with the coin. And, the sample was the randomly chosen 5 tosses out of those infinite ones.

You had results (5 heads) and conclusions (100% unbiased coin!; you are always 100% sure about your sample) for the 5 tosses, and you wanted to extend the conclusions for your population. After all, knowing whether the coin is biased means knowing what your infinite number of tosses would throw up as the breakup of heads and tails. If the two were equal, or nearly equal, the coin is normal.

Back to our medical trial

We wanted to know if marathoners have more OA risk than non–runners. There are only 2 ways to be 100% sure about that:

  1. Test all the millions of marathoners around the world.
  2. Ask God for the answer.

Since both the options are not particularly practical, we chose a shortcut of testing 675 marathoners. So, again, we cannot be 100% sure in our conclusions, because we will be extending our results from 675 marathoners that we tested to millions of marathoners whom we could not test.

And, we found our 675 marathoners had 8.8% chance of OA versus 17.9% for non–runners.

We applied our statistical science to calculate p–value for this result. Exactly how this p–value was calculated is outside the scope of this article.

In fact, almost 100% of the scientists themselves don’t understand this part. They involve someone with expertise in bio–statistics for this. I should know; I was involved in some medical studies as a co–author, simply because I did the statistical head–banging.

So, here are two possible conclusions from the same results, based on p–values thrown up by statistical calculations. What do they mean to you?

  1. There was a statistically significant lower OA risk in marathoners (8.8%) than in non–runners (17.9%) (p < 0.001).
  2. There was no statistically significant lower OA risk in marathoners (8.8%) than in non–runners (17.9%) (p > 0.05).

What does p < 0.001 mean?

A p–value of < 0.001 (in sentence 1) tells us that there is less than 1 in 1000 chance that the conclusions we draw about millions of marathoners, from our lame–duck shortcut method of testing only 675 marathoners, is an artefact of pure luck. Thus, we are quite confident of our conclusion.

A layman’s interpretation: Marathoners have a lower OA risk than non–runners.
Correct claim: Marathoners have a statistically significant lower OA risk than non–runners, p < 0.001
Correct interpretation: Marathoners have a lower OA risk than non–runners. And, there is less than 1 in 1000 chance that this was observed just due to random luck. So, we are highly confident that there is a lower OA risk in marathoners.

What does p > 0.05 mean?

A p–value of > 0.05 (in sentence 2) tells us that there is more than 1 in 20 chance that the conclusion we draw about millions of marathoners, from testing only 675 marathoners, is due to pure chance. Thus, we are not at all confident whether there is a reduced OA risk in marathoners.

A layman’s interpretation: Marathoners do not have a lower OA risk than non–runners.
Correct claim: Marathoner have no statistically significant lower OA risk than non–runners, p > 0.05
Correct interpretation: Marathoners may have a lower OA risk than non–runners. However, there is more than 1 in 20 chance that this was observed just due to random luck. So, we are not at all confident that there is a lower OA risk in marathoners.

Let me rephrase the correct interpretation part above, for the sake of clarity. It means four things:

  1. We are 100% sure that there is a decrease in OA risk in marathoners, in the sample tested. In fact, the risk is less than half (8.8% versus 17.9%).
  2. But, we are not 100% sure that there was a decreased OA risk in the whole population of marathoners. In fact, we are not even 95% sure that there is a decreased OA risk in all marathoners in the world.
  3. Since there is more than 5% chance that we could wrong, we cannot overlook that when making any scientific proclamations about the OA risk reduction in marathoners. So, we would rather refrain from saying anything about decreased OA risk in marathoners.
  4. So, we are not saying that that there is no reduction in the OA risk in marathoners. We can only say that, based on the data, we have no way to conclude (confidently) whether there is a decrease in the OA risk in marathoners.

This last sentence is very important to understand. People confuse a phrase such as “no statistically significant lower risk…” with “no lower risk of…” or worse, “same risk of…”, or the worst, “higher risk of…”.

If you did not find the moon in the sky, it does not mean that the moon does not exist, nor can it mean that the moon will not be found later on.

How to find the moon?

If you could not find the moon, it is not the fault of the moon. Better to look in the mirror for the culprit. Right?

Similarly, if you find the risk of OA in marathoners was half that in non–runners in a sample, but you are unable to conclude anything about your population, the fault lies with you. Let us mildly say that it is the shortcoming of your study design.

If only you have p–value smaller than 0.05, you would be able to turn around and claim your results to be more statistical significant, and hence, reliable.

That biased coin

In that biased coin example, if you had tossed it only 4 times and got 4 heads, the chance of that happening would have been 1 in 16, or 1 / 16, or 6.7%. So, there was a 93.3% chance that the coin was biased, and 6.7% chance that it was normal.

So, you would have said: (based on the data of 4 tosses), there is no statistically significant bias in the coin, p = 0.067, which is p > 0.05.

For the same coin, you had said earlier: (based on the data of 5 tosses), there is a statistically significant bias in the coin, p = 0.033, which is p < 0.05.

In other words, just by adding one extra toss, you managed to make your conclusions statistically significant. The coin did not change; your confidence in your conclusion about the coin changed.

Medical research and bio–statistics

That is exactly what happens in medical science research. After all, your population does not change.

Bio–statistics help you quantify your confidence in your conclusions. And, a better study design, or tweaking your sample, helps you increase the confidence in your conclusions.

Suppose, we would have tested 2,000 marathoners instead of 675.

Nothing would have changed in the population of marathoners. Whatever reduced OA risk was there would still be the same. But, our confidence in our conclusion would increase, because we tested more, though not all, marathoners. This would be reflected in the reducing p–value.

The moon does not change. You just scan a larger pie of the sky, and increase the chance of finding it.

Thus, by making a trial larger (more marathoners), or longer (more years of running), or using better techniques (more foolproof tests for OA), you can make statistically more significant claims. In fact, a statistically insignificant finding can suddenly become statistically significant.

While we may have the same OA risk reduction — 8.8% versus 17.9% — we may find the p–value to be lower now — from > 0.05 to < 0.05. Our chance of going wrong reduced from more than 5% to less than 5%. Bingo! Suddenly, we have a statistically significant claim.

Once again, keep in mind that there is nothing sacrosanct about this 0.05. If p is 0.051, it does not mean that the conclusion is absolutely untenable. And, if p is 0.0499, it does not mean that the conclusion is super–certain. You become more and more certain about your claim, that is all.

Non–claim, no–claim, and negative–claim

Ok, I made those words up. There is nothing like that in statistics. But, I thought it will help you things a bit better.

Non–claim: There is statistically insignificant OA risk reduction in marathoners.
No–claim: There is no, or almost zero, but statistically significant OA risk reduction in marathoners.
Negative–claim: There is no OA risk reduction in marathoners.

As you can guess, statistics cannot give you a negative–claim, unless you test 100% of the population. Statistics can only say anything with p–value, that is, specifying statistical significance. It cannot say anything with 100% conviction, unless you test all marathoners.

That is why, a statistically insignificant claim is actually a non–claim. Non–claim means you could not claim anything. No–claim means you claimed there was nothing. Non–claim means you could not find the moon. No–claim means there is no moon. Never confuse a non–claim with a no–claim.

It is better to have the latter, no–claim, because it tells us something. It tells us that there is very little difference between marathoners and non–runners.

The former, non–claim, does not tell us anything. It tells us that we cannot draw any conclusions, as we are not confident.

And, the worst of all is the mix–up between a non–claim and a negative claim. There is no such thing as a negative claim in statistics. So, don’t twist the sentence there is no conclusion about decrease in risk to there is no decrease in risk.

Very, very subtle, yet life–and–death difference.

Statistical versus medical significance

Finally, statistical significance has nothing to do with medical significance. Statistical significance says how confident we are about what we claim. Medical significance says how medically relevant our claim is.

What if we found a statistically significant difference between OA risks for marathoners (17.8%) and non–runners (17.9%)? Of course, you found a difference, but does it have any relevance in life, since the difference is so small? Who would want to publish or read such a medical paper?

Personally, though, a statistically significant non–difference (17.8% versus 17.9%, p < 0.05) is much better than statistically insignificant difference (8.5% versus 17.9%, p > 0.05). The latter tells us to do a better, and larger, study, if we want to know the answer. The former simply says that there is not much difference. Close the file, shelve it, and forget it. No further need for anyone to break his head.

So, I would prefer medical community to write more papers about statistically significant non–differences than statistically insignificant differences.

In conclusion

Statistics is the science of probabilities. It helps us draw meaningful conclusions, in the face of uncertainty. However, it cannot bring certainty where there is none.

p–value quantifies the uncertainty in the conclusions.

Be clear about the meaning of the claims. Never confuse a non–claim with a no–claim, or a negative–claim.

Finally, never confuse statistical significance with medical significance. What is the point in being very sure about things that don’t matter?

So much for statistics. Note that a purist statistician will prefer to modify some points that I used in the explanation. For example, the p–value about marathoners having lower risk of OA will be different than the p–value about marathoners having different (lower as well as higher) risk of OA. The former uses one–sided tails, versus two–sided for the latter, whatever that means.

So, a purist would be right; I have taken some creative liberties to keep the explanation simple, slightly bending the exact scientific interpretations.

Now, if you thought you understood it all, let me confuse you by saying that all this was using something called Frequentist statistics. Everything will have to be rewritten, if you were to use Bayesian statistics. Go to this link for some great bedtime 🙂 reading: Frequentist and Bayesian approaches in statistics.

Image credit: David Schwarzenberg from Pixabay


Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Latest Articles