The statistics dictionary: Significantly misleading

Author: Mark Kelly

Mark Twain with characteristic panache said ‘…I am dead to adverbs, they cannot excite me’. Stephen King agrees saying ‘The road to hell is paved with adverbs’. The idea being of course that if you are using an adverb you have chosen the wrong verb. It is stronger to say ‘He shouted’ than it is to say ‘He said loudly’.

What are we to make then of the ubiquitous ‘statistically significantly related’. Not very much I highly suspect. Even if you don’t read academic manuscripts you will have heard it in news reports of medical research or adverts for anti-aging skin creams, typically to indicate that the probability of observing the data observed under the null hypothesis is less than 5%. You may have heard Hans Rosling joke in his TED talk about an experiment to test whether undergraduates knew which of five pairs of countries had the higher infant mortality rates. The students got 1.8 pairs on average correct, leading him to conclude that ‘Swedish top-level students know statistically significantly less than chimpanzees’ (who choosing randomly could be expected to get 2.5 pairs correct).

‘Statistically significant’ is a tremendously ugly phrase but unfortunately that is the least of its shortcomings. What is far worse is that it is misleading. Significant has a plain language meaning of ‘important’ (a significant breakthrough in an investigation) or ‘large’ (a significant amount of money) or ‘meaningful’ (a statement is significant). Statistical significance, on the other hand, may not correspond with any of those things since it is a descriptor of the p-value alone and for a given association the magnitude of the p-value is a simple function of the sample size. If statistical significance is all you want, just increase your sample size. In other words, our ability to detect differences is strongly associated with how hard we are looking.

Imagine if an environmentalist said that oil contamination was detectable in a sample of water from a protected coral reef. The importance of that statement would change drastically depending on whether they were referring to a naked-eye assessment of a water sample or an electron microscope examination. The smaller the amount of oil, the harder we would have to look. The same is true for a clinical study that detects a statistically significant treatment effect. If the study is huge, then issues of statistical significance become unimportant, since even tiny and clinically unimportant differences can be found to be statistically significant. The meaning of the p-value must be interpreted with reference to the sample size and ideally to the effect size (which summarises how large the effect is). The International Journal of Epidemiology actively discourages the use of the term statistically significant in submitted manuscripts, but this is far from the norm.  

What we mean by a ‘statistically significant’ difference is that the difference is ‘unlikely to be zero’. This is a phrase that is unlikely to catch on. An alternative term, statistically discernible, has a number of advantages (even if it still includes an adverb). Firstly, discernibility has none of the aforementioned unhelpful meanings of significance. It merely implies that the effect is distinguishable from randomness. Moreover, when space is restricted in an abstract or manuscript it is often tempting to drop the ‘statistically’ from 'statistically significant' which results in a potentially misleading statement. Discernible has none of this baggage. No value is placed on the importance, magnitude or meaning of the result. It is just observable.

Statistically discernible is still 50% adverb however. What alternatives do we have? Distinguishable from zero? Inconsistent with random variation? Unattributable to chance? All of these sound more definitive than they have any right to be. Perhaps a sufficiently statistically literate audience can just be presented with the p-value and allowed to add their own qualitative labels to different thresholds depending on their personal attitude toward probability. HG Wells might approve of that approach having said ‘Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write’. What seems certain however is that ‘significance’ with its multiple meanings is a great name for a magazine, but a poor choice for a scientific term.

Bookmark and Share

Comment on this article

Submit your comment
  1. Image of unique ID

Comments

Mark Kelly

Quote:

How about the term "systematic"?

i.e. a "significant systematic difference"

 

Hmmmm. I would worry that systematic has too many connotations of order or regularity. A systematic difference to me sort of sounds like there is clear daylight between the two things you are comparing (i.e. perhaps one is systematically shifted up or down).

I think discernible or even detectable aligns more with what is meant.

reply to this comment

rjw

How about the term "systematic"?

i.e. a "significant systematic difference"

reply to this comment

Mark Kelly

Quote:

More than 25 years ago, I attempted to make this very point regarding the subjectivity of adverbial descriptors while presenting a lecture on my research at a major medical school.  I ecouraged efforts to put numbers on findings.  I was met with fierce opposition to put it mildly.  A large part of the problem is that the inherent ambiguity of language renders it inadequate to the task of unmistakable exposition. But, qualitative information, which is not readily reduced to mathematical expressions, is as important as quantitative data.  

We're also kind of stuck with statistics as the adverb of quantification.  Statistics per se, however, is not the problem most of the time assuming the test used is appropriate to the data.  If an outcome is 99.5% likely to be nonrandom - that gives one a lot of confidence that it's real - confidence enough to get you convicted on DNA evidence in court, for instance.  The real problem lies with data acquisition methodology.  If the data are unreliable - which is usually a result of poor methodology - the statistical analysis is futile.  Scientists are rarely capable of evaluating the soundness of methods in fields outside their own and the lay public (this includes science 'journalists') has no hope of doing so.  As a result, much junk science is foist onto public awareness. 

 

Thanks for the comment. I also work in a medical school and have say I think the clinicians I work with are extremely open to statistical opinions and reasoning. I think in lots of ways my generation of statisticians are benefitting from the integrity and example set by previous cohorts. These days, the integrity of the clinical trial statisticians is well recognised.

reply to this comment

Mark Kelly

Quote:

It would be significant to place this article into context where medical advances in drugs have been promoted and pushed based on the "statistical significance" correlating to positive outcomes........only to be pulled from shelves because they were then found to have unintended affects such as liver disease or coronary disease or something else.

Oh, but neither the CDC nor the FDA keeps lists of pulled drugs....I wonder why that is.

 

Thanks for the comment. There is another related point I didn't address and that is when results fail to reach statistical significance and people take this to mean that treatments are equivalent. I have seen papers conclude equivalence based on p-values exceeding 0.05 even when woefully underpowered.

reply to this comment

martin phd

what a bizarre article

the term refers / describes " if/ when numerical results are across some arbitrary threshold of likely /or not likely to be found at random, "

that explanation probably (oops statistically probably?) doesn't help

now class we will try an example,

flip a fair coin heads or tails 50% (all coins are defaulted to fair, ie 50% heads/ tails)

how many flips, heads only or tails only before you decide the coin is not fair?

one head in a row (or tail) is 50%
 two in a row is 25%
three in a row is 12.5%
for in a row is 6.25&

(if you can figure out the next number in this series, of 50% 25% 12.5% 6.25% you are doing numerical progression)

let's do this  a different way

1 in a row is 1/2 chance
2 in a row is 1/4 chance
3 in a row is 1/8
4 in a row is 1/16
5 in a row is 1/32

take 5 in a row, if you flipped the coin five times and it is  fair coin

then if you did a five-flip 32 times you would get one instance of five heads (or tails) in a row, one instance from a fair coin every 32 times -

this is a long shot 1/32

we usually set 5% as an industry standard of when we call it such a long shot as a 'fair coin' that five heads (or tails) ...nah, ....it is not a fair coin

departure from random

plan c

if the coin went 3 heads in a row when do you bet on the fourth head

4 heads in a row when do you bet on the fifth head

our science by convention says after 5% (just after four heads in a row, or tails) THEN this is not a fair coin, ie random, therefore bet on heads

---
now in our next example, the numbers are not coin toss but drug (good drugs) efficacy - if we use four people and the drug works do we take it? vs some other drug

and permit it to be sold,

any questions class? this will be on the exam (this is the simple version)

btw, if mark twain said to never use adverbs he is my man

i don't read books with adverbs

p values as the author suggest (probability, as i started out) are probably more off putting than adverbial certainty

that said,. we wonks often argue about the basis for p, distributional assumptions, orthogonality and normalcy of variance, multi colinearty and incremental significance vs cumulative significance

and when is p < .05 or .001 and of course one tailed v two tailed - the earliest data on second hand smoking bent the numbers using a one-tailed test when the proper (normative) test was two tailed, a few wonks called it and the later numbers were reformed and corected

are we having fun yet ?

reply to this comment

Hominid

More than 25 years ago, I attempted to make this very point regarding the subjectivity of adverbial descriptors while presenting a lecture on my research at a major medical school.  I ecouraged efforts to put numbers on findings.  I was met with fierce opposition to put it mildly.  A large part of the problem is that the inherent ambiguity of language renders it inadequate to the task of unmistakable exposition. But, qualitative information, which is not readily reduced to mathematical expressions, is as important as quantitative data.  

We're also kind of stuck with statistics as the adverb of quantification.  Statistics per se, however, is not the problem most of the time assuming the test used is appropriate to the data.  If an outcome is 99.5% likely to be nonrandom - that gives one a lot of confidence that it's real - confidence enough to get you convicted on DNA evidence in court, for instance.  The real problem lies with data acquisition methodology.  If the data are unreliable - which is usually a result of poor methodology - the statistical analysis is futile.  Scientists are rarely capable of evaluating the soundness of methods in fields outside their own and the lay public (this includes science 'journalists') has no hope of doing so.  As a result, much junk science is foist onto public awareness. 

reply to this comment

Gaelan Clark

It would be significant to place this article into context where medical advances in drugs have been promoted and pushed based on the "statistical significance" correlating to positive outcomes........only to be pulled from shelves because they were then found to have unintended affects such as liver disease or coronary disease or something else.

Oh, but neither the CDC nor the FDA keeps lists of pulled drugs....I wonder why that is.

reply to this comment

Skip to Main Site Navigation / Login

Site Search Form

Site Search