There is a joke that has been winding its way through statistician cocktail parties in Spain: How do you tell an introvert statistician? He looks at his shoes when talking to you. How do you tell an extrovert statistician? He looks at your shoes when he is talking to you. Statisticians are trained to be precise in mathematical language, where we know exactly what has to be interpreted by the reader. We are also highly accurate with numbers. But, just as a person with, say, Asperger’s syndrome may be incapable of understanding the nuance of language, we don’t know how our peers interpret our words in natural language. As statisticians, we often feel that we have failed in communicating to society how important, interesting and gratifying variability can be.
When Karl Pearson introduced in 1893 an alternative to the more cumbersome ‘root mean square error’, he chose not to call it standard ‘variance root’, but instead ‘deviation’. And we have become overly habituated to the use of ‘deviation’ for describing variability. But, do we really want ‘deviations’ to become ‘deviants’? Little did Pearson know that in less than ten years, the word ‘deviate’ would become a widely used noun to describe a sexual pervert. Similar negative connotations can be induced from the translation into παρέκκλιση, отклонение, afwijking, abweichung, desvio, desviazione or desviación. But the French term ‘écart-type’ is neutral, with a more general interpretation of distance or gap.
We know that variance cannot be negative. And variation should not in itself be negative. Because it is a factor in almost every model, because it is a fact of life, we - as statisticians - should avoid connotations, whether they be negative or positive. For example, not so long ago, when human rights were not recognized in the Iberian peninsula, Spaniards used to justify their contrast to other countries by stating simply, ‘Spain is different’, giving a positive connotation to variability. Yet, you can go further down the rabbit hole and find in Spain various cultures spread out over at least four languages, and those differences confuse some citizens who wonder if they even belong to the same population.
Figure 1: ‘Am I with them?’
In line with French people who say ‘vive la diférence’, statisticians appreciate that within any population, variability is the desired rule, not the annoying exception. Art, design, music, even humour are the result of repetition and variation. Indeed, application of statistics to physics and computers resulted in the ‘information’ theory. And we are actually living in the 2011-2020 ‘United Nations decade on biodiversity’. In contrast to the scientific search for better models, an immature adolescent position could be to just look for repetition, searching for people in the same position and avoiding ‘deviants’ of the own behavior: ‘If they think like me, I cannot be wrong’.
Figure 2: in the stadium: ‘They are with me’
The awful connotation of ‘deviation’ is not alone, either. Let’s have a look at the names we give to the random component of the model, the imprecision of estimators or the modeling itself. When we build stochastic models, we allow for variability as a natural part of the model. This randomness considers particular contributions from units that are not shared with other units: inputs that are not repeatable, in the sense that they belong to the specific unit. In fact, they are ‘the individualizing quality or characteristic of a person or group’. That’s the definition of ‘idiosyncrasies’ in Wikipedia. But how do we name those singular elements? Again, when we try to become closer to our users, we choose terms with negative connotations such as residuals, errors, or even, perturbations. Of course, we all know residuals are not residues, errors are not mistakes and perturbations are not mental diseases. But do our peers understand that by ‘perturbations’ we mean ‘idiosyncrasies’?
‘Standard error’, introduced by Udny Yule in, is another completely negative word. Of course, there is a mistake when we confuse the estimator and the parameter; that is, when we take the sample value as the true value. But there is not an error at all when it is simply used to describe the expected fluctuation of the estimator. Why not call it ‘imprecision’? We teach that ‘the greater the standard error, the greater the imprecision,’ so why not say ‘standard imprecision’? This would have the added advantage of putting us in line with the Bayesians, who refer to the inverse as the ‘precision’.
Regression is another word which has taken on negative meanings through psychology (‘a defensive reaction to some unaccepted impulses,’ Wikipedia) and computer science (‘the appearance of a bug which was absent in a previous revision,’ Ibid). In the era of modern computing, we now construct ‘models,’ which sounds rather positive. So, why not speak of ‘linear modeling’ and ‘logistic modeling’ instead of ‘linear regression’ and ‘logistic regression’?
Continuing with regression and variability, if we hope to collect a wide variety of predictors, we will obtain more information if those predictors are varied and independent. The corresponding results suggest that we should get independent samples from among heterogeneous subjects. Furthermore, some beautiful mathematical results show us that a sample provides more information if they are independent and different units, than if they are related and similar ones. It can be easily stated: no more information is obtained if you ask someone’s height twice – but please don’t try to apply this example to sex habits. Statistics was founded to model diversity, to account for idiosyncrasies. That is what we are trained for, yet we continue to describe our raison d'être negatively.
Language is forever evolving, and if we are to keep up with the times, we must allow our language to evolve, just as Pearson did. After all, other fields continuously redefine their terms in order to comply with evolving connotations: ‘mutation’ has been changed to ‘polymorphism,’ ‘vegetative state’ to ‘unresponsive wakefulness syndrome,’ ‘opportunistic management’ to ‘tracking strategy,’ ‘mongoloid’ to ‘victim of trisomy 21,’ and a whole host of others.
‘Standard Deviation’ could more accurately be preserved for some specific quality control applications, where the norm should be constant. For example, if a bolt has to be manufactured to a certain width, then the variability of the bolt’s width is sensibly measured by the standard deviation. And of course, any lack of equality in human rights will be a deviation, even a violation. However, as a general term for the natural sciences such as biology, psychology or health care, where variability is the norm, why not say ‘Standard Diversity’? This term has no negative connotations and intuitively expresses the fact that variation is something to be expected.
There is another popular joke going around statistician cocktail parties, but this one in England: A researcher rubbed a magic lamp and, when a genie popped out, he wished for a highway from Southampton to New York. The genie rubbed his chin and asked if there weren’t an easier more feasible wish. The researcher thought for a moment, then said: ‘’Oh, yes! Please, I would love to understand my statistician.’’ And the genie replied: ‘’How many lanes would you like the highway to have?’’
Statistics is the science of diversity. Statistics is highly precise in measuring information and uncertainty. Statistics is devoted to detecting ‘confused’ variables. Yet, we seem to ignore our colleagues’ confusion when we talk about ‘deviated’ human beings.
Figure 3: ‘Your son is 4 standard deviations away.’
So, away with deviation, residual, error and regression. We conjecture a wiser, funnier and happier world if we avoid negative terms for statistical concepts.