This book must be welcomed; it has a very strong and important message for statisticians and all users of statistics. The message is ‘statistical significance testing is extremely widely badly used in many important fields of study’, this misuse leads to many false and misleading conclusions, and this damages us all in terms of actions taken based on such conclusions. This is not about poor significance testing, it is about wide spread ignoring of potential effect size in analysis methods and in deciding what ‘studies’ show.
The authors make a strong case well illustrated by published examples by well regarded professionals in many different fields. Their aim, against extremely large inertia in publishing and practice, is to significantly change the reliance and over emphasis on ‘p’ values.
As a statistician, relatively thoughtful and flexible in when and how to use ‘p’ values, I still found much to challenge my own thinking and previous practice. So I say, read the book, change your own practice, champion the issue with others but finally we must find a way to start to institutionalise the change. Perhaps Deming and other improvement and change thinkers have something to offer here.
The book includes significant discussion of the history of statistical development, rather more than is needed to make their case. There is much too little discussion of and recommendations on good scientific methods and when and how to use statistical methods appropriately to support them. Partly because of this, there is also a risk that the message could be taken by some readers to imply that statistical methods and issues are best avoided.
The case is best made for me by the (many) examples from journals of how ‘p’ etc is used and how the size of effect and or the ‘scientific’ context is rarely used for decision making, discussed, or even provided. The authors show how statistical significance is given overwhelming emphasis and often irrelevant hypotheses are tested. Whilst we statisticians might be tempted to shake our heads and moan, this would ignore the fact that (as the authors show) most statisticians are themselves at the heart of the problem. Most convincing for the importance of the problem, is when the authors show examples of completely incorrect conclusions and recommendations. Also convincing is the data and information from journals on the extent of the misuse.
One specific criticism of the authors is of stepwise regression. This widely used procedure uses only ‘p’ values for ‘decision making’. Once reminded of this in the context of all their criticism, it is not hard to think of more appropriate approaches, albeit, not widely available ones on our computers. Even here, the authors do not discuss these other approaches nor offer advice on how to develop a more meaningful relationship. My own view is that it requires consideration of both the size of the effect coefficient and knowledge of the likely range of the variable in the situations of relevance.
The authors used 19 questions to survey the quality of a large amount of published literature – a useful set worth noting. However, though these are very useful, they do not amount to true advice on the use of statistical methods and scientific methods.
The authors blame Fisher for the current ‘ills’ and suggest that Gossett (Student) and Neyman/Pearson should have been given more attention. The authors cite Deming as one of a number of people, who criticised the narrow Fisher approach. However, they do not discuss Deming’s vital distinction between analytic and enumerative studies. In this work, Deming makes clear that for true analytical studies (most studies and certainly the more important ones), we have to predict the future. Measures of probability or uncertainty then involve huge judgement that cannot be avoided. Thus ‘p’ values and power calculation are impossible. The use of power is given a strong push by the authors.
The book is not always easy to read. As an experienced applied statistician / improvement consultant and ‘Deming student’, with experience in many organisations of different types, I still found some paragraphs / explanations / references not understandable to the theme. There was also too much repetition of the message and history though the authors might argue that the current usage is so ingrained they had to bang the drum hard. I personally found the amount of history included detracted from the message and reduced space for more detailed discussion of what should be done.
The key issue for improvement in the book, I believe, is how to better achieve the authors aim of significantly changing practice. I do not believe that shouting, however loudly and well about the errors of current practice will achieve this. The authors cast doubt on the effect they have had so far, and show how little effect others have had.
In my experience, few professionals are taught, discuss or use good scientific methods. The key to achieving change is to change the process of change and improvement; whether in research, development or during application. Here the ‘improvement industry’ has something to offer and Deming in particular, much to offer. Simplistically, approaches such as PDSA, or Six Sigma’s DMAIC (without the rigid significance testing) offer some insight. These could have been discussed. However, I suggest, there is a deep and widespread lack of knowledge of scientific method and thinking. How should we go through a process of improving something; though new insights, new learning, development of theories, data collection and analysis, and the implementation / application so that the improvement becomes a full reality? What are the different thinking and testing processes we should use? How does probability and or data analysis best support this process? How is this different in the different parts of the process?
It also requires some addressing of the forces against change.
A book review is not the place for a full exploration of my own thinking. However, the scientific methods explored need to cover induction and deduction, exploratory investigation vs experiments, designed and controlled situations vs real life in the future, before and after change comparison, monitoring of the current situation for ‘signals’, measures core to the change theme and secondary measures to check for ‘side effects’, and more. The distinction between enumerative and analytic studies and how to run good analytic studies must be covered. I believe the findings are very important for the application and credibility of Statistics and Statisticians. We should support the authors' campaign for a major change in practice and in the teaching of Statistics.