HYPOTHESIS TESTING AND CONFIDENCE INTERVAL MISINTERPRETATIONS in textbooks
Article Summary
The widespread persistence of NHST and confidence interval misinterpretations is due in part to gaps in statistical education. This article examines one such gap, statistical misunderstandings promoted in published textbooks.
Article status
This article is an early draft. The following tasks are left:
Complete literature review and writeups
Citations cleanup
Final proofreading
Final statistical review
Contact study authors for optional comment
Contact information
We’d love to hear from you! Please direct all inquires to james@theresear.ch.
Table of contents
Coming soon
Review of the evidence
Haller and Krauss examined NHST explanations in a popular textbook, Introduction to statistics for psychology and education. The authors note that within a span of just three pages the author provides eight different interpretations of a statistically significant test result, all of which are incorrect statements [https://www.metheval.uni-jena.de/lehre/0405-ws/evaluationuebung/haller.pdf]:
“the improbability of observed results being due to error”
“the probability that an observed difference is real”
“if the probability is low, the null hypothesis is improbable”
“the statistical confidence ... with odds of 95 out of 100 that the observed difference will hold up in investigations”
“the degree to which experimental results are taken ‘seriously’”
“the danger of accepting a statistical result as real when it is actually due only to error”
“the degree of faith that can be placed in the reality of the finding”
“the investigator can have 95 percent confidence that the sample mean actually differs from the population mean”
Education researchers Jeffrey Gliner, Nancy Leech, and George Morgan examined 12 different textbooks popular in graduate-level education programs in the United States and ranked how well the authors clarified one of three common NHST misinterpretations [https://www.tandfonline.com/doi/pdf/10.1080/00220970209602058]. Textbooks were categorized into two groups: educational research books and statistics for education books. The three clarifications of common NHST misinterpretations were:
Rating | Degree of emphasis |
---|---|
0 | None |
1 | Indirect |
2 | Direct statement but brief; easily missed (no heading, box, examples, etc.) |
3 | More than brief statement, but not not very helpful in forms of examples of best practice |
4 | Clear statement with emphasis, examples, or both and with help about best practice |
Table notes:
1. Reproduced from "Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?", Jeffrey Gliner, Nancy Leech, and George Morgan, The Journal of Experimental Education, 2002 [link]
There is disagreement about NHST. While not strictly a misinterpretation the fact that there is disagreement and controversy about NHST is an issue that many non-statisticians aren’t aware of. In many texts NHST is taught as if it were a mechanical process, however in practice statisticians believe NHST should either not be used or requires a significant amount of education around correct usage [citation].
The p-value does not indicate the strength of a relationship. This is Fallacy 5 above. It is the estimated effect size that most closely corresponds to the notion of “strength of a relationship.” The p-value simply measures how compatible the null hypothesis is to the observed data.
Statistical significance is not the same as practical significance.
The three authors independently ranked each textbook on a scale of 0 to 4, indicating varying levels of emphasis on the issue (rating disagreements between the authors were later resolved so that a common score was created). The rating system is shown in the chart at right. A rating of 0 indicated there was no discussion of the misinterpretation, while a rating of 4 indicated there was a full discussion including clear statements, examples, and help regarding best practices.
The authors found that both sets of textbooks did a poor job of covering the first two misinterpretations: disagreement over NHST and the p-value not indicating the strength of a relationship.
No research textbook mentioned that NHST was controversial or there was disagreement. Some statistics books covered the controversy, but the average score was only 1.17, indicating the emphasis on the topic was mostly indirect. Only two of the six statistics textbooks were rated as 2 or above by the authors.
The misinterpretation regarding the p-value as a measure of the strength of a relationship didn’t fare much better, scoring just 1.5 for research textbooks and 1.67 for statistics textbooks. Four out of the six research textbooks and four out of the six statistics books were rated a 2 or above by the authors.
The fact that statistical significance is not the same as practical significance was mostly well covered. Research textbooks scored 3.33 on average, with all six books scoring a 2 or better rating from the authors. Somewhat surprisingly statistics textbooks did slightly less well. Just four out of the six scored a rating of 2 or better for an average score of 3.17. A summary of the scores for the two textbook categories is shown in the table below.
Issue | Average score of research texts | Average score of statistics texts |
---|---|---|
There is disagreement about NHST | 0 | 1.17 |
The p-value does not indicate the strength of a relationship | 1.5 | 1.67 |
Statistical significance is not the same as practical significance | 3.33 | 3.17 |
Table notes:
1. A total of 12 textbooks were reviewed, 6 used in graduate-level research classes in education and 6 graduate-level statistics classes in education.
2. Textbooks were scored by three judges on a scale of 0 to 4. A score of 0 indicated no discussion of the issue and a score of 4 indicated a full discussion including clear statements, examples, and help regarding best practices.
3. Reproduced from "Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?", Jeffrey Gliner, Nancy Leech, and George Morgan, The Journal of Experimental Education, 2002 [link]
A small number of other researchers have examined textbooks for accuracy and collected examples of books they think contain errors. Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch found four books that they believe promote, "the illusion that the level of significance would specify the probability of [the] hypothesis,” without quoting specific examples.
In “On the probability of making Type 1 Errors,” Paul Pollard and John Richardson list a number of articles that misstate the probability of making Type 1 errors. These articles state that using the normal Type 1 error rate of 0.05 causes a false positive 1 out of 20 times; however, the articles fail to mention that this probability is conditional on the null hypothesis being true. If the null hypothesis is not true the Type 1 error rate no longer applies. This is shown in detail in our article on p-value replication.
In a late January of 2020 PhD candidate Felix Singleton Thorn (@FSingletonThorn) posted a Twitter thread which included numerous examples of confusing or incorrect definitions of p-values and the null hypothesis significance testing framework.
Each of these three sources are compiled in the table below along with the name of the textbook, the year of publish, and the statistical error if listed.
Textbook name | Year published | Statistical error |
---|---|---|
Differential psychology1 | 1958 | Not stated |
Statistical analysis in psychology and education1 | 1959 | Not stated |
Statistical analysis in educational research1 | 1959 | Not stated |
Psychology: The science of mental life1 | 1973 | Not stated |
Fundamental statistics in psychology and education2 | 1978 | "The probability of making a Type 1 error is very simply and directly indicated by α" |
Basic statistics for business and economics2 | 1977 | "The probability of committing a Type 1 error, which is denoted by α, is called the significance level of the test" |
Experimental design and statistics2 | 1975 | "The probability of committing such an error is actually equivalent to the significance level we select..."
"If we reject the null hypothesis whenever the chance of it being true is less than 0.05, then obviously we shall be wrong 5 per cent of the time..." |
Understanding statistics in the behavioral sciences2 | 1981 | "Alpha determines the probability of making a Type I error" |
Experiment, design, and statistics in psychology2 | 1973 | "The significance level is simply the probability of making a Type 1 error" |
Experimental methodology2 | 1980 | "If the .05 significance level is set, you run the risk of being wrong and committing Type 1 error five times in 100" |
Introduction to design and analysis: A students' handbook2 | 1980 | "We will make a Type 1 error a small percentage of the time—the exact amount being specified by our significance level" |
Learning to use statistical tests in psychology: A students' guide2 | 1982 | "after correctly describing the significance level as a conditional prior probability, later refer to it as the 'percentage probability...that your results are due to chance'" |
Nonparametric statistics for the behavioral sciences2 | 1956 | "In fact, the probability that [the null hypothesis is true, but the p-value lies in the rejection region]...is given by α, for rejecting H0 when in fact it is true is the Type 1 error" |
A Dictionary of Dentistry3 | 2010 | "For example, a p-value of 0.01 (p = 0.01) means there is a 1 in 100 chance the result occurred by chance." |
Oxford Dictionary of Biochemistry and Molecular Biology3 | 2008 | "The closer the p‐value to 0, the more significant the match." |
A Dictionary of Business Research Methods3 | 2016 | "... part of the output of a statistical test, such as regression or ANOVA, that can take a value between 0.0 and 1.0. This is the probability that the observed difference or association among variables has arisen by chance." |
A Dictionary of Social Research Methods3 | 2016 | "The p-value is the probability that the observed data are consistent with the null hypothesis." |
Table notes:
1. Source: "The Null Ritual What You Always Wanted to Know About
Significance Testing but Were Afraid to Ask," Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch, The Sage handbook of quantitative methodology for the social sciences, 2004 [link] The authors do not specify the actual statistical misinterpretations, simply citing that the authors of the listed textbooks promote, "the illusion that the level of significance would specify the probability of [the] hypothesis."
2. Source: "On the Probability of Making Type I Errors," P. Pollard and J. T. E. Richardson, Psychological Bulletin, 1987 [link]
3. Source: These were collected in a Twitter thread by Felix Singleton Thorn (@FSingletonThorn), a PhD candidate studying statistics in the Interdisciplinary Meta Research Group at the University of Melbourne [link].
Additional articles to be reviewed
BEHAVIORAL STATISTICS TEXTBOOKS: SOURCE OF MYTHS AND MISCONCEPTIONS?
https://www.jstor.org/stable/pdf/1164796.pdf
Six behavioral statistics textbooks listed by their publishers as "best-sellers" during 1982 were reviewed by the author. The intent was to detect the presence of and discuss the nature of some theoretical inferential inaccuracies, misinterpretations, and errors, that is, statistical "myths and misconceptions." Approximately 43 quotes were found that to some degree reflected misconceptions of statistical theory and that may mislead the behavioral researcher. These quotes were classified in general categories of half-truths, definitional errors, constant-cum-variable, cart- before-the-horse, and unitary inference.
References
Coming soon