Statistical Significance Misinterpretations among researchers
Article Summary
Many statisticians caution against using statistical significance as a method of making policy or business decisions. One reason why is that p-values are notoriously difficult to interpret, even for PhD level researchers. This article outlines some of the common misinterpretations of p-values.
Quick guide
Direct surveys of statystical knowledge
Click the dropdown of each subject to see detailed contents.
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
- Introduction
- Summary table of research
- NHST
- Review of evidence
- Meta-analysis
- Confidence intervals
- Review of evidence
- Meta-analysis
- Cliff effects and dichotomization of evidence
- Review of evidence
Article Status
This article is a draft and is incomplete.
Paid reviewers
Nothing goes out the door of The Research, without rigorous quality assurance and review. The reviewers below were paid by The Research to ensure this article is accurate and fair. That work includes a review of the article content as well as the code that produced the results. This does not mean the reviewer would have written the article in the same way, that the reviewer is officially endorsing the article, or that the article is perfect, nothing is. It simply means the reviewer has done their best to take what The Research produced, improve it where needed, given editorial guidance, and generally considers the content to be correct. Thanks to all the reviewers for their time, energy, and guidance in helping to improve this article.
TBD
TBD
TBD
“There is a long line of work documenting how applied researchers misuse and misinterpret p-values in practice.”
Article summary
There are four categories of evidence of professional misinterpretations of Null Hypothesis Significance Testing (NHST). The current article focuses on the first category, although we have draft versions of two of the three other categories.
Surveys of statistical knowledge. Although not without methodological challenges, this work is the most direct method of assessing NHST understanding. The standard procedure is a convenience sample of either students or researchers (or both) in a particular academic discipline to which a survey instrument is administered. Subjects are either provided a set of statements which they must judge as true or false or provided questions which they must answer.
Examination of NHST usage in statistics and methodology textbooks. This work includes both systematic reviews and casual observations (for example in Twitter threads) and has documented incorrect or incomplete language when describing NHST. In these cases it is unclear if the researchers themselves do not fully understand NHST or if they were simply imprecise in their writing and editing or otherwise thought it best to omit or simplify NHST for pedagogical purposes. You can see a draft of our article on this category here.
Audits of NHST usage in published articles. Similar to reviews of textbooks, these audits includes systematic reviews of peer-reviewed articles making use of NHST. The articles are assessed for correct NHST usage. Audits are typically focused on a particular academic discipline, with quantitative metrics are often given representing the percentage of reviewed articles that exhibited correct and incorrect usage.
Published critiques of NHST. A large number of researchers have written articles underscoring the nuances of NHST and common misinterpretations (similar to this article) directed at their own subfield. In those cases it is implied that in the experience of the authors NHST misinterpretations are common enough in their subfield that authors feel a corrective is warranted. Using a semi-structured search we identified more than 60 such articles, each in a different subfield. You can see a draft of our article on this category here.
Most of the studies formally testing NHST knowledge via a hypothetical scenario and accompanying survey have been focused on misunderstandings by psychologists. In general these studies use small sample sizes and follow the method constructed by Michael Oakes in the late 1970s (see below for details). This body of research spans multiple countries and specialty areas of psychology.
Discuss literature review process, how were articles discovered.
As Jacob Cohen put it in his famous 1990 article, the null hypothesis tests us [https://pdfs.semanticscholar.org/fa77/0a7fb7c45a59abbc4c2bc7d174fa51e5d946.pdf?_ga=2.59759405.1652636300.1590952836-298364495.1590347346].
Direct surveys and tests of statistical knowledge
A number of studies have attempted to formally tested NHST knowledge. This is done using a hypothetical scenario and accompanying questionnaire asking participants to interpret the results of the scenario via multiple choice selection. Historically, many of these studies seem to have focused on testing psychologists, although it is unclear why other disciplines have not had similar exposure to such surveys. More recent studies have expanded the focus beyond psychologists to other disciplines (ex. [https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D1520CFBFEB2C282E93484057D84B6C6/S183449091900028Xa.pdf/beyond_psychology_prevalence_of_p_value_and_confidence_interval_misinterpretation_across_different_fields.pdf#page=7&zoom=100,0,0 and http://www.blakemcshane.com/Papers/mgmtsci_pvalue.pdf]
PSYCHOLOGY
A total of 24 psychology studies were found that used survey instruments to directly test the knowledge of psychologists in one of four areas:
Null hypothesis significance testing (NHST) misinterpretations. These misinterpretations are primarily focused on misunderstanding p-value definitions or statistical properties; for example, interpreting the p-value as the probability of the null hypothesis or the probability of replication.
Confidence interval misinterpretations. For example, interpreting the confidence interval as a probability.
The dichotomization of evidence. A specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant.
The cliff effect. A drop in the confidence of an experimental result based on the p-value. For example, having relatively high confidence in a result with a p-value of 0.04, but much lower confidence in a result with a p-value of 0.06.
Some studies are mixed, testing a combination of two or more areas above. Psychology has by far the most studies in these three areas of any academic field. A summary of each article is presented in the table below including the authors and year published, the title of the article and a link to the paper, which of the four categories above the article belongs to, the subjects of the study and their associated sample size, and a brief summary of the article’s primary findings.
Below the table more details of each study are provided, broken out by the four categories above (articles that are mixed are presented multiple times, which each aspect of the study presented in the associated section). The cliff effect and dichotomization of evidence are combined into a single section. Of course, the methodological details and complete results of each study cannot be presented in full without duplicating the article outright. Readers are encouraged to go to the original articles to get the full details and in-depth discussions of each study.
A meta-analysis in the form of a simple weighted average of misinterpretations across studies is also presented.
In the course of analyzing each study below several errors were found. In cases where errors were present the authors were contacted for comment. We note these errors throughout the article as well as any responses received from the authors.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Rosenthal & Gatio (1963) | The Interpretation of Levels of Significance by Psychological Researchers [link] | Cliff effect | Psychology faculty (n=9) Psychology graduate students (n=10) |
1. A cliff effect was found at a p-value of 0.05. 2. There is less confidence in p-values generated from a sample size of n=10 than a sample size of n=100, suggesting participants care about both Type I and Type II errors. 3. Psychology faculty have lower confidence in p-values than graduate students |
Beauchamp & May (1964) | Replication report: Interpretation of levels of significance by psychological researchers [link] | Cliff effect | Psychology graduate students (n=11) PhD psychology faculty (n=9) |
1-page summary of a replication of Rosenthal & Gatio (1963) 1. No cliff effect was found at any p-value (however, see Rosenthal & Gaito, 1964). 2. Subjects expressed higher confidence with smaller p-values and larger sample sizes. |
Rosenthal & Gaito (1964) | Further evidence for the cliff effect in the interpretation of levels of significance [link] | Cliff effect | NA | 1-page comment on Beauchamp & May (1964), which was a replication of Rosenthal & Gaito (1963) 1. Despite Beauchamp & May's study claiming to find "no evidence" for a 0.05 cliff effect, a tendency to interpret results at this level as special can be seen in an extended report provided by Beauchamp & May. 2. Beauchamp & May themselves demonstrate a cliff effect when they find Rosenthal & Gaito's original 1963 results "nonsignificant" due to the p-value being 0.06. |
Minturn, Lansky, & Dember (1972) | The interpretation of levels of significance by psychologists: A replication and extension (Note that despite various attempts to obtain this paper a copy could not be found) |
Cliff effect | Bachelor's students, master's students, and PhD graduates (n=51) | Results as described in Nelson, Rosenthal, & Rosnow (1986): 1. Cliff effects were found at p-values of 0.01, 0.05, and 0.10, with the most pronounced at the standard 0.05 level. 2. Subjects expressed higher confidence with smaller p-values and larger sample sizes. |
Oakes (1979) | The Statistical Evaluation of Psychological Evidence (unpublished doctoral thesis cited in Oakes 1986) [link] | NHST misinterpretations | Academic psychologists (n=54) | 1. On average subjects drastically misestimate the probability that a replication of a hypothetical experiment will yield a statistically significant result based on the p-value of an initial experiment (replication fallacy). |
Oakes (1979) | The Statistical Evaluation of Psychological Evidence (unpublished doctorial thesis cited in Oakes 1986) [link] | NHST misinterpretations | Academic psychologists (n=30) | 1. Subjects overestimate the effect size based on the p-value. |
Oakes (1986) | Statistical Inference [link] | NHST misinterpretations | Academic psychologists (n=70) | 1. 96% of subjects demonstrated at least one NHST misinterpretation. 2. 89% of subjects did not select the correct definition of statistical significance. 3. Only two respondents (3%) correctly answered both 1 and 2. |
Nelson, Rosenthal, & Rosnow (1986) | Interpretation of significance levels and effect sizes by psychological researchers [link] | Cliff effect | Academic psychologists (n=85) | 1. A cliff effect was found at a p-value of 0.05. 2. Subjects expressed higher confidence with smaller p-values. 3. Subjects expressed higher confidence with larger sample sizes, but this was moderated by years of experience. 4. Subjects expressed higher confidence with larger effect sizes, but this was moderated by years of experience. |
Zuckerman et al. (1993) | Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists [link] | Mixed | Psychology students (n=17) Academic psychologists (n=508) |
1. Overall accuracy of subjects was 59%. |
Falk & Greenbaum (1995) | Significance tests die hard: The amazing persistence of a probabilistic misconception [link] | NHST misinterpretations | Undergraduate psychology students (n=53) | 1. 92% of subjects demonstrated at least one NHST misinterpretation. |
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Psychology students (n=70) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 17% of pedagogy students incorrectly marked the statement as true. Only 9% of psychology students that had correctly answered the statement also provided a correct written explanation of their reasoning. |
Poitevineau & Lecoutre (2001) | Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated [link] | Cliff effect | Psychological researchers (n=18) | 1. The cliff effect varied by subject and three distinct categories were observed: (a) a decreasing exponential curve, (b) a negative linear curve, and (c) an all-or-none curve representing a very high degree of confidence when p is less than 0.05 and very low confidence otherwise. 2. Only subjects in the decreasing exponential curve group expressed higher confidence with larger sample sizes. |
Haller & Krauss (2002) | Misinterpretations of Significance: A Problem Students Share with Their Teachers? [link] | NHST misinterpretations | Psychology methodology instructors (n=30) Scientific psychologists (n=39) Undergraduate psychology students (n=44) |
1. 80% of psychology methodology instructors demonstrated at least one NHST misinterpretation. 2. 90% of scientific psychologists demonstrated at least one NHST misinterpretation. 3. 100% of students demonstrated at least one NHST misinterpretation. |
Lecoutre, Poitevineau, & Lecoutre (2003) | Even statisticians are not immune to misinterpretations of Null Hypothesis Tests [link] | NHST misinterpretations | Psychological researchers from various laboratories in France (n=20) | 1. Percentage of psychologists correctly responding to four situations combining p-values and effect size ranged from 15% to 100%. |
Monterde-i-Bort et al. (2010) | Uses and abuses of statistical significance tests and other statistical resources: a comparative study [link] | NHST misinterpretations | Psychology researchers in Spain (n=120) | Using eight statements from the original 29-statement survey instrument which had clear true/false answers, members deviated from the correct answer by an average of 1.32 points on a 5-point Likert scale. |
Hoekstra, Johnson, & Kiers (2014) | Confidence Intervals Make a Difference: Effects of Showing Confidence Intervals on Inferential Reasoning [link] | Mixed | Psychology PhD students (n=66) | 1. Cliff effects were observed for both NHST and confidence interval questions. 2. Subjects referenced significance less often and effect size more often when results were presented by means of CIs than by means of NHST. 3. On average subjects were more certain that a population effect exists and that the results are replicable when outcomes were presented by means of NHST rather than by means of CIs. |
Hoekstra et al. (2014) | Robust misinterpretation of confidence intervals [link] | CI misinterpretation | Psychology undergraduate students (n=442) Psychology master's students (n=34) Psychology researchers (n=120) |
1. 98% of undergraduate students demonstrated at least one CI misinterpretation. 2. 100% of master's students demonstrated at least one CI misinterpretation. 3. 97% of researchers demonstrated at least one CI misinterpretation. |
Badenes-Ribera et al. (2015) | Interpretation of the p value: A national survey study in academic psychologists from Spain [link] | NHST misinterpretations | Academic psychologists (n=418) | 1. 94% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 35% of subjects demonstrated a NHST misinterpretation related to the replication fallacy. 3. 40% of subjects demonstrated at least one NHST misinterpretation related to either the effect size fallacy or the practical/scientific importance fallacy. |
Badenes-Ribera et al. (2015) | Misinterpretations Of P Values In Psychology University Students (Catalan language) [link] | NHST misinterpretations | Psychology undergraduate students (n=63) | 1. 97% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 49% of subjects demonstrated at least one NHST misinterpretation related to either the effect size fallacy or the practical/scientific importance fallacy. 3. 73% of subjects demonstrated a NHST misinterpretation related to correct decision making. |
McShane & Gal (2015) | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] | Dichotomization of evidence | Psychological Science editorial board (n=54) | 1. A cliff effect was found between p-values of 0.01 and 0.27. |
Kühberger et al. (2015) | The significance fallacy in inferential statistics [link] | NHST misinterpretations | Psychology students enrolled in a statistics course (n=133) | When given only a cue that a study result was either significant or nonsignificant students consistently estimated larger effect sizes in a significant scenario than in an nonsignificant scenario. |
Badenes-Ribera et al. (2016) | Misconceptions of the p-value among Chilean and Italian Academic Psychologists [link] | NHST misinterpretations | Academic psychologists (n=164) | 1. 62% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 12% of subjects demonstrated a NHST misinterpretation related to the replication fallacy. 3. 5% of subjects demonstrated a NHST misinterpretation related to the effect size fallacy. 4. 9% of subjects demonstrated a NHST misinterpretation related to the practical/scientific importance fallacy. |
Lyu et al. (2018) | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation [link] | NHST misinterpretations | Psychology undergraduate students (n=106) Psychology master's students (n=162) Psychology PhD students (n=47) Psychologists with a PhD (n=31) |
1. 94% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 96% of master's students demonstrated at least one NHST misinterpretation. 3. 100% of PhD students demonstrated at least one NHST misinterpretation. 4. 93% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | Mixed | Psychology undergraduate students (n=67) Psychology master's students (n=122) Psychology PhD students (n=47) Psychologists with a PhD (n=36) |
1. 94% of undergraduate students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 2. 93% of master's students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 3. 81% of PhD students demonstrated at least one NHST misinterpretation and 85% demonstrated at least on CI misinterpretation. 4. 92% of subjects with a PhD demonstrated at least one NHST misinterpretation and 92% demonstrated at least on CI misinterpretation. |
Surveys of NHST knowledge
Formal studies of NHST knowledge originated with Michael Oakes’ work in the late 1970’s through the mid 1980s. These studies are outlined in his 1986 book Statistical Inference. Oakes’ primary NHST study consisted of a short survey instrument that he presented to 70 academic psychologists. Oakes notes subjects were, "university lecturers, research fellows, or postgraduate students with at least two years of research experience." The survey instrument outlined the results of a simple experiment and asked the subjects which of six statements could be marked as “true” or “false” based on the experimental results. His survey is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false”. “False” means that the statement does not follow logically from the above premises.
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
2. You have found the probability of the null hypothesis being true.
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
4. You can deduce the probability of the experimental hypothesis being true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
The correct answer to all of these questions is "false.” Yet, only 3 of the 70 academic psychologists correctly marked all six statements as false. The average number of incorrect responses — statements for which the subject marked “true” — was 2.5.
Statements 1 and 3 are false because NHST cannot offer a “proof” of hypotheses about scientific or social phenomena. Statements 2 and 4 are examples of the Inverse Probability Fallacy; NHST measures the probability of observed data assuming the null hypothesis is true and therefore cannot offer evidence about the probability of either the null or alternative hypothesis. Statement 6 is incorrect because the p-value is not a measure of experimental replicability.
Across numerous studies, including Oakes (1986), Statement 5 was the most misinterpreted. This is likely due to the association respondents made between the statement and the Type I error rate. Formally, the Type I error rate is given by the pre-specified alpha value, usually set to a probability of 0.05 under the standard definition of statistical significance. It could then be said that if the sampling procedure and p-value calculation were repeated on a population in which the null hypothesis were true, 5% of the time the null would be mistakenly rejected. The Type I error rate is sometimes summarized as, “The probability that you wrongly reject a true null hypothesis.”
There is a rearranged version of Statement 5 that is close to this Type I error definition: “You know the probability that you are making the wrong decision if you decide to reject the null hypothesis.” Note though that this statement is missing a key assumption from the Type I error rate: that the null hypothesis is true. The actual wording of Statement 5 was more complex: “You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.” The sentence structure makes the statement more difficult to understand, but again the statement does not include any indication about the truth of the null hypothesis. For this reason the statement cannot be judged as true.
As an additional piece of analysis we set out to understand if Statement 5 was syntactically sound and consulted with Maayan Abenina-Adar, a final-year PhD student at UCLA’s Department of Linguistics. Although the sentence may seem somewhat awkward as written, it is syntactically unambiguous and correctly constructed. However, a clearer version of the sentence would be the rearranged version previously mentioned: “If you decide to reject the null hypothesis you know the probability that you are making the wrong decision.” This version avoids some of the syntactic complexity of the original statement:
The conditional antecedent appearing in the middle of the sentence. The conditional antecedent is the “if you decide to reject the null hypothesis” phrase.
The use of the the noun phrase, “probability that you are making the wrong decision,” as a so-called “concealed question.”
Whether the phrasing of Statement 5 contributed to its misinterpretation cannot be determined from the data at hand. One might argue that the more complex sentence structure caused respondents to spend extra time thinking about the nature of the statement, which might reduce misunderstanding. Plus, Statement 6 was also syntactically complex, but did not elicit the same rate of misinterpretation. (The Statement 6 wording was: “You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.”). A controlled experiment comparing the two different versions of Statement 5 would be needed to tease apart the impact of the sentence’s structure and its statistical meaning on the rate of misunderstanding.
One other notable factor is that the pre-specified alpha value is not present in either the hypothetical scenario preceding Statements 1-6 or in Statement 5 itself. This might have been a clue that the statement couldn’t have referred to the Type I error rate since not enough information was given. On the other hand, the 0.05 alpha probability is so common that respondents may have assumed its value.
Using a different set of statements, the Psychometrics Group Instrument, both Mittag and Thompson (2000) and Gordon (2001) found that two sets of education researchers had particular trouble with statements about Type I error and Type II. However, this same instrument was also used by Monterde-i-Bort (2010) to survey psychologists (details discussed later in this section), and while some researchers struggled with these questions, they were not answered incorrectly more frequently than other types of questions. More research would be needed to understand if Type I and Type II error is generally poorly understood by psychology students, researchers, and professionals, and the extent to which statement wording effects this confusion.
In a 2002 follow-up study to Oakes (1986), Heiko Haller and Stefan Krauss repeated the experiment in six German universities, presenting the survey to 44 psychology students, 39 non-methodology instructors in psychology, and 30 methodology instructors in psychology. Haller and Krauss added a sentence at the end of the result description noting that, “several or none of the statements may be correct.”
Subjects from Haller and Krauss (2002) were surveyed in 6 different German universities. Methodology instructors consisted of "university teachers who taught psychological methods including statistics and NHST to psychology students." Note that in Germany methodology instructors can consist of "scientific staff (including professors who work in the area of methodology and statistics)” as well as “advanced students who teach statistics to beginners (so called 'tutors')." Scientific psychologists consisted of "professors and other scientific staff who are not involved in the teaching of statistics."
The results of Haller and Krauss (2002) were similar to Oakes (1986) with 100% of the psychology students, 89% of the non-methodology instructors, and 80% of the methodology instructors incorrectly marking at least one statement as “true.” The average number of responses incorrectly marked as “true” was generally lower than in Oakes: 2.5 for the psychology students, 2.0 for non-methodology instructors, and 1.9 for methodology instructors.
Details of the six different misinterpretations for both Oakes’ original study as well as the Haller and Krauss replication are shown in the chart below. The education level with the highest proportion of misinterpretation by question is highlighted in red, showing that U.S. psychologists generally fared worse than those in Germany. The most common misinterpretation across all four groups was that the p-value represents the Type I error rate. The least common interpretation among all groups within German universities was that the null hypothesis had been proved. This selection was likely marked as “false” due to the small p-value in the result statement. In contrast, relatively small percentages of each group believed that the small p-value indicated the null hypothesis had been disproved, indicating at least some understanding that the p-value is a probabilistic statement, not a proof. Details of this misunderstanding were examined in Statement 4, the “probability of the null is found,” with more than half of German methodology instructors and scientific psychologists answering correctly, but more than half of the other two groups answering incorrectly. The p-value is indeed a probability statement, but it is a statement about data’s compatibility with the null hypothesis, not the null hypothesis itself.
German universities | U.S. psychologists | |||
---|---|---|---|---|
Statement summaries | Methodology instructors | Scientific psychologists | Psychology students | Academic psychologists |
1. Null Hypothesis disproved | 10% | 15% | 34% | 1% |
2. Probability of null hypothesis | 17% | 26% | 32% | 36% |
3. Null hypothesis proved | 10% | 13% | 20% | 6% |
4. Probability of null is found | 33% | 33% | 59% | 66% |
5. Probability of Type I error | 73% | 67% | 68% | 86% |
6. Probability of replication | 37% | 49% | 41% | 60% |
Percentage with at least one misunderstanding |
80% | 89% | 100% | 96% |
Average number of misintepretations |
1.9 | 2.0 | 2.5 | 2.5 |
Table notes:
1. Sample sizes: methodology instructors (n=30), scientists not teach methods (n=39), psychology students (n=44), U.S. academic psychologists (n=70).
2. Reproduced from (a) "Misinterpretations of Significance: A Problem Students Share with Their Teachers?", Heiko Haller & Stefan Krauss, Methods of Psychological Research Online, 2002 [link] (b) Statistical Inference, Michael Oakes, Epidemiology Resources Inc. (1990) [link]
In a second question Oakes asked the 70 academic psychologists which of the six statements aligned with the usual interpretation of statistical significance from the NHST procedure. If none of the answers were correct respondents were allowed to write in the correct interpretation. Again, only 3 of the 70 correctly identified that none of the answers were the correct interpretation. Across Oakes’ two studies only 2 of 70 academic psychologists correctly answered all 12 questions correct (marked all six statements as incorrect in both studies).
One criticism of Oakes’ six-question instrument is that the hypothetical setup itself is incorrect. The supposed situation notes that there are two groups of 20 subjects each, but that the resulting degrees of freedom is only 18, as can be seen in the question wording: “suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01).” However, the correct degrees of freedom is actually 38 (20-1 + 20-1=38).
One might wonder whether this miscalculation confused some of the respondents. This explanation seems unlikely since follow-up surveys found similar results. In 2018 Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu surveyed 346 psychology students and researchers in China using the same wording as in Haller and Krauss (translated into Chinese), but correcting the degrees of freedom from 18 to 38. Similar error rates were obtained as previous studies with 97% of Master’s and PhD students (203 out of 209), 100% of post-doc researchers and assistant professors (23 out of 23), and 75% of experienced professors (6 out of 8) selecting at least one incorrect answer.
Lyu et al. recruited subjects via an online survey that included notifications on WeChat, Weibo, and blogs. There was also a pen-and-paper survey that was conducted during the registration day of the 18th National Academic Congress of Psychology that took place in Tianjin, China. No monetary or other material payment was offered.
Details from Lyu et al. (2018) are shown below. The education level with the highest proportion of misinterpretation by question is highlighted in red, demonstrating that master’s students had the highest misinterpretation rate in four of the six questions. The overall pattern is similar to that seen in Oakes and Haller and Krauss. The “Probability Type 1 error” and “Probability of null is found” misinterpretations were the most prominent. The “Null disproved” and “Null proved” were the least selected misinterpretations, but were both quite common and substantially more prevalent than in the German and U.S. cases.
Statement summaries | Undergraduate | Master's | PhD | Postdoc and assistant proessors. | Experienced professors |
---|---|---|---|---|---|
1. Null Hypothesis disproved | 20% | 53% | 17% | 39% | 0% |
2. Probability of null hypothesis | 55% | 58% | 51% | 39% | 25% |
3. Null hypothesis proved | 28% | 46% | 6% | 35% | 25% |
4. Probability of null is found | 60% | 43% | 51% | 39% | 50% |
5. Probability of Type I error | 77% | 44% | 96% | 65% | 75% |
6. Probability of replication | 32% | 56% | 36% | 35% | 12% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample sizes: Undergraduates (n=106), Master's students (n=162), PhD students (n=47), Postdoc or assistant prof (n=23), Experienced professor (n=8).
3. Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Although not presented in their article, Lyu et al. also recorded the psychology subfield of subjects in their study. Because the authors made their data public, it was possible for us to reanalyze the data, slicing by sub-field. Our R code used to calculate these figures is provided in the additional materials section of this article. All subfields had relatively high rates of NHST misinterpretations. Social and legal psychology had the highest misinterpretation rate with 100% of the 51 respondents in that subfield having at least one misinterpretation. Neuroscience and neuroimaging respondents had the lowest rate with 78% having at least one misinterpretation, although there were only nine total subjects in the subfield.
Psychological subfield | Percentage with at least one NHST misunderstanding | Sample size |
---|---|---|
Fundamental research & cognitive psychology | 95% | 74 |
Cognitive neuroscience | 98% | 121 |
Social & legal psychology | 100% | 51 |
Clinical & medical psychology | 84% | 19 |
Developmental & educational psychology | 97% | 30 |
Psychometric and psycho-statistics | 94% | 16 |
Neuroscience/neuroimaging | 78% | 9 |
Others | 94% | 17 |
Table notes:
Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Modified instruments with slightly different hypothetical setups have also been used. In a separate study from China released in 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers including 272 in the field of psychology. This was the largest sample size of the eight disciplines surveyed. They used a four-question instrument derived from Oakes (1987) in which respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant.
Note that the Lyu in Lyu et al. (2018) described above and in Lyu et al. (2020) described here are different researchers. Ziyang Lyu coauthored the 2018 study, Xiao-Kang Lyu coauthored the 2020 study. However, Chuan-Peng Hu was a coauthor on both studies.
Recruitment for Lyu et al. (2020) was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. The location of respondents consisted of two geographic areas, Mainland China (83%) and overseas (17%). Although all respondents received their degree in China.
Lyu et al. (2020) used the following survey instrument:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
We also used the open source data to more thoroughly examine the study results. At 91%, psychologists had the sixth highest proportion of respondents with at least one NSHT misinterpretation out of eight professions surveyed by Lyu et al. (2020). The proportion of respondents incorrectly answering the significant and nonsignificant versions of the instrument are shown below. The significant version had a higher misinterpretation rate for three out of the four questions, including a 14 percentage point difference for Statement 3 and 13 percentage point difference for Statement 4.
Overall, psychologists had the second highest average number of NHST misinterpretations, 1.94 out of a total of four possible. Given the results in the table below it is perhaps surprising that the nonsignificant version had a substantially higher rate of misinterpretations, 2.14, compared to 1.71 for the significant version. Using an independent means t-test comparing the average number of incorrect responses between the two test versions resulted in a p-value of 0.0017 (95% CI: 0.16 to 0.70). This suggests that the observed data in the study are relatively incompatible with the hypothesis that random sampling variation alone accounted for the difference in the average number of misinterpretations of the two test versions. Looking across the entire 1,479 sample of all eight disciplines paints a similar picture, with a p-value of 0.00011 (95% CI: 0.12 to 0.36). However, more research would be needed to fully understand why the nonsignificant version posed more interpretation challenges.
Statement summaries | Significant version (n=125) |
Nonsignificant version (n=147) |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 50% | 54% |
2. You have found the probability of the null (alternative) hypothesis being true. | 59% | 40% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 77% | 63% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 42% | 29% |
Table notes:
1. Percentages do not add to 100% because multiple responses were acceptable.
2. Sample sizes: significant version (n=125), nonsignificant version (n=147).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. PhD students fared best, with “only” 81% demonstrating at least one NHST misinterpretation, but they also had the highest average number of incorrect responses with 2.4 (out of four possible), indicating that those respondents that did have a misinterpretation tended to have multiple misinterpretations. This was one of the highest rates of any combination of academic specialty and education, behind only statistics PhD students and post-PhD statisticians.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 67 | 94% | 1.9 |
Master's | 122 | 93% | 1.8 |
PhD | 47 | 81% | 2.4 |
Post-PhD | 36 | 92% | 2.0 |
In a set of three studies — the first focusing on Spanish academic psychologists, the second on Spanish university students, and the third on Italian and Chilean academic psychologists — researchers Laura Badenes-Ribera and Dolores Frías-Navarro created a 10-question instrument. (They partnered with Marcos Pascual-Soler on both Spanish studies, Héctor Monterde-i-Bort on the Spanish academic psychologist study, and Bryan Iotti, Amparo Bonilla-Campos and Claudio Longobardi on the Italian and Chilean study). The study of university students was published in Spanish, but we had the study professionally translated into English.
Those particularly interested in the statistical practices of Spanish and Italian psychologists are also encouraged to review the team’s other work on knowledge of common statistical terms (Italians: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6024681/#B10, Spanish: https://www.formacionasunivep.com/ejihpe/index.php/journal/article/download/200/162). These studies are not reviewed here as they do not deal with statistical misinterpretations.
Their primary survey instrument is shown below categorized by the fallacy they are attempting to test.
Let’s suppose that a research article indicates a value of p = 0.001 in the results section (alpha = 0.05). Mark which of the following statements are true (T) or false (F).
Inverse Probability Fallacy
1. The null hypothesis has been shown to be true
2. The null hypothesis has been shown to be false
3. The probability of the null hypothesis has been determined (p = 0.001)
4. The probability of the experimental hypothesis has been deduced (p = 0.001)
5. The probability that the null hypothesis is true, given the data obtained, is 0.01
Replication Fallacy
6. A later replication would have a probability of 0.999 (1-0.001) of being significant
Effect Size Fallacy
7. The value p = 0.001 directly confirms that the effect size was large
Clinical or Practical Significance Fallacy
8. Obtaining a statistically significant result indirectly implies that the effect detected is important
Correct interpretation and decision made
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance
Statements 1 and 2 are false because NHST cannot definitively show the truth or falsity of any hypothesis. Statements 3, 4, and 5 are examples of the Inverse Probability Fallacy; NHST measures the probability of observed data assuming the null hypothesis is true and therefore cannot offer evidence about the probability of either the null or alternative hypothesis. It is unclear why Statement 5 uses a p-value of 0.01, while other statements use a p-value of 0.001. Thus far the authors have not replied to multiple emails we have sent and so we therefore cannot provide additional insight. Statement 6 is incorrect because the p-value is not a measure of experimental replicability, the so-called Replication Fallacy. Statement 7 is incorrect because the p-value is not a direct measure of effect size, a separate calculation is needed to determine effect size. Statement 8 is incorrect because whether a result is important depends not on the p-value, but on the broader scientific context under which the hypothesis test is being performed. Statement 9 is correct as it is a restatement of the p-value definition. The authors also consider Statement 10 correct, but its interpretation depends on how one interprets the word “conclude.” One cannot definitively conclude that a small p-value implies differences that are not due to random chance as large random errors are sometimes observed which heavily influence p-values. Some subjects might also interpret “conclude” to mean the decision is correct. However, as previously stated NHST does not provide probabilities of hypotheses and therefore the correctness of the decision cannot be determined from the p-value. On the other hand, a p-value of 0.001 is typically small enough that a researcher considers random errors sufficiently unlikely that they can be ignored for whatever decision or conclusion is currently at hand.
For the study of Spanish academic psychologists a total of 418 subjects were recruited that worked at Spanish public universities. Participants were contacted via email based on a list collected from publicly available sources. The mean period of time working at a university was 14 years. Subjects were asked to provide their subfield.
Again, results show significant misunderstanding with 94% of Spanish academic psychologists choosing at least one incorrect response related to the five inverse probability questions, 35% incorrectly answering question six related to replication, and 40% incorrectly marking statement seven or eight as true. The percentage of respondents demonstrating at least one misunderstanding across all questions was not provided by the authors. Although questions nine and ten were listed in the questionnaire, the results were not presented.
The percentage of respondents incorrectly answering each question broken out by psychological subfield is shown below. The subfield with the highest proportion of misinterpretation by question is highlighted in red. Developmental and educational psychologists fared worst as there were four questions for which that population had the highest proportion of misunderstandings.
Statements | Personality, Evaluation and Psychological Treatments | Behavioral Sciences Methodology | Basic Psychology | Social Psychology | Psychobiology | Developmental and Educational Psychology |
---|---|---|---|---|---|---|
1. The null hypothesis has been shown to be true | 8% | 2% | 7% | 5% | 7% | 13% |
2. The null hypothesis has been shown to be false | 65% | 36% | 61% | 66% | 55% | 62% |
3. The probability of the null hypothesis has been determined (p=0.001) | 51% | 58% | 68% | 62% | 62% | 56% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 41% | 13% | 23% | 37% | 38% | 44% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 33% | 19% | 25% | 31% | 41% | 36% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | 35% | 16% | 36% | 39% | 28% | 46% |
7. The value p=0.001 directly confirms that the effect size was large | 12% | 3% | 9% | 16% | 24% | 18% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 35% | 16% | 36% | 35% | 28% | 46% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | Not shown to respondents | |||||
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | Not shown to respondents |
Table notes:
1. The academic sub-field with the highest proportion of misinterpretation by question is highlighted in red.
2. Percentages are not meant to add to 100%.
3. Sample sizes: Personality, Evaluation and Psychological Treatments (n=98), Behavioral Sciences Methodology (n=67), Basic Psychology (56), Social Psychology (n=74), Psychobiology (n=29), Developmental and Educational Psychology (n=94)
4. Reference: "Interpretation of the p value: A national survey study in academic psychologists from Spain", Laura Badenes-Ribera, Dolores Frías-Navarro, Héctor Monterde-i-Bort, and Marcos Pascual-Soler, Psicothema, 2015 [link]
The study testing the knowledge of Spanish students used the same 10-question instrument as the study of Spanish academic psychologists, except the question about p-value replicability was not present. A total of 63 students took part in the study, all recruited from the university of Valencia. On average students were 20 years of age and all had previously studied statistics. The results broken down by by fallacy category were that 97% of subjects demonstrated at least one NHST misinterpretation related to the Inverse Probability Fallacy (questions 1-5), 49% of subjects demonstrated at least one NHST misinterpretation related to either the Effect Size Fallacy or the Clinical or Practical Importance Fallacy (questions 7 and 8), and 73% of subjects demonstrated a NHST misinterpretation related to correct decision making (questions 9 and 10). The percentage of incorrect responses by question is shown below.
Statements | Percentage incorrectly answering the question |
---|---|
1. The null hypothesis has been shown to be true | 25% |
2. The null hypothesis has been shown to be false | 56% |
3. The probability of the null hypothesis has been determined (p=0.001) | 65% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 29% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 51% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | Not shown to respondents |
7. The value p=0.001 directly confirms that the effect size was large | 18% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 41% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | 49% |
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | 50% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample size is 63 students.
3. Reference: "Misinterpretations Of P Values In Psychology University Students", Laura Badenes-Ribera, Dolores Frías-Navarro, and Marcos Pascual-Soler, Anuari de Psicologia de la Societat Valenciana de Psicologia, 2015 [link]
The study of Chilean and Italian academic psychologists included 164 participants overall (134 Italian, 30 Chilean). Participants were contacted via email based on a list collected from Chilean and Italian universities. For both countries subjects were broken out into methodology and non-methodology areas of expertise. In Italy the average years of teaching or conducting research was 13, with a standard deviation of 10.5 years. The gender break down was 54% women and 46% men, 86% were from public universities, the remaining 14% were from private universities. In Chile the average years of teaching or conducting research was 15 and a half, with a standard deviation of eight and a half years. Subjects were evenly split between women and men; 57% were from private universities, while 43% were from public universities.
Overall, 56% of methodology instructors and 74% of non-methodology instructors selected at least one incorrect response. Note that the sample size for both Chilean methodologists (n=5) and Italian methodologists (n=13), were substantially smaller than the corresponding sample size for non-methodologists, 25 and 121 respectively. Although questions nine and ten were listed in the questionnaire, the results were not presented.
The percentage of incorrect responses by question is shown below. The population with the highest proportion of misinterpretation by question is highlighted in red, showing that the two countries had roughly equal levels of misinterpretations. While Chilean methodologists did not have any statements for which their level of misinterpretation was highest, there were only five psychologists in the sample.
Chilean psychologists | Italian psychologists | |||
---|---|---|---|---|
Statements | Methodology | Other | Methodology | Other |
1. The null hypothesis has been shown to be true | 0% | 4% | 0% | 4% |
2. The null hypothesis has been shown to be false | 40% | 60% | 23% | 28% |
3. The probability of the null hypothesis has been determined (p=0.001) | 20% | 12% | 31% | 26% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 0% | 16% | 8% | 12% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 0% | 8% | 23% | 14% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | 0% | 20% | 8% | 12% |
7. The value p=0.001 directly confirms that the effect size was large | 0% | 0% | 8% | 6% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 0% | 8% | 8% | 9% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | Not shown to respondents | |||
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | Not shown to respondents |
Table notes:
1. Percentages are rounded and may not add to 100%.
2. Sample sizes: Chilean methodologists (n=5), Chilean other areas (n=25), Italian methodologists (n=13), Italian other areas (n=121).
3. Reference: "Misconceptions of the p-value among Chilean and Italian Academic Psychologists", Laura Badenes-Ribera, Dolores Frias-Navarro, Bryan Iotti, Amparo Bonilla-Campos, and Claudio Longobardi, Frontiers in Psychology, 2016 [link]
In 2010 Hector Monterde-i-Bort, Dolores Frías-Navarro, Juan Pascual-Llobell used Part II of the The Psychometrics Group Instrument developed by Mittag (1999) to survey 120 psychology researchers in Spain. All subjects had a doctorate degree or proven research experience or teaching experience in psychology.
Previously, Mittag and Thompson (2000) surveyed 225 members of the American Educational Research Association (AERA) using the same instrument and Gordon (2001) surveyed 113 members of the American Vocational Education Research Association (AVERA). Both studies are discussed in the Education section of this article.
Part II of the the Psychometrics Group Instrument contains 29 statements broken out into nine categories. Subjects responded using a 5-point Likert scale where for some statements 1 meant agree and 5 meant disagree and for others the scale was reversed so that 1 denoted disagreement and 5 agreement. We refer to the first scale direction as positive (+) and the second as negative (-).
The 5-point Likert scale is often constructed using the labels 1 = Strongly agree, 2 = Agree, 3 = Neutral, 4 = Disagree, 5 = Strongly disagree (https://legacy.voteview.com/pdf/Likert_1932.pdf, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.608&rep=rep1&type=pdf, https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-78665-0_6363). However, the Psychometrics Group Instrument uses the labels 1 = Agree, 2 = Somewhat agree, 3 = Neutral, 4 = Somewhat disagree, 5 = Disagree (with the word “strongly” omitted).
The instrument included both opinion- and fact-based statements. These two categories are not recognized within the instrument itself, but are our categorization based on whether the question has a normatively correct answer; if so it is considered fact-based. An example of an opinion-based statement was:
It would be better if everyone used the phrase, “statistically significant,” rather than “significant”, to describe the results when the null hypothesis is rejected.
An example of a fact-based statement was:
It is possible to make both Type I and Type II error in a given study.
We selected eight fact-based statements related to NHST as a measure of NHST misunderstandings of this group. Mean responses on the Likert scale are shown below. As a measure of the degree of misunderstanding the absolute difference between the mean and normatively correct answer was calculated. For example, Statement 2 reads, “Type I errors may be a concern when the null hypothesis is not rejected” is incorrect. This is because Type I error refers to falsely rejecting a true null hypothesis; if the null hypothesis is not rejected there is no possibility of making an error in the rejection. For this reason every respondent should have selected 5 (disagree). However, the mean response was in fact 3.53, therefore the deviation was 5.0 - 3.53 = 1.47 in Likert scale points.
One case in which ambiguity might exist is Statement 1, which reads “It is possible to make both Type I and Type II error in a given study.” For the purposes of this analysis we consider this statement incorrect, as was intended by the authors. This is because Type II error occurs when the null hypothesis is not rejected, while Type I error occurs when the null hypothesis is rejected. Because the null is either rejected or not, with mutual exclusion between the two possibilities, only one type of error can occur in a given hypothesis test. However, the word “study” is used in the statement and therefore one could easily argue that a study can contain multiple hypothesis tests. There is no way to know if respondents interpreted “study” to mean the entire process of data collection and hypothesis testing of a single outcome of interest or if they interpreted it to mean multiple hypothesis tests used to examine various aspects of a single phenomenon.
Averaging the deviation from the correct answer across all eight responses resulted in a figure of 1.32. This is roughly equivalent to somewhat agreeing with a statement that is in fact true, and therefore normatively should elicit complete agreement. The 1.32 mean difference is substantially lower than that found in both the AERA and AVERA populations, which both had mean differences above 1.7.
Statement 3 had the largest mean difference. This was also the case with the AERA and AVERA populations, however for those populations all three statements relating to Type I and Type II errors had larger deviations from the correct answer than other statements, not true for the psychologists in the Monterde-i-Bort et al. study.
Statements 4, 6 and 7 were both related to the Clinical or Practical Significance Fallacy.” Statement 7 had the lowest deviation of all eight statements, 0.99, again also true for the AERA and AVERA populations. The fact that Statement 6 had larger mean differences than Statements 6 and 7 may be due to the difference in statement wording. Statement 6 read that, “Finding that a p < 0.05 is one indication that the results are important.” While some might argue that this statement is true — a p-value less than 0.05 result is one, but not the only, indication of an important result — we unambiguously find this statement to be incorrect as the p-value has no bearing at all on whether a result is important.
Statement 5 relates to the Effect Size Fallacy and is also incorrect. It is true that the relative effect size is one determinate of the size of the p-value; this is the reason for the zone of nonsignificance. However, the p-value is not a direct measure of the effect size. For example, a small effect can still produce a small p-value if the sample size is sufficiently large.
Statement 8 relates to the Replicability Fallacy. It had a deviation of 1.47 from the correct response. The statement is incorrect as the p-value is not a measure of experimental replicability.
Statement | Mean response (Likert scale) |
Upper CI limit | Lower CI limit | Correct answer | Deviation from correct answer (Likert scale) |
Scale direction |
---|---|---|---|---|---|---|
1. It is possible to make both Type I and Type II error in a given study. | 3.53 | 3.22 | 3.84 | Incorrect | 1.47 | + |
2. Type I errors may be a concern when the null hypothesis is not rejected. | 3.91 | 3.62 | 4.20 | Incorrect | 1.09 | + |
3. A Type II error is impossible if the results are statistically significant. | 2.26 | 1.97 | 2.55 | Correct | 2.74 | - |
4. If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important. | 3.90 | 3.70 | 4.10 | Incorrect | 1.10 | + |
5. Smaller p values provide direct evidence that study effects were larger. | 3.54 | 3.29 | 3.79 | Incorrect | 1.46 | + |
6. Finding that p < .05 is one indication that the results are important. | 3.24 | 3.67 | 4.11 | Incorrect | 1.76 | + |
7. Studies with non-significant results can still be very important. | 1.99 | 1.81 | 2.17 | Correct | 0.99 | + |
8. Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research. | 3.53 | 3.28 | 3.78 | Incorrect | 1.47 | + |
Table notes:
1. Deviation from correct answer is calculated by assuming that -- when the scale is positive -- for incorrect answers 5 (disagree) is normatively correct and subtracts the mean from 5. This is reversed for the negative scale. The same logic applies to answers that are correct, in which case a response of 1 (agree) is considered normatively correct.
2. + scale direction indicates that 1 = agree and 5 = disagree, - scale direction indicates that 1 = disagree and 5 = agree.
3. The mapping between our statement numbering and that in the survey instrument is as follows (our statement = instrument statement): 1 = 22, 2 = 17, 3 = 9, 4 = 14, 5 = 11, 6 = 6, 7 = 18, 8 = 8.
4. Reference "Uses and abuses of statistical significance tests
and other statistical resources: a comparative study", Hector Monterde-i-Bort, Dolores Frías-Navarro, Juan Pascual-Llobell, European Journal of Psychology of Education, 2010 [link]
In 1995 Ruma Falk and Charles Greenbaum tested 53 psychology students’ NHST knowledge at the Hebrew University of Jerusalem using the five-question instrument below.
The subjects were told that a test of significance was conducted and the result turned out significant for a predetermined level. They were asked what is the meaning of such a result. The five following options were offered as answers.
1. We proved that H0 is not true.
2. We proved that H1 is true.
3. We showed that H0 is improbable.
4. We showed that H1 is probable.
5. None of the answers 1-4 is correct.
The authors note that although multiple answers were allowed all students choose only a single answer. The correct answer, Number 5, was chosen by just seven subjects (8%).
There have also been more targeted studies of specific misinterpretations. In a 1979 study Michael Oakes found that psychology researchers overestimated the size of the effect when the significance threshold was changed from 0.05 to 0.01. Oakes asked 30 academic psychologists to answer the prompt below:
Suppose 250 psychologist and 250 psychiatrists are given a test of psychopathic tendencies. The resultant scores are analyzed by an independent means t-test which reveals that the psychologists are significantly more psychopathic than the psychiatrists at exactly the 0.05 level of significance (two-tailed). If the 500 scores were to be rank ordered, how many of the top 250 (the more psychopathic half) would you guess to have been generated by psychologists?
Significance level | ||
---|---|---|
0.05 | 0.01 | |
0.05 level presented first | 163 | 181 |
0.05 level presented second | 163 | 184 |
Table notes:
1. Sample sizes: academic psychologists first presented 0.05 and then asked to revise at 0.01 (n=30); academic psychologists first presented 0.01 and then asked to revise at 0.05 (n=30).
2. The standard deviation for all four groups was around 20, ranging from 18.7 to 21.2.
3. Reproduced from Michael Oakes, Statistical Inference, Epidemiology Resources Inc. [link]; the book was published in 1990, but the study was conducted in 1979.
The participants were then asked to revise their answer assuming the 0.05 level of significance was changed to 0.01. For a separate set of an additional 30 academic psychologists the order of the prompts was reversed, with respondents asked first about the 0.01 level and then about the 0.05 level.
The results of the respondents’ answers are shown in the table at right (answers have been rounded). The first row shows responses from the group asked first to consider the 0.05 significance level while the second row shows responses from the group asked first to consider the 0.01 significance level.
The correct answer is that moving from a level of significance of 0.05 to a level of 0.01 implies three additional psychologists appear in the top 250. However, the average answers for both groups shows that on average the psychologists estimated a difference of around 20 additional psychologists would appear in the top 250. Oakes also calculated the median responses, which did not substantively change the results.
Oakes also tested academic psychologists’ understanding of p-values and replication by asking 54 of them to predict via intuition (or direct calculation if desired) the probability of replication under three different scenarios.
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the results is Z = 2.33 (p = 0.01, one-tailed), N=40. This experiment is theoretically important and you decide to repeat it with 20 new subjects. What do you think the probability is that these 20 subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.05 level?
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the result is Z = 1.16 (p=0.12, one-tailed), N=20. This experiment is theoretically important and you decide to repeat it with 40 new subjects. What do you think the probability is that these 40 new subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.05 level?
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the result is Z = 1.64 (p=0.05, one-tailed), N=20. This experiment is theoretically important and you decide to repeat it with 40 new subjects. What do you think the probability is that these 40 new subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.01 level?
Scenario | Mean intuition of replicability | True replicability |
---|---|---|
1 | 80% | 50% |
2 | 29% | 50% |
3 | 75% | 50% |
Table notes:
1. Reproduced from Michael Oakes, Statistical Inference, Epidemiology Resources Inc. [link]; the book was published in 1990, but the study was conducted in 1979.
The results of Oakes’ test is presented in the table at right. Oakes designed the scenarios in a clever manner so that each of the three scenarios produced the same answer: the true replicability is always 50%. In all three cases the the difference between the average intuition about the replicability of the scenarios and the true replicability is substantial. As outlined above Oakes argues that this difference is due to statistical power being underappreciated by psychologists who instead rely on mistaken notions of replicability linked to the statistical significance of the p-value.
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simple NHST statement. This survey included 70 University of Granada students in the field of psychology. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work. Both works appear to be based on Vallecillos’ doctoral thesis written in Spanish. We have not yet been able to obtain a copy of this thesis. Note that using online search it appears Augustias is also sometimes spelled Angustias.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Psychology | 70 | 17% | 74% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 52, false - 12, blank - 6 (8.6%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer, although an explanation was not required. The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results are shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation although we have omitted those figures from the table for clarity. It was not stated why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Nearly three quarters of psychology students incorrectly answered the statement as true, the highest proportion of any of the seven university specialties surveyed. In addition, as described below, when providing written explanations justifying their responses few psychology students provided correct reasoning.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is a case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show, the fallacy is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Not all of the respondents to the statement gave a written explanation, 59 of the 70 psychology students (84%) gave written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. Just 9% of psychology students gave a correct written explanation and only 20% gave a partially correct explanation. At 51%, psychology students had the highest proportion of explanations falling into the M1 category, but the second lowest proportion of explanations falling into the M2 category (business students had the lowest proportion).
Vallecillos notes that when considering the full sample of all 436 students across all seven majors, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Psychology | 59 | 9% | 20% | 51% | 10% | 7% | 3% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to interpret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows: C - 5, PC - 12, M1 - 30, M2 - 6, M3 - 4, DI - 2.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In 1993 Miron Zuckerman, Holley Hodgins, Adam Zuckerman, and Robert Rosenthal tested 508 academic psychologists and 17 psychology students by asking five statistical questions. The subjects were recruited among a wide range of subfields drawn from authors that had published one or more articles in the following journals between 1989 and 1990: Developmental Psychology, Journal of Abnormal Psychology, Journal of Consulting and Clinical Psychology, Journal of Counseling Psychology, Journal of Educational Psychology, Journal of Experimental Psychology: General, Journal of Experimental Psychology: Human Perception and Performance, Journal of Experimental Psychology: Learning, Memory and Cognition, and Journal of Personality and Social Psychology. Zuckerman, Hodgins, and Zuckerman note that, “The respondents included 17 students, 175 assistant professors, 134 associate professors, 182 full professors, and 43 holders of nonacademic jobs. The earliest year of Ph.D. was 1943 and the median was between 1980 and 1981.”
Four of the questions outlined in Zuckerman, Hodgins, and Zuckerman (1993) are outside the scope of this article, but readers are encouraged to reference the full paper for details. One question in particular asked about effect sizes estimates within an NHST framework [https://journals.sagepub.com/doi/pdf/10.1111/j.1467-9280.1993.tb00556.x]:
Lisa showed that females are more sensitive to auditory nonverbal cues than are males, t = 2.31, df = 88, p < .O5. Karen attempted to replicate the same effect with visual cues but obtained only a t of 1.05, df = 18, p < .I5 (the mean difference did favor the females). Karen concluded that visual cues produce smaller sex differences than do auditory cues. Do you agree with Karen’s reasoning?
The correct answer was “No.” Three choices “Yes,” “No,” and “It depends” were available for selection. The reason for an answer of “No” is that by using the provided data one can reverse engineer the effect sizes of the two studies, which leads to an equal effect size for both. In addition, the correct method to compare the sensitivity of visual and auditory cues would be to compare them directly, not indirectly by looking at p-values. The direct calculation would produce an estimated mean difference between auditory and visual cues as well as a confidence interval for that mean. The authors also note that Karen’s study would have been stronger had she also replicated the auditory condition.
This question had the highest accuracy of the five with 90% of respondents answering correctly. However, the author notes that respondents may have had a bias to answer “no” and that questions where the correct answer was “yes” had much lower proportions of respondents answering correctly. However, the bias is inconclusive since questions where the answer was “yes” might have also been more difficult.
Lecoutre, Poitevineau, and Lecoutre (2003) surveyed 20 psychologists working at laboratories throughout France, specifically hoping to identify two common NHST misinterpretations (25 statisticians were also surveyed; these results are presented in the statistics section).
The authors constructed a hypothetical scenario in which the efficacy of a drug is being tested by using two groups, one given the drug and one given a placebo. Each group had 15 participants, for a total of 30. The drug was to be considered clinically interesting by experts in the field if the unstandardized difference between the treatment mean and the placebo mean was more than 3. Four different scenarios were constructed crossing statistical significance with effect size (large/small). These are shown in the table below.
Situation three and four are considered by the authors to offer conflicting information since in one case the result is nonsignificant, but the effect size is large and in the other the result is significant, but the effect size is small. These two situations are meant to test two common NHST misinterpretations: Interpreting a nonsignificant result as evidence of no effect — the Nullification Fallacy — and confusing statistical significance with scientific significance, the Clinical or Practical Significance Fallacy.
Only the t-statistic, p-value, and effect size were provided to subjects. The authors suggest that two metrics in particular are useful in determining the drug’s efficacy. The first is the 100(1 – α)% confidence interval. The the standard 2σ rule can be used to calculate the 95% confidence interval, by multiplying two times the standard error, 2(D/t). The second metric is the sampling error, which can be calculated by squaring the effect size divided by the t-statistic, (D/t)^2. The authors note that for larger variances the estimated effect size is not very precise and no conclusion should be made about the drug’s efficacy. Although these two metrics were not shown to respondents they are provided in the table below for completeness.
Situation | t-statistic | P-value | Effect size (D) | Estimated sampling error (D/t)2 | Standard error (D/t) 95% CI |
Normative answer |
---|---|---|---|---|---|---|
1. Significant result, large effect size | 3.674 | 0.001 | 6.07 | 2.73 | 1.65 2.77 to 9.37 |
Clinically interesting effect |
2. Nonsignificant result, small effect size | 0.683 | 0.5 | 1.52 | 4.95 | 2.23 -2.93 to 5.97 |
No firm conclusion |
3. Significant result, small effect size | 3.674 | 0.001 | 1.52 | 0.17 | 0.41 0.69 to 2.5 |
No clinically interesting effect |
4. Nonsignificant result, large effect size | 0.683 | 0.5 | 6.07 | 78.98 | 8.89 -11.7 to 23.84 |
No firm conclusion |
Subjects were asked the following three questions:
1. For each of the four situations, what conclusion would you draw for the efficacy of the drug? Justify your answer.
2. Initially, the experiment was planned with 30 subjects in each group and the results presented here are in fact intermediate results. What would be your prediction of the final results for D then t, for the conclusion about the efficacy of the drug?
3. From an economical viewpoint, it would of course be interesting to stop the experiment with only the first 15 subjects in each group. For which of the four situations would you make the decision to stop the experiment, and conclude?
Only the results for Question 1 are discussed here as they align with commonly documented NHST misinterpretations. For a discussion of Questions 2 and 3 please see Lecoutre, Poitevineau, and Lecoutre (2003). The results for Question 1 are shown below. The three response categories below were coded by the authors based on interviews with subjects (we have made the category names slightly friendlier without changing their meaning). Green indicates the subject's response aligns with the authors' normatively correct response. Red indicates it does not align. The authors note that, “subjects were requested to respond in a spontaneous fashion, without making explicit calculations.”
All subjects gave responses that matched the normatively correct response to Situation 1. When looking at the confidence interval it can be seen that values below the clinically important effect of 3 are still reasonably compatible with the data, meaning that the true impact of the drug on the population may be clinically ineffective. However, the results are promising enough that more research is likely warranted, that is there is a “clinically interesting effect” as the authors noted in their normatively correct response. Clinically interesting is not the same as effective, however, and it is unclear what coding methodology was used by the authors to produce the figures reproduced in the table below.
All but three subjects responded incorrectly to Situation 2, indicating that the drug is ineffective. The confidence interval for Situation 2 included both 3, the clinically important value, as well as 0, indicating no effect at all. Therefore, the normatively correct response was “Do not know” since the true impact of the drug on the population could be either clinically important or completely ineffective (or even mildly harmful). The authors note that interview statements by subjects implying the drug is not effective demonstrate the Nullification Fallacy, a nonsignificant result should not be taken as evidence that the drug is ineffective.
Situation 3 was split between correct and incorrect responses, with just 40% responding correctly. Here the confidence interval does not include 0, but also does not include the clinically important effect of 3. Therefore, “No clinically interesting effect” was the normatively correct response selected by the authors. This response appears to have been mapped onto “The drug is ineffective.” The authors note that statements implying the drug is effective are exhibiting the Clinical or Practical Significance Fallacy since Situation 3 had a small p-value (0.001), but clinically unimportant effect size.
Situation 4 had a slightly higher correct response rate of 65%. Here the confidence intervals are extremely large relative to the other situations, ranging from -11 to 23. Again, “No firm conclusion” was normatively correct and so respondents who were coded into the “Do not know” category are considered to have given the correct response.
Some may object to this study because the authors discouraged any direct calculation. The results, therefore, are based on statistical intuition. Though the authors note that all subjects “perceived the task as routine for their professional activities,” suggesting statistical intuition is a key part of performing their job successfully. Likewise, no subjects raised concerns about the statistical setup of the four situations, for instance whether requirements such as normality or equality of variances were fulfilled.
Others may object that if a value of 3 was clinically important perhaps the correct approach would be to conduct a one-way t-test and obtain a p-value representing whether the observed data were compatible with a hypothesis that the drug was equal to or greater than this level of effect. The authors likely chose their method because it allowed more explicit testing of the fallacies as they typically occur in practice. For instance, the Nullification Fallacy typically arises when the result is nonsignificant (the confidence interval covers zero).
Situation | Response | ||
---|---|---|---|
The drug is effective | The drug is ineffective | Do not know | |
1. Significant result, large effect size | 100% | 0% | 0% |
2. Nonsignificant result, small effect size | 0% | 85% | 15% |
3. Significant result, small effect size | 45% | 40% | 15% |
4. Nonsignificant result, large effect size | 0% | 35% | 65% |
Table notes:
1. Green indicates the subject's response aligns with the authors' normative response. Red indicates it does not align.
In 2015 Anton Kühberger, Astrid Fritz, Eva Lermer and Thomas Scherndl set out to test the susceptibility of psychology student to both the Effect Size Fallacy and the Nullification Fallacy, the interpretation of a nonsignificant result as evidence of no effect. The authors surveyed 133 students at the University of Salzburg in Austria that were enrolled in a basic statistics course.
Without being shown the results students were asked to estimate the sample and effect sizes of two real psychology studies: a study on the effect of temperature on perceived social distance — what the authors call the “thermometer study” — and a study on the influence of physical movement on cognitive processing, the “locomotion study”. The thermometer study was conducted by Hans IJzerman and Gün R. Semin and published in 2009 in Psychological Science under the title, “The thermometer of social relations. Mapping social proximity on temperature.” The locomotion study was conducted by Severine Koch, Rob W Holland, Maikel Hengstler, and Ad van Knippenberg; it was also published in 2009 in Psychological Science under the title, “Body Locomotion as Regulatory Process: Stepping Backward Enhances Cognitive Control.”
Overviews of the two studies are provided in the task descriptions below. Respondents were each shown both scenarios, but randomized into a version citing either a significant or nonsignificant result. The order in which they saw the scenarios was also randomized.
Both studies started with the following instructions.
Dear participant, thank you for taking part in our survey. You will see descriptions of two scientific research papers and we ask you to indicate your personal guess on several features of these studies (sample size, p-value, …). It is important that you give your personal and intuitive estimates.
You must not be shy in delivering your estimates, even if you are not sure at all. We are aware that this may be a difficult task for you – yet, please try.
The thermometer study description was as follows:
Task 1: The influence of warmth on social distance
In this study researchers investigated the influence of warmth on social distance. The hypothesis was that warmth leads to social closeness. There were two groups to investigate this hypothesis:
Participants of group 1 held a warm drink in their hand before filling in a questionnaire. Participants of group 2 held a cold drink in their hands before they filled in the same questionnaire. Participants were told to think about a known person and had to estimate their felt closeness to this person. They had to indicate closeness on a scale from 1–7, whereas 1 means ‘very close’ and 7 means ‘very distant’.
The closeness ratings of the participants of group 1 were then compared to the closeness ratings of group 2.
Researchers found a statistically significant [non-significant] effect in this study.
While the locomotion study gave this overview:
Task 2: The influence of body movement on information processing speed
Previous studies have shown that body movements can influence cognitive processes. For instance, it has been shown that movements like bending an arm for pulling an object nearer go along with diminished cognitive control. Likewise, participants showed more cognitive control during movements pushing away from the body. In this study, the influence of movement of the complete body (stepping forward vs. stepping backward) on speed of information processing was investigated.
The hypothesis was that stepping back leads to more cognitive control, i.e., more capacity. There were two conditions in this study: In the first condition participants were taking four steps forwards, and in the second condition participants were taking four steps backwards. Directly afterwards they worked on a test capturing attention in which their responses were measured in milliseconds. The mean reaction time of the stepping forward-condition was compared to the mean reaction time of the stepping backward-condition.
Researchers found a statistically significant [non-significant] effect in this study.
Due to data quality, results from 126 participants are provided for the thermometer study, whereas the locomotion study has data from 133 participants. Note that this data quality issue is why the 214 total students cited in the “Methods” section of Kühberger et al. is inconsistent with the actual results presented subsequently by the authors. Further note that in the “Results” section of Kühberger et al. the authors mistakenly cite a figure of 127 participants in the thermometer study; all other figures in the paper support the 126 figure that we have presented here (for example, adding sample sizes in data tables results in 126 participants). One additional error in the Table 3 crosstabulation was also found and is outlined below in the discussion of the Nullification Fallacy.
Results of student estimates are shown below alongside the results from the actual psychology studies of which students were given descriptions. The data suggest that the students indeed fell prey to the Effect Size fallacy, consistently estimating larger effect sizes for the significant than for the nonsignificant versions, measured by the difference between Group 1 and Group 2 means (Diff. of means) and Cohen’s d.
The authors formally tested the proclivity of this overestimation in the significant and nonsignificant versions using the Mann–Whitney U-Test. As a measure of effect size. Mann–Whitney uses what is known as a “rank-biserial correlation,” abbreviated by ‘r’. To calculate r the Cohen’s d estimates for both the significant and nonsignificant cases were ranked in order of highest to lowest and set side by side. The proportion of rows favorable to the null hypothesis — rows in which Cohen’s d was larger for the nonsignificant version than the significant version — were subtracted from the proportion of rows unfavorable to the null hypothesis, that is rows in which Cohen’s d was larger for the significant version than the nonsignificant version. Here the null hypothesis is that the significant version does not produce a larger effect size estimate than the nonsignificant version. The same procedure was done with the Diff. of Means estimate.
After conducting this analysis the authors found that for the thermometer study a test of equality between the significant and nonsignificant version produced z =-5.27 (p < .001) with r =-0.47 for Diff. of means and for Cohen’s d produced z =-3.88 (p < .001) with r = -0.34. For the locomotion study the equivalent numbers were z =-2.48 (p = .013) with r = -0.21 for the Diff. of means and z =-4.16 (p < .001) with r = -0.36 for Cohen’s d. These results suggest that the observed estimates from students are relatively incompatible with the hypothesis that students would estimate equal effect sizes in the significant and nonsignificant versions. The corresponding calculation for the difference in sample size estimates were z =-1.75 (p = .08) with r = -0.15 for the thermometer study and z =-0.90 (p = .37) with r = -.08 for the locomotion study, suggesting that differences in sample size estimates between the significant and nonsignificant version was perhaps due to simple random variation. These two results are in line with the author’s hypothesis: while the Effect Size Fallacy would cause a noticeable correlation between significant results and larger effect sizes, there is no reason to believe student estimates of sample size would vary between the significant and nonsignificant versions.
However, less formal analysis paints a similar picture. For the thermometer study the estimated difference of means was two times larger in the significant version (2.0 vs. 1.0) as was Cohen’s d (0.6 vs 0.3). The ratio of sample sizes was smaller, 1.5. The pattern was even more pronounced with the locomotion study with the estimated difference of means five times larger in the significant version (50 vs. 10) while Cohen’s d was 3.5 times larger (0.7 vs 0.2). The sample size meanwhile was just 1.2 times larger (60 vs. 50).
Thermometer study | Locomotion study | |||||
---|---|---|---|---|---|---|
Value | Actual | Median estimate from significant version | Median estimate from nonsignificant version | Actual | Median estimate from significant version | Median estimate from nonsignificant version |
Sample size (n) | 33 | 76 | 50 | 38 | 60 | 50 |
Group 1 mean | 5.12 | 2.7 | 3.5 | 712 | 150 | 150 |
Group 2 mean | 4.13 | 4.05 | 4.0 | 676 | 120 | 118 |
Diff. of means | 0.99 | 2.0 | 1.0 | 36 | 50 | 10 |
Group 1 SD | 1.22 | 1.0 | 1.25 | 83 | 10 | 5 |
Group 2 SD | 1.41 | 8.0 | 10.0 | 95 | 8 | 5 |
Cohen's d | 0.78 | 0.6 | 0.3 | 0.79 | 0.7 | 0.2 |
Table notes:
1. Actual estimates are from the original studies, Median estimates are from participants.
2. Sample sizes: Thermometer significant (n=53), Thermometer nonsignificant (n=73), locomotion significant (n=65), locomotion nonsignificant (n=68).
3. Units for the thermometer study was a 7-point Likert scale, units for the locomotion study were milliseconds.
4. Reference: "The significance fallacy in inferential statistics", Anton Kühberger, Astrid Fritz, Eva Lermer, & Thomas Scherndl, BMC Research Notes, 2015 [link]
To examine the Nullification Fallacy Cohen’s d was again used. The authors argued that the primary sign of the fallacy would be an estimate of zero difference between means for the nonsignificant version, and a nonzero estimate for the significant version. This reasoning follows from the fallacy itself: that a nonsignificant result is evidence of no difference between group means.
To investigate the presence of the fallacy the authors categorized the students’ Cohen’s d estimates into four ranges:
Large (d > 0.8)
Medium (0.5 < d < 0.8)
Small (0.3 < d < 0.5)
Very small (d < 0.3)
While only 6% of students in the thermometer study and 3% of students in locomotion study estimated effect sizes of exactly zero in the nonsignificant versions of the two experiments, a much greater number estimated what the authors termed a “very small” difference.
The proportion of students choosing a “very small” Cohen’s d was indeed much larger in the nonsignificant version across both studies. A “very small” difference was estimated twice as often for the nonsignificant version of the thermometer study (59% versus 30%) and four times as much in the locomotion study (60% vs. 13%).
These trends are quickly distinguishable in the chart at right. This chart uses data from Table 3 of Kühberger et al. Note that the table in the paper miscalculated percentages in the significant version of the thermometer study, however we corrected these percentages when producing the chart.
In sum this data does not necessarily support the Nullification Fallacy outright. It could be viewed instead as further evidence of the Effect Size Fallacy, that significant results are in general associated with larger effects than nonsignificant results. However, this is somewhat dependent on subject-level data. If subjects assumed equal sample sizes in the treatment and control group then smaller p-values would indeed be associated with large effect sizes.
One might argue these results are also indicative of the Clinical or Practical Significance Fallacy, which describes the phenomenon of equating statistically nonsignificant results with practically unimportant results. However, to clearly delineate the extent of each of these fallacies among this population more research would be needed.
NHST meta-analysis
The 14 journal articles reviewed in the “Surveys of NHST knowledge” section used different methodologies to assess common NHST misinterpretations and overall NHST understanding of psychology students and researchers. Several of these studies cannot be coherently combined. However, others can be. In particular, a subset of 10 studies were identified that used survey instruments eliciting true or false responses, thereby consistently measuring rates of correct responses. These 10 studies represent nine different journal articles; one article surveyed respondents in two different countries and is counted twice for this reason.
Two primary quantitive assessments of NHST knowledge are available across these studies:
Measure 1: The percentage of respondents demonstrating at least one NHST misunderstanding (available in all 10 studies)
Measure 2: The average number of incorrect responses to the survey instrument (available in 5 studies)
Measures 1 was combined across studies using a simple weighted average. Although the length and wording of the survey instruments varied we ignore these factors and weight each survey equally regardless of the number of statements included. One could come up with various weighting schemes to account for survey length, but we decided to keep things simple. We have made the underlying meta-analysis data available so that others may recalculate the results using their own methodologies.
Measure 1 should be considered a lower bound. This is because for some survey instruments the percentage of incorrect responses across the entire survey was not available. For example, the set of studies by Badenes-Ribera used a 10-question instrument (although the results from all 10 questions were not always reported in the results). However, incorrect response rates were broken down by fallacy. For that reason response rates from the 5-question Inverse Probability Fallacy category were used as a proxy for the total misinterpretation rate. If the percentage of respondents demonstrating at least one NHST misinterpretation was calculated across the entire instrument it would no doubt be higher. The longest survey instrument used in this meta-analysis was therefore six statements in length. Methodological challenges not withstanding it seems reasonable to conclude that regardless of education level a clear understanding of NHST should enable one to correctly answer six NHST statements based on a simple hypothetical setup. Possible methodological challenges are discussed momentarily.
For Measure 2 the average number of incorrect responses was divided by the number of questions in the survey instrument to obtain a proportion of incorrect responses. A simple weighted average was then applied.
There are limitations to our methodology. First, the survey instruments themselves can be criticized in various ways including incorrect hypothetical setups, confusing statement language, and debates over the normatively correct response. The specific criticisms of each study were outlined in the “Surveys of NHST knowledge” section. Second, survey instruments were administered across multiple countries and it is unclear what effects translation may have had on subject responses. In total, studies were conducted in seven different countries. Third, the experience of students and researchers varied across studies. Four broad education categories were used to span the full breadth of respondent experience: undergraduates, master’s students, PhD students, and Post-PhD researchers. At the end of this section details on the specific experience level of each study included in the meta-analysis is provided. Fourth, all studies used convenience sampling, thereby limiting the external validity of any meta-analysis. However, other forms of evidence corroborate the general problem of NHST misinterpretation and misusage. This evidence includes incorrect textbooks definitions, audits of NHST usage in published academic articles, and calls for reform from NHST practitioners that have been published in academic journals across nearly every academic discipline. These three types of evidence will be reviewed in full in an upcoming article.
Despite the challenges Measures 1 and 2 appear to be fair measure of base NHST knowledge.
Turning to the results of the meta-analysis, across the 10 studies a total of 1,569 students and researchers were surveyed between 1986 and 2020. A weighted average across all 10 studies resulted in 90% of subjects demonstrating at least one NHST misinterpretation. Note that some studies surveyed multiple education levels. See the Excel in the Additional Resources section for the detailed calculations across studies.
A breakdown by education level is shown below. Post-PhD researchers have the lowest rates of NHST misinterpretations, but still have a misinterpretation rate of 87%. Master’s students had the highest rate at 95%.
Education | Number of studies | Sample size | Percentage with at least one NHST misunderstanding (weighted average) |
---|---|---|---|
Undergraduates | 5 | 403 | 92% |
Master's | 2 | 284 | 95% |
PhD | 2 | 94 | 90% |
Post-PhD | 7 | 788 | 87% |
Total | 10 | 1,569 | 90% |
Table notes:
1. Calculated from 10 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
A breakdown by country is shown below. Most countries have only a single study that was conducted, although China has two and Spain three. Chile and Italy have by far the lowest rate of misinterpretations, however it is unclear if there is an underlying causal mechanism or if this is random sampling variation.
Country | Number of studies | Sample size | Percentage with at least one NHST misunderstanding (weighted average) |
---|---|---|---|
U.S. | 1 | 70 | 96% |
Israel | 1 | 53 | 92% |
Germany | 1 | 113 | 91% |
Spain | 3 | 551 | 92% |
China | 2 | 618 | 94% |
Chile | 1 | 30 | 67% |
Italy | 1 | 134 | 60% |
Total | 10 | 1,569 | 90% |
Table notes:
1. Calculated from 10 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
Misinterpretation rates do not appear to have gone down over time. This can be seen in the chart below. This chart depicts the 10 studies included in the Measure 1 meta-analysis with the publishing year on the x-axis. The y-axis shows the percentage of respondents endorsing at least one false statement, that is having at least one NHST misinterpretation. The size of the bubble represents the sample size; the sample size key can be seen in the bottom left. Between 2015 and 2020 four separate studies show misinterpretation rates equal to or near Oakes’ original work in 1986; all have misinterpretation rates above 90%.
The results of Measure 2 are shown below. Only half of the total 10 studies used for Measure 1 were available for Measure 2.
The average respondent missed nearly half the questions on their survey instrument. Post-PhD researchers fared best, missing on average two out of five questions. All other education levels were clustered around the 50% mark.
Education | Number of studies | Sample size | Average percentage of questions that respondents answered incorrectly |
---|---|---|---|
Undergraduates | 4 | 287 | 52% |
Master's | 2 | 284 | 48% |
PhD | 2 | 94 | 51% |
Post-PhD | 4 | 206 | 40% |
Total | 5 | 871 | 48% |
Table notes:
1. Calculated from 5 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
Measure 2 is broken down by country below. Only China had more than a single study included in the Measure 2 analysis. Indeed, China drove much of the Measure 2 calculation due to its large sample size. Spain had the highest average percentage answered incorrectly, but note that this was a single-question instrument and the sample size was relatively small compared to the German and Chinese studies.
Country | Number of studies | Sample size | Average percentage of questions that respondents answered incorrectly |
---|---|---|---|
U.S. | 1 | 70 | 42% |
Germany | 1 | 113 | 36% |
Spain | 1 | 70 | 74% |
China | 2 | 618 | 48% |
Total | 5 | 871 | 48% |
Table notes:
1. Calculated from 5 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
Intuitively, looking at the results of individual studies paints a picture of mass misinterpretation of NHST among all education levels. Using simplistic meta-analysis methods to combine data across studies confirms this regardless of whether the average number of misinterpretations is used as a measure (Measure 2) or whether one focuses on the average number of respondents with at least one incorrect response (Measure 1).
Details of the experience level of each population is shown below.
Authors | Year | Country | Instrument length | Population details |
---|---|---|---|---|
Oakes | 1986 | U.S. | 6 | Book title: Statistical Inference [link] The subjects were academic psychologists. Oakes notes they were, "university lecturers, research fellows, or postgraduate students with at least two years' research experience." |
Falk & Greenbaum | 1995 | Israel | 5 | Article title: "Significance tests die hard: The amazing persistence of a probabilistic misconception" [link] The authors note that the psychology students attended Hebrew University of Jerusalem. The authors note that, "these students have had two courses in statistics and probability plus a course of Experimental Psychology in which they read Bakan's (1996) paper." Bakan (1996) warns readers against the Inverse Probability Fallacy. |
Vallecillos | 2000 | Spain | 1 | Article title: "Understanding of the Logic of Hypothesis Testing Amongst University Students" [link] The author notes that psychology students were selected because they have obtained, "...prior humanistic grounding during their secondary teaching..." |
Haller & Krauss | 2002 | Germany | 6 | Article title: "Misinterpretations of Significance: A Problem Students Share with Their Teachers?" [link] Subjects were surveyed in 6 different German universities. There were three groups. First, methodology instructors, which consisted of "university teachers who taught psychological methods including statistics and NHST to psychology students." Note that in Germany methodology instructors can consist of "scientific staff (including professors who work in the area of methodology and statistics), and some are advanced students who teach statistics to beginners (so called 'tutors')." The second group were scientific psychologists, which consisted of "professors and other scientific staff who are not involved in the teaching of statistics." The third group were psychology students. |
Badenes-Ribera et al. | 2015 | Spain | 5 | Article title: "Misinterpretations of p-values in psychology university students" The subject consisted of psychology students from the Universitat de Valencia who have already studied statistics. The mean age of the participants was 20.05 years (SD = 2.74). Men accounted for 20% and women 80%. |
Badenes-Ribera et al. | 2015 | Spain | 5 | Article title: "Interpretation of the p value: A national survey study in academic psychologists from Spain" [link] Academic psychologists from Spanish public universities. The mean number of years teaching and/or conducting research was 14.16 (SD = 9.39). |
Badenes-Ribera et al. | 2016 | Italy | 5 | Article title: "Misinterpretations Of P Values In Psychology University Students (Spanish language)" [link] Subjects were academic psychologists. The average years of teaching or conducting research was 13.28 years (SD = 10.52), 54% were women and 46% were men, 86% were from public universities, the remaining 14% were from private universities. |
Badenes-Ribera et al. | 2016 | Chile | 5 | Article title: "Misinterpretations Of P Values In Psychology University Students (Spanish language)" [link] Subjects were academic psychologists. The average years of teaching or conducting research was 15.53 years (SD = 8.69). Subjects were evenly split between women and men, 57% were from private universities, while 43% were from public universities. |
Lyu et al. | 2018 | China | 6 | Article title: "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation" [link] The online survey "recruited participants through social media (include WeChat, Weibo, blogs etc.), without any monetary or other material payment...The paper-pen survey data were collected during the registration day of the 18th National Academic Congress of Psychology, Tianjin, China..." This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. |
Lyu et al. | 2020 | Mainland China (83%), Overseas (17%) | 4 | Article title: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields" [link] Recruitment was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. All respondents were awarded their degree in China. |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving the precision of an effect size.
In 2014 Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers tested confidence interval misinterpretations among 120 Dutch researchers, 442 Dutch first-year university students, and 34 master’s students all in the field of psychology. The authors used a six question instrument adapted from Gigerenzer (2004) based on Oakes.
The authors note that the undergraduate students "were first-year psychology students attending an introductory statistics class at the University of Amsterdam." None of the students had previously taken a course in inferential statistic. The master's students "were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years." The researchers came from three universities: Groningen, Amsterdam, and Tilburg.
The instrument presented six statements and asked participants to mark each as true or false (all six were false). The six statements are shown below.
Professor Bumbledorf conducts an experiment, analyzes the data, and reports: “The 95% confidence interval for the mean ranges from 0.1 to 0.4!”
Please mark each of the statements below as “true” or “false”. False means that the statement does not follow logically from Bumbledorf’s result. Also note that all, several, or none of the statements may be correct:
1. The probability that the true mean is greater than 0 is at least 95%.
2. The probability that the true mean equals 0 is smaller than 5%.
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
4. There is a 95% probability that the true mean lies between 0.1 and 0.4.
5. We can be 95% confident that the true mean lies between 0.1 and 0.4.
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.
Statements 1, 2, 3, and 4 are incorrect because confidence intervals are not probability statements about either parameters or hypotheses. Statements 5 and 6 are incorrect because the “confidence” in confidence intervals are a result of the properties of the procedure used to calculate them, not any particular interval: if we repeat the population sampling and data collection process and the subsequent confidence interval calculation, approximately 95% of the resulting intervals will contain the true population mean. Statement 6 is also incorrect because it implies the true mean might vary, 95% of the time falling between 0.1 and 0.4, but 5% falling within some other interval. However, the true population mean is a fixed number and does not vary.
Overall 98% of first-year students, 100% of master’s students, and 97% of researchers had at least one misunderstanding. A breakdown of the percentage of each group incorrectly asserting each of the six statements is shown below.
The least common misconception was the first statement, that the probability of the true mean being larger than 0 is at least 95%. The most common misconception was that the null hypothesis is likely to be false; this is similar to the NHST statement that a small p-value disproves the null hypothesis.
Statement summaries | First year undergraduates | Master's | Researchers |
---|---|---|---|
1. The probability that the true mean is greater than 0 is at least 95%. | 51% | 32% | 38% |
2. The probability that the true mean equals 0 is smaller than 5%. | 55% | 44% | 47% |
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. | 73% | 68% | 86% |
4. There is a 95% probability that the true mean lies between 0.1 and 0.4. | 58% | 50% | 59% |
5. We can be 95% confident that the true mean lies between 0.1 and 0.4. | 49% | 50% | 55% |
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. | 66% | 79% | 58% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample sizes: first-year students (n=442), master's students (n=34), researchers (n=120).
3. Reference: "Robust misinterpretation of confidence intervals", Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers, Psychonomic Bulletin & Review, 2014 [link]
The distribution of the number of statements endorsed by each group is shown below. The mean number of items endorsed for first-year students was 3.51, for master’s students was 3.24, and for researchers was 3.45.
Lyu et al. (2018) translated the 6-question confidence interval instrument from Hoekstra et al. (2014) into Chinese and surveyed 347 Chinese students and researchers. Recall that Lyu et al. (2018) also surveyed the same population for NHST misinterpretations. Undergraduate and master’s students had roughly similar rates of misinterpretations for NHST and confidence intervals. PhD, postdocs, and experienced professors had generally fewer NHST misinterpretations than CI misinterpretations.
Education | Average number of misinterpretations Lyu et al. (2018) |
Average number of misinterpretations Hoekstra et al. (2014) |
---|---|---|
Undergraduate | 3.66 | 3.51 |
Master's | 2.89 | 3.24 |
PhD student | 3.51 | Not surveyed |
Post-doc/assistant prof. | 3.13 | 3.45 |
Teaching/research for years | 3.50 |
Table notes:
1. Reference: "Robust misinterpretation of confidence intervals", Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers, Psychonomic Bulletin & Review, 2014 [link]
2. Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Compared to Hoekstra et al. (2014) Chinese undergraduate students had a higher proportion of misunderstanding for Statements 1, 3, 4, and 5; master’s had a higher proportion for Statements 1 and 2. The comparison of researchers is more complex and depends on whether you compare researchers in Hoekstra et al. (2014) against Chinese assistant professors or experienced professors.
A comparison of the number of average misinterpretations between Lyu et al. (2018) and Hoekstra et al. (2014) is shown at right. The average number of misinterpretations for undergraduates was roughly comparable, although subjects in Lyu et al. (2018) did have a higher rate. Master students had a larger difference, with more than an additional half statement missed by the Chinese subjects (0.65). Hoekstra et al. (2014) did not break out researchers by experience whereas Lyu et al. (2018) segmented researchers into less and more experienced. The average number of misinterpretations for the category of “Teaching/research for years” from Lyu et al. (2018) was quite close to the general researcher figure from Hoekstra et al., 3.50 and 3.45, respectively. However, some of the researchers from Hoekstra et al. may have been less experienced, and for that population Lyu et al. (2018) reported a lower rate of average misinterpretation, 3.13.
The breakdown of incorrect responses by question and education level is shown below. Data in the table below was calculated by us using open data from the authors. The education level with the highest proportion of misinterpretation by question is highlighted in red. Undergraduates fared worse, they had the highest proportion of misinterpretations for four of the six questions. This aligns with the table above showing they also had the highest average number of misinterpretations.
Statement summaries | Undergraduate | Master's | PhD | Postdoc and assistant prof. | Experienced professors |
---|---|---|---|---|---|
1. The probability that the true mean is greater than 0 is at least 95%. | 54% | 49% | 43% | 48% | 50% |
2. The probability that the true mean equals 0 is smaller than 5%. | 63% | 51% | 62% | 39% | 50% |
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. | 79% | 49% | 72% | 61% | 62% |
4. There is a 95% probability that the true mean lies between 0.1 and 0.4. | 65% | 43% | 72% | 52% | 75% |
5. We can be 95% confident that the true mean lies between 0.1 and 0.4. | 71% | 40% | 57% | 61% | 62% |
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. | 34% | 59% | 45% | 52% | 50% |
Table notes:
1. The education level with the highest proportion of misinterpretation by question is highlighted in red.
2. Percentages are not meant to add to 100%.
3. Sample sizes: Undergraduates (n=106), Master's (n=162), PhD (n=47), Postdoc or assistant prof (n=23), Experienced professor (n=8).
4. Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Lyu et al. (2018) also looked at misinterpretation by subfield. Data in the table below was calculated by us using open data from the authors. The percentage of each subfield incorrectly marking at least one statement as “true” is shown below. All subfields besides clinical and medical psychology had misinterpretation rates of at least 95%. Multiple sub-fields had all subjects mark at least one statement incorrectly. These misinterpretation rates are slightly higher than those found in for Lyu et al. (2018) NHST statements. The average number of misinterpretations was also high. Cognitive neuroscientists had the lowest rate, but still had 2.68 misinterpretations. Psychometric and psycho-statistics researchers had the highest number of misinterpretations, a strikingly high 4.31.
Psychological sub-field | Sample size | Percentage with at least one CI misunderstanding | Average number of misinterpretations |
---|---|---|---|
Fundamental research & cognitive psychology | 74 | 96% | 3.18 |
Cognitive neuroscience | 121 | 95% | 2.68 |
Social & legal psychology | 51 | 100% | 3.41 |
Clinical & medical psychology | 19 | 84% | 3.58 |
Developmental & educational psychology | 30 | 100% | 3.77 |
Psychometric and psycho-statistics | 16 | 100% | 4.31 |
Neuroscience/neuroimaging | 9 | 100% | 3.67 |
Others | 17 | 100% | 4.0 |
Table notes:
Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
For questions one and three there were more misinterpretations resulting from the nonsignificant version, whereas for questions two and four more misinterpretations resulted from the significant version.
At 92%, psychology researchers had the fifth highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). Psychology researchers ranked fourth out of eight in terms of the average number of confidence interval misinterpretations, 1.82 out of a total of four possible. This was lower than the 1.94 average misinterpretations in the NHST instrument. The nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.84 compared to 1.79.
The percentage of incorrect responses to each question broken down by the significant and nonsignificant versions is shown below. Statement 1 had one of the highest incorrect response rates for both the significant and nonsignificant versions among the eight fields surveyed, while Statement 4 had the highest incorrect response rate for the significant version across any field. Meanwhile, the significant version of Statement 3 had the lowest rate across any field. There was fairly wide variation between the significant and nonsignificant versions for Statement 3 and Statement 4, 15 percentage points and 17 percentage points, respectively.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 66% | 69% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 54% | 48% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 31% | 46% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 70% | 53% |
Table notes:
1. Percentages are not meant to add up to 100%.
2. Sample sizes: significant version (n=125), nonsignificant version (n=147).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 85% to 93%. PhD students faired best with the lowest rate of confidence interval misinterpretations (85%), but the group had the highest average number of misinterpretations with 2.1.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 67 | 93% | 1.8 |
Master's | 122 | 93% | 1.8 |
PhD | 47 | 85% | 2.1 |
Post-PhD | 36 | 92% | 1.7 |
Table notes:
1. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
Confidence Interval meta-analysis
The three studies reviewed in the “Surveys of confidence interval knowledge” section used similar methodologies to assess common confidence interval misinterpretations and overall understanding of psychology students and researchers.
Two primary quantitive assessments of confidence interval knowledge are available across these studies:
Measure 1: The percentage of respondents demonstrating at least one NHST misunderstanding
Measure 2: The average number of incorrect responses to the survey instrument
Both measures are available in all 3 studies. Measures 1 was combined across studies using a simple weighted average. For Measure 2 the average number of incorrect responses was divided by the number of questions in the survey instrument to obtain a proportion of incorrect responses. A simple weighted average was then applied.
See the “NHST Meta-Analysis” section for a discussion of the methodological critiques of these two measures.
Across three studies focusing on confidence interval misinterpretations a total of 1,212 students and researchers were surveyed between 2014 and 2019. A weighted average across all three studies and all populations resulted in a misinterpretation rate of 96% of subjects demonstrating at least one confidence interval misinterpretation. See the Excel in the Additional Resources section for the detailed calculations across studies.
A breakdown of confidence interval misinterpretation rates by education level is shown below. All education levels have similar misinterpretation rates. When looking across studies weighted average confidence interval misinterpretation rates are equal to or higher than weighted average NHST misinterpretation rates across all education levels. However, as we have argued elsewhere we still believe confidence intervals are preferable to NHST. In part this is because we believe that confidence interval misinterpretation is less harmful than NHST misinterpretation in many contexts.
Consider the following two misinterpretations, one related to confidence intervals and the other to NHST:
CI: There is a 95% probability that the true mean lies between 0.1 and 0.4 (from Hoekstra et. al., 2014)
NHST: You have found the probability of the null hypothesis being true [when you calculate a p-value] (from Oakes, 1986).
While both statements are incorrect confidence intervals have the benefit of including the precision of the point estimate. Therefore, wrongly concluding a confidence interval is a probability statement still keeps focus on the range of parameter estimates reasonably compatible with the observed data and by doing so forces an analyst to acknowledge values both higher and lower than the point estimate when considering a particular course of action.
Confidence intervals can fall prey to dichotomized thinking, however, which results in a similar kind of uselessness as NHST statistical significance. Indeed, in two studies using the six-question confidence interval instrument from Hoekstra et. al (2014), question three — “The ‘null hypothesis’ that the true mean equals 0 is likely to be incorrect” — resulted in the highest rates of incorrect endorsement. Nonetheless, with some reservations our view is that educating analysts and researchers that confidence intervals should not be evaluated based solely on whether the interval covers zero will result in “good enough” usage of confidence intervals, even if some other misinterpretations persist.
Others might reasonably disagree with our assessment and we are sympathetic to those views. Likewise, whether confidence interval misinterpretations are somehow “less bad” than NHST misinterpretations depends on the context. For instance, in some medical settings both misinterpretations might be viewed as equally harmful for patient wellbeing. The primary focus of The Research is on usage in a business context and personal experience suggests there confidence intervals are preferable. Others have argued that Bayesian techniques are preferable to both approaches; such techniques are outside the scope of the current article and we therefore do not comment on them here.
Education | Number of studies | Sample size | Percentage with at least one confidence interval misunderstanding |
---|---|---|---|
Undergraduates | 3 | 615 | 97% |
Master's | 3 | 318 | 95% |
PhD | 3 | 212 | 95% |
Post-PhD | 2 | 67 | 96% |
Total | 3 | 1,212 | 96% |
Table notes:
1. Calculated from three studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
Only two countries are represented in this analysis: Germany and China. Both had high rates of misinterpretations, Germany had 98% and China 94%.
Country | Number of studies | Sample size | Percentage with at least one confidence interval misunderstanding |
---|---|---|---|
China | 2 | 618 | 94% |
Germany | 1 | 594 | 98% |
Total | 3 | 1,212 | 96% |
Table notes:
1. Calculated from 3 studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
Turning to Measure 2, the average percentage of questions answered incorrectly was high across all education levels. Undergraduates, PhD students, and post-PhD researchers all averaged incorrect response rates of more than half the statements on the survey instrument. For all education levels except master’s students these averages were higher than for NHST survey instruments.
Education | Number of studies | Sample size | Average percentage of questions that respondents answered incorrectly |
---|---|---|---|
Undergraduates | 3 | 615 | 57% |
Master's | 3 | 318 | 47% |
PhD | 2 | 94 | 56% |
Post-PhD | 3 | 185 | 54% |
Total | 3 | 1,212 | 54% |
Table notes:
1. Calculated from 3 studies on confidence misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
Breaking things down by country, Chinese subjects averaged 50% while German subjects averaged a slightly higher 58%.
Country | Number of studies | Sample size | Average percentage of questions that respondents answered incorrectly |
---|---|---|---|
China | 2 | 618 | 50% |
Germany | 1 | 594 | 58% |
Total | 3 | 1,212 | 54% |
Table notes:
1. Calculated from 3 studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
Details of the experience level of each population is shown below.
Authors | Year | Country | Instrument length | Population details |
---|---|---|---|---|
Hoekstra et al. | 2014 | The Netherlands | 6 | Article title: "Robust misinterpretation of confidence intervals" [link] This study included bachelor's and master's students as well as researchers. The authors note that the bachelor students "were first-year psychology students attending an introductory statistics class at the University of Amsterdam." None of the students had previously taken a course in inferential statistic. The master's students "were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years." The researchers came from three universities: Groningen, Amsterdam, and Tilburg. |
Lyu et al. | 2018 | China | 6 | Article title: "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation" [link] The online survey "recruited participants through social media (include WeChat, Weibo, blogs etc.), without any monetary or other material payment...The paper-pen survey data were collected during the registration day of the 18th National Academic Congress of Psychology, Tianjin, China..." This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. |
Lyu et al. | 2020 | Mainland China (83%), Overseas (17%) | 4 | Article title: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields" [link] Recruitment was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. All respondents were awarded their degree in China. |
Cliff Effect
The cliff effect refers to a researcher’s drop in the confidence of an experimental result based on the p-value. Typically, the cliff effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below the 0.05 threshold and lower confidence for values above 0.05. This is often manifested as an interpretation that an experimental treatment effect is real, important, or robust if the experimental result reaches statistical significance and implausible or unimportant if it does not. In practice the cliff effect could refer to any p-value that results in a precipitous drop in researcher confidence, for instance at the 0.1 threshold.
The cliff effect is often considered an artifact of NHST misinterpretations. Why would two p-values of, say, 0.04 and 0.06 elicit a drop in researcher confidence that is more severe than two p-values of 0.2 and 0.22? From a decision theoretic point of view — in the spirt of Neyman and Pearson — a pre-specified cutoff, the “alpha” value, is required to control long-run false positive rates. It is a common confusion to interpret the p-value as a measure of this Type I error rate; this is a mistake. Nonetheless even using that mistaken interpretation does not warrant the presence of a cliff effect: regardless of whether one moves from a p-value of 0.04 to 0.06 or from 0.20 to 0.22, on average two additional Type I errors out of 100 would be observed. Returning to correct NHST usage, it is true the alpha cutoff would cause us to either accept or reject a hypothesis based on the p-value as a means of controlling our Type I error rate. However, rejection regions in that sense should be considered distinct from researcher confidence. When the p-value is treated as an evidentiary measure of the null hypothesis, as it most often is, the cliff effect is unwarranted absent some additional scientific context.
Research of the cliff effect began in 1963 with a small sample study of nine psychology faculty and ten psychology graduate students by Robert Rosenthal & John Gatio. Subjects were presented with 14 different p-values. However, the full list of values was not presented in the paper. Subjects were then asked to rate their confidence on a scale of one to six, where one indicated extreme confidence and six indicated no confidence. The confidence scale presented to subjects is shown below.
0 - Complete absence of confidence or belief
1 - Minimal confidence or belief
2 - Mild confidence or belief
3 - Moderate confidence or belief
4 - Great confidence or belief
5 - Extreme confidence or belief
The authors conclude a cliff effect exists at the 0.05 threshold by noting that 84% of subjects had a larger decrease in confidence between p-values of 0.05 and 0.10 than a decrease in confidence between p-values of 0.10 and 0.15. This was the highest proportion of subjects expressing decreased confidence of any of the ranges the authors presented in the paper. The authors also tested the confidence by presenting two different sample sizes, 10 and 100. The drop in confidence between 0.05 and 0.1 and between 0.1 and 0.15 can be seen by looking at the two distances d1 and d2 on the chart at right. These distances are for students responding to the scenario with a sample size of 100, however it can easily be seen that both populations across both sample size scenarios experience similar drops.
In general graduate students expressed greater confidence than faculty members for a given p-value. All respondents tended to express more confidence at higher sample sizes for the same p-value. However, sample sizes does not impact Type I error. The authors therefore concluded that respondents either intentionally or subconsciously consider Type II error.
Kenneth Beauchamp and Richard May replicated Rosenthal and Gatio a year later and wrote up a short one-page summary. Beauchamp and May gave their questionnaire to nine psychology faculty and 11 psychology graduate students. The authors report that no cliff effect was observed at any p-level. However, a one-page response by Rosenthal and Gatio the same year made use of an extended report from Beauchamp and May (provided to Rosenthal and Gatio upon request). The Rosenthal and Gatio rejoinder disputed the conclusion of no cliff effect. Rosenthal and Gatio noted that in Beauchamp and May’s extended report confidence for a p-value of 0.05 was actually higher than for a p-value of 0.03, indicating 0.05 is treated as if it has special properties. However, this pattern of usage in itself does not fit the normal definition of the cliff effect. Rosenthal and Gatio also use as evidence for a 0.05 cliff effect the fact that Beauchamp and May considered Rosenthal and Gatio’s main effects from their original paper as nonsignificant despite the p-value being 0.06, thereby falling prey to the very cliff effect they argued wasn’t present in their data.
In 1972 Eric Minturn, Leonard Lansky, and William Dember surveyed 51 bachelor, master’s, and PhD graduates. Participants were asked about their level of confidence at 12 different p-values with two sample sizes, 20 and 200. A cliff effect was reported at p-values of 0.01, 0.05, and 0.10. Participants also had more confidence in the larger sample size scenario. The results of Minturn, Lansky, and Dember (1972) presented here are as reported in Nelson, Rosenthal, and Rosnow (1986). The original Minturn, Lansky, and Dember (1972) paper could not be obtained. It was presented at the meeting of the Eastern Psychological Association, Boston under the title, “The interpretation of levels of significance by psychologists.” Despite several emails to the Eastern Psychological Association to obtain the original paper no response was received. We confirmed that Leonard Lansky and William Dember are now deceased and therefore could not be contacted. We attempted to reach Eric Minturn via a LinkedIn profile that appeared to be the same Eric Minturn who authored the paper, however no response was received. All three authors had an academic affiliation with the University of Cincinnati, however the paper is not contained in the authors’ listed works kept by the university nor is it contained in the listed works collected by any of the third party journal libraries we searched.
Nanette Nelson, Robert Rosenthal, and Ralph Rosnow surveyed 85 academic psychologists in 1986. They asked about 20 p-values ranging from 0.001 to 0.90 using the same confidence scale as Rosenthal and Gatio (1963). The authors found a cliff effect at 0.05 and 0.10. They also found that there was a general increase in confidence as the effect size increased, as sample size increased, and when the experiment was a replication rather than the first run of an experiment.
In 2001 Jacques Poitevineau and Bruno Lecoutre wrote a paper titled, “Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated” in which they presented a questionnaire to 18 psychology researchers to measure the presence of a cliff effect. As the title of their paper suggests Poitevineau and Lecoutre found the cliff effect to be less pronounced than perviously identified. The authors’ main finding was that averaging across subjects may mask between-subject heterogeneity in p-value confidence. In particular the authors note that when averaged across all 18 subject a cliff effect was present. However, this was largely driven by four subjects that expressed what the authors call an '“all-or-none” approach to p-values with extremely high confidence for p-values less than 0.05 and almost zero confidence for p-values larger than 0.05. Four other subjects had a linear decrease in confidence across p-values. The majority — 10 subjects — expressed exponential decrease in confidence. The exponential group had higher decreases in confidence between small p-values than between large ones, but did not exhibit a cliff effect in the traditional sense.
In 2012 Rink Hoekstra, Addie Johnson, and Henk Kiers published the results of their study of 65 PhD students within psychology departments in the Netherlands. Each subject was presented with eight different scenarios, four related to NHST and four related to confidence intervals. The NHST scenario included the mean difference along with degrees of freedom, sample size, a p-value, and standard error. For each scenario four p-values were shown designed to be just above the traditional 0.05 cutoff (.04 and .06), or clearly above or below the cutoff (.02 or .13).
Confidence in both the original result and the result if the study were repeated were measured by responses to the question below.
What would you say the probability is that there is an effect in the expected direction in the population based on these results?
What would you say the probability is that you would find a significant effect if you were to do the same study again?
The authors found a cliff effect around the 0.05 level, the decrease in confidence for both questions was greatest between 0.04 and 0.06 among the four p-values presented. There was a 14 point drop in confidence (85 to 71) between p-values of 0.04 to 0.06, compared to a drop of just 6.5 (91.5 to 85) between p-values of 0.02 and 0.04 and a drop of 25 (71 to 46) between the much wider p-value interval of 0.06 to 0.13. These results can be seen visually in the chart at right; the line segment has the steepest slope between 0.04 and 0.06. The cliff effect for confidence intervals was found to be more pronounced than for NHST.
Cliff effect meta-analysis
Due to the varying methodologies of studies investigating the cliff effect, a meta-analysis could not be conducted. However, nearly all results suggest that a cliff effect of some magnitude exists for psychology researchers. More work is likely needed in the spirt of Poitevineau and Lecoutre (2001) to determine if the aggregated cliff effect reported in the other studies is the result of a relatively small number of “all or nothing” respondents and the extent to which subject confidence in p-values can be segmented into distinguishable patterns such as linear or exponential decreases.
Dichotomization of evidence
In their 2016 paper “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” Blakeley McShane and David Gal set out to test the prevalence of the dichotomization of evidence induced by the statistical significant/nonsignificant threshold. They surveyed 54 members of the editorial board of Psychological Science.
Their survey instrument is shown below. Two versions of the summary were present, one with a p-value of 0.01 and one with a p-value of 0.27. Respondents saw both versions, but the order of which version was presented first was randomized.
Below is a summary of a study from an academic paper:
The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to either write daily about positive things they were blessed with or to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants who wrote about the positive things they were blessed with lived, on average, 8.2 months after diagnosis whereas participants who wrote about others’ misfortunes lived, on average, 7.5 months after diagnosis (p = 0.27). Which statement is the most accurate summary of the results?
A. The results showed that participants who wrote about their blessings tended to live longer post-diagnosis than participants who wrote about others’ misfortunes.
B. The results showed that participants who wrote about others’ misfortunes tended to live longer post-diagnosis than participants who wrote about their blessings.
C. The results showed that participants’ post-diagnosis lifespan did not differ depending on whether they wrote about their blessings or wrote about others’ misfortunes.
D. The results were inconclusive regarding whether participants’ post-diagnosis lifespan was greater when they wrote about their blessings or when they wrote about others’ misfortunes.
The correct answer to both questions was option A. This follows simply from calculating the effect size, which is independent of the p-value. However, respondents were much more likely to select an incorrect response in the version of the question with the nonsignificant p-value. A graphical representation of their results is shown at right, with the dot representing the mean and the lines representing the 95% confidence interval of the mean.
One critique of this study might be whether the four options above can accurately reflect the extent of evidence dichotomization. Option B gets the results completely backward. Selecting this option would indicate that either the respondent misread the scenario or is statistically immature.
Selection of Option C indeed would be an indication of dichotomization of evidence since it confuses a nonsignificant p-value for evidence of no effect (the Nullification Fallacy).
Option D is more subtle. The question asks about the life span of participants in the study. However, the purpose of statistical inference is to generalize from a sample to a population. Perhaps subjects mistook the hypothetical scenario for statistical inference, rather than the more modest task of identifying the effect on the sample. If so then selection of Option D would not be an indication of the evidence dichotomization.
This theory is unlikely for several reasons. First, four times as many subjects selected Option D in the p=0.27 version than in the 0.01 version, 20 subjects and 5 subjects, respectively. There is no accounting for why respondents would confuse the scenario for statistical inference in one version, but not the other (recall that all subjects saw both versions, but in a random order). Furthermore, Option C — clearly an indication of dichotomization of evidence at work — was selected by just two subjects in the p=0.01 version, but by 18 subjects in the p=0.27 version. Overall 47 of the 54 subjects (87%) chose Option A in the p=0.01 case, but just 9 (17%) chose Option A in the p=0.27 case, despite no change to the hypothetical setup or statement options (besides the p-value itself changing).
All populations surveyed by McShane and Gal (2016) exhibited similar levels of dichotomization of evidence. This included medical researchers, marketing researchers, and statisticians. For some populations a so-called “choice” question was also presented. There were two versions of the choice question. In one version respondents were asked to choose a drug preference based on a hypothetical scenario in which two drugs were being compared for efficacy in disease treatment. In the second version respondents were asked to make a recommendation about the hypothetical treatment to either a socially close or socially distant person (for example, a family member or a stranger). The dichotomization of evidence was moderated when respondents answered the choice question. McShane and Gal hypothesize that respondents are more likely to select the correct answer for the choice question because it short circuits the automatic response to interpret the results with statistical significance in mind. Instead, focus is redirected toward a simpler question, like “which drug is better?” The details of these results are discussed throughout this article in the relevant sections.
The hypothetical setup as well as the statement wording shown above were modified for some populations. To distinguish it from the choice question, the original setup shown above is referred to by the authors as the “judgement” question.
Both the choice question and a modified judgement question were presented to two populations of professional psychology researchers when McShane and Gal conceptually repeated the Psychological Science experiment. These two populations were 31 members of the Cognition editorial board and 33 members of the Social Psychological and Personality Science (SPPS) editorial board. We chose to include the results of the Cognition board under both the “Psychology” and “Medicine” sections of this article as the journal cross both cognitive neuroscience and cognitive psychology.
The judgement question presented to these two populations was as follows.
Below is a summary of a study from an academic paper:
The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of patients who took Drug A recovered from the disease while forty-four percent (44%) of patients who took Drug B recovered from the disease (p = 0.26).
Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?
A. A person drawn randomly from the same patient population as the patients in the study is more likely to recover from the disease if given Drug A than if given Drug B. B. A person drawn randomly from the same patient population as the patients in the study is less likely to recover from the disease if given Drug A than if given Drug B. C. A person drawn randomly from the same patient population as the patients in the study is equally likely to recover from the disease if given Drug A or if given Drug B. D. It cannot be determined whether a person drawn randomly from the same patient population as the patients in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.
Here the statements were more explicit about the population/sample distinction, noting that of interest was, “A person drawn randomly from the same patient population as the patients in the study.” Nonetheless the results were similar. For the Cognition board three times as many subjects chose Option D in the p = 0.26 version as in the p=0.01 version, 4 and 13 respectively. For the SPPS board the situation was similar, with 4 and 14 subjects choosing Option D in the p=0.01 and p=0.26 versions, respectively. This again supports the claim that confusion in hypothetical setup or statement wording was not a factor in the observation of evidence dichotomization.
After answering the judgment question participants were asked the choice question. This included the same hypothetical setup, but asked subjects to make a choice about their preference of drug:
If you were a patient from the same population as the patients in the study, what drug would you prefer to take to maximize your chance of recovery?
A. I prefer Drug A.
B. I prefer Drug B.
C. I am indifferent between Drug A and Drug B.
Overall, 27 of 31 members of the Cognition board (87%) chose Option A in the p=0.01 version of the judgement question compared to just 11 (35%) in the p=0.26 version. Just one subject failed to prefer Drug A in the p=0.01 version of the choice question. While fewer, 20 subjects or 65%, preferred Drug A in the p=0.26 version of the choice question, this was still almost twice as high as the number of subjects that viewed Drug A as superior in the p=0.26 version of the judgement question.
The results for the SPPS board were similar; 29 of 33 subjects (89%) chose Option A in the p=0.01 version of the judgement question compared to just 5 (15%) in the p=0.26 version. Only four subjects had no preference of drug in the p=0.01 version of the choice question, with the remaining 29 (89%) preferring Drug A. While 19 subjects or 58%, preferred Drug A in the p=0.26 version of the choice question, this was still almost four times as high as the number of subjects that viewed Drug A as superior in the p=0.26 version of the judgement question.
Note that in the data tables presented in Appendix A of McShane and Gal (2016) the authors mistakenly indicate that the p-value presented to the Cognition and Social Psychology and Personality subjects was 0.27. All other references in the supplementary material report the p-value in this scenario as 0.26.
The authors produced a follow-up study in their paper “Statistical Significance and the Dichotomization of Evidence” that was focused specifically on statisticians. However, the experimental setup was the same. That article was published in the Journal of the American Statistical Association and selected by the editors for discussion. Those interested in the full set of discussions and responses are encouraged to read the paper, freely available here.
Dichotomization of evidence meta-analysis
A small meta-analysis was possible for dichotomization of evidence using the three studies from McShane and Gal. This included 118 psychology researchers in total: 31 from the editorial board of Cognition, 33 from the editorial board of Social Psychological and Personality Science, and 54 from the editorial board of Psychological Science. A simple weighted average was used across two separate survey instruments to determine the proportion of subjects selecting Option A, the correct option, in both a small p-value scenario (p=0.01) and a large p-value scenario (p=0.26 or p=0.27). For a full review of these studies please see the “Cliff effect and dichotomization of evidence” section above.
Using this weighted average methodology resulted in an average of 21% of subjects responding correctly in the low p-value scenario and 87% responding correctly in the high p-value scenario, for a difference of 66 percentage points. This is no surprise; the three individual studies all suggested dichotomization of evidence was apparent in each population, so combining the results leads to the same conclusion. As the results in the “Cliff effect and dichotomization of evidence” section show this effect is robust to survey instrument wording and is moderated when choice questions are used in place of judgement questions.
ECONOMICS
There are relatively few papers directly testing the knowledge of economists. Just three papers were found, one covering the cliff effect, one examining the dichotomization of evidence, and another covering both NHST and confidence interval misinterpretations. Descriptions of these four areas are provided below:
Null hypothesis significance testing (NHST) misinterpretations. These misinterpretations are primarily focused on misunderstanding p-value definitions or statistical properties; for example, interpreting the p-value as the probability of the null hypothesis or the probability of replication.
Confidence interval misinterpretations. For example, interpreting the confidence interval as a probability.
The dichotomization of evidence. A specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant.
The cliff effect. A drop in the confidence of an experimental result based on the p-value. For example, having relatively high confidence in a result with a p-value of 0.04, but much lower confidence in a result with a p-value of 0.06.
A summary of each article is presented in the table below including the authors and year published, the title of the article and a link to the paper, which of the four categories above the article belongs to, the subjects of the study and their associated sample size, and a brief summary of the article’s primary findings.
Below the table more details of each study are provided, broken out by the four categories above (articles that are mixed are presented multiple times, which each aspect of the study presented in the associated section). Of course, the methodological details and complete results of each study cannot be presented in full without duplicating the article outright. Readers are encouraged to go to the original articles to get the full details and in-depth discussions of each study.
In the course of analyzing each study below several errors were found. In cases where errors were present the authors were contacted for comment. We note these errors throughout the article as well as any responses received from the authors.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
McShane & Gal (2015) | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] | Dichotomization of evidence | Quarterly Journal of Economics authors (n=55) American Economic Review authors (n=94) |
1. A cliff effect was found between p-values of 0.01 and 0.27. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST and CI misinterpretations | Economics undergraduate students (n=48) Economics masters students (n=76) Economics PhD students (n=20) Economics with a PhD (n=20) |
1. 90% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 95% of masters students demonstrated at least one NHST misinterpretation. 3. 90% of PhD students demonstrated at least one NHST misinterpretation. 4. 90% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Economic researchers (n=4) | 1. No cliff effect was found for economists using simple descriptive statistics. 2. 99ooking across all 114 respondents, in a one-sample t-test the cliff effect is damped by using gradient or violin visual presentations over standard confidence interval bounds. |
Questionnaires of NHST knowledge
In 2017 Lutz Ostkamp wrote a master’s thesis at Bielefeld University focused on NHST misinterpretations. The thesis was supervised by Professor Roland Langrock and Professor Fridtjof Nußbeck. In total 155 economics students in a second semester statistics class were surveyed. Ostkamp used a combination of questions in his instrument that included a replication of Oakes (1986) and Haller and Krauss (2002), but also included questions from the Reasoning about P-values and Statistical Significance (RPASS) scale (Lane-Getaz, 2017). Results from the replication are presented first. As a reminder, the hypothetical scenario provided by Oakes and the six true/false statements are shown below. Ostkamp used the Haller and Krauss variation that also added a sentence at the end of the result description noting that, “several or none of the statements may be correct.”
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false”. “False” means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
2. You have found the probability of the null hypothesis being true.
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
4. You can deduce the probability of the experimental hypothesis being true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
All six statements are false, yet 96% of the 155 students marked at least one statement as correct. The mean number of incorrect responses was 2.5. This was comparable to Haller and Krauss (2002) in which, out of 44 psychology students surveyed, 100% marked at least one incorrect statement as correct, with a mean number of incorrect responses that was also 2.5.
As mentioned when discussing psychologists the degrees of freedom of the hypothetical setup, 18, is incorrect; the correct degrees of freedom is actually 38 (20-1 + 20-1=38). Versions of this survey instrument correcting the degrees of freedom were used in Lyu et al. (2018) and Lyu et al. (2020), with similarly high rates of misinterpretations as in the original uncorrected version used by Oakes (1986), Haller and Krauss (2002), and Ostkamp (2017).
[insert data analysis]
In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 95 in the field of management. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”
The author’s instrument wording is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
In terms of the proportion with at least one NHST misinterpretation, management researchers were near the middle of the pack of the academic fields surveyed by Lyu et al. (2020). They had the fifth highest proportion of respondents with at least one misinterpretation out of eight professions surveyed. Economists ranked seventh out of eight in terms of the average number of misinterpretations, 1.71 out of a total of four possible misinterpretations. The nonsignificant version had a substantially higher proportion of misinterpretations compared to the significant version, 1.84 compared to 1.59. This pattern was true for all fields except general science.
The highest proportion of incorrect responses across both versions was statement three, which was also the most misinterpreted statement across all fields. A separate version of this statement was also the most misinterpreted across three studies in psychology: Oakes (1986), a 2002 replication by Haller and Krauss, another replication in 2018 by Lyu et al. (translated into Chinese).
The reason for participants having an especially difficult time with this statement is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.
One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.
Across numerous studies Statement 3 was the most misinterpreted. This is likely due to the association respondents made between the statement and the Type I error rate. Formally, the Type I error rate is given by the pre-specified alpha value, usually set to a probability of 0.05 under the standard definition of statistical significance. It could then be said that if the sampling procedure and p-value calculation were repeated on a population in which the null hypothesis were true, 5% of the time the null would be mistakenly rejected. The Type I error rate can be summarized as, “The probability that you wrongly reject a true null hypothesis.”
There is a rearranged version of Statement 3 that is close to this Type I error definition: “You know the probability that you are making the wrong decision if you decide to reject the null hypothesis.” Note though that this statement is missing a key assumption from the Type I error rate: that the null hypothesis is true. The actual wording of Statement 3 was more complex: “You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.” The sentence structure makes it more difficult to understand, but again the statement does not include any indication about the truth of the null hypothesis. For this reason the statement cannot be judged as true.
As an additional piece of analysis we set out to understand if Statement 3 was syntactically sound and consulted with Maayan Abenina-Adar, a final-year PhD student at UCLA’s Department of Linguistics. Although the sentence may seem somewhat awkward as written, it is syntactically unambiguous and correctly constructed. However, a clearer version of the sentence would be the rearranged version just mentioned: “If you decide to reject the null hypothesis you know the probability that you are making the wrong decision.” This version avoids some of the syntactic complexity of the original statement:
The conditional antecedent appearing in the middle of the sentence. The conditional antecedent is the “if you decide to reject the null hypothesis” phrase.
The use of the the noun phrase, “probability that you are making the wrong decision,” as a so-called “concealed question.”
Whether the phrasing of Statement 3 contributed to its misinterpretation cannot be determined from the data at hand. One might argue that the more complex sentence structure caused respondents to spend extra time thinking about the nature of the statement, which might reduce misunderstanding. Plus Statement 4 was also syntactically complex, but did not elicit the same rate of misinterpretation. (The Statement 4 wording was: “You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.”). A controlled experiment comparing the two different versions of Statement 3 would be needed to tease apart the impact of the sentence’s structure and its statistical meaning on the rate of misunderstanding.
One other notable factor is that the pre-specified alpha value is not present in either the hypothetical scenario preceding Statements 1-4 or in Statement 3 itself. This might have been a clue that the statement couldn’t have refereed to the Type I error rate since not enough information was given. On the other hand, the 0.05 alpha probability is so common that respondents may have assumed its value.
Using a different set of statements, the Psychometrics Group Instrument, both Mittag and Thompson (2000) and Gordon (2001) found that two sets of education researchers had particular trouble with statements about Type I error. Therefore, it may simply be that Type I error is poorly understood by students, researchers, and professionals independent of statement wording.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis. | 60% | 48% |
2. You have found the probability of the null (alternative) hypothesis being true. | 44% | 42% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 67% | 65% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 63% | 42% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=93), nonsignificant version (n=71).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. PhD students fared best, with “only” 90% demonstrating at least one NHST misinterpretation with an average of 1.6 incorrect NHST responses.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 48 | 92% | 1.8 |
Masters | 76 | 92% | 1.9 |
PhD | 20 | 90% | 1.6 |
Post-PhD | 20 | 95% | 1.7 |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested economists for common confidence interval misinterpretations. All researchers surveyed were from China.
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
At 92%, economists had the fourth highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). Economists ranked fifth out of eight in terms of the average number of confidence interval misinterpretations, 1.76 out of a total of four possible. This was slightly higher than the 1.82 average misinterpretations in the NHST instrument. Like in the NHST instrument the nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.86 compared to 1.68. This pattern was true for all fields except medicine and the social sciences.
There was fairly wide variation between the significant and nonsignificant versions for statement four, 15 percentage points. Statement four suffers from the same subtle wording issue as the NSHT version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 60% | 61% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 54% | 56% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 53% | 47% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 66% | 51% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=93), nonsignificant version (n=71).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one confidence interval question. Masters students had the highest rate of at least misunderstanding at 95%, but the lowest average number of misinterpretations overall at 1.6.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 48 | 90% | 1.9 |
Masters | 76 | 95% | 1.6 |
PhD | 20 | 90% | 1.7 |
Post-PhD | 20 | 90% | 1.9 |
Cliff Effects and Dichotomization of Evidence
The cliff effect refers to a drop in the confidence of an experimental result based on the p-value. Typically, the effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below 0.05 and lower confidence for values above 0.05.
In 2016 Blakeley McShane and David Gal released the results of a multi-year study that investigated the prominence of the cliff effect within various academic fields, including economics. To test economists McShane and Gal surveyed 55 authors of articles from the Quarterly Journal of Economics and 94 from the American Economic Review.
To fully study the impact of the cliff effect McShane and Gal created a hypothetical scenario and then asked two questions. The first was the so-called “judgement” question, meant simply to test economists’ statistical understanding of the scenario presented. The judgement question randomized economists into a hypothetical scenario with either a p-value of 0.01 or 0.26. The second was the “choice” question, in which economists were asked to make a recommendation based on the hypothetical scenario. The choice question randomized the recommendation to be toward either a close other or a distant other.
The judgement question is shown below.
Below is a summary of a study from an academic paper:
The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly drawn from a fixed population 24 and then randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of subjects who took Drug A recovered from the disease while forty-four percent (44%) of subjects who took Drug B recovered from the disease.
A test of the null hypothesis that there is no difference between Drug A and Drug B in terms of probability of recovery from the disease yields a p-value of 0.26. Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?
A. A person drawn randomly from the same patient population as the patients in the study is more likely to recover from the disease if given Drug A than if given Drug B.
B. A person drawn randomly from the same patient population as the patients in the study is less likely to recover from the disease if given Drug A than if given Drug B.
C. A person drawn randomly from the same patient population as the patients in the study is equally likely to recover from the disease if given Drug A or if given Drug B.
D. It cannot be determined whether a person drawn randomly from the same patient population as the patients in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.
The choice question for a close loved one then saw the following wording:
If you were to advise a loved one who was a patient from the same population as those in the study, what drug would you advise him or her to take?
Participants in the distant other condition saw this wording instead:
If you were to advise physicians treating patients from the same population as those in the study, what drug would you advise these physicians prescribe for their patients?
All participants then saw the following three response options:
A. I would advise Drug A.
B. I would advise Drug B.
C. I would advise that there is no evidence of a difference between Drug A and Drug B.
The correct answer to both versions of the judgement statements was Option A since Drug A had a higher percentage of patients recover from the disease. However, respondents were much more likely to select an incorrect response in the version of the question with the nonsignificant p-value, likely that believing that a nonsignificant p-value was evidence of no effect between Drug A and Drug B. McShane and Gal identify this as evidence of the cliff effect at work.
The evidence is supplemented by respondents’ selection for the choice statement. While strictly speaking there is no correct answer to the choice question as it is a recommendation, Drug A had a higher recovery rate from the disease and is therefore the natural choice. Like in the judgement question the nonsignificant p-value induced fewer respondents to answer correctly. Nonetheless, the proportion answering correctly in the nonsignificant version of the choice question was substantially higher than in the judgement question. As in McShane and Gal’s article we collapse the choice question across the “close other” and “distant other” categories as this is not the primary hypothesis being considered. In general, respondents were more likely to recommend Drug A in the “close other” scenario. For complete details see McShane and Gal (2016).
McShane and Gal hypothesis that respondents are more likely to select the correct answer for the choice question because it short circuits the automatic response to interpret the results with statistical significance in mind. Instead, focus is redirected toward a simpler question: which drug is better?
One more study was uncovered that examined the cliff effect, Helske et al. (2020). While more than a hundred researchers participated, just four were economists. Three of the economists had received a PhD and one had received a master’s degree. While the sample size is extremely small, the results of the study are included here for completeness.
Like in McShane and Gal a hypothetical scenario was presented. The instrument wording was as follows:
A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?
One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.
Visualization | Largest difference in confidence | Difference in confidence (percentage points) |
---|---|---|
P-value | 0.01 to 0.04 | 30% |
CI | 0.04 to 0.06 | 19% |
Using the open source data made available by the authors we analyzed the extent of a cliff effect. A possible cliff effect was observed after plotting the drop in confidence segmented by visual presentation type. Only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.
Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.
For the confidence interval visual presentation the largest drop in confidence was indeed associated with the 0.04 and 0.06 interval. The 0.04 to 0.06 p-value difference had just the fifth highest drop in confidence for the p-value visual presentation. However, due to the extremely low sample size caution should be used in interpretation of these results.
MEDICINE
There are relatively few papers directly testing the knowledge of medical practitioners and researchers. Five papers were found: one testing belief in the false claim that hypothesis testing can prove a hypothesis is true, one exploring general NHST misinterpretations, another exploring both general NHST and confidence interval misinterpretations, and two investigating the prevalence and moderators of the cliff effect.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Medicine students (n=61) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 26% of pedagogy students incorrectly marked the statement as true. Eight medicine students that had correctly answered the statemenet also provided a correct written explanation of their reasoning. |
McShane & Gal (2015) | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] | Cliff effect | Journal of Epidemiology authors (n=261) | 1. A cliff effect was found between p-values of 0.025 and 0.075. |
Lyu et al. (2018) | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation [link] | NHST misinterpretations | Psychology students and researchers in one of four medically related subfields (n=137) | 1. 100% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 97% of masters students demonstrated at least one NHST misinterpretation. 3. 100% of PhD students demonstrated at least one NHST misinterpretation. 4. 100% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST misinterpretations | Medical undergraduate students (n=19) Medical masters students (n=69) Medical PhD students (n=24) Medical with a PhD (n=18) |
1. 79% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 96% of masters students demonstrated at least one NHST misinterpretation. 3. 96% of PhD students demonstrated at least one NHST misinterpretation. 4. 94% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Medical practitioners and researchers (n=16) | 1. 1. A cliff effect was found between p-values of 0.04 and 0.06. |
Questionnaires of NHST knowledge
The majority of direct surveys of null hypothesis significance testing (NHST) knowledge occur in the field of psychology. Only three studies were found that tested the knowledge of medical researchers.
In 1987 Henrik Wulff, Björn Andersen, Preben Brandenhoff, and Flemming Guttler surveyed Danish doctors and medical students. Initially 250 Danish doctors were randomly sampled from a list of all Danish doctors and sent the survey instrument described below; in the end 148 responded. In addition, 97 medical students in an introductory biostatistics course were given the survey.
The survey was 11 questions long, with the first question asking about the subject’s self-evaluated statistical knowledge and the last question asking about the subject’s perception of the survey’s usefulness. The other nine questions were statistical in nature. For the population of doctors no respondent had more than seven correct answers. The median number of correct answers was just 2.4. Even those doctors who selected Option A for Question 1, “I understand all the [statistical] expressions” scored just 4.1 out of nine. The students scored better than the doctors, the median number of correct answers was 4.0. Two students answered all nine questions correctly. The distribution of correct answers by doctors and students is shown at right.
For doctors, there were four questions for which the correct answer was also the most selected response. Those were Questions 4, 5, 7, and 9. Students also had four such questions: Questions 2, 4, 7, and 8. For Questions 4 and 7 a plurality of both populations selected the correct answer. Question 4 asked about using the standard error to create a 95% confidence interval, while Question 7 asked about the correct interpretation of a p-value. Question 7, in fact, had the highest correct response rate for students, with 67% answering correctly. For doctors it was Question 9, which asked about interpreting “statistical significance,” with 41% answering correctly. This means that even for the most correctly answered question 6 out of 10 doctors still answered incorrectly.
Question 3 had the fewest doctors selecting the correct response, 8%; for students Question 5 had the fewest correct responses, 27%. Question 3 asked about the correct usage of the standard deviation; Question 5 asked about the correct usage of the standard error.
On a per-question average doctors selected the correct response 28.7%. Students fared much better, at 41.9%. Comparison with a 1988 study replication involving dentists and dental students is discussed below in the section on Scheutz, Andersen, and Wulff (1988).
The survey instrument is shown below along with the corresponding number of doctors and students selecting each statement response. Correct answers are highlighted in green. A discussion of each statement and its correct answer is provided below the table.
Statements & responses | Percent of doctors selecting statement | Percent of students selecting statement |
---|---|---|
1. Which of the following statements reflects your attitude to the most common statistical expressions in medical literature, such as SD, SE, p-values, confclassence limits and correlation coefficients? | ||
a. I understand all the expressions. | 20% | 10% |
b. I understand some of the expressions. | 35% | 51% |
c. I have a rough idea of the meaning of these expressions. | 22% | 32% |
d. I know vaguely what it is all about, but not more. | 17% | 7% |
e. I do not understand the expressions. | 6% | 0% |
2. In a medical paper 150 patients were characterized as ‘Age 26 years ± 5 years (mean ± standard deviation)’. Which of the following statements is the most correct? | ||
a. It is 95 per cent certain that the true mean lies within the interval 16-36 years. | 26% | 30% |
b. Most of the patients were aged 26 years; the remainder were aged between 21 and 31 years. | 38% | 13% |
c. Approximately 95 per cent of the patients were aged between 16 and 36 years. | 30% | 51% |
d. I do not understand the expression and do not want to guess. | 6% | 6% |
3. A standard deviation has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct? | ||
a. My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation. | 8% | 29% |
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. | 23% | 10% |
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150. | 37% | 23% |
d. Such expressions are used only when research workers have assured themselves that the assumption is fulfilled. | 20% | 35% |
e. I know nothing about the normal distribution and do not want to guess. | 12% | 3% |
4. A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration one hour after oral administration of 10 mg of the drug was 188 ng/ml ± 10 ng/ml (mean ± standard error). Which of the following statements do you prefer? | ||
a. Ninety-five per cent of the volunteers had plasma concentrations between 168 and 208 ng/ml. | 27% | 28% |
b. The interval from 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration. | 20% | 7% |
c. We are 95 per cent confident that the true mean lies somewhere within the interval 168 to 208 ng/ml. | 39% | 55% |
d. I do not understand the expression and do not wish to guess. | 14% | 10% |
5. A standard error has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct? | ||
a. My interpretation presupposes a normal distribution. However, biological data are rarely distributed normally, and this is why expressions of this kind cannot usually be interpreted sensibly. | 5% | 15% |
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. | 20% | 14% |
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large. | 38% | 27% |
d. Such expressions are only used when research workers have assured themselves that the assumption is fulfilled. | 19% | 35% |
e. I know nothing about the normal distribution and do not want to guess. | 18% | 9% |
6. A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer? | ||
a. It has been proved that the treatment is better than placebo. | 20% | 6% |
b. If the treatment is not effective, there is less than a 5 per cent chance of obtaining such results. | 13% | 39% |
c. The observed effect of the treatment is so large that there is less than a 5 per cent chance that the treatment is no better than placebo. | 51% | 54% |
d. I do not really know what a p-value is and do not want to guess. | 16% | 1% |
7. A research team wishes to examine whether or not the ingestion of licorice decreases the plasma concentration of magnesium. Twenty-three volunteers ingest a considerable amount of licorice, and no significant change in the serum magnesium is found (p > 0.05). Which of the following statements do you prefer? | ||
a. There is more than a 5 per cent chance of obtaining such results if licorice does not decrease the serum magnesium. | 39% | 67% |
b. There is only a small probability of obtaining such results if licorice does decrease the serum magnesium. | 29% | 26% |
c. The research workers ought to have studied more volunteers, then the difference would have become significant. | 13% | 5% |
d. I do not know what p-values are and do not want to guess. | 18% | 2% |
8. A new drug was tested independently in two randomized controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective (p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective (p > 0.05). The actual p-values were 0.041 and 0.097. Which of the following interpretations do you prefer? | ||
a. The first trial gave a false-positive result. | 2% | 6% |
b. The second trial gave a false-negative result. | 3% | 3% |
c. Obviously, the trials were not comparable after all. | 41% | 34% |
d. One must not attach too much importance to small differences between p-values. | 34% | 43% |
e. I do not understand the problem and do not wish to guess. | 20% | 14% |
9. Patients with ischaemic heart disease and healthy subjects are compared in a population survey of 20 environmental factors. A statistically significant association is found between ischaemic heart disease and one of those factors. Which of the following interpretations do you prefer? | ||
a. The association is true as it is statistically significant. | 2% | 6% |
b. This is no doubt a false-positive result. | 3% | 3% |
c. The result is not conclusive but might inspire a new investigation of this particular problem. | 41% | 34% |
d. I do not understand the question and do not wish to guess. | 34% | 43% |
10. In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, N = 83. Which of the following answers do you prefer? | ||
a. There is a strong correlation between concentration and effect. | 22% | 17% |
b. There is only a weak correlation between concentration and effect. | 16% | 32% |
c. I am not able to interpret the expressions and do not wish to guess. | 62% | 51% |
11. What is your opinion of this survey? | ||
a. It is very important that this problem is raised. | 65% | 80% |
b. I do not think that the problem is very important, but it may be reasonable to take it up. | 27% | 9% |
c. The problem is unimportant and the survey is largely a waste of time. | 8% | 11% |
Table notes:
1. Correct answers highlighted in green.
2. Reference: "What do doctors know about statistics", Henrik Wulff, Björn Andersen, Preben Brandenhoff, and Flemming Guttler, Statistics in Medicine, 1987 [link]
Question 1 asked about self-reported understanding of a set of statistical concepts. The most frequent response was that the respondent understood “some” of the statistical terms.
Question 2 asked about the correct interpretation of patient age characterized by “Age 26 years plus ± 5 years (mean ± standard deviation).” The correct response was Option C, “Approximately 95 per cent of the patients were aged between 16 and 36 years,” which 30% of respondents selected. However, as outlined in the Question 3 summary below, to interpret standard deviations in this way the population data must be normally distributed. Somewhat curiously age is not an attribute that follows a normal distribution. It is unclear why age was chosen for this question. Option B was the most selected at 38%. It stated that, “Most of the patients were aged 26 years; the remainder were aged between 21 and 31 years.” This is obviously incorrect as 26 is simply the mean age of the observed patients and it is unclear from the given context how many were exactly that age. The age interval given in the next part of Option B, 21 to 31 years, would correspond to the 1-sigma rule, meaning about 68% of patients are within this range (again, assuming age is normally distributed).
Question 3 asks about the standard deviation, noting that it has something to do with the “normal distribution.” In fact, the standard deviation can be calculated regardless of the distribution of the data. However, when the data comes from the normal distribution the interpretation is tractable, following well-known statistical properties (for example the proportion of the data falling between various standard deviation intervals). For this reason the author’s preferred response is Option A, “My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation.” Just 8% of subjects responded in this way. It’s possible other answers could make sense with certain assumptions. For instance, Option B: “My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research.” If the doctors who were surveyed are specialists, it may be that their area of expertise is one for which the normal distribution applies, for example research on height or weight. However, if the doctors who were surveyed are general practitioners, and therefore are called upon to broadly interpret research across aspects of the human body, then it may be true that the majority of studies cannot assume a normal distribution. While we are not medical specialists, Wulff et al. are; in their explanation of Question 3 they simply note that, “…biological phenomena are rarely distributed [normally]….” For this reason Option A does seem like the most correct of those provided. The most selected answer, Option C with 37%, was incorrect. Option C stated that “My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150.” This is a confusion of the application of the Central Limit Theorem, which holds for measures of the sampling distribution of the sample mean, for example the standard error, but nor for measures of population dispersion such as the standard deviation.
Question 4 asked about the correct interpretation of plasma concentration characterized by a concentration of “188 ng/ml ± 10 ng/ml (mean ± standard error)” one hour after oral administration of a drug. The correct answer was Option C, “We are 95 per cent confident that the true mean lies somewhere within the interval 168 to 208 ng/ml,” which was selected by 39% of doctors and 55% of students. The standard error is the standard deviation of the sampling distribution, usually applied to the so-called “sampling distribution of the sample mean” in which case the standard error is called “the standard error of the mean.” The standard error of the mean is used to construct the familiar 95% confidence interval using the 2-sigma rule: 95% of sample means fall within two standard errors of the mean. The 2-sigma rule can be used because the sampling distribution of the sample mean follows a normal distribution due to the Central Limit Theorem. The standard deviation from Question 2 asks the question, “How spread out are the ages of the patients I surveyed?” The standard error for Question 4 asks the questions, “How spread out are my estimates of the sample mean (used to estimate the population mean)?” This distinction is why Option A is incorrect, “Ninety-five per cent of the volunteers had plasma concentrations between 168 and 208 ng/ml.” Option A refers to the standard deviation. However, more than a quarter of respondents in both population selected Option A.
Question 5 mirrors Question 3, but asks about the standard error, noting that it has something to do with the “normal distribution.” The correct response was Option C, “My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large.” Option C was selected by 38% of doctors, the modal Option for this question, and 27% of students. Options A and C are incorrect because as stated in the Question 4 description the normality condition always holds in sampling distributions due to the Central Limit Theorem. One caveat is that the sample size needs to be sufficiently large, the condition met in Option C. The usual rule of thumb is a sample size of 30. Option B, “My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research” is incorrect, because biological research may have studies in which the sample size is too small to guarantee the Central Limit Theorem applies. Option D was selected by the plurality of student, 35%. Option D read, “Such expressions are only used when research workers have assured themselves that the assumption is fulfilled.” What “the assumption” means is not specified and it is therefore hard to judge this statement as incorrect. How are we to know if a subject interpreted “the assumption” to mean the sample size assumption? Still, one could argue that in comparison to the explicit statement about sample size assumption in Option C, Option D is less correct.
Question 6 regarded p-values, “A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer?” The correct answer was Option B “If the treatment is not effective, there is less than a 5 per cent chance of obtaining such results.” However, this was selected by just 13% of doctors and 39% of students. Option B is a restatement of the p-value definition, which can also be written as, “The probability of obtaining the observed data or more extreme values assuming the null hypothesis is true.” Option A was select by 20% of doctors and read, “It has been proved that the treatment is better than placebo.” Option A has been investigated [cite other articles with this interpretation]. Option C was selected by the majority of students, 54%. It read, “The observed effect of the treatment is so large that there is less than a 5 per cent chance that the treatment is no better than placebo.” This is a version of the Effect Size Fallacy, the p-value does not directly measure the size of the treatment.
Question 7 was similar to Question 6, but asked about a p-value of less than 0.05 obtained in study investigating the concentration of magnesium in plasma due to licorice ingestion. Option A was correct, “There is more than a 5 per cent chance of obtaining such results if licorice does not decrease the serum magnesium.” This answer was the most selected by doctors, 39%, and students, 67%. Again, the correct answer follows form the definition of the p-value.
Question 8 regarded two randomized trials of a new drug in which one was statistically significant and the other statistically nonsignificant. The actual p-values were 0.041 and 0.097. The correct answer was Option D, “One must not attach too much importance to small differences between p-values.” This option was selected by 34% of doctors and 43% of students. As we have shown, p-values have substantial natural variation regardless of the truth or falsity of the null hypothesis. Option C — “Obviously, the trials were not comparable after all” — was selected by 41% of doctors. This is incorrect, however. The trials may or may not be comparable given their underlying methodologies, but the comparability of the two trials cannot be determined from their p-values.
Question 9 attempted to determine the impact of multiple comparisons. It stated that “A statistically significant association is found between ischaemic heart disease and one of those [20 environmental] factors [that was tested].” Given 20 independent hypothesis tests — one for each environmental factor — on average one will obtain statistical significance by chance alone even if the null hypothesis is true for all 20. Therefore, the correct response is Option C, “The result is not conclusive but might inspire a new investigation of this particular problem.” Option A was selected by 41% of doctors and 34% of students. Along with Question 10, Question 9 had the largest proportion of students respond that, “I do not understand the question and do not wish to guess.”
Question 10 considered the impact of correlation. It read: “In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, N = 83." The majority of both populations, 62% of doctors and 51% of students indicated that, “I am not able to interpret the expressions and do not wish to guess.” This is the most of any question. Option A indicated that the correlation was strong, “There is a strong correlation between concentration and effect.” Option A was selected by 22% of doctors and 17% of students. However, using the standard r-squared metric results in a value of 0.17, typically considered small. In a bivariate correlation r-squared is simply the square of ‘r’, the correlation coefficient, which was given as 0.41 in this scenario. The interpretation of r-squared is that 17% of the variation in plasma concentration can be explained by the effect of the drug. Some assumptions are needed for that interpretation to be true, which is likely why the question includes the statement, “In a methodologically impeccable investigation…” The correct answer is therefore Option B, “There is only a weak correlation between concentration and effect,” which was selected by 16% of doctors and 32% of students.
Question 11 asked respondents’ opinion of the survey. Most respondents believe the project of statistical evaluation and education had some merits. Just 8% of doctors and 11% of students felt the survey was a waste of time.
The authors conclude that the lack of basic statistical knowledge of the respondents indicates a serious problem for the medical profession as doctors are required to interpret new and existing medical research in order to best provide care to patients.
In 1988 Wulff et al. (1987) was replicated using a group of dentists and dental students. Two of the authors were the same as in the 1987 study of Danish doctors — Henrik Wulff and Björn Andersen — and one was new, Flemming Scheutz, who at the time was associated with the Royal Dental College. Initially 250 Danish dentists were randomly sampled from a list of all Danish dentist and sent the survey instrument described below; in the end 125 responded, fewer than the 148 doctors that responded in the original survey. In addition, 27 dental students in an introductory statistics course were given the survey. This was a substantially smaller sample size than the number of medical students that participated in the study of doctors, 97.
The surveyed used was identical to that in the 1987 study of doctors. For the population of dentists no respondent had more than six correct answers. The median number of correct answers was just 2.2, even lower than the 2.4 median correct responses of doctors. Even those doctors who selected Option A for Question 1, “I understand all the [statistical] expressions” scored just 4.5 out of nine. Students scored better than doctors; the median number of correct answers was 3.4, which was lower than the 4.0 of medical students from the 1987 study. No student answered more than six questions correctly (recall that out of the 97 medical students, two answered all nine questions correctly). The distribution of correct answers by doctors and students is shown at right.
For dentists, there was just one question for which the correct answer was also the most selected response, Question 7. Question 7 asked about the correct interpretation of a p-value. Still, only 32% of dentists answered correctly. More dentists answered Question 9 correctly, 39%, but the correct response was not the most selected; 46% of dentists incorrectly selected Option A (the correct answer was Option C).
Dental students on the other had had four questions for which the correct answer was also the most selected response. Those were Questions 2, 4, 5, and 9. Of these, Question 2 had the highest correct response rate, 59%. Question 2 asked about the correct interpretation of standard deviation. Questions 4 and 5 asked about the correct interpretation of standard error. Question 9 asked about the multiple comparison problem when using statistical significance.
Compared to doctors dentists performed more poorly: for doctors there were four questions for which the correct answer was also the most selected response. Dental students were roughly comparable to medical students. Both groups had four questions for which the correct answer was also the most selected response. The highest correct response rate for medical students was students 67% (Question 7), for dental students it was 59% (Question 2).
Question 10 had the fewest dentists selecting the correct response, 6%. However, Questions 3 and 6 had were a close second and third, with just 7% and 8%, respectively, of dentists selecting the correct response. Question 3 asked about the correct interpretation of standard deviation, Question 6 asked about the correct interpretation of the p-value, and Question 10 asked about the magnitude of a correlation effect. For students, it was Question 3 with 4%. For students the next lowest correct response rate was Question 8, with 22%.
On a per-question average dentists selected the correct response 22.4% of the time, while dental students fared better at 33.3%. When ranking correct response rates of the four populations included in both the 1987 and 1988 studies, medical students were the best (41.9%), followed by dental students (33.3%), doctors (28.7%), and dentists (22.4%).
The survey instrument is shown below along with the corresponding number of doctors and students selecting each statement response. Correct answers are highlighted in green. For a full explanation of the correct and incorrect answers for each question see the section above discussing the results for doctors from Wulff et al. (1987).
Statements & responses | Percent of doctors selecting statement | Percent of students selecting statement |
---|---|---|
1. Which of the following statements reflects your attitude to the most common statistical expressions in medical literature, such as SD, SE, p-values, confidence limits and correlation coefficients? | ||
a. I understand all the expressions. | 6% | 4% |
b. I understand some of the expressions. | 26% | 41% |
c. I have a rough idea of the meaning of these expressions. | 23% | 26% |
d. I know vaguely what it is all about, but not more. | 31% | 30% |
e. I do not understand the expressions. | 14% | 0% |
2. In a medical paper 150 patients were characterized as ‘Age 26 yr ± 5 yr (mean ± standard deviation)’. Which of the following statements is the most correct? | ||
a. It is 95% certain that the true mean lies within the interval 16-36 years. | 13% | 7% |
b. Most of the patients were aged 26 yr; the remainder were aged between 21 and 31 yr. | 41% | 19% |
c. Approximately 95% of the patients were aged between 16 and 36 yr. | 34% | 59% |
d. I do not understand the expression and do not want to guess. | 12% | 15% |
3. A standard deviation has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct? | ||
a. My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation. | 7% | 4% |
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. | 14% | 19% |
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150. | 28% | 37% |
d. Such expressions are used only when research workers have assured themselves that the assumption is fulfilled. | 29% | 19% |
e. I know nothing about the normal distribution and do not want to guess. | 22% | 22% |
4. A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration 1 hour after oral administration of 10 mg of the drug was 188 ng/ml ± 10 ng/ml (mean ± standard error). Which of the following statements do you prefer? | ||
a. Ninety-five percent of the volunteers had plasma concentrations between 168 and 208 ng/ml. | 21% | 15% |
b. The interval from 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration. | 22% | 19% |
c. We are 95% confident that the true mean lies somewhere within the interval 168 to 208 ng/ml. | 24% | 41% |
d. I do not understand the expression and do not wish to guess. | 34% | 26% |
5. A standard error has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct? | ||
a. My interpretation presupposes a normal distribution. However, biological data are rarely distributed normally, and this is why expressions of this kind cannot usually be interpreted sensibly. | 7% | 7% |
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. | 10% | 11% |
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large. | 26% | 41% |
d. Such expressions are only used when research workers have assured themselves that the assumption is fulfilled. | 21% | 22% |
e. I know nothing about the normal distribution and do not want to guess. | 35% | 19% |
6. A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer? | ||
a. It has been proved that the treatment is better than placebo. | 18% | 4% |
b. If the treatment is not effective, there is less than a 5% chance of obtaining such results. | 8% | 33% |
c. The observed effect of the treatment is so large that there is less than a 5% chance that the treatment is no better than placebo. | 50% | 56% |
d. I do not really know what a p-value is and do not want to guess. | 25% | 7% |
7. A research team wishes to examine whether or not the ingestion of licorice decreases the plasma concentration of magnesium. Twenty-three volunteers ingest a considerable amount of licorice, and no significant change in the serum magnesium is found (p > 0.05). Which of the following statements do you prefer? | ||
a. There is more than a 5% chance of obtaining such results if licorice does not decrease the serum magnesium. | 32% | 30% |
b. There is only a small probability of obtaining such results if licorice does decrease the serum magnesium. | 24% | 48% |
c. The research workers ought to have studied more volunteers, then the difference would have become significant. | 16% | 19% |
d. I do not know what p-values are and do not want to guess. | 28% | 4% |
8. A new drug was tested independently in two randomized controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective (p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective (p > 0.05). The actual p-values were 0.041 and 0.097. Which of the following interpretations do you prefer? | ||
a. The first trial gave a false-positive result. | 6% | 4% |
b. The second trial gave a false-negative result. | 1% | 4% |
c. Obviously, the trials were not comparable after all. | 34% | 52% |
d. One must not attach too much importance to small differences between p-values. | 26% | 22% |
e. I do not understand the problem and do not wish to guess. | 34% | 19% |
9. Patients with ischaemic heart disease and healthy subjects are compared in a population survey of 20 environmental factors. A statistically significant association is found between ischaemic heart disease and one of these factors. Which of the following interpretations do you prefer? | ||
a. The association is true as it is statistically significant. | 46% | 11% |
b. This is no doubt a false-positive result. | 1% | 7% |
c. The result is not conclusive but might inspire a new investigation of this particular problem. | 39% | 44% |
d. I do not understand the question and do not wish to guess. | 14% | 37% |
10. In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, n = 83. Which of the following answers do you prefer? | ||
a. There is a strong correlation between concentration and effect. | 21% | 26% |
b. There is only a weak correlation between concentration and effect. | 6% | 26% |
c. I am not able to interpret the expressions and do not wish to guess. | 73% | 48% |
11. What is your opinion of this survey? | ||
a. It is very important that this problem is raised. | 35% | 45% |
b. I do not think that the problem is very important, but it may be reasonable to take it up. | 52% | 41% |
c. The problem is unimportant and the survey is largely a waste of time. | 13% | 14% |
Table notes:
1. Correct answers highlighted in green.
2. Reference: "What do doctors know about statistics", Flemming Guttler, Björn Andersen, and Henrik Wulff, Scandinavian Journal of Dental Research, 1988 [link]
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 61 students in the field of medicine. It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Medicine | 61 | 69% | 26% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 16, false - 42, blank - 3 (4.9%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Medicine students had the second highest proportion of correct responses, 69%, just behind mathematics students at 71%. As described in more detail below, medicine students also had the largest proportion of correct written explanations.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Not all of the respondents to the statement gave a written explanation, 54 of the 61 medicine students (88%) gave written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. Medicine students had the largest proportion of correct written explanations both in percentages terms (15%) and in absolute (8 students). However, medicine student had the second lowest proportion of partially correct explanations. Still the combined proportion of correct and incorrect written explanations was about 33%, the third largest proportion after mathematics and business students.
Mistake M1 was the most common of the three mistake categories. Medicine students had 15% of written explanations categorized as DI, or “difficult to interpret,” the second largest proportion, just behind business students.
[compare to other types of misinterpretations]
Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Medicine | 54 | 15% | 19% | 28% | 17% | 7% | 15% |
Table notes:
1. Percentages have been rounded for clarity and may not add to 100%.
2. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
3. The exact number of respondents coded under each category were as follows: C - 8, PC - 10, M1 - 15, M2 - 9, M3 - 4, DI - 8.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In addition seven second-year medicine students were interviewed in the 1992-1993 school year. However, the summary of these interviews provided in Vallecillos (2000) are difficult to understand.
In 2018 Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu surveyed 346 psychology students and researchers in China using the same wording as in Haller and Krauss (2002), translated into Chinese. The authors also corrected the degrees of freedom from 18 to 38, which were incorrect in the original scenario.
Using the open source data provided by the authors we segmented psychology researchers that were deemed as being in a medically related psychology subfield. The procedure used to categorize these subfields as “medically related” was simple judgement on our part. The four subfields were cognitive neuroscience, biological or neuropsychology, psychiatry or medical psychology, and neuroscience or neuroimaging. The sample sizes for all but cognitive neuroscience (n=121) are quite small; biological and neuropsychology had four respondents, psychiatry and medical psychology had three respondents, and neuroscience and neuroimaging had nine respondents.
The English version of the wording is shown below.
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false”. “False” means that the statement does not follow logically from the above premises. Several or none of the statements may be correct.
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
2. You have found the probability of the null hypothesis being true.
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
4. You can deduce the probability of the experimental hypothesis being true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
The correct answer to all of these questions is "false” (for an explanation see Statistical Inference or https://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf).
[add interpretation]
Statement summaries | Undergraduate | Masters | PhD | Postdoc and assistant professors |
---|---|---|---|---|
1. Null Hypothesis disproved | 21% | 72% | 21% | 56% |
2. Probability of null hypothesis | 50% | 57% | 47% | 44% |
3. Null hypothesis proved | 36% | 60% | 5% | 44% |
4. Probability of null is found | 64% | 32% | 47% | 22% |
5. Probability of Type I error | 79% | 25% | 95% | 44% |
6. Probability of replication | 29% | 64% | 42% | 56% |
Table notes:
1. Percentages do not add to 100% because each respondent answered all questions.
2. Sample sizes: Undergraduates (n=14), Masters (n=92), PhD (n=19), Postdoc or assistant prof (n=9)
3. Data in this table is for a subset of psychology researchers that self-identified as being in one of the four following subfields related to medicine: cognitive neuroscience, biological/neuropsycho, psychiatry/medical, or neuroscience/neuroimaging.
4. Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 130 in the field of medicine. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”
The author’s instrument wording is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
In terms of the proportion with at least one NHST misinterpretation, medical researchers were near the top of the pack of the professions surveyed by Lyu et al. (2020). They had the second highest proportion of respondents with at least one misinterpretation out of eight professions surveyed. Medical researchers ranked third out of eight in terms of the average number of misinterpretations, 1.88 out of a total of four possible misinterpretations. The nonsignificant version had a slightly higher proportion of misinterpretations compared to the significant version, 1.92 compared to 1.84. This pattern was true for all fields except general science.
The highest proportion of incorrect responses varied between the significant and nonsignificant version. Statement four was the most misinterpreted on the significant version, while statement three was the most misinterpreted in the nonsignificant version. A separate version of statement three was also the most misinterpreted across three studies in psychology: Oakes (1986), a 2002 replication by Haller and Krauss, the 2018 replication by Lyu et al. (translated into Chinese).
The reason for participants having an especially difficult time with statement three is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.
One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 49% | 48% |
2. You have found the probability of the null (alternative) hypothesis being true. | 52% | 54% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 51% | 64% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 64% | 43% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=69), nonsignificant version (n=61).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. Undergraduate students fared best, with “only” 79% demonstrating at least one NHST misinterpretation, but they also had the highest average number of incorrect responses with 2.4 (out of four possible), indicating that those respondents that did have a misinterpretation tended to have multiple misinterpretations.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 19 | 79% | 2.4 |
Masters | 69 | 96% | 1.8 |
PhD | 24 | 96% | 1.8 |
Post-PhD | 18 | 94% | 1.8 |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested medical researchers for common confidence interval misinterpretations. All researchers surveyed were from China.
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
At 92%, medical researchers had the fourth highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). Medical researchers also ranked third out of eight in terms of the average number of confidence interval misinterpretations, 1.86 out of a total of four possible. This was comparable to the 1.88 average misinterpretations in the NHST instrument. The nonsignificant version had a lower proportion of misinterpretations compared to the significant version, 1.74 compared to 1.97. This pattern was true for all fields except medicine and the social sciences.
There was fairly wide variation between the significant and nonsignificant versions for statement four, 15 percentage points. Statement four suffers from the same subtle wording issue as the NSHT version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 60% | 61% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 54% | 56% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 53% | 47% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 66% | 51% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=93), nonsignificant version (n=71).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had approximately the same number of average confidence interval misinterpretations, at 1.9 out of a possible four. There was some variation in the percentage of each education level with at least one confidence interval misinterpretations. Undergraduates had the lowest rate at 84%, although the sample size of undergraduates was relatively small at 19 participants. All Post-PhD participants had at least one misunderstanding, but again the sample size was relatively small at 18 participants. Masters and PhD students faired about equally.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 19 | 84% | 1.9 |
Masters | 69 | 93% | 1.9 |
PhD | 24 | 92% | 1.9 |
Post-PhD | 18 | 100% | 1.8 |
Cliff Effects and Dichotomization of Evidence
The cliff effect refers to a drop in the confidence of an experimental result based on the p-value. Typically, the effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below 0.05 and lower confidence for values above 0.05.
In 2016 Blakeley McShane and David Gal released the results of a multi-year study that investigated the prominence of the cliff effect within various academic fields, including medicine. To test medical researchers McShane and Gal surveyed two populations: 261 authors of articles from the American Journal of Epidemiology and 75 from the New England Journal of Medicine. Both populations of authors were from articles published in 2013.
To fully study the impact of the cliff effect McShane and Gal created a hypothetical scenario. The details of the hypothetical scenario differed for the two populations. For the American Journal of Epidemiology two questions were asked after the hypothetical scenario was presented. The first was the so-called “judgement” question, meant simply to test medical researchers’ statistical understanding of the scenario presented. The judgement question randomized medical researchers into a hypothetical scenario with one of four p-values: of 0.025, 0.075, 0.125, or 0.175. and one of two treatment magnitudes: a small treatment effect of 52% vs. 44% and a large treatment effect of 57% vs. 39%, for the Drug A and Drug B recovery rates respectively.
The second was the “choice” question, in which economists were asked to make a recommendation based on the hypothetical scenario. The choice question randomized the recommendation to be toward either a close other or a distant other.
The judgement question is shown below.
Below is a summary of a study from an academic paper:
The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly drawn from a fixed population and then randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of subjects who took Drug A recovered from the disease while forty-four percent (44%) of subjects who took Drug B recovered from the disease. A test of the null hypothesis that there is no difference between Drug A and Drug B in terms of probability of recovery from the disease yields a p-value of 0.175. Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?
A. A person drawn randomly from the same population as the subjects in the study is more likely to recover from the disease if given Drug A than if given Drug B.
B. A person drawn randomly from the same population as the subjects in the study is less likely to recover from the disease if given Drug A than if given Drug B.
C. A person drawn randomly from the same population as the subjects in the study is equally likely to recover from the disease if given Drug A than if given Drug B.
D. It cannot be determined whether a person drawn randomly from the same population as the subjects in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.
The choice question for a close loved one then saw the following wording:
If you were to advise a loved one who was a patient from the same population as those in the study, what drug would you advise him or her to take?
Participants in the distant other condition saw this wording instead:
If you were to advise physicians treating patients from the same population as those in the study, what drug would you advise these physicians prescribe for their patients?
All participants then saw the following three response options:
A. I would advise Drug A.
B. I would advise Drug B.
C. I would advise that there is no evidence of a difference between Drug A and Drug B.
The correct answer to all versions of the judgement statements was Option A since Drug A had a higher percentage of patients recover from the disease. However, respondents were much more likely to select an incorrect response in the version of the question with the nonsignificant p-value, likely believing that a nonsignificant p-value was evidence of no effect between Drug A and Drug B. McShane and Gal identify this as evidence of the cliff effect at work.
The evidence is supplemented by respondents’ selection for the choice statement. While strictly speaking there is no correct answer to the choice question as it is a recommendation, Drug A had a higher recovery rate from the disease and is therefore the natural choice. Like in the judgement question the nonsignificant p-value induced fewer respondents to answer correctly. Nonetheless, the proportion answering correctly in the nonsignificant version of the choice question was substantially higher than in the judgement question. As in McShane and Gal’s article we collapse the choice question across the “close other” and “distant other” categories as this is not the primary hypothesis being considered. In general, respondents were more likely to recommend Drug A in the “close other” scenario. For complete details see McShane and Gal (2016).
McShane and Gal hypothesis that respondents are more likely to select the correct answer for the choice question because it short circuits the automatic response to interpret the results with statistical significance in mind. Instead, focus is redirected toward a simpler question: which drug is better?
For the New England Journal of Medicine (NEJM) two questions were asked after the hypothetical scenario was presented. The first was the so-called “judgement” question, meant simply to test medical researchers’ statistical understanding of the scenario presented. The judgement question presented the same scenario twice, first with a p-value of 0.27 and next with a p-value of 0.01. Participants were randomized into one of three scenario wordings. Wording one is shown below.
Below is a summary of a study from an academic paper. The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to one of two groups. 1 Group A was instructed to write daily about positive things they were blessed with while Group B was instructed to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants in Group A lived, on average, 8.2 months post-diagnosis whereas participants in Group B lived, on average, 7.5 months post-diagnosis (p = 0.27). Which statement is the most accurate summary of the results?
A. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B.
B. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was less than that lived by the participants who were in Group B.
C. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was no different than that lived by the participants who were in Group B.
D. Speaking only of the subjects who took part in this particular study, it cannot be determined whether the average number of post-diagnosis months lived by the participants who were in Group A was greater/no different/less than that lived by the participants who were in Group B.
Response wording two was identical to response wording one above except it omitted the phrase “Speaking only of the subjects who took part in this particular study” from each of the four response options.
Response wording three omitted “Speaking only of the subjects who took part in this particular study” and rephrased “the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B” with “The participants who were in Group A tended to live longer post-diagnosis than the participants who were in Group B.” The complete list of options was then:
A. The participants who were in Group A tended to live longer post-diagnosis than the participants who were in Group B.
B. The participants who were in Group A tended to live shorter post-diagnosis than the participants who were in Group B.
C. Post-diagnosis lifespan did not differ between the participants who were in Group A and the participants who were in Group B.
D. It cannot be determined whether the participants who were in Group A tended to live longer/no different/shorter post-diagnosis than the participants who were in Group B.
Like with American Journal of Epidemiology authors a substantial cliff effect was observed for NEJM authors.
The authors produced a follow-up study in their paper “Statistical Significance and the Dichotomization of Evidence” that was focused specifically on statisticians. However, the experimental setup was the same. That article was published in the Journal of the American Statistical Association and selected by the editors for discussion.
In thier discussions both Donald Berry and Eric Labera and Kerby Shedden levied the criticism at the “Speaking only” phrasing used in the questionnaire. However, in their rejoinder by McShane and Gal note that respondents were randomized into two versions of question phrasing, one of which did not include the “Speaking only” language. Nonetheless the response patterns were the same regardless of the phrasing. The original paper and as well as the discussions and rejoinders can be found in citation X [https://statmodeling.stat.columbia.edu/wp-content/uploads/2017/11/jasa_combined.pdf].
One more study was uncovered that examined the cliff effect, Helske et al. (2020). More than a hundred researchers participated including 16 medical researchers. One of the medical researchers had received an undergraduate degree, three had received a master’s degree, and 12 had a PhD. While the sample size is somewhat small, the results of the study are included here for completeness.
Like in McShane and Gal a hypothetical scenario was presented. The instrument wording was as follows:
A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?
One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.
Visualization | Largest difference in confidence | Difference in confidence (percentage points) |
---|---|---|
P-value | 0.04 to 0.06 | 21% |
CI | 0.04 to 0.06 | 19% |
Using the open source data made available by the authors we analyzed the extent of a cliff effect. A cliff effect was observed after plotting the drop in confidence segmented by visual presentation type. Only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.
Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.
The 0.04 and 0.06 interval was associated with the highest drop in confidence for both the p-value and confidence interval presentation methods.
MARKETING & MANAGEMENT
There are relatively few studies directly surveying the NHST, p-value, and confidence interval knowledge of business and marketing researchers and students. Just four papers were found: one testing belief in the false claim that hypothesis testing can prove a hypothesis is true, a second exploring both general NHST and confidence interval misinterpretations, and two investigating the prevalence and moderators of the cliff effect.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Business students (n=75) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 51% of business students incorrectly marked the statement as true. Only two business students that had correctly answered the statemenet also provided a correct written explanation of their reasoning. |
McShane & Gal (2015) | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] | Cliff effect | Marketing Science Institute Young Scholars (n=27) | 1. A cliff effect was found between p-values of 0.01 and 0.27. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST and CI misinterpretations | Management undergraduate students (n=33) Management masters students (n=37) Management PhD students (n=9) Management with a PhD (n=16) |
1. 91% and 97% of undergraduate students demonstrated at least one NHST and one CI misinterpretation, respectively. 2. 97% and 97% of masters students demonstrated at least one NHST and one CI misinterpretation, respectively. 3. 100% and 100% of PhD students demonstrated at least one NHST and one CI misinterpretation, respectively. 4. 94% and 94% of subjects with a PhD demonstrated at least one NHST and one CI misinterpretation, respectively. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Marketing and management researchers (n=4) | 1. A moderate cliff effect was found for marketing and management researchers using simple descriptive statistics. 2. Looking across all 114 respondents, in a one-sample t-test the cliff effect is damped by using gradient or violin visual presentations over standard confidence interval bounds. |
Questionnaires of NHST knowledge
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 75 students in the field of business. It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Business | 75 | 33% | 51% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 25, false - 38, blank - 12 (16%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Not all of the respondents to the statement gave a written explanation, 52 of the 75 pedagogy students (69%) gave written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. Just 4% of respondents gave a correct written explanation, however 40% gave a partially correct explanation, the second largest proportion behind mathematics students. Mistake M1 was by far the most common of the three mistake categories. Business students had 15% of written explanations categorized as DI, or “difficult to interpret,” the largest proportion of any university specialization.
[compare to other types of misinterpretations]
Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Business | 52 | 4% | 40% | 37% | 2% | 2% | 15% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows: C - 2, PC - 21, M1 - 19, M2 - 1, M3 - 1, DI - 8.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 130 in the field of medicine. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”
The author’s instrument wording is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
In terms of the proportion with at least one NHST misinterpretation, management researchers had the highest rate out of the academic fields surveyed by Lyu et al. (2020), with 95% of participants having at least one misinterpretation. However, management researchers fared well when it came to the average number of misinterpretations, coming in with “only” 1.71 out of four possible, behind only general science for the lowest figure. The nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.84 compared to 1.59. This pattern was true for all fields except general science.
The highest proportion of incorrect responses across both versions was statement three, which was also the most misinterpreted statement across all fields. A separate version of this statement was also the most misinterpreted across three studies in psychology: Oakes (1986), a 2002 replication by Haller and Krauss, another replication in 2018 by Lyu et al. (translated into Chinese).
The reason for participants having an especially difficult time with statement three is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.
One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 63% | 55% |
2. You have found the probability of the null (alternative) hypothesis being true. | 55% | 48% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 71% | 71% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 53% | 43% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=51), nonsignificant version (n=44).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. Undergraduate students fared best, with “only” 91% demonstrating at least one NHST misinterpretation with an average of 1.7 incorrect NHST responses. There was substantial differentiation between education levels in the average number of incorrect responses, with Post-PhD researcher marking 1.6 questions incorrect on average, while PhD students marked 2.1 questions incorrect. Note, however, that both sample sizes are relatively small, just 16 and 9, respectively.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 33 | 91% | 1.7 |
Masters | 37 | 97% | 1.7 |
PhD | 9 | 100% | 2.1 |
Post-PhD | 16 | 94% | 1.6 |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested management researchers for common confidence interval misinterpretations. A total of 95 management researchers were surveyed, all from China.
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
At 97%, management researchers had the highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). However, management researchers ranked seven out of eight in terms of the average number of confidence interval misinterpretations, 1.68 out of a total of four possible. This was comparable to the 1.71 average misinterpretations in the NHST instrument. The nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.73 compared to 1.65. This pattern was somewhat unusual, six of the eight fields had higher numbers of misinterpretations for the nonsignificant version.
There was fairly wide variation between the significant and nonsignificant versions for statement four, 15 percentage points. Statement four suffers from the same subtle wording issue as the NSHT version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 63% | 55% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 51% | 61% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 59% | 43% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 63% | 68% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=51), nonsignificant version (n=44).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 94% to 100%. Overall, post-PhD researchers faired best with the lowest rate of confidence interval misinterpretations (94%) and the lowest average number of incorrect responses (1.4).
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 33 | 97% | 1.9 |
Masters | 37 | 97% | 1.6 |
PhD | 9 | 100% | 1.8 |
Post-PhD | 16 | 94% | 1.4 |
Cliff Effects and Dichotomization of Evidence
The cliff effect refers to a drop in the confidence of an experimental result based on the p-value. Typically, the effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below 0.05 and lower confidence for values above 0.05.
In 2016 Blakeley McShane and David Gal released the results of a multi-year study that investigated the prominence of the cliff effect within various academic fields, including 27 marketing researchers. These researchers were Marketing Science Institute Young Scholars, who the authors note are selected on the basis of their being “potential leaders of the ‘next generation’ of marketing academics.” All Young Scholars are three to six years post-Ph.D and are conducting research on critical marketing topics such as emerging technologies, consumer decision making, and quantitative marketing research [https://www.msi.org/research/msi-young-scholars/].
Their instrument is shown below.
Below is a summary of a study from an academic paper:
The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to either write daily about positive things they were blessed with or to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants who wrote about the positive things they were blessed with lived, on average, 8.2 months after diagnosis whereas participants who wrote about others’ misfortunes lived, on average, 7.5 months after diagnosis (p = 0.27). Which statement is the most accurate summary of the results?
A. The results showed that participants who wrote about their blessings tended to live longer post-diagnosis than participants who wrote about others’ misfortunes.
B. The results showed that participants who wrote about others’ misfortunes tended to live longer post-diagnosis than participants who wrote about their blessings.
C. The results showed that participants’ post-diagnosis lifespan did not differ depending on whether they wrote about their blessings or wrote about others’ misfortunes.
D. The results were inconclusive regarding whether participants’ post-diagnosis lifespan was greater when they wrote about their blessings or when they wrote about others’ misfortunes.
Respondents saw two versions of the instrument in a random order, one where the p-value was statistically significant (p = 0.01) and one where it was statistically nonsignificant (p=0.27).
The correct answer to both versions was Option A since participants writing about positive things lived longer on average. However, respondents were much more likely to select an incorrect response in the version of the question with the nonsignificant p-value, likely believing that a nonsignificant p-value was evidence of no effect between writing habits. McShane and Gal identify this as evidence of the cliff effect at work.
A substantial cliff effect was observed among all populations surveyed by McShane and Gal, including the Young Scholars. Unlike some other populations surveyed, Young Scholars were not asked the so-called “choice” question which modified the instrument wording to elicit a recommendation toward either a close or distant other. Instrument versions that included the choice question assessed the impact of two potentially life-saving drugs, a situation more amenable to asking for a recommendation.
Another study that examined the cliff effect was Helske et al. (2020). While more than a hundred researchers participated, just five were in the field of management or marketing. Three researchers had received a PhD and two had received a master’s degree. While the sample size is extremely small, the results of the study are included here for completeness.
Like in McShane and Gal a hypothetical scenario was presented. The instrument wording was as follows:
A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?
One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.
Using the open source data made available by the authors we analyzed the extent of a cliff effect. A possible cliff effect was observed after plotting the drop in confidence segmented by visual presentation type. Only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.
Visualization | Largest difference in confidence | Difference in confidence (percentage points) |
---|---|---|
P-value | 0.05 to 0.06 | 30% |
CI | 0.04 to 0.06 | 31% |
Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.
For the confidence interval visual presentation type the largest drop in confidence was indeed associated with the 0.04 and 0.06 interval. The 0.04 to 0.06 p-value difference had the second highest drop in confidence for the p-value visual presentation. The largest drop in confidence was associated with the 0.05 to 0.06 p-value interval, which might also be thought of as indicative of a cliff effect. However, due to the extremely low sample size caution should be used in interpretation of these results.
EDUCATION, PEDAGOGY, AND VOCATIONAL STUDIES
[insert]
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Pedagogy students (n=43) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 33% of pedagogy students incorrectly marked the statement as true. No pedagogy students that had correctly answered the statemenet also provided a correct written explanation of their reasoning. |
Mittag and Thompson (2000) | A National Survey of AERA Members' Perceptions of Statistical Significance Tests and Other Statistical Issuess [link] | NHST misinterpretations | Members of the American Educational Research Association (AERA) (n=25) | Using eight statements from the original 29-statement survey instrument which had clear true/false answers, members deviated from the correct answer by 1.725 points on a 5-point Likert scale. |
Gordon (2001) | American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies [link] | NHST misinterpretations | American Vocational Education Research Association (AVERA) (n=113) | Using eight statements from the original 29-statement survey instrument which had clear true/false answers, members deviated from the correct answer by 1.77 points on a 5-point Likert scale. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Education researchers (n=2) | TBD |
Questionnaires of NHST knowledge
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 43 students in the field of pedagogy (the method and practice of teaching). It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Pedagogy | 43 | 33% | 56% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 24, false - 14, blank - 5 (11.6%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Not all of the respondents to the statement gave a written explanation, 30 of the 43 pedagogy students (70%) gave written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. No pedagogy students gave a correct written explanation and just 27% gave a partially correct response. Twice as many (40%) were guilty of M1 as of M2 (20%).
[compare to other types of misinterpretations]
Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Pedagogy | 30 | 0% | 27% | 40% | 20% | 0% | 13% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows: C - 0, PC - 8, M1 - 12, M2 - 6, M3 - 0, DI - 4.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In 1997 Matt Wilkerson and Mary Olson asked 52 graduate students three questions about the impact of sample size on p-values and Type I and Type II errors. The academic fields of the graduate students was not provided in the paper, but results are compiled in this section on Education, Pedagogy, and Vocational Studies for two reasons. First, it is stated that 14 of the 52 students (27%) were pursuing a Doctor of Education (EdD) degree. No additional information was given, however, for the 20 students that were pursuing a PhD or the 16 pursuing master’s degrees. Second, one co-author, Mary Olson, received her PhD in Adult Education.
The authors do not provide the survey instrument in the write-up of their study. However, they do note the results, which are summarized in the table below. The authors note three results.
Only one subject, “recognized that, given two different studies, both reporting a p value of .05, the study with the smaller n provides better evidence for treatment effect.” However, this characterization of the results is somewhat problematic. In the Discussion section the authors reference the same results, but in a different way: “A significant number of respondents failed to recognize that a small sample requires a greater treatment effect than a large sample to obtain an equal level of statistical significance.” The first version of the result summary conflates a small n treatment effect with “better evidence.” While mathematically it is true that relative to a large-n sample the treatment effect must be bigger in a small-n sample to achieve the same sample size, this does not mean it has provided better evidence. For instance, Smaller sample sizes are also less resilient to random error, which can artificially inflate the measured treatment effect.
Six subjects, “recognized that the probability of Type I error does not depend on sample size.” While this can be seen by the underlying mathematics, it is also possible to demonstrate via simulation.
Twenty-five subjects, “demonstrated an understanding that the probability of Type II error decreases with a larger sample size.”
For all three results the authors note that the most common justification for the subjects’ response was that increasing sample size decreases the amount of statistical error. Increasing sample size can have benefits; however, as the authors note a more sophisticated understanding is required for the highest caliber of research. For instance, if Type I error is the primary error that needs to be minimized then sample size does not matter.
The difficulties of education researchers correctly interpreting Type I and Type II error were reaffirmed in Mittag and Thompson (2000) and Gordon (2001) discussed next.
Author descriptions of results | Percentage answering incorrectly | Most common subject justification for response |
---|---|---|
1. "[G]iven two different studies, both reporting a p value of .05, the study with the smaller n provides better evidence for treatment effect." | 98% | Increasing sample size decreases the amount of statistical error. |
2. "[T]he probability of Type I error does not depend on sample size." | 89% | Increasing sample size decreases the amount of statistical error. |
3. "[D]emonstrated an understanding that the probability of Type I1 error decreases with a larger sample size." | 52% | Increasing sample size decreases the amount of statistical error. |
Table notes:
1. Percentages have been rounded. Actual percentages were: 98.1%, 88.5%, 52.0%, respectively.
2. Reference: "Misconceptions About Sample Size, Statistical
Significance, and Treatment Effect", Matt Wilkerson & Mary R. Olson, The Journal of Psychology, 1997 [link]
In 2000 Kathleen Mittag and Bruce Thompson published the results of a survey of 225 members of the American Educational Research Association (AERA). According to the AERA website, the organization is “a national research society, strives to advance knowledge about education, to encourage scholarly inquiry related to education, and to promote the use of research to improve education and serve the public good.” Approximately 4% of AERA members were initially randomly selected, a total of 1,127 surveys were mailed, and 21.7% usable surveys were received back (for a total of 225).
The survey instrument contained 29 statements broken out into nine categories (formally the 29-statements constitute Part II of the so-called The Psychometrics Group Instrument developed by Mittag in 1999). Subjects responded using a 5-point Likert scale where for some statements 1 meant agree and 5 meant disagree and for others the scale was reversed so that 1 denoted disagreement and 5 agreement. We refer to the first scale direction as positive (+) and the second as negative (-).
The 5-point Likert scale is often constructed using the labels 1 = Strongly agree, 2 = Agree, 3 = Neutral, 4 = Disagree, 5 = Strongly disagree (https://legacy.voteview.com/pdf/Likert_1932.pdf, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.608&rep=rep1&type=pdf, https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-78665-0_6363). However, it appears that in the survey instrument used by Mittag and Thompson the labels 1 = Agree, 2 = Somewhat agree, 3 = Neutral, 4 = Somewhat disagree, 5 = Disagree (with the word “strongly” omitted). While the instrument used in Mittag and Thompson was not shown, Gordon (2001) appears to have used the same instrument in his follow-up study and provides it in full. The instrument presents the scale in the way we just described. The results of Gordon (2001) are discussed after those of Mittag and Thompson (2000).
The instrument included both opinion- and fact-based statements. An example of an opinion-based statement was:
It would be better if everyone used the phrase, “statistically significant,” rather than “significant”, to describe the results when the null hypothesis is rejected.
An example of a fact-based statement was:
It is possible to make both Type I and Type II error in a given study.
Eight fact-based statements related to NHST were selected as a measure of NHST misunderstandings of this group. Mean responses on the Likert scale are shown below. As a measure of the degree of misunderstanding the absolute deviation of the mean from the normatively correct answer was calculated. For example, the Statement 2 reads, “Type I errors may be a concern when the null hypothesis is not rejected” is incorrect. This is because Type I error refers to falsely rejecting a true null hypothesis; if the null hypothesis is not rejected in the first place there is no possibility of making an error in the rejection. For this reason every respondent should have selected 5 (disagree). However, the mean response was in fact 3.0, therefore the deviation was 5.0 - 3.0 = 2.0 in Likert scale points.
We judge each question below as unambiguously true or false. Statement 1, which reads “It is possible to make both Type I and Type II error in a given study,” is incorrect. This is because Type II error occurs when the null hypothesis is not rejected, while Type I error occurs when the null hypothesis is rejected. Because the null is either rejected or not, with mutual exclusion between the two possibilities, only one type of error can occur in a given hypothesis test. The word “study” is used in the statement and therefore one might argue that a study can contain multiple hypothesis tests. However, our reading of this question is that study refers to the entire process of data collection and hypothesis testing of a single outcome of interest. Taking the word “study” literally results in interpretation that is too broad to be useful since in a study all bets are off (i.e. a study could mean anything).
Methodology for reading accurate values off of the charts provided in Mittag & Thompson 2000.
Precise values for each statement response were not presented in the paper. Instead, a series of charts were presented that used the statement number as the mean and dashes as confidence intervals. The y-axis (Likert scale) was presented a precision of 0.2. To ensure we collected mean responses as accurately as possible we used our standard method of copying the charts to Adobe Illustrator and using tracing. Results of this procedure are shown at right. Using horizontal line segments we were able to match the values on the y-axis with each statement’s mean. A similar procedure could have been conducted using visual inspection of the charts in the paper, but we prefer this method due to the increased accuracy. We feel the values we recorded are as accurate as possible given the data presented in the paper. Note that the numbers presented below are our ordering for easy statement reference and do not align with the original statement numbers presented by Mittag and Thompson. For those interested we have provided a mapping in Note 5 of the Table Notes.
Using the same methodology confidence intervals can be determined to be almost uniformly plus or minus 0.3 on the Likert scale for all statements and are therefore omitted in the table below.
Averaging the deviation from the correct answer across all eight responses resulted in a figure of 1.725. This is roughly equivalent to being neutral on (or ever so slightly agreeing with) a statement that is in fact true, and therefore normatively should elicit complete agreement.
Statement 3 had the largest deviation and in general statements relating to Type I and Type II errors had larger deviations from the correct answer than other statements.
Statements 4, 6 and 7 were both related to the “Clinical or practical significance fallacy.” Statement 7 had the lowest deviation of all eight statements, 0.4, while Statement 4 had a deviation of 1.0, the second lowest. Statement 6, however, had a substantially larger deviation, 2.2. This is likely due to the difference in statement wording. Statement 6 read that, “Finding that a p < 0.05 is one indication that the results are important.” While some might argue that this statement is true — a p less than 0.05 result is one, but not the only, indication of an important result — we unambiguously find this statement to be incorrect as the p-value has no bearing at all on whether a result is important.
Statement 5 relates to the “Effect size fallacy” and is also incorrect. It is true that the relative effect size is one determinate of the size of the p-value; this is the reason for the zone of nonsignificance. However, the p-value is not a direct measure of the effect size. For example, a small effect can still produce a small p-value if the sample size is sufficiently large.
Statement 8 relates to the “Replicability fallacy.” It had a deviation of 2.0 from the correct response. The statement is incorrect as the p-value is not a measure of experimental replicability.
***Mittag and Thompson (200) had different result for X, which they said was verified by the authors. However, we are confident in our method. Reatche dout to THompson, but did not recieve a response.
Statement | Mean response (Likert scale) |
Correct answer | Deviation from correct answer (Likert scale) |
Scale direction |
---|---|---|---|---|
1. It is possible to make both Type I and Type II error in a given study. | 2.8 | Incorrect | 2.2 | + |
2. Type I errors may be a concern when the null hypothesis is not rejected. | 3.0 | Incorrect | 2.0 | + |
3. A Type II error is impossible if the results are statistically significant. | 2.2 | Correct | 2.8 | - |
4. If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important. | 4.0 | Incorrect | 1.0 | + |
5. Smaller p values provide direct evidence that study effects were larger. | 3.8 | Incorrect | 1.2 | + |
6. Finding that p < .05 is one indication that the results are important. | 2.8 | Incorrect | 2.2 | + |
7. Studies with non-significant results can still be very important. | 1.4 | Correct | 0.4 | + |
8. Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research. | 3.0 | Incorrect | 2.0 | + |
Table notes:
1. Exact means are not provided in the paper, instead each is rounded to the nearest 0.2 and are reported here using visual inspection of the charts provided in the paper.
2. Confidence intervals for each statement are uniformly approximately ± 0.3 on the Likert scale (ex. if the mean is 2.0 the 95% CI is approximately 1.7 to 2.3)
3. Deviation from correct answer is calculated by assuming that -- when the scale is positive -- for incorrect answers 5 (disagree) is normatively correct and subtracts the mean from 5. This is reversed for the negative scale. The same logic applies to answers that are correct, in which case a response of 1 (agree) is considered normatively correct.
4. + scale direction indicates that 1 = agree and 5 = disagree, - scale direction indicates that 1 = disagree and 5 = agree.
5. The mapping between our statement numbering and that in the survey instrument is as follows (our statement = instrument statement): 1 = 22, 2 = 17, 3 = 9, 4 = 14, 5 = 11, 6 = 6, 7 = 18, 8 = 8.
6. Reference "A National Survey of AERA Members' Perceptions of Statistical Significance Tests and Other Statistical Issues", Kathleen Mittag & Bruce Thompson, Educational Researcher, 2000 [link]
In 2001 Howard Gordon followed the recommendation of Mittag and Thompson to survey other professional organizations. The professional organization of focus for his study was the American Vocational Education Research Association (AVERA). It is unclear if AVERA is still in existence, but it appears it’s purpose was to promote the study of vocational training — career and workforce education — to help improve the efficacy of vocational teacher training and assess outcomes of students educated through the vocational system.
A simple random sample of AVERA members was used and 113 usable surveys were returned. As in Mittag and Thompson (2000) Part II of the The Psychometrics Group Instrument was used. We selected the same eight fact-based questions for analysis as from Mittag and Thompson (2000). However, Gordon provided more precise mean responses as well as standard deviations (SD). The figures are shown in the table below.
Despite the higher precision, the results of Gordon align closely with those of Mittag and Thompson (2000). The average deviation from the correct answer across all eight responses in Gordon (2001) was 1.77, close to the 1.725 found in Mittag and Thompson (2000). Indeed, even on a statement-by-statement basis the mean responses are quite similar, with absolute Likert scale differences of 0.57, 0.28, 0.07, 0.18, 0.53, 0.0, 0.05, and 0.05, respectively. Four of the eight statements, therefore, had mean responses within 0.1. Statement 1 and Statement 5 both had differences in mean responses above 0.5. For Statement 1 AVERA members had a mean response closer to the correct answer, while for Statement 5 AERA members were closer to the truth.
For both AERA and AVERA members Statement 7 had the smallest deviation from the correct answer and Statement 3 had the largest deviation. See the discussion of Mittag and Thompson above for an explanation of the statements, their fallacies, and the correct answers.
Statement | Mean response (Likert scale) |
SD | Correct answer | Deviation from correct answer (Likert scale) |
Scale direction |
---|---|---|---|---|---|
1. It is possible to make both Type I and Type II error in a given study. | 3.37 | 1.21 | Incorrect | 1.63 | + |
2. Type I errors may be a concern when the null hypothesis is not rejected. | 2.72 | 1.19 | Incorrect | 2.28 | + |
3. A Type II error is impossible if the results are statistically significant. | 2.27 | 1.10 | Correct | 2.73 | - |
4. If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important. | 3.82 | 1.19 | Incorrect | 1.18 | + |
5. Smaller p values provide direct evidence that study effects were larger. | 3.27 | 1.17 | Incorrect | 1.73 | + |
6. Finding that p < .05 is one indication that the results are important. | 2.80 | 1.41 | Incorrect | 2.20 | + |
7. Studies with non-significant results can still be very important. | 1.45 | 1.19 | Correct | 0.45 | + |
8. Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research. | 3.05 | 1.21 | Incorrect | 1.95 | + |
Table notes:
1. Deviation from correct answer is calculated by assuming that -- when the scale is positive -- for incorrect answers 5 (disagree) is normatively correct and subtracts the mean from 5. This is reversed for the negative scale. The same logic applies to answers that are correct, in which case a response of 1 (agree) is considered normatively correct.
2. + scale direction indicates that 1 = agree and 5 = disagree, - scale direction indicates that 1 = disagree and 5 = agree.
3. The mapping between our statement numbering and that in the survey instrument is as follows (our statement = instrument statement): 1 = 22, 2 = 17, 3 = 9, 4 = 14, 5 = 11, 6 = 6, 7 = 18, 8 = 8.
4. Reference "American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies", Howard Gordon, Journal of Vocational Education Research, 2001 [link]
Cliff Effects and Dichotomization of Evidence
adfasdfadsf
SOCIAL SCIENCE
There are relatively few studies directly surveying the NHST, p-value, and confidence interval knowledge of social science and teaching researchers and students. Just three papers were found: one testing belief in the false claim that hypothesis testing can prove a hypothesis is true, a second exploring both general NHST and confidence interval misinterpretations, and a third investigating the prevalence and moderators of the cliff effect.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Pedagogy students (n=43) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 33% of pedagogy students incorrectly marked the statement as true. No pedagogy students that had correctly answered the statemenet also provided a correct written explanation of their reasoning. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST and CI misinterpretations | Social science undergraduate students (n=53) Social science masters students (n=94) Social science PhD students (n=44) Social science with a PhD (n=26) |
1. 91% of undergraduate students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 2. 88% of masters students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 3. 86% of PhD students demonstrated at least one NHST misinterpretation and 78% demonstrated at least on CI misinterpretation. 4. 96% of subjects with a PhD demonstrated at least one NHST misinterpretation and 96% demonstrated at least on CI misinterpretation. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Social scientists (n=19) | 1. A cliff effect was found between p-values of 0.04 and 0.06. |
Questionnaires of NHST knowledge
In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 217 in the social sciences. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”
The authors’ instrument wording is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
In terms of the proportion with at least one NHST misinterpretation, social science researchers had the second lowest rate out of the academic fields surveyed by Lyu et al. (2020), with 89% of participants having at least one misinterpretation. However, social scientists were in the middle of the pack when it came to the average number of misinterpretations, ranking fifth out of eight with 1.82 out of four possible. The nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.93 compared to 1.71. This pattern was true for all fields except general science.
The highest proportion of incorrect responses across both versions was statement three, which was also the most misinterpreted statement across all fields. A separate version of this statement was also the most misinterpreted across three studies in psychology: Oakes (1986), a 2002 replication by Haller and Krauss, another replication in 2018 by Lyu et al. (translated into Chinese).
The reason for participants having an especially difficult time with statement three is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.
One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 59% | 53% |
2. You have found the probability of the null (alternative) hypothesis being true. | 45% | 49% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 67% | 59% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 59% | 45% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=111), nonsignificant version (n=106).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. PhD students fared best, with “only” 86% demonstrating at least one NHST misinterpretation, but they also had the highest average number of incorrect responses with 2.1. There was differentiation between education levels in the average number of incorrect responses, with undergraduates marking 1.6 questions incorrect on average, while PhD students marked 2.1 questions incorrect.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 53 | 91% | 1.6 |
Masters | 94 | 88% | 1.8 |
PhD | 44 | 86% | 2.1 |
Post-PhD | 26 | 96% | 1.9 |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested social scientists for common confidence interval misinterpretations. A total of 217 social scientists were surveyed, all from China.
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
At 93%, social scientists had the second highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). The nonsignificant version had a slightly higher proportion of misinterpretations compared to the significant version, 93% compared to 92%. The social sciences along with Medicine were the two fields that followed this pattern, the other six fields all had a higher proportion of at least one misunderstanding in the significant version.
Although social scientists had the second highest proportion of respondents with at least one confidence interval misinterpretation they had the lowest average number of confidence interval misinterpretations, 1.67 out of a total of four possible. This was lower than the 1.82 average misinterpretations in the NHST instrument. Here the significant version had a higher average number of misinterpretations than the nonsignificant version, 1.63 compared to 1.71. Again, only the social sciences and medicine followed this pattern, all six other fields had higher average numbers of misinterpretations for the nonsignificant version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 67% | 63% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 59% | 60% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 48% | 50% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 56% | 63% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=111), nonsignificant version (n=106).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 94% to 100%. Overall, post-PhD researchers faired best with the lowest rate of confidence interval misinterpretations (94%) and the lowest average number of incorrect responses (1.4).
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 53 | 93% | 1.8 |
Masters | 94 | 93% | 1.5 |
PhD | 44 | 91% | 1.8 |
Post-PhD | 26 | 96% | 1.7 |
Cliff Effects and Dichotomization of Evidence
Another study that examined the cliff effect was Helske et al. (2020). More than a hundred researchers participated including 19 in various social science fields. This included researchers in political science (n = 4), sociology (n = 10), linguistics (n = 1), philosophy (n = 1), education (n = 2), general social science (n = 1). In terms of education, 11 researchers had received a PhD, six had received a master’s degree, two had bachelor’s degrees.
Like in McShane and Gal a hypothetical scenario was presented. The instrument wording was as follows:
A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?
One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.
Using the open source data made available by the authors we analyzed the extent of a cliff effect. A cliff effect was observed after plotting the drop in confidence segmented by visual presentation type.
A cliff effect was also observed using the data analysis described below. In our analysis only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.
Visualization | Largest difference in confidence | Difference in confidence (percentage points) |
---|---|---|
P-value | 0.04 to 0.06 | 20% |
CI | 0.04 to 0.06 | 28% |
Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.
For both the confidence interval visual presentation type and the p-value visual presentation type the largest drop in confidence was indeed associated with the 0.04 and 0.06 interval.
ENGINEERING, AGRONOMY, & GENERAL SCIENCE
There are relatively few studies directly surveying the NHST, p-value, and confidence interval knowledge of engineering, science, and agronomy researchers and students. Just four papers were found: one testing belief in the false claim that hypothesis testing can prove a hypothesis is true, a second investigating NHST interpretations of nonsignificance as “no effect,” a third exploring both general NHST and confidence interval misinterpretations, and a fourth investigating the prevalence and moderators of the cliff effect.
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
Vallecillos (2000) | Understanding of the Logic of Hypothesis Testing Amongst University Students [link] | NHST misinterpretations | Civil engineering students (n=93) Computer science students (n=63) |
1. When shown a statement claiming NHST can prove the truth of a hypothesis, 38% of civil engineering students incorrectly marked the statement as true. Only two civil engineering students that had correctly answered the statemenet also provided a correct written explanation of their reasoning. 2. When shown a statement claiming NHST can prove the truth of a hypothesis, 44% of computer science students incorrectly marked the statement as true. Only one computer science student that had correctly answered the statemenet also provided a correct written explanation of their reasoning. |
Fidler & Loftus (2009) | Why Figures with Error Bars Should Replace p Values: Some Conceptual Arguments and Empirical Demonstrations [link] | Confidence intervals | Final-year bachelor and masters students in environmental science classes at the University of Melbourne (n=79) Second-year ecology students at the University of Melbourne (n=55) |
For both populations confidence intervals reduced interpretations of statistical nonsigificance as "no effect" relative to p-value descriptions. |
Lyu et al. (2020) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST misinterpretations | Engineering and agronomy undergraduate students (n=22) Engineering and agronomy masters students (n=65) Engineering and agronomy PhD students (n=36) Engineering and agronomy with a PhD (n=28) |
1. 82% of undergraduate students demonstrated at least one NHST misinterpretation and 82% demonstrated at least on CI misinterpretation. 2. 99% of masters students demonstrated at least one NHST misinterpretation and 92% demonstrated at least on CI misinterpretation. 3. 92% of PhD students demonstrated at least one NHST misinterpretation and 86% demonstrated at least on CI misinterpretation. 4. 89% of subjects with a PhD demonstrated at least one NHST misinterpretation and 96% demonstrated at least on CI misinterpretation. |
Helske et al. (2020) | Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link] | Cliff effect | Human-Computer Interaction (HCI) researchers (n=32) General engineering (n=9) |
1. No cliff effect was found for HCI researchres using simple descriptive statistics. 2. A moderate cliff effect was found for general engineers. 3. Looking at all 114 respondents across various feilds, the authors found via Bayesian multilevel models that in a one-sample t-test the cliff effect is moderated by using gradient or violin visual presentations over standard confidence interval bounds. |
Questionnaires of NHST knowledge
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 93 students in the field of civil engineering and 63 in computer science. It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Civil engineering | 93 | 57% | 38% |
Computer science | 63 | 46% | 44% |
1. The exact number of respondents coded under each category were as follows. Civil engineering students: true - 35, false - 53, blank - 5 (5.4%). Computer science students: true - 28, false - 29, blank - 6 (9.5%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Not all of the respondents to the statement gave a written explanation; 79 of the 93 civil engineering students (85%) gave written explanations and just 29 of the 63 computer science students (46%) gave written explanations.
Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. For both populations M1 and M2 were the most common response categories, although for civil engineers more than a quarter were partially correct in their written response. Only two civil engineering students and one computer science student were correct in their written explanations.
[compare to other types of misinterpretations]
Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Civil engineering | 79 | 3% | 28% | 30% | 29% | 5% | 5% |
Computer science | 29 | 3% | 17% | 21% | 35% | 10% | 14% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows. Civil engineering students: C - 2, PC - 22, M1 - 24, M2 - 23, M3 - 4, DI - 4. Computer science students: C - 1, PC - 5, M1 - 6, M2 - 10, M3 - 3, DI - 4.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 247 in the sciences and 151 in engineering and agronomy. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”
The author’s instrument wording is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
In terms of the proportion with at least one NHST misinterpretation, scientists ranked fourth out of the eight academic fields surveyed by Lyu et al. (2020), with 92% of participants having at least one misinterpretation. However, scientists had the lowest average number of misinterpretations with 1.70 out of four possible. The nonsignificant version had a lower proportion of misinterpretations compared to the significant version, 1.65 compared to 1.74. Science were the only field in which the nonsignificant version had a lower average number of misinterpretations.
The highest proportion of incorrect responses varied across versions. Statement four was the most misinterpreted in the significant version, while statement one was the most misinterpreted in the nonsignificant version.
The reason for participants having an especially difficult time with statement three is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.
One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 53% | 63% |
2. You have found the probability of the null (alternative) hypothesis being true. | 58% | 57% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 53% | 54% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 62% | 60% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=133), nonsignificant version (n=114).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. Post-PhD scientists fared best, with “only” 89% demonstrating at least one NHST misinterpretation while undergraduates had the highest proportion at 97%. Post-PhD also had the highest average number of incorrect responses with 1.9, while undergraduates and masters students marked 1.6 questions incorrect on average.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 29 | 97% | 1.6 |
Masters | 71 | 92% | 1.6 |
PhD | 65 | 95% | 1.5 |
Post-PhD | 82 | 89% | 1.9 |
Turning attention to the 151 engineering and agronomy students and researchers surveyed by Lyu et al. (2020), they had the third lowest rate out of the eight academic fields surveyed with 93% of participants having at least one misinterpretation. Engineers and agronomists were in the middle of the pack when it came to the average number of misinterpretations, ranking four out of eight with 1.83 out of four possible. The nonsignificant version had a higher proportion of misinterpretations compared to the significant version, 1.96 compared to 1.68. This pattern was true for all fields except scientists.
The highest proportion of incorrect responses varied between statements with 63% of subjects answering both statements two and three incorrectly on the significant version. Meanwhile, 57% answered statement one incorrectly on the nonsignificant version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 53% | 57% |
2. You have found the probability of the null (alternative) hypothesis being true. | 63% | 43% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 63% | 56% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 54% | 48% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=72), nonsignificant version (n=79).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. Undergraduates students fared best, with “only” 82% demonstrating at least one NHST misinterpretation while masters students had the highest proportion at 99%. Undergraduates also had the highest average number of incorrect responses with 2.0, while post-PhD researchers marked 1.7 questions incorrect on average.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 22 | 82% | 2.0 |
Masters | 65 | 99% | 1.8 |
PhD | 36 | 92% | 1.9 |
Post-PhD | 28 | 89% | 1.7 |
Questionnaires of Confidence Interval Knowledge
Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested social scientists for common confidence interval misinterpretations. A total of 247 scientists and 151 in engineering and agronomy researchers were surveyed, all from China.
Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
At 92%, scientists had the third highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). The nonsignificant version had a slightly lower proportion of misinterpretations compared to the significant version, 90% compared to 93%. This pattern was true for all fields except social sciences and medicine.
Scientists also the third highest average number of confidence interval misinterpretations, 1.72 out of a total of four possible. This was comparable to the 1.70 average misinterpretations in the NHST instrument. The significant version had a lower average number of misinterpretations than the nonsignificant version, 1.66 compared to 1.79. This pattern was true for all fields except social sciences and medicine, all six other fields had higher average numbers of misinterpretations for the nonsignificant version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 56% | 62% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 59% | 53% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 57% | 54% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 62% | 52% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=133), nonsignificant version (n=114).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 90% to 97%. Overall, post-PhD researchers faired best with the lowest rate of confidence interval misinterpretations (90%), but the group had the highest average number of misinterpretations with 1.9.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 29 | 97% | 1.5 |
Masters | 71 | 92% | 1.6 |
PhD | 65 | 91% | 1.8 |
Post-PhD | 82 | 90% | 1.9 |
Turning attention to the 151 engineering and agronomy students and researchers surveyed by Lyu et al. (2020), they had the second lowest rate out of the eight academic fields surveyed with 90% of participants having at least one misinterpretation. The nonsignificant version had a slightly lower proportion of misinterpretations compared to the significant version, 89.9% compared to 90.3%. This pattern was true for all fields except social sciences and medicine.
Engineers and agronomists had the second highest average number of confidence interval misinterpretations, 1.90 out of a total of four possible. This was slightly higher than the 1.83 average misinterpretations in the NHST instrument. The significant version had a lower average number of misinterpretations than the nonsignificant version, 1.86 compared to 1.94. This pattern was true for all fields except social sciences and medicine, all six other fields had higher average numbers of misinterpretations for the nonsignificant version.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 53% | 54% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 56% | 50% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 53% | 44% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 53% | 58% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=133), nonsignificant version (n=114).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 82% to 96%. Overall, undergraduates faired best with the lowest rate of confidence interval misinterpretations (82%), but the group had the highest average number of misinterpretations with 2.0.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 22 | 82% | 2.0 |
Masters | 65 | 92% | 1.9 |
PhD | 36 | 86% | 2.0 |
Post-PhD | 28 | 96% | 1.8 |
In two studies published in a 2009 paper Fiona Fidler and Geoffrey Loftus investigated the impact of confidence interval presentation versus traditional p-value descriptions in interpreting a statistically nonsignificant result as evidence of “no effect.” Equating nonsignificance with no effect is a common misinterpretation because in such cases the NHST procedure has not formally rejected the null hypothesis. Nonetheless, it is the estimated effect size that is always most compatible with the observed data regardless of the p-value, and thus in the NHST paradigm the estimated effect size is always the best estimate of the true population parameter of interest.
Study one surveyed 79 final-year bachelor and master’s ecology students from the University of Melbourne. According to the authors, “All subjects had at least one prior semester of statistics, and were more than half-way through a second quantitative course in risk assessment or environmental problem-solving.” The authors also note that the students had been warned several times throughout the year against misinterpreting statistical nonsignificance as evidence of no effect. For that reason the positive impact of confidence intervals reported below may be understated.
Four different ecological scenarios were developed. Each scenario was designed so that the result was statistically nonsigificant, but the effect size was practically meaningful in an ecological context. Each scenario had both an NHST version in which a p-value was presented and a confidence interval version in which a confidence interval visual representation was presented along with the scenario. In the confidence interval version the p-value was omitted. Respondents were randomized so that they each saw two of the four scenarios, but would always be presented with one NHST version and one confidence interval version.
One scenario is shown below as an example. This scenario involves the practice of toe-clipping frogs. Regardless of version, each respondent first saw the following wording:
Toe-clipping is commonly used to mark frogs in population ecology studies because other methods of marking don’t work on their skin. It is a valuable technique but there is some controversy over whether it affects recapture rates and, therefore, frog survival.
This study examined the decline in recapture rate of 60 frogs that had toes clipped.
The confidence interval version accompanied the scenario by displaying a visual representation of the confidence interval showing that it covered zero. A horizontal line was displayed to show the cut-off for unacceptably low rates of recapture. A diamond was used to show the estimated effect size. The full confidence interval wording for the frog scenario accompanying the confidence interval visual presentation is shown below:
In the figure above, the Y axis shows proportion change in recapture rate (negative values show proportion decline, positive values show increase). The horizontal line crossing at 0 indicates no effect on recapture rate. The thicker, horizontal line crossing at -.05 indicates the minimum decline we understand to be ecologically unacceptable. If the true proportion decline exceeds .05, toe clipping is an unacceptable practice. The black diamond is the change in recapture rate for this sample; the error bar is 95% confidence interval.
The p-value version instead presented the following wording. No visual representation was provided. All italics are original.
The minimum ecologically unacceptable decline in recapture rate is known to be .05. If the true proportion decline exceeds .05, toe clipping is an unacceptable practice. The proportion decline in this sample was .08. This proportion (.08) is statistically not significantly different from zero (one sided t test = 1.1, p = .27; df = 59). The a priori statistical power of this test, to detect a decline of .05, was 40%.
Respondents then saw the following prompt and had to select one answer.
In response to this information, the researcher who conducted this study should conclude that:
– There is strong evidence in support of an important effect.
– There is moderate evidence in support of an important effect.
– The evidence is equivocal.
– There is moderate evidence of no effect.
– There is strong evidence of no effect.
The specific wording of each statement varied by scenario since what constituted an “effect” was scenario dependent. For instance, in the frog scenario the last option read: “There is strong evidence that toe clipping does not cause unacceptable decline.”
Respondents selecting either of the last two statements, moderate or strong evidence for no effect, were considered to have made a misinterpretation.
In study two 55 second year ecology subjects at the University of Melbourne were given a single ecological scenario. Every subject was shown both the NHST version and a confidence interval version, but which was presented first was randomized. The scenario is shown below.
There are concerns about the air quality in a freeway tunnel. This study monitored the concentration of carbon monoxide (CO) during peak-hour traffic over 2 weeks, taking a total of 35 samples. Normal background levels of carbon monoxide are between 10–200 parts per million (ppm). A 1 h exposure time to CO levels of 250 ppm can lead to 5% carboxylated hemoglobin in the blood. Any level above this is abnormal and unsafe. If the true level of CO concentration in the tunnel exceeds 250 ppm, the tunnel will be closed and a surface road built. However, the surface road proposal has problems of its own, including the fact that threatened species inhabit an area near the surface site. First consider Presentation A. Please answer the question following Presentation A and then move on to Presentation B.
The response prompt was the same as in study one as was the classification of misinterpretation as moderate or strong evidence of “no effect.”
A summary of results from Fidler & Loftus are shown below. For both populations confidence intervals had a lower rate than standard p-value descriptions of misinterpreting nonsignificant results as evidence of “no effect.”
Study | Subjects | Sample size | Percentage with NHST misunderstanding | Percentage with CI misunderstanding |
---|---|---|---|---|
1 | Final-year undergraduate and master's ecology students | 79 | 39% | 16% |
2 | Second-year ecology students | 55 | 44% | 18% |
Cliff Effects and Dichotomization of Evidence
The primary study that examined the cliff effect within engineering, science, and agronomy was Helske et al. (2020). More than a hundred researchers participated including 9 in various science and engineering fields. This included researchers in animal science (n = 1), biology (n = 1), botany (n = 1), ecology (n = 1), zoology (n = 1), physics (n = 1), wind energy (n = 1), mathematical biology (n = 1), water engineering (n = 1). In terms of education, three researchers had received a PhD, two had received a master’s degree, one had a bachelor degree, and two had unknown levels of education.
In addition, 10 computer scientists were surveyed in the sub-fields of computer vision (n = 1), virtual reality (n = 3), network analysis (n = 1), artificial intelligence (n = 1), web technology (n = 1), general computer science (n = 3). This included eight researchers with PhDs and two with masters.
Finally, 28 human-computer interaction and visualization researchers were surveyed, including 20 with PhDs, two with masters degrees, and one with a bachelor degree.
Helske et al. (2020) measured a cliff effect by way of hypothetical scenario. The instrument wording was as follows:
A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?
One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.
Using the open source data made available by the authors we analyzed the extent of a cliff effect by observing the drop in confidence plotted against p-values as well as by using simple descriptive statistics. Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.
In our analysis only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.
A cliff effect was observed for engineers and scientists after plotting confidence against p-value. A moderate cliff effect was also observed after using the descriptive statistics described above. The 0.04 to 0.06 p-value interval was associated with the largest drop in confidence for the confidence interval presentation type and the second largest drop in confidence for the standard p-value description presentation type.
A cliff effect was not observed for computer scientists after plotting confidence against p-value. Nor was a cliff effect was also observed after using the descriptive statistics described above. The 0.04 to 0.06 p-value interval was associated with the fourth largest drop in confidence for the confidence interval presentation type and the third largest drop in confidence for the standard p-value description presentation type. The p-value interval associated with the largest drop in confidence for both presentation types was between 0.1 and 0.5.
A cliff effect was not observed for HCI and visualization researchers after plotting confidence against p-value. Nor was a cliff effect was also observed after using the descriptive statistics described above. The 0.04 to 0.06 p-value interval was associated with the second largest drop in confidence for the confidence interval presentation type and the fourth largest drop in confidence for the standard p-value description presentation type. The p-value interval associated with the largest drop in confidence for both presentation types was between 0.1 and 0.5.
In some cases a p-value of 0.1 is thought of as the last chance at statistical significance, sometimes denoted as “significant at the 90% level” [citation]. In this respect the 0.1 to 0.5 p-value interval may be thought of as having its own cliff effect. However, we found no evidence that either computer science or HCI research relies on the 0.1 significance level any more than other fields and therefore there is no reason a prior to expect the 0.1 to 0.5 level to represent the primary cliff effect for either computer science or HCI.
Group | Visualization | Largest difference in confidence | Difference in confidence (percentage points) |
---|---|---|---|
Engineers and scientists | P-value | 0.01 to 0.04 | 22% |
Engineers and scientists | CI | 0.04 to 0.06 | 25% |
Computer scientists | P-value | 0.1 to 0.5 | 42% |
Computer scientists | CI | 0.1 to 0.5 | 24% |
HCI and visualization researchers | P-value | 0.1 to 0.5 | 19% |
HCI and visualization researchers | CI | 0.1 to 0.5 | 30% |
MATH & STATISTICS
It is worth noting that the table below is from the supplementary material of a now famous statement called “Retire Statistical Significance” that appeared in Nature and was signed by 854 statisticians and researchers from 52 countries [16]. As the statement’s title implies it argued that statistical significance was outdated and better statistical methods should be employed to weigh competing hypotheses. There was then a follow-up article by Tom Hardwicke and John Ioannidis, which re-surveyed a portion of the 854 initial signatories. The article by Hardwicke and Ioannidis argued that some of the signatories themselves have conflicting views about statistical significance and the original “Retire Statistical Significance” article may have overstated its support [17]. That was then followed by a post in the online magazine Medium by influential statistician David Spiegelhalter who argued the Hardwicke and Ioannidis article itself was flawed [18]. Spiegelhalter asked Ioannidis for comment and Ioannidis obliged, his comment appearing at the end of Spiegelhalter’s Medium piece. As yet the parties have failed to come to agreement and as such readers are encouraged to examine the alternating barbs noted above for a full sense of the authors’ divergent points of view.
Valentin Amrhein et al., “Supplementary information to: Retire statistical significance”, Nature, 2019 [link]
Tom Hardwicke and John Ioannidis, “Petitions in scientific argumentation: Dissecting the request to retire statistical significance”, European Journal of Clinical Investigation, 2019 [link]
David Spiegelhalter, “Andromeda and ‘appalling science’: a response to Hardwicke and Ioannidis”, Medium, 2020 [link]
Authors & year | Article title | Category | Subjects | Primary findings |
---|---|---|---|---|
McShane & Gal (2015) | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] | Cliff effect | Psychological Science editorial board (n=54) | 1. A cliff effect was found between p-values of 0.01 and 0.27. |
Lyu et al. (2019) | Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] | NHST misinterpretations | Psychology undergraduate students (n=67) Psychology masters students (n=122) Psychology PhD students (n=47) Psychologists with a PhD (n=36) |
1. 94% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 93% of masters students demonstrated at least one NHST misinterpretation. 3. 81% of PhD students demonstrated at least one NHST misinterpretation. 4. 92% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Questionnaires of NHST knowledge
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 31 students in the field of mathematics. It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Mathamatics | 31 | 71% | 19% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 6, false - 22, blank - 3 (9.7%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Mathematics students had the lowest proportion of incorrect responses at just 19%.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement, however all 31 mathematics students provided a written explanation, the only university specialization for which this was true. Mathematics students had the largest proportion of partially correct responses at 42%, just above the 40% of business students. However, both psychology and medicine students had a larger proportion of correct responses in both percentage and absolute terms. In terms of the combined proportion of correct and partially correct responses mathematics students again had the largest proportion at about 49%. While mathematics students had the lowest proportion of M1 mistakes, they were in the middle of the pack of M2 and M3 mistakes.
[compare to other types of misinterpretations]
Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Mathamatics | 31 | 7% | 42% | 10% | 19% | 10% | 13% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows: C - 2, PC - 13, M1 - 3, M2 - 6, M3 - 3, DI - 4.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Lecoutre, Poitevineau, and Lecoutre (2003) surveyed 25 professional statisticians working at pharmaceutical companies in France, specifically hoping to identify two common NHST misinterpretations (20 psychology researchers were also tested; these results are presented in the psychology section).
The authors constructed a hypothetical scenario in which the efficacy of a drug is being tested by using two groups, one given the drug and one given a placebo. Each group has 15 participants, for a total of 30. The drug was to be considered clinically interesting by experts in the field, if the unstandardized difference between the treatment mean and the placebo mean was more than 3. Four different scenarios were constructed crossing statistical significance with effect size (large/small).
Situation three and four are considered by the authors to offer conflicting information since in one case the result is nonsignificant, but the effect size is large and in the other the result is significant, but the effect size is small. These two situations are meant to two test two common NHST misinterpretations: Interpreting a nonsignificant result as proof of the null hypothesis — a version of the “Inverse probability fallacy”— and confusing statistical significance with scientific significance, sometimes called the “Clinical or practical significance fallacy.” (For fallacies see Misinterpretations of “P” values in Psychology University students).
The normative answers provided by the authors are based on a combination of the effect size and the sampling error variance, which they note can be estimated by squaring the effect size divided by the t-statistic: (D/t)^2. The larger this variance the larger the difference between the estimated effect size and the true population effect size. Thus, for larger variances the estimated effect size is simply not very precise and no conclusion should be made about the drug’s efficacy.
Situation | t-statistic | P-value | Effect size (D) | Estimated sampling error (D/t)2 | Normative answer |
---|---|---|---|---|---|
1. Significant result, large effect size | 3.674 | 0.001 | 6.07 | 2.73 | Clinically interesting effect |
2. Nonsignificant result, small effect size | 0.683 | 0.5 | 1.52 | 4.95 | No firm conclusion |
3. Significant result, small effect size | 3.674 | 0.001 | 1.52 | 0.17 | No clinically interesting effect |
4. Nonsignificant result, large effect size | 0.683 | 0.5 | 6.07 | 78.98 | No firm conclusion |
Subjects were asked the following three questions:
1. For each of the four situations, what conclusion would you draw for the efficacy of the drug? Justify your answer.
2. Initially, the experiment was planned with 30 subjects in each group and the results presented here are in fact intermediate results. What would be your prediction of the final results for D then t, for the conclusion about the efficacy of the drug?
3. From an economical viewpoint, it would of course be interesting to stop the experiment with only the first 15 subjects in each group. For which of the four situations would you make the decision to stop the experiment, and conclude?
Only the results for Question 1 are discussed here as they align with commonly documented NHST misinterpretations. For a discussion of Questions 2 and 3 please see Lecoutre, Poitevineau, and Lecoutre (2003). The results for Question 1 are shown below. The three categories below were coded by the authors based on answers to Question 1 (we have made the category names slightly friendlier without changing their meaning). Green indicates the subject's response aligns with the authors' normative response. Red indicates it does not align.
A single subject responded incorrectly to Situation 1. However, all but two subjects responded incorrectly to Situation 2, indicating that the drug is ineffective. The authors claim that a response of “inefficacy” demonstrates an interpretation of a nonsignificant finding as proof of no effect, a version of the “Inverse probability fallacy.” However, in Situation 2 the effect size is 1.52, below the 3.0 threshold that is considered clinically significant. While the effect size is not 0, it is clinically nonsignificant another indication the drug may be ineffective.
Moreover, while the authors note that the sampling error estimation is the primary method by which clinical conclusions can be drawn they discouraged any direct calculation while answering the questionnaire. Even using the methodology it is unclear why the authors consider 2.73 — the sampling error from Question 1 — to sufficiently low, but consider 4.95 — the sampling error from Question 2 — to be unreasonably large.
Situation | Response | ||
---|---|---|---|
The drug is effective | The drug is ineffective | Do not know | |
1. Significant result, large effect size | 96% | 0% | 4% |
2. Nonsignificant result, small effect size | 0% | 84% | 16% |
3. Significant result, small effect size | 12% | 80% | 8% |
4. Nonsignificant result, large effect size | 12% | 36% | 52% |
Table notes:
1. Green indicates the subject's response aligns with the authors' normative response. Red indicates it does not align.
asdfasdf
asdfasdf
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
You have absolutely disproved (proved) the null hypothesis | 44% | 43% |
You have found the probability of the null (alternative) hypothesis being true. | 32% | 34% |
You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 70% | 55% |
You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 48% | 32% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=105), nonsignificant version (n=98).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2019, paper: [link], data: [link]
asdf
Questionnaires of Confidence Interval Knowledge
CI
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
You have absolutely disproved (proved) the null hypothesis | 33% | 33% |
You have found the probability of the null (alternative) hypothesis being true. | 48% | 53% |
You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 40% | 37% |
You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 59% | 45% |
Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=105), nonsignificant version (n=98).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2019, paper: [link], data: [link]
Cliff Effects and Dichotomization of Evidence
JASA Study 1
Below is a summary of a study from an academic paper.
The study aimed to test how diferent interventions might afect terminal cancer patients’ survival. Subjects were randomly assigned to one of two groups. Group A was instructed to write daily about positive things they were blessed with while Group B was instructed to write daily about misfortunes that others had to endure. Subjects were then tracked until all had died. Subjects in Group A lived, on average, 8.2 months post-diagnosis whereas subjects in Group B lived, on average, 7.5 months post-diagnosis (p = 0.01). Which statement is the most accurate summary of the results?
A. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the subjects who were in Group A was greaterthan that lived by the subjects who were in Group B.
B. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the subjects who were in Group A was less than that lived by the subjects who were in Group B.
C. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the subjects who were in Group A was no diferent than that lived by the subjects who were in Group B.
D. Speaking only of the subjects who took part in this particular study, it cannot be determined whether the average number of post-diagnosis months lived by the subjects who were in Group A was greater/no diferent/less than that lived by the subjects who were in Group B.
JASA Study 2 - judgement
Below is a summary of a study from an academic paper.
The study aimed to test how two diferent drugs impact whether a patient recovers from a certain disease. Subjects were randomly drawn from a fxed population and then randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of subjects who took Drug A recovered from the disease while forty-four percent (44%) of subjects who took Drug B recovered from the disease. A test of the null hypothesis that there is no diference between Drug A and Drug B in terms of probability of recovery from the disease yields a p-value of 0.025. Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?
A. A person drawn randomly from the same population as the subjects in the study ismore likely to recover from the disease if given Drug A than if given Drug B.
B. A person drawn randomly from the same population as the subjects in the study is less likely to recover from the disease if given Drug A than if given Drug B.
C. A person drawn randomly from the same population as the subjects in the study is equally likely to recover from the disease if given Drug A than if given Drug B.
D. It cannot be determined whether a person drawn randomly from the same population as the subjects in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.
JASA Study 2 - choice
Assuming no prior studies have been conducted with these drugs, if you were a patient from the same population as the subjects in the study, what drug would you prefer to take to maximize your chance of recovery?
A. I prefer Drug A.
B. I prefer Drug B.
C. I am indifferent between Drug A and Drug B.
Undergraduates
Below is a summary of a study from an academic paper:
The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to either write daily about positive things they were blessed with or to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants who wrote about the positive things they were blessed with lived, on average, 8.2 months after diagnosis whereas participants who wrote about others’ misfortunes lived, on average, 7.5 months after diagnosis (p = 0.27). Which statement is the most accurate summary of the results?
A. The results showed that participants who wrote about their blessings tended to live longer post-diagnosis than participants who wrote about others’ misfortunes.
B. The results showed that participants who wrote about others’ misfortunes tended to live longer post-diagnosis than participants who wrote about their blessings.
C. The results showed that participants’ post-diagnosis lifespan did not differ depending on whether they wrote about their blessings or wrote about others’ misfortunes.
D. The results were inconclusive regarding whether participants’ post-diagnosis lifespan was greater when they wrote about their blessings or when they wrote about others’ misfortunes.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. At lectus urna duis convallis convallis. Penatibus et magnis dis parturient montes nascetur ridiculus mus mauris. Mi bibendum neque egestas congue quisque egestas diam in arcu. Facilisis mauris sit amet massa vitae tortor. Malesuada proin libero nunc consequat interdum varius sit amet. Ac tortor dignissim convallis aenean. Elementum nisi quis eleifend quam adipiscing vitae proin. Ipsum dolor sit amet consectetur adipiscing elit ut. Facilisis volutpat est velit egestas. Eu consequat ac felis donec et odio. Praesent semper feugiat nibh sed pulvinar. Pulvinar proin gravida hendrerit lectus a. Pretium vulputate sapien nec sagittis aliquam malesuada bibendum. Magna eget est lorem ipsum dolor sit amet consectetur. Et sollicitudin ac orci phasellus egestas tellus. Velit aliquet sagittis id consectetur purus. Tempor orci eu lobortis elementum nibh tellus molestie nunc non. Varius quam quisque id diam vel quam elementum pulvinar etiam.
OTHER STUFF
https://people.clas.ufl.edu/dchamberlain31/files/Students-Understanding-of-Test-Statistics-Long-Paper-Final-Version.pdf
https://www.researchgate.net/publication/326501460_Statistics_anxiety_in_university_students_in_assessment_situations
References and theoretical basis
Ronald Wasserstein, Allen Schirm, & Nicole Lazar, “Moving to a World Beyond ‘p < 0.05’”, The American Statistician, 2019 [link]
Ro