hypothesis testing and confidence interval Misinterpretations among Psychology researchers and students
Article Summary
This article reviews the evidence on hypothesis testing and parameter estimation misinterpretations of psychology researchers and students. The article focuses on research with a “direct inquiry” methodology, meaning psychologists and students were asked directly to assess a statistical situation and provide a response. This is in contrast to other methodologies, for example systematically reviewing statistical usage in academic psychology journals. A total of 32 studies are reviewed, falling into one of four categories: null hypothesis significant testing (NHST), confidence intervals (CI), the cliff effect, and dichotomization of evidence. Where possible meta-analyses are used to combine results.
A detailed table of contents outlines the studies, with links to each organized by the four categories above. The introductory sections further outline this articles purpose and contributions as well as provides summary results and themes and criticism across the studies.
Article status
This article is complete. It is being maintained as new research comes out or original authors provide feedback.
Contact information
We’d love to hear from you! Please direct all inquires to info@theresear.ch.
Table of contents
Click each link below to be taken to the associated section or subsection.
Null Hypothesis Significance Testing (NHST) Review
General misinterpretations examined using survey instruments
Oakes (1987) original six-statement survey instrument from the U.S. with replication by Haller & Krauss (2002) in Germany and Lyu et al. (2018) in China (with corrected degrees of freedom)
Falk and Greenbaum (1995) five-statement survey instrument in Israel
Lyu et al. (2020) four-statement survey instrument in China
Laura Badenes-Ribera and Dolores Frías-Navarro (and team) 10-statement survey instrument used in a series of three studies in Spain, Chile, and Italy in 2015 and 2016
Monterde-i-Bort et al. (2010) replication of Part II of the The Psychometrics Group Instrument developed by Mittag (1999)
Specific misinterpretations examined using interviews, singe questions, and statistical tasks
Oakes (1979) estimation task to U.S. psychology researchers eliciting estimates of the effect size when a significance threshold is changed from 0.05 to 0.01
Oakes (1979) estimation task to U.S. psychology researchers eliciting estimates of p-value replication under three different hypothetical scenarios
Vallecillos (1994/2000) one-question survey to psychology students in Spain about the ability of NHST to “prove” a hypothesis true
Zuckerman et al. (1993) multiple choice questionnaire to psychology researchers about comparing the results of two studies utilizing NHST
Lecoutre et al. (2003) opinion elicitation of French laboratory psychologists about drug efficacy under four combinations of effect size and p-value
Lai et al. (2012) three estimation tasks among psychology researchers to determine if their 80% p-value replication intervals were normatively correct
Kühberger et al. (2015) estimation task among psychology students in Austria to provide sample and effect sizes based only on the high-level results of two actual studies
General misinterpretations examined using survey instruments
Hoekstra et al. (2014) original six-statement survey instrument from the Netherlands with criticism by Miller and Ulrich (2015) and response by Morey et al. (2016)
Criticism and replication of Hoekstra et al. (2014) by García-Pérez and Alcalá-Quintana (2016) in Spain
Replication of Hoekstra et al. (2014) by Lyu et al. (2018) in China
Lyu et al. (2020) four-statement survey instrument in China
Specific misinterpretations examined using interviews, singe questions, and statistical tasks
Cumming et al. (2004) interactive task for psychology researchers to estimate how many means from a replicated experiment will fall within the original experiment’s 95% confidence interval
Belia et al. (2005) interactive task of estimating the needed position of a Group 2 interval so that its overlap with a fixed Group 1 interval attains a p-value of 0.05 for the difference in group means
Relevant results from Coulson et al. (2010) opinion elicitation about whether two studies conflict when one has a confidence interval that covers zero and the other does not; main results presented “Dichotomization of evidence review” section
Kalinowski et al. (2018) interactive task of producing distributions within confidence intervals that represent the probability the mean will fall with that region
Rosenthal and Gatio (1963) original confidence elicitation study of assigning subjective confidence to a range of p-values with replication by Beauchamp and May (1964) and subsequent comment by Rosenthal and Gatio (1964)
Minturn et al. (1972) replication of Rosenthal and Gatio (1963)
Nelson, Rosenthal, & Rosnow (1986) confidence elicitation study of assigning subjective confidence to a range of p-values
Poitevineau and Lecoutre (2001) confidence elicitation study of assigning subjective confidence to a range of p-values with focus on between-subject results
Replication of Poitevineau and Lecoutre (2001) by Lai (2010).
Hoekstra et al. (2012) confidence elicitation study of assigning subjective confidence to a range of four p-values under both NHST and confidence interval scenarios
Dichotomization of evidence review
Coulson et al. (2010) opinion elicitation around whether two studies conflict when one has a confidence interval that covers zero and the other does not
McShane and Gal (2016) multiple choice questionnaire to indicate whether a hypothetical treatment was effective under either a significant or nonsignificant scenario
Ulrich and Miller (2018) opinion elicitation of providing a publication recommendation to a PhD student based on a result’s p-value
Paid reviewers
The reviewers below were paid by The Research to ensure this article is accurate and fair. This does not mean the reviewer would have written the article in the same way, that the reviewer is officially endorsing the article, or that the article is perfect, nothing is. It simply means the reviewer has done their best to take what The Research produced, improve it where needed, given editorial guidance, and generally considers the content to be correct. Thanks to all the reviewers for their time, energy, and guidance.
Dan Hippe, M.S., Statistics, University of Washington (2011). Dan is currently a statistician in the Clinical Biostatistics Group at the Fred Hutchinson Cancer Research Center in Seattle, Washington. He is a named co-authored on more than 180 journal articles, is on the editorial board of Ultrasound Quarterly, and is a statistical consultant and reviewer for the Journal of the American College of Radiology.
Other reviewers will be added as funding allows.
Article purpose & contribution
This article reviews the evidence on direct inquiry of null hypothesis significance testing (NHST) and confidence interval (CI) knowledge of psychology researchers and students. Direct inquires include surveys, multiple choice questionnaires, statistical tasks, and other methods of directly interacting with participants to determine knowledge and misunderstanding. As far as we are aware this is by far the most comprehensive review of this topic available. In particular this article attempts to make the following contributions:
Comprehensive. This article reviews evidence from more than 30 different studies. As additional studies are released and existing studies are uncovered the material will be added.
Concentration on psychology. Many studies in the literature on NHST and CI knowledge are cross-discipline. These studies therefore offer a good comparison across disciplines of performance on a particular statistical task, survey instrument, or other inquiry method. This article aims for something different, however: “breaking apart” the literature, extracting only research on psychology, and then reassembling it, highlighting common themes and criticism. This includes re-analyzing open data sources and extracting only the subset applicable to psychology (which might have been subsumed in reporting of the original findings) as well as combining data into meta-analyses where possible. Forthcoming articles will focus on other disciplines.
Detailed summaries. The key methodological elements of each study are outlined along with the main findings. Our own criticisms are presented as well as any published responses from other researchers. Of course, no article can be summarized comprehensively without duplicating it outright. Those particularly interested are encouraged to use the provided links to review the original articles.
Author input. All authors have been contacted for their input on the summaries of their studies. When applicable their responses have been incorporated into the article.
Categorization. Articles are categorized for easy navigation into one of four research areas and seven methodological types. These are incorporated in the table of contents as well as in the article summary below. In addition, a main table is presented at the beginning of the literature review with one-sentence summaries of primary findings along with other important information about the article. Included are links to the original articles themselves.
Broader literature. This article focuses on a single type of evidence regarding NHTS and confidence interval misinterpretation: direct inquires of NHST and CI knowledge. However, we have identified three types of additional evidence. These four types of evidence are covered briefly, with relevant links where applicable. Statistical evaluations outside of NHST and CIs are also noted when relevant.
Meta-analysis. Simple meta-analyses are conducted when possible. The raw data used to calculate them is available in the Supplementary Material section. Charts and tables are also provided to clarify meta-analysis findings.
Clear identification of errors. A number of papers reviewed here were published with errors. These errors have been highlighted and corrected in our summaries. This will help interested readers better interpret the original articles if sought out.
Translation. Articles that were not original published in English have been professionally translated and are available for free on our Google drive. Currently, only one article is included, but others are in the works.
Better charts and tables. When reproducing charts improvements were made to increase readability and clarity. Some charts may appear small, but clicking the chart will expose a larger version in a lightbox. Charts also use a common design language to increase visual continuity within the article. The same is true with tables, which also utilize highlighting where necessary to aim focus at important information. All tables have extensive notes including links to the original article.
Living document. Because this article is not published in any journal, we have the freedom to continually add new studies as they are discovered or released. It also provides us the opportunity to improve the article over time by clarifying or adding content from those studies already summarized as well as correct any errors if they are found.
Summary & results
There are four categories of evidence regarding knowledge and misinterpretations of NHST and CI by professional researchers and students. The current article focuses on the first category below for the population of psychology researchers and students. Psychology has by far the most studies in this area of any academic disincline.
Direct inquires of statistical knowledge. Although not without methodological challenges, this work is the most direct method of assessing statistical understanding. The standard procedure is a convenience sample of either students or researchers (or both) in a particular academic discipline to which a survey instrument, statistical task, or other inquiry method is administered.
Examination of NHST and CI usage in statistics and methodology textbooks. This line of research includes both systematic reviews and casual observations documenting incorrect or incomplete language when NHST or CI is described in published textbooks. In these cases it is unclear if the textbook authors themselves do not fully understand the techniques and procedures or if they were simply imprecise in their writing and editing or otherwise thought it best to omit or simplify the material for pedagogical purposes. For an early draft of our article click here. We will continue to expand this article in the coming months.
Audits of NHST and CI usage in published articles. Similar to reviews of textbooks, these audits include systematic reviews of academic journal articles making use of NHST and CIs. The articles are assessed for correct usage. Audits are typically focused on a particular academic discipline, most commonly reviewing articles in a small number of academic journals over a specified time period. Quantitative metrics are often provided that represent the percentage of reviewed articles that exhibited correct and incorrect usage. Click here for a growing list of relevant papers.
Journal articles citing NHST or CI misinterpretations. A large number of researchers have written articles underscoring the nuances of the procedures and common misinterpretations directed at their own academic discipline. In those cases it is implied that in the experience of the authors and journal editors the specified misinterpretations are common enough in their field that a corrective is warranted. Using a semi-structured search we identified more than 60 such articles, each in a different academic discipline. Click here to read the article.
Category 1, inquires of statistical knowledge, can be further subdivided into four areas:
Null hypothesis significance testing (NHST) misinterpretations. These misinterpretations include p-values, Type I and Type II errors, statistical power, sample size, standard errors, or other components of the NHST framework. Examples of such misinterpretations are interpreting the p-value as the probability of the null hypothesis or the probability of replication.
Confidence interval misinterpretations. For example, interpreting the confidence interval as a probability.
The dichotomization of evidence. A specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant. Dichotomization of evidence is closely related to the cliff effect (see Item 4 below).
The cliff effect. A specific misuse of NHST in which there is a dramatic drop in the confidence of an experimental result based on the p-value. For example, having relatively high confidence in a result with a p-value of 0.04, but much lower confidence in a result with a p-value of 0.06.
Some studies are mixed, testing a combination of two or more subcategories above. A total of 32 psychology studies were found that fall into one of these four subcategories, each of which is covered in detail later in this article.
Nearly all of the studies presented in this article support the thesis that NHST and CI misinterpretations are widespread across both student and researcher populations. An NHST meta-analysis we conducted combined the results of 10 studies totaling 1,569 students and researchers that were surveyed between 1986 and 2020. The average proportion of respondents with at least one NHST misinterpretation was 9 out of 10. Segmenting to the population of 788 professional research psychologists still resulted in 87% exhibiting at least one NHST misinterpretation. A smaller set of five studies were combined to report on the average number of misinterpretations in a typical survey instrument. Here a full 50% of statements were misinterpreted on average. Professional research psychologists had a lower average proportion of misinterpretations than students, but it was still as startling 40%. The studies available for meta-analysis for confidence intervals totaled just four, encompassing 1,683 participants. But the results were even more extreme: 97% of respondents had at least one confidence interval misinterpretation. Some might view this as particularly problematic because confidence intervals are often suggested as a replacement or supplement to NHST (for example, see references 1, 2, 3, 10).
A meta-analysis was not feasible for the cliff effect, but seven of eight studies showed evidence the effect was present. A meta-analysis across three studies was possible for dichotomization of evidence. This included 118 psychology researchers from the editorial boards of three different journals. In a hypothetical scenario outlining two different p-values — 0.01 and 0.27 — respondents were 27 percentage points more likely to indicate no evidence of treatment effect in the 0.01 p-value version than in the 0.27 p-value version despite no change in effect size.
The studies covered here are not without methodological critiques. Therefore, the results above may overestimate the true magnitude of the problem. Nonetheless, the general finding that NHST and confidence interval misinterpretations are widespread is hard to dispute and is consistent with the three other categories of evidence outlined above. For instance, in one audit of NHST usage in psychology titled, “The prevalence of statistical reporting errors in psychology (1985–2013)” the authors found that half of the 250,000 p-values reported in eight major psychology journals over a thirty-year period were inconsistent with their reported test statistic and degrees of freedom [4].
In a now 30-year-old study Paul Pollard and John Richardson identified a number of psychology textbook passages in which researchers gave incorrect explanations of Type I error [5]. The results were written up in their aptly named article, “On the Probability of Making Type I Errors.” More recently Heiko Haller and Stefan Krauss identified numerous incorrect NHST explanations in a major psychology textbook, Introduction to Statistics for Psychology and Education [6].
And psychology researchers have a long history of critiquing their own field for NHST misuse. Psychology professor Brian Haig wrote in the 2012 edition of the Oxford Handbook of Quantitative Methods that, “Despite the plethora of critiques of statistical significance testing, most psychologists understand them poorly, frequently use them inappropriately, and pay little attention to the controversy they have generated” [7]. Two psychology professors — Robert Calin-Jageman and Geoff Cumming — co-wrote a 2019 article in the American Statistician going so far as to suggest a ban on p-values due to their frequent misuse [8]. “We do not believe that the cognitive biases which p-values exacerbate can be trained away. Moreover, those with the highest levels of statistical training still regularly interpret p-values in invalid ways. Vulcans would probably use p-values perfectly; mere humans should seek safer alternatives,” the two authors wrote.
One might reasonably wonder what can be done about the problem, but that is a different question than we attempt to answer here. Instead, this article outlines in detail the results of the numerous direct inquires of NHST and confidence interval knowledge of psychology researchers and their students.
Finally, several statistical assessments not covered in this article are worth mentioning. A particularly relevant assessment is the Reasoning about P-values and Statistical Significance (RPASS) scale [9] developed by Sharon Lane-Getaz. Although the assessment is relevant to NHST misinterpretations research, to our knowledge the only academic research utilizing the assessment occurred during the validation phases of its development. General statistical assessments that expand beyond NHST and confidence intervals have also been developed. For those who are interested we recommend the Comprehensive Assessment of Outcomes in Statistics (CAOS) [10, 11] and the Psychometrics Group Instrument [12], a portion of which is covered in this article as some questions relate to NHST. There are numerous studies investigating specific statistical misconceptions unrelated to NHST or CIs, for example Pearson’s correlation coefficient [22].
Study summaries and conceptual relations
Unlike a systematic literature review, there was little structure in how papers in this article were identified. This project grew organically; at the outset the goal was simply to support other work we’ve completed, identifying a few studies demonstrating that even professional researchers and students trained in statistical techniques are prone to difficulties with some basic hypothesis procedures. Nor was it apparent at the outset just how many studies had been conducted that directly tested this statistical knowledge.
We started by reading, and then reading a bit more. As we read more journal articles, we began to recognize the names of authors and studies. These results were written up. When we encountered unfamiliar work in the area it was bookmarked for follow-up. We circled back on these bookmarked items, writing up the results of those as well. At times a simple Google search to find an article we had previously read lead to more studies appearing in the search results. These too were bookmarked and later examined and summarized. Over time reading and rereading these studies failed to uncover any work we had not already included.
That is not to say this article covers every direct inquiry in the area of psychology. It does not. We know of more work and plan to incorporate it when possible. That work does not yet appear here for one of a few reasons. In some cases the study seems promising, but is not in English. This means the work must be translated, which takes time and money. In other cases the study results have been presented by authors whose other work we have covered below. However, the sample population appears similar enough that the authors may be reusing the results in another forum (for example, a conference presentation). In these cases we have reached out to the authors to determine if the sample population is independent of other results they have presented. Lastly, it is possible there is work we haven’t yet come across or failed to take note of. As we find additional research it will be incorporated.
Study category | Count of studies | Total number of participants |
---|---|---|
NHST misinterpretations | 16 | 2,613 |
CI misinterpretations | 6 | 1,657 |
Cliff effect | 7 | 259 |
Dichotomization of evidence | 3 | 1,470 |
To the right a summary of articles by category is provided. In total, the studies covered in this article elicited responses from 5,727 participants across 32 studies. Half of these studies focused on NHST misinterpretations while the remaining were split among the three other categories. NHST misinterpretation studies also had the largest sample size, 2,613. Studies exploring the cliff effect had by far the lowest sample size, just 259. Note that a few studies fell into multiple categories.
The 5,727 figure assumes no two studies shared participants. We believe this to be the case, but are attempting to verify where possible with study authors.
In addition to partitioning the studies into the four research categories, they were also each assigned a methodological type based on their design. Summary statistics by type are shown in the table below. The seven types were:
Surveys. Surveys were designed to directly test statistical misinterpretations by eliciting responses to statements with nominally correct answers. Typically only true/false options were available, although Monterde-i-Bort et al. (2010) used a Likert scale.
Confidence elicitation. Confidence elicitation was primarily associated with the cliff effect studies and asked respondents to provide their subjective confidence based on a provided p-value.
Opinion elicitation. Opinion elicitation comprises any survey with an open-ended design; for example, Coulson (2010) asked for a respondent’s opinion on whether the outcome of two studies were contradictory or in agreement.
Multiple choice questionnaires. Multiple choice questionnaires provided participants with a small selection of responses from which they had to chose, for example McShane and Gal (2015) outlined a hypothetical scenario and then provided four options for the correct interpretation.
Estimation tasks. Estimation tasks required respondents to provide some kind of estimation; for example, Kühberger (2015) asked participants to estimate group means and effect sizes after being given the results of two experiments.
Interactive tasks. Interactive tasks required respondents to interact with statistical objects within a web application; for example, Belia et al. (2005) required participants to move a Group 2 confidence interval toward a fixed Group 1 interval to produce a specified p-value.
Reanalysis. Reanalysis only included one study: Rosenthal and Gatio (1964), which used data collected by Beauchamp and May (1964) in a prior study.
Study type | Count of studies | Total number of participants |
---|---|---|
Survey | 12 | 2,756 |
Confidence Elicitation | 6 | 259 |
Opinion Elicitation | 3 | 1,372 |
Multiple Choice Questionnaire | 2 | 643 |
Estimation Task | 4 | 379 |
Interactive Task | 3 | 318 |
Reanalysis | 1 | NA |
The Survey type were by far the common, more than a third of studies used the survey design. These surveys span both NHST and confidence interval misinterpretations. These studies also had the largest sample size with 2,756 total participants. Confidence elicitation and estimation tasks each had five studies, but relatively low sample sizes. Opinion elicitation had only three studies, but the second highest sample size of 1,372. This was driven mostly by the very large sample size from Miller and Ulrich (2015).
Studies also spanned a breath of countries, including participants from at least 11 different countries. “At least” because numerous studies recruited participants from lists of past journal authors. In these cases the respondents could, and likely did, represent a variety of different countries. However, the country of origin for every respondent was not recorded, and thus it is impossible to know for certain how many countries are represented and the number of respondents associated with each. In fact, this population was the modal response, representing a quarter of the studies covered and 1,801 total participants. The U.S. was the top named country with six studies, but the sample size was low, just 193. This is due in large part to the low sample sizes used by Oakes, who made up half of the six studies. Spain had the largest sample size of any named country, with more than 1,100 participants. This was thanks to Laura Badenes-Ribera and Dolores Frías-Navarro (and team) for their numerous survey-based studies. Chile had the lowest sample size of any country, just 30 participants.
Note that the total count of studies overstates the number of journal articles reviewed because some studies used participants from multiple countries.
Country | Count of studies | Total number of participants |
---|---|---|
Journal authors from various countries | 8 | 1,801 |
U.S. | 6 | 193 |
Spain | 5 | 1,142 |
France | 2 | 38 |
Germany | 2 | 703 |
Australia | 2 | 169 |
Netherlands | 2 | 662 |
China | 2 | 618 |
Israel | 1 | 53 |
Austria | 1 | 133 |
Chile | 1 | 30 |
Italy | 1 | 134 |
Unknown | 1 | 51 |
Table notes:
1. The total count of studies overstates the number of journal articles reviewed because some studies used participants from multiple countries.
Conceptual maps and design outlines
In total 12 studies attempted to measure NHST and confidence interval misinterpretations via survey instruments. These are represented in the conceptual timeline below (solid circles), along with additional studies that are notable, but were not themselves surveys (open circles). The main lines of research are denoted by the solid lines, originating with Oakes. Dotted lines represent studies that were influenced by the previous studies.
The timeline does not represent articles explicitly citing previous work. While networks built off of citations have uses (ex. search algorithms) they overcomplicate the main relational aspects between the studies at hand. Instead, the timeline below is subjective and curated by us based on our reading of the primary influences of each study.
As an example of how to read the chart consider the blue lines denoting confidence interval surveys. Hoekstra et al. (2014) deployed a confidence interval survey based on Oakes (1987), via the discussion in Gigerenzer (2004). Both Lyu et al. (2018) and García-Pérez and Alcalá-Quintana (2016) undertook replications of Hoekstra et al. (2014) with Lyu et al. (2020) then undertaking a partial replication. Miller and Ulrich (2015) criticized Hoekstra et al. (2014) with Morey et al. (2016) replying.
The majority of surveys were nonrandomized, presenting a single version to all participants.
The conceptual timeline for non-survey study types is shown below. The study types appear on the left hand axis. The colors denote study category. The interaction between study type and study category are apparent from this chart. For instance, all cliff effect studies employed a confidence elicitation design. As another example all estimation tasks were aimed at NHST misinterpretations.
Statistical tasks differed greatly in both the design of the task itself and the randomization scheme. Overviews of the four estimation task studies are provided in the diagram below. Lai et al. (2012) appears in the three panels of the first row as it was composed of three separate tasks.
Overviews of the three interactive task studies are provided in the diagram below.
A summary of each study is presented in the table below including the authors and year published, the title of the article and a link to the paper, which of the four research categories the study belongs to, which of the seven methodological types the study belongs to, the country of study participants, the participant population and associated sample size, and a brief summary of the article’s primary findings. Below that, in the body of this article, details of each study are presented. Of course, the methodological details and complete results of each study cannot be presented in full without duplicating the article outright. Readers are encouraged to go to the original articles to get the full details and in-depth discussions of each study.
Authors, year, & title | Category | Type | Country | Psychology participants | Primary findings |
---|---|---|---|---|---|
Rosenthal & Gatio (1963) The Interpretation of Levels of Significance by Psychological Researchers [link] |
Cliff effect | Confidence Elicitation | U.S. | Faculty (n=9) Graduate students (n=10) |
1. A cliff effect was found at a p-value of 0.05. 2. There is less confidence in p-values generated from a sample size of n=10 than a sample size of n=100, suggesting participants care about both Type I and Type II errors. 3. Psychology faculty have lower confidence in p-values than graduate students |
Beauchamp & May (1964) Replication report: Interpretation of levels of significance by psychological researchers [link] |
Cliff effect | Confidence Elicitation | U.S. | Graduate students (n=11) Faculty (n=9) |
1-page summary of a replication of Rosenthal & Gatio (1963) 1. No cliff effect was found at any p-value (however, see Rosenthal & Gaito, 1964). 2. Subjects expressed higher confidence with smaller p-values and larger sample sizes. |
Rosenthal & Gaito (1964) Further evidence for the cliff effect in the interpretation of levels of significance [link] |
Cliff effect | Reanalysis | U.S. | NA | 1-page comment on Beauchamp & May (1964), which was a replication of Rosenthal & Gaito (1963) 1. Despite Beauchamp & May's study claiming to find "no evidence" for a 0.05 cliff effect, a tendency to interpret results at this level as special can be seen in an extended report provided by Beauchamp & May. 2. Beauchamp & May themselves demonstrate a cliff effect when they find Rosenthal & Gaito's original 1963 results "nonsignificant" due to the p-value being 0.06. |
Minturn, Lansky, & Dember (1972) The interpretation of levels of significance by psychologists: A replication and extension (Note that despite various attempts to obtain this paper a copy could not be found) |
Cliff effect | Confidence Elicitation | Unknown | Bachelor's students, master's students, and PhD graduates (n=51) | Results as described in Nelson, Rosenthal, & Rosnow (1986): 1. Cliff effects were found at p-values of 0.01, 0.05, and 0.10, with the most pronounced at the standard 0.05 level. 2. Subjects expressed higher confidence with smaller p-values and larger sample sizes. |
Oakes (1979) The Statistical Evaluation of Psychological Evidence (unpublished doctoral thesis cited in Oakes 1986) [link] |
NHST | Estimation Task | U.S. | Academic psychologists (n=54) | 1. On average subjects drastically misestimate the probability that a replication of a hypothetical experiment will yield a statistically significant result based on the p-value of an initial experiment (replication fallacy). |
Oakes (1979) The Statistical Evaluation of Psychological Evidence (unpublished doctorial thesis cited in Oakes 1986) [link] |
NHST | Estimation Task | U.S. | Academic psychologists (n=30) | 1. Subjects overestimate the effect size based on the p-value. |
Oakes (1986) Statistical Inference [link] |
NHST | Survey | U.S. | Academic psychologists (n=70) | 1. 96% of subjects demonstrated at least one NHST misinterpretation. 2. 89% of subjects did not select the correct definition of statistical significance. 3. Only two respondents (3%) correctly answered both 1 and 2. |
Nelson, Rosenthal, & Rosnow (1986) Interpretation of significance levels and effect sizes by psychological researchers [link] |
Cliff effect | Confidence Elicitation | J.A. | Academic psychologists (n=85) | 1. A cliff effect was found at a p-value of 0.05. 2. Subjects expressed higher confidence with smaller p-values. 3. Subjects expressed higher confidence with larger sample sizes, but this was moderated by years of experience. 4. Subjects expressed higher confidence with larger effect sizes, but this was moderated by years of experience. |
Zuckerman et al. (1993) Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists [link] |
NHST | Multiple Choice Questionnaire | J.A. | Students (n=17) Academic psychologists (n=508) |
1. Overall accuracy of subjects was 59%. |
Falk & Greenbaum (1995) Significance tests die hard: The amazing persistence of a probabilistic misconception [link] |
NHST | Survey | Israel | Undergraduates (n=53) | 1. 92% of subjects demonstrated at least one NHST misinterpretation. |
Vallecillos (2000) Understanding of the Logic of Hypothesis Testing Amongst University Students [link] |
NHST | Survey | Spain | Psychology students (n=70) | 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 17% of pedagogy students incorrectly marked the statement as true. Only 9% of psychology students that had correctly answered the statement also provided a correct written explanation of their reasoning. |
Poitevineau & Lecoutre (2001) Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated [link] |
Cliff effect | Confidence Elicitation | France | Researchers (n=18) | 1. The cliff effect varied by subject and three distinct categories were observed: (a) a decreasing exponential curve, (b) a negative linear curve, and (c) an all-or-none curve representing a very high degree of confidence when p is less than 0.05 and very low confidence otherwise. 2. Only subjects in the decreasing exponential curve group expressed higher confidence with larger sample sizes. |
Haller & Krauss (2002) Misinterpretations of Significance: A Problem Students Share with Their Teachers? [link] |
NHST | Survey | Germany | Methodology instructors (n=30) Scientific psychologists (n=39) Undergraduates (n=44) |
1. 80% of psychology methodology instructors demonstrated at least one NHST misinterpretation. 2. 90% of scientific psychologists demonstrated at least one NHST misinterpretation. 3. 100% of students demonstrated at least one NHST misinterpretation. |
Lecoutre, Poitevineau, & Lecoutre (2003) Even statisticians are not immune to misinterpretations of Null Hypothesis Tests [link] |
NHST | Opinion Elicitation | France | Researchers from various laboratories (n=20) | 1. Percentage of psychologists correctly responding to four situations combining p-values and effect size ranged from 15% to 100%. |
Cumming, Williams, & Fidler (2004) Replication and Researchers’ Understanding of Confidence Intervals and Standard Error Bars [link] |
CI | Interactive Task | J.A. | Authors of 20 high-impact psychology journals (n=89) | 1. 80% of respondents overestimated the probability of a sample mean from a replication falling within the confidence interval of the original sample mean. 2. 70% of respondents overestimated the probability of a sample mean from a replication falling within the standard error interval of the original sample mean. |
Belia, Fidler, Williams, & Cumming (2005) Researchers Misunderstand Confidence Intervals and Standard Error Bars [link] |
CI | Interactive Task | J.A. | Authors of articles in high-impact psychology journals (n=162) | Participants were given a task in which they had to move a Group 2 interval so that its overlap with a fixed Group 1 interval attained a p-value of 0.05. 1. On average they moved the Group 2 confidence interval to a position that attained a p-value of 0.017 (they moved Group 2 too far from Group 1). 2. On average they moved the Group 2 standard error interval to a position that attained a p-value of 0.158 (they moved Group 2 too close to Group 1). |
Coulson, Healey, Fidler, & Cumming (2010) Confidence intervals permit, but do not guarantee, better inference than statistical significance testing [link] |
Dichotomization of evidence | Opinion Elicitation | Australia J.A. |
Authors in major psychology journals (n=102) Academic psychologists (n=50) |
1. On average psychologists only offered agreement between 3.5 and 4.5 on a 7-point Likert that two studies were consistent despite large overlap in their confidence intervals. 2. 27% of responses from Australian psychologists to a fictitious confidence interval scenario prompted responses mentioning NHST, indicating probably dichotomous thinking. |
Lai (2010) Dichotomous Thinking: A Problem Beyond NHST [link] |
Cliff effect | Confidence Elicitation | J.A. | Authors in major psychology and medical journals (n=258) | 1. 21% of respondents demonstrated a cliff effect. |
Monterde-i-Bort et al. (2010) Uses and abuses of statistical significance tests and other statistical resources: a comparative study [link] |
NHST | Survey | Spain | Researchers (n=120) | 1. Using eight statements from the original 29-statement survey instrument which had clear true/false answers, members deviated from the correct answer by an average of 1.32 points on a 5-point Likert scale. |
Lai, Fidler, & Cumming (2012) Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication [link] |
NHST | Estimation Task | J.A. | Academic pscyhologists (n=162) | 1. When first given a p-value from an initial experiment and then asked to provide estimates of an interval in which approximatly 80% of p-values would fall if the experiment were repeated, psychologists on average estimated a much narrower interval covering just 48% of future p-values. |
Hoekstra, Johnson, & Kiers (2012) Confidence Intervals Make a Difference: Effects of Showing Confidence Intervals on Inferential Reasoning [link] |
Cliff effect | Estimation Task | Netherlands | PhD students (n=66) | 1. Cliff effects were observed for both NHST and confidence interval statements. 2. Subjects referenced significance less often and effect size more often when results were presented by means of CIs than by means of NHST. 3. On average subjects were more certain that a population effect exists and that the results are replicable when outcomes were presented by means of NHST rather than by means of CIs. |
Hoekstra et al. (2014) Robust misinterpretation of confidence intervals [link] |
CI | Survey | Netherlands | Undergraduates (n=442) Master's students (n=34) Researchers (n=120) |
1. 98% of undergraduate students demonstrated at least one CI misinterpretation. 2. 100% of master's students demonstrated at least one CI misinterpretation. 3. 97% of researchers demonstrated at least one CI misinterpretation. |
Badenes-Ribera et al. (2015) Interpretation of the p value: A national survey study in academic psychologists from Spain [link] |
NHST | Survey | Spain | Academic psychologists (n=418) | 1. 94% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 35% of subjects demonstrated a NHST misinterpretation related to the replication fallacy. 3. 40% of subjects demonstrated at least one NHST misinterpretation related to either the effect size fallacy or the practical/scientific importance fallacy. |
Badenes-Ribera et al. (2015) Misinterpretations Of P Values In Psychology University Students (translation from Catalan) [link] |
NHST | Survey | Spain | Undergraduates (n=63) | 1. 97% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 49% of subjects demonstrated at least one NHST misinterpretation related to either the effect size fallacy or the practical/scientific importance fallacy. 3. 73% of subjects demonstrated a NHST misinterpretation related to correct decision making. |
McShane & Gal (2015) Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link] |
Dichotomization of evidence | Multiple Choice Questionnaire | J.A. | Psychological Science editorial board (n=54) Social Psychological and Personality Science editorial board (n=33) Cognition editorial board (n=31) |
1. A moderate cliff effect was found between p-values of 0.01 and 0.27. |
Kühberger et al. (2015) The significance fallacy in inferential statistics [link] |
NHST | Estimation Task | Austria | Students enrolled in a statistics course (n=133) | When given only a cue that a study result was either significant or nonsignificant students consistently estimated larger effect sizes in a significant scenario than in an nonsignificant scenario. |
García-Pérez and Alcalá-Quintana (2016) The Interpretation of Scholars' Interpretations of Confidence Intervals: Criticism, Replication, and Extension of Hoekstra et al. (2014) [link] |
CI | Survey | Spain | First-year students (n=313) Master's students (n=158) |
1. 99% of first-year undergraduate students demonstrated at least one CI misinterpretation. 2. 97% of master's students demonstrated at least one CI misinterpretation. 3. When given the opportunity to omit responses to statements becasue they felt they could not answer, between 20% and 60% of master's students chose to omit (depending on statement number). |
Badenes-Ribera et al. (2016) Misconceptions of the p-value among Chilean and Italian Academic Psychologists [link] |
NHST | Survey | Chile Italy |
Chilean academic psychologists (n=30) Italian academic psychologists (n=134) |
1. 62% of subjects demonstrated at least one NHST misinterpretation related to the inverse probability fallacy. 2. 12% of subjects demonstrated a NHST misinterpretation related to the replication fallacy. 3. 5% of subjects demonstrated a NHST misinterpretation related to the effect size fallacy. 4. 9% of subjects demonstrated a NHST misinterpretation related to the practical/scientific importance fallacy. |
Kalinowski, Jerry, & Cumming (2018) A Cross-Sectional Analysis of Students’ Intuitions When Interpreting CIs [link] |
CI | Interactive Task | Australia | Students, various disciplines but 66% were psychology (n=101) | 1. 74% of students had at least one CI misconception in a set of three tasks. |
Ulrich & Miller (2018) Some Properties of p-Curves, With an Application to Gradual Publication Bias [link] |
Dichotomization of evidence | Opinion Elicitation | Germany J.A. |
German pscyhologists (n=590) Experimental pscyhologists (n=610) |
1. A prominent cliff effect was observed for experimental pscyhologists, however not for German pscyhologists. |
Lyu et al. (2018) P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation [link] |
NHST | Survey | China | Undergraduates (n=106) Master's students (n=162) PhD students (n=47) Post-PhD (n=31) |
1. 94% of undergraduate students demonstrated at least one NHST misinterpretation. 2. 96% of master's students demonstrated at least one NHST misinterpretation. 3. 100% of PhD students demonstrated at least one NHST misinterpretation. 4. 93% of subjects with a PhD demonstrated at least one NHST misinterpretation. |
Lyu et al. (2020) Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link] |
NHST CI |
Survey | China | Undergraduates (n=67) Master's students (n=122) PhD students (n=47) Post-PhD (n=36) |
1. 94% of undergraduate students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 2. 93% of master's students demonstrated at least one NHST misinterpretation and 93% demonstrated at least on CI misinterpretation. 3. 81% of PhD students demonstrated at least one NHST misinterpretation and 85% demonstrated at least on CI misinterpretation. 4. 92% of subjects with a PhD demonstrated at least one NHST misinterpretation and 92% demonstrated at least on CI misinterpretation. |
Themes and criticism
Across the 32 studies a number of themes and related criticism were uncovered. These are outlined below.
Respondents struggeled most with one NHST statement
One NHST statement proved particularly problematic for respondents, Statement 5 from Oakes (1986). Oakes’ survey instrument first provides a hypothetical scenario involving a simple independent means t-test of two groups with a resulting p-value. Subjects are asked to provide true/false responses to six statements about what can be inferred based on the results. Statement 5 read, “You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.” The statement had the highest rate of misinterpretation for Oakes’ original study as well as several replications. Although this article is focused on psychology, surveys of other disciplines produced similar outcomes.
The high rate of misinterpretation is likely due to the association respondents made between the statement and the Type I error rate. Formally, the Type I error rate is given by the pre-specified alpha value, usually set to a probability of 0.05 under the standard definition of statistical significance. It could then be said that if the sampling procedure and p-value calculation were repeated on a population in which the null hypothesis were true, 5% of the time the null would be mistakenly rejected. The Type I error rate is sometimes summarized as, “The probability that you wrongly reject a true null hypothesis.”
There is a rearranged version of Statement 5 that is close to this Type I error definition: “You know the probability that you are making the wrong decision if you decide to reject the null hypothesis.” Note though that this statement is missing a key assumption from the Type I error rate: that the null hypothesis is true. The actual wording of Statement 5 was more complex: “You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.” The sentence structure makes the statement more difficult to understand, but again the statement does not include any indication about the truth of the null hypothesis. For this reason the statement cannot be judged as true.
As an additional piece of analysis we set out to understand if Statement 5 was syntactically sound and consulted with Maayan Abenina-Adar, a final-year PhD student at UCLA’s Department of Linguistics. Statement 5 may seem somewhat awkward as written. A clearer version of the sentence would be the rearranged version previously mentioned: “If you decide to reject the null hypothesis you know the probability that you are making the wrong decision.” This version avoids some of the syntactic complexity of the original statement:
The conditional antecedent appearing in the middle of the sentence. The conditional antecedent is the “if you decide to reject the null hypothesis” phrase.
The use of the noun phrase, “probability that you are making the wrong decision,” as a so-called “concealed question.” (The concealed question is, “What the probability that you are making the wrong decision is.”).
Whether the phrasing of Statement 5 contributed to its misinterpretation cannot be determined from the data at hand. One might argue that the more complex sentence structure caused respondents to spend extra time thinking about the nature of the statement, which might reduce misunderstanding. Plus, Statement 6 was also syntactically complex, but did not elicit the same rate of misinterpretation. (The Statement 6 wording was: “You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.”). A controlled experiment comparing the two different versions of Statement 5 would be needed to tease apart the impact of the sentence’s structure and its statistical meaning on the rate of misunderstanding.
One other notable factor is that the pre-specified alpha value is not present in either the hypothetical scenario preceding Statements 1-6 or in Statement 5 itself. This might have been a clue that the statement couldn’t have referred to the Type I error rate since not enough information was given. On the other hand, the 0.05 alpha probability is so common that respondents may have assumed its value.
Using a different set of statements, the Psychometrics Group Instrument, both Mittag and Thompson (2000) and Gordon (2001) found that two sets of education researchers had particular trouble with statements about Type I error and Type II. However, this same instrument was also used by Monterde-i-Bort (2010) to survey psychologists and, while some researchers struggled with these questions, they were not answered incorrectly more frequently than other types of questions. More research would be needed to understand if Type I and Type II error is generally poorly understood by psychology students and researchers and the extent to which statement wording affects this confusion.
Survey instrument design can be improved
Survey instruments provide valuable insight into the statistical knowledge of students and researchers. In these surveys subjects are provided a set of statements which they must judge as true or false. The studies attempt to assess knowledge by constructing statements which are fundamental to proper usage of the associated statistical tools, often written so that they represent common misconceptions of statistical procedures or interpretations. Therefore, if subjects cannot correctly answer the statements they are judged to have a misinterpretation of the particular statistical ideas being tested.
Oakes (1986) was the first study to test psychology researchers. A number of surveys were derived from Oakes (1986) and thus have common properties. First, only “true” and “false” responses are available. There is no response option to explicitly indicate ignorance, such as “I don’t know.” Second, all responses are compulsory, a subject cannot leave a response unanswered. Third, “false” is the correct answer to all statements presented. These design decisions are not without criticism.
One particularly relevant critique came from García-Pérez and Alcalá-Quintana (2016). Their criticism was a response to the six-question confidence interval survey instrument from Hoekstra et al. (2014). García-Pérez and Alcalá-Quintana (2016) argued that rather than an all-false statement set, a better design would have been to have an equal number of correct and incorrect statements. This methodology would increase the fidelity of the results, for example by countering bias respondents may have had to systematically answer true or false.
But García-Pérez and Alcalá-Quintana (2016) raised additional concerns as well, unveiling the methodological challenges inherent in any survey instrument with compulsory responses and an omitted “I don’t know” response option. A survey instrument with this design cannot distinguish misinterpretation from mere guessing since respondents cannot opt out of answering or directly indicate a lack of knowledge. Where one might willingly admit ignorance they are instead forced into a “misinterpretation.”
Whether one views the misinterpretation/guessing distinction as relevant likely depends on the research question. Is the focus on the scope of the problem or on possible solutions? Assuming one considers a set of statistical statements to be unambiguous, with normatively correct answers, and is primarily concerned with the scope of the problem — the existing statistical knowledge of members of a particular academic discipline and their students — then there may be little desire to distinguish ignorance from misconception. Either a respondent understands the fundamental set of statistical statements or they do not. Whether the subject is knowingly ignorant or instead unknowingly harbors misconceptions is irrelevant; the full set of concepts present in the statements are within the universe needed for proficient statistical analysis. In either case the subject is therefore deficient. Such clear cut views of proficiency are not mere smugness. One may be primarily concerned with the average statistical knowledge base of particular academic disciplines because there are implications for the quality of the resulting research.
On the other hand if one seeks a solution by way of pedagogical prescription, the distinction is no doubt relevant. If misinterpretations are the more prominent feature then education resources should focus on shoring up existing explanations. However, if ignorance is the more common observation then the scope of explanations may warrant widening, with entirely new explanations and ways of thinking about the procedures added.
However, there is also a dispute over whether one can really interpret statements present in the survey instruments as “unambiguous, with normatively correct answers.” This difficulty was highlighted above with interpretations of Statement 5 from Oakes (1986). Likewise, García-Pérez and Alcalá-Quintana (2016) argued that some confidence interval statements from Hoekstra et al. (2014) are indeed ambiguous, with both true or false responses correct depending on how one interprets the question. While such issues are important in interpreting the results it is worth noting that the majority of evidence still points to high levels of misinterpretations across both students and researchers. García-Pérez and Alcalá-Quintana (2016) themselves found confidence interval misinterpretations were still prevalent, even when researchers self-indicated an informed opinion about the statements presented. And Badenes-Ribera used a 10-question NHST survey instrument that included mostly false, but some true statements, and still found significant rates of misinterpretation. Likewise Falk and Greenbaum (1995) and more recently Lyu et al. (2020) used different NHST statement wording with similarly high misinterpretation rates. Plus, task-based tests of NHST and confidence interval knowledge in general also suggest gaps in knowledge for large proportions of respondents. Finally, as the summary of this article indicated, other types of evidence like reviews of psychology textbooks also underscore common misinterpretations.
Another possible methodological shortcoming is that typically a single version of the survey instrument was shown to respondents. Out of the 12 NHST and confidence interval survey studies assessing misinterpretations, only Lyu et al. (2020) randomized participants into different versions, one with a statistically significant hypothetical scenario and one that was statistically nonsignificant. The NHST statements from Lyu et al. (2020) resulted in substantial differences in performance between the two versions. This wasn’t true for the confidence interval questions. Whether this result was simply a matter of the Chinese students and researchers that partook in the survey is unclear; so too is why one version might be more confusing than another. More research should be conducted that includes randomizing participants into different survey versions so that the results of Lyu et al. (2020) can be confirmed and causal mechanisms identified.
Respondents did not demonstrate better knowledge of confidence intervals than NHST
Another theme was that in general psychology researchers did not demonstrate a better understanding of confidence intervals than NHST. Some might view this as particularly problematic because confidence intervals are often suggested as a replacement or supplement to NHST (for example, see references 1, 2, 3, 10). When it came to survey instruments respondents did worse, not better, when judging the truth or falsity of confidence interval statements. Likewise respondents also demonstrated misunderstandings in task-based studies. As discussed in the previous section there are some methodological caveats, but in general the finding is robust.
The implications of this finding are somewhat complex. When viewed within a cost-benefit framework some argue that confidence intervals are still better since their benefits over NHST are numerous. Such arguments have included [3, 24]:
Confidence intervals redirect focus from p-values to effect sizes, typically the real meat of any decision making criteria
Confidence intervals help quantify the precision of the effect size estimate, helping to take into account uncertainty
Confidence intervals can more fruitfully be combined into meta-analyses than effect sizes alone
Confidence intervals are better at predicting future experimental results than p-values
On the other hand, some have argued that confidence intervals are deceivingly hard to use correctly and therefore the costs may not outweigh the benefits [25]. Some researchers misunderstand the properties of confidence interval precision, for instance: narrow confidence intervals do not necessarily imply more precise estimates [25]. And as studies covered in this article show, it can be difficult for researchers to truly understand the information contained in confidence intervals and the resulting interpretation. Even textbooks sometimes get it wrong [26].
The debate about how confidence intervals should be incorporated into the analysis and presentation of scientific results is not likely to end soon. However, for those interested in the debate the “Confidence interval review” section of this article summarizes the outcomes of the various direct inquires of student and researcher knowledge.
Between-subject analysis may reveal important patterns
Between-subject analysis revealed important findings within the small set of studies focusing on the so-called cliff effect, the idea that there is a large, but unwarranted drop in researcher confidence between p-values below the 0.05 threshold and those above it. Poitevineau et al. (2001) found that averaging across subjects masked between-subject heterogeneity in p-value confidence, making the effect appear more severe than when subjects were considered individually. In fact, the average cliff effect they found was substantially driven by a small number of respondents that expressed what the authors call an “all-or-none” approach to p-values, with extremely high confidence for p-values less than 0.05 and almost zero confidence for p-values larger than 0.05. Two other patterns of researcher confidence were also identified, with some researchers exhibiting a more mild cliff effect and others exhibiting almost no cliff effect. Lai (2010) replicated the experiment with a much larger sample size and came to somewhat different conclusions than Poitevineau et al. (2001), noting the cliff effect was more prominent than claimed. But this finding, and the other contributions of his study, were only possible because of between-subject analysis.
Another example came from Belia et al. (2005). Interpreting results from the study benefited from the histogram of subject responses provided by the authors. It showed the average was somewhat effected by a small number of outliers. Finally, Lyu et al. (2018) and Lyu et al. (2020) made their raw data publicly available and so it was possible for us to reanalyze their results, revealing important findings not presented in the original paper.
In general it is difficult to discern the importance of between-subject heterogeneity and average effects without access to the raw data. Instead, we must trust the analysis provided by authors in their published findings. However, there is undoubtedly valuable information hidden within the inaccessible raw data of some published studies. We encourage the continued trend of making raw data publicly available whenever possible for other researchers to reexamine.
NHST Review
Formal studies of NHST knowledge originated with Michael Oakes’ work in the late 1970’s through the mid 1980s. These studies are outlined in his 1986 book Statistical Inference. Oakes’ primary NHST study consisted of a short survey instrument that he presented to 70 academic psychologists in the United States. Oakes notes subjects were, "university lecturers, research fellows, or postgraduate students with at least two years of research experience." The survey instrument outlined the results of a simple experiment and asked the subjects which of six statements could be marked as “true” or “false” based on the experimental results. His survey is shown below:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false”. “False” means that the statement does not follow logically from the above premises.
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
2. You have found the probability of the null hypothesis being true.
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
4. You can deduce the probability of the experimental hypothesis being true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
The correct answer to all of these statements is "false.” Yet, only 3 of the 70 academic psychologists correctly marked all six statements as false. The average number of incorrect responses — statements for which the subject marked “true” — was 2.5.
Statements 1 and 3 are false because NHST cannot offer a “proof” of hypotheses about scientific or social phenomena. Statements 2 and 4 are examples of the Inverse Probability Fallacy; NHST measures the probability of observed data assuming the null hypothesis is true and therefore cannot offer evidence about the probability of either the null or alternative hypothesis. Statement 6 is incorrect because the p-value is not a measure of experimental replicability. Belief in Statement 6 is indicative of the Replication Fallacy.
The results of Oakes (1986) are presented alongside Haller and Krauss (2002) in a table below.
In a second question Oakes asked the 70 academic psychologists which of the six statements aligned with the usual interpretation of statistical significance from the NHST procedure. If none of the answers were correct respondents were allowed to write in the correct interpretation. Again, only 3 of the 70 correctly identified that none of the answers were the correct interpretation. Across Oakes’ two studies only 2 of 70 academic psychologists correctly answered all 12 statements correct (marked all six statements as incorrect in both studies).
Across numerous studies, including Oakes (1986), Statement 5 was the most misinterpreted. This is likely due to the association respondents made between the statement and the Type I error rate. This phenomenon was discussed in detail in the “Themes & Criticisms” section of this article.
In a 2002 follow-up study to Oakes (1986), Heiko Haller and Stefan Krauss repeated the experiment in six German universities, presenting the survey to 44 psychology students, 39 non-methodology instructors in psychology, and 30 methodology instructors in psychology. Haller and Krauss added a sentence at the end of the result description that preceded the statements, noting that, “several or none of the statements may be correct.”
Subjects from Haller and Krauss (2002) were surveyed in 6 different German universities. Methodology instructors consisted of "university teachers who taught psychological methods including statistics and NHST to psychology students." Note that in Germany methodology instructors can consist of "scientific staff (including professors who work in the area of methodology and statistics)” as well as “advanced students who teach statistics to beginners (so called 'tutors')." Scientific psychologists consisted of "professors and other scientific staff who are not involved in the teaching of statistics."
The results of Haller and Krauss (2002) were similar to Oakes (1986) with 100% of the psychology students, 89% of the non-methodology instructors, and 80% of the methodology instructors incorrectly marking at least one statement as “true.” The average number of responses incorrectly marked as “true” was generally lower than in Oakes: 2.5 for the psychology students, 2.0 for non-methodology instructors, and 1.9 for methodology instructors.
Details of the six different misinterpretations for both Oakes’ original study as well as the Haller and Krauss replication are shown in the chart below. The education level with the highest proportion of misinterpretation by statement is highlighted in red, showing that U.S. psychologists generally fared worse than those in Germany. The most common misinterpretation across all four groups was that the p-value represents the Type I error rate. The least common interpretation among all groups within German universities was that the null hypothesis had been proved. This selection was likely marked as “false” due to the small p-value in the result statement. In contrast, relatively small percentages of each group believed that the small p-value indicated the null hypothesis had been disproved, indicating at least some understanding that the p-value is a probabilistic statement, not a proof. Details of this misunderstanding were examined in Statement 4, the “probability of the null is found,” with more than half of German methodology instructors and scientific psychologists answering correctly, but more than half of the other two groups answering incorrectly. The p-value is indeed a probability statement, but it is a statement about data’s compatibility with the null hypothesis, not the null hypothesis itself.
German universities | U.S. psychologists | |||
---|---|---|---|---|
Statement summaries | Methodology instructors | Scientific psychologists | Psychology students | Academic psychologists |
1. Null Hypothesis disproved | 10% | 15% | 34% | 1% |
2. Probability of null hypothesis | 17% | 26% | 32% | 36% |
3. Null hypothesis proved | 10% | 13% | 20% | 6% |
4. Probability of null is found | 33% | 33% | 59% | 66% |
5. Probability of Type I error | 73% | 67% | 68% | 86% |
6. Probability of replication | 37% | 49% | 41% | 60% |
Percentage with at least one misunderstanding |
80% | 89% | 100% | 96% |
Average number of misintepretations |
1.9 | 2.0 | 2.5 | 2.5 |
Table notes:
1. Sample sizes: methodology instructors (n=30), scientists not teach methods (n=39), psychology students (n=44), U.S. academic psychologists (n=70).
2. Reproduced from (a) "Misinterpretations of Significance: A Problem Students Share with Their Teachers?", Heiko Haller & Stefan Krauss, Methods of Psychological Research Online, 2002 [link] (b) Statistical Inference, Michael Oakes, Epidemiology Resources Inc., 1990 [link]
One criticism of Oakes’ six-statement instrument is that the hypothetical setup itself is incorrect. The supposed situation notes that there are two groups of 20 subjects each, but that the resulting degrees of freedom is only 18, as can be seen in the instrument wording: “suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01).” However, the correct degrees of freedom is actually 38 (20-1 + 20-1=38).
One might wonder whether this miscalculation confused some of the respondents. This explanation seems unlikely since follow-up surveys found similar results. In 2018 Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu surveyed 346 psychology students and researchers in China using the same wording as in Haller and Krauss (translated into Chinese), but correcting the degrees of freedom from 18 to 38. Similar error rates were obtained as previous studies with 97% of Master’s and PhD students (203 out of 209), 100% of post-doc researchers and assistant professors (23 out of 23), and 75% of experienced professors (6 out of 8) selecting at least one incorrect answer.
Lyu et al. recruited subjects via an online survey that included notifications on WeChat, Weibo, and blogs. There was also a pen-and-paper survey that was conducted during the registration day of the 18th National Academic Congress of Psychology that took place in Tianjin, China. No monetary or other material payment was offered.
Details from Lyu et al. (2018) are shown below. The education level with the highest proportion of misinterpretation by statement is highlighted in red, demonstrating that master’s students had the highest misinterpretation rate in four of the six statements. The overall pattern is similar to that seen in Oakes and Haller and Krauss. The “Probability Type 1 error” and “Probability of null is found” misinterpretations were the most prominent. The “Null disproved” and “Null proved” were the least selected misinterpretations, but were both quite common and substantially more prevalent than in the German and U.S. cases.
Statement summaries | Undergraduate | Master's | PhD | Postdoc and assistant proessors. | Experienced professors |
---|---|---|---|---|---|
1. Null Hypothesis disproved | 20% | 53% | 17% | 39% | 0% |
2. Probability of null hypothesis | 55% | 58% | 51% | 39% | 25% |
3. Null hypothesis proved | 28% | 46% | 6% | 35% | 25% |
4. Probability of null is found | 60% | 43% | 51% | 39% | 50% |
5. Probability of Type I error | 77% | 44% | 96% | 65% | 75% |
6. Probability of replication | 32% | 56% | 36% | 35% | 12% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample sizes: Undergraduates (n=106), Master's students (n=162), PhD students (n=47), Postdoc or assistant prof (n=23), Experienced professor (n=8).
3. Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Although not presented in their article, Lyu et al. also recorded the psychology subfield of subjects in their study. Because the authors made their data public, it was possible to reanalyze the data, slicing by sub-field. R code used to calculate these figures is provided in the “Additional Resources” section of this article. All subfields had relatively high rates of NHST misinterpretations. Social and legal psychology had the highest misinterpretation rate with 100% of the 51 respondents in that subfield having at least one misinterpretation. Neuroscience and neuroimaging respondents had the lowest rate with 78% having at least one misinterpretation, although there were only nine total subjects in the subfield.
Psychological subfield | Percentage with at least one NHST misunderstanding | Sample size |
---|---|---|
Fundamental research & cognitive psychology | 95% | 74 |
Cognitive neuroscience | 98% | 121 |
Social & legal psychology | 100% | 51 |
Clinical & medical psychology | 84% | 19 |
Developmental & educational psychology | 97% | 30 |
Psychometric and psycho-statistics | 94% | 16 |
Neuroscience/neuroimaging | 78% | 9 |
Others | 94% | 17 |
Table notes:
Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Modified instruments with slightly different hypothetical setups have also been used. In 1995 Ruma Falk and Charles Greenbaum tested 53 psychology students’ NHST knowledge at the Hebrew University of Jerusalem using the five-statement instrument below.
The students attended Hebrew University of Jerusalem. The authors note that, "these students have had two courses in statistics and probability plus a course of Experimental Psychology in which they read Bakan's (1966) paper." Bakan (1966) warns readers against the Inverse Probability Fallacy and other NHST misconceptions [19].
The subjects were told that a test of significance was conducted and the result turned out significant for a predetermined level. They were asked what is the meaning of such a result. The five following options were offered as answers.
1. We proved that H0 is not true.
2. We proved that H1 is true.
3. We showed that H0 is improbable.
4. We showed that H1 is probable.
5. None of the answers 1-4 is correct.
The authors note that although multiple answers were allowed all students choose only a single answer. The correct answer, Item 5, was chosen by just seven subjects (8%).
In a study from China released in 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers including 272 in the field of psychology. This was the largest sample size of the eight disciplines surveyed. They used a four-statement instrument derived from Oakes (1987) in which respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant.
Note that the Lyu in Lyu et al. (2018) described above and in Lyu et al. (2020) described here are different researchers. Ziyang Lyu coauthored the 2018 study, Xiao-Kang Lyu coauthored the 2020 study. However, Chuan-Peng Hu was a coauthor on both studies.
Recruitment for Lyu et al. (2020) was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. The location of respondents’ highest academic degree consisted of two geographic areas, Mainland China (83%) and overseas (17%).
Lyu et al. (2020) used the following survey instrument:
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
1. You have absolutely disproved (proved) the null hypothesis.
2. You have found the probability of the null (alternative) hypothesis being true.
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.
Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1 in the original paper). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.
We also used the open source data to more thoroughly examine the study results. At 91%, psychologists had the sixth highest proportion of respondents with at least one NSHT misinterpretation out of eight professions surveyed by Lyu et al. (2020). The proportion of respondents incorrectly answering the significant and nonsignificant versions of the instrument are shown below. The significant version had a higher misinterpretation rate for three out of the four statements, including a 14 percentage point difference for Statement 3 and 13 percentage point difference for Statement 4.
Overall, psychologists had the second highest average number of NHST misinterpretations, 1.94 out of a total of four possible. Given the results in the table below it is perhaps surprising that the nonsignificant version had a substantially higher rate of misinterpretations, 2.14, compared to 1.71 for the significant version. We used an independent means t-test to compare the average number of incorrect responses between the two test versions, which resulted in a p-value of 0.0017 (95% CI: 0.16 to 0.70). This suggests that the observed data in the study are relatively incompatible with the hypothesis that random sampling variation alone accounted for the difference in the average number of misinterpretations of the two test versions. Looking across the entire 1,479 sample of all eight disciplines paints a similar picture, with a p-value of 0.00011 (95% CI: 0.12 to 0.36). However, more research would be needed to fully understand why the nonsignificant version posed more interpretation challenges.
Statement summaries | Significant version (n=125) |
Nonsignificant version (n=147) |
---|---|---|
1. You have absolutely disproved (proved) the null hypothesis | 50% | 54% |
2. You have found the probability of the null (alternative) hypothesis being true. | 59% | 40% |
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. | 77% | 63% |
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. | 42% | 29% |
Table notes:
1. Percentages do not add to 100% because multiple responses were acceptable.
2. Sample sizes: significant version (n=125), nonsignificant version (n=147).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Journal of Pacific Rim Psychology, 2020, paper: [link], data: [link]
A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST statement. PhD students fared best, with “only” 81% demonstrating at least one NHST misinterpretation, but they also had the highest average number of incorrect responses with 2.4 (out of four possible), indicating that those respondents that did have a misinterpretation tended to have multiple misinterpretations. This was one of the highest rates of any combination of academic specialty and education, behind only statistics PhD students and post-PhD statisticians.
Education | Sample size | Percentage with at least one NHST misunderstanding | Average number of NHST misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 67 | 94% | 1.9 |
Master's | 122 | 93% | 1.8 |
PhD | 47 | 81% | 2.4 |
Post-PhD | 36 | 92% | 2.0 |
In a set of three studies — the first focusing on Spanish academic psychologists, the second on Spanish university students, and the third on Italian and Chilean academic psychologists — researchers Laura Badenes-Ribera and Dolores Frías-Navarro created a 10-statement instrument. (They partnered with Marcos Pascual-Soler on both Spanish studies, Héctor Monterde-i-Bort on the Spanish academic psychologist study, and Bryan Iotti, Amparo Bonilla-Campos and Claudio Longobardi on the Italian and Chilean study). The study of university students was published in Catalan, but we had the study professionally translated into English [13].
Those particularly interested in the statistical practices of Spanish and Italian psychologists are also encouraged to review the team’s other work on knowledge of common statistical terms (for Italian researchers see reference 14, for Spanish researches see reference 15). These studies are not reviewed here as they do not deal with statistical misinterpretations.
The authors’ primary survey instrument is shown below categorized by the fallacy they are attempting to test.
Let’s suppose that a research article indicates a value of p = 0.001 in the results section (alpha = 0.05). Mark which of the following statements are true (T) or false (F).
Inverse Probability Fallacy
1. The null hypothesis has been shown to be true
2. The null hypothesis has been shown to be false
3. The probability of the null hypothesis has been determined (p = 0.001)
4. The probability of the experimental hypothesis has been deduced (p = 0.001)
5. The probability that the null hypothesis is true, given the data obtained, is 0.01
Replication Fallacy
6. A later replication would have a probability of 0.999 (1-0.001) of being significant
Effect Size Fallacy
7. The value p = 0.001 directly confirms that the effect size was large
Clinical or Practical Significance Fallacy
8. Obtaining a statistically significant result indirectly implies that the effect detected is important
Correct interpretation and decision made
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance
Statements 1 and 2 are false because NHST cannot definitively show the truth or falsity of any hypothesis. Statements 3, 4, and 5 are examples of the Inverse Probability Fallacy; NHST measures the probability of observed data assuming the null hypothesis is true and therefore cannot offer evidence about the probability of either the null or alternative hypothesis. It is unclear why Statement 5 uses a p-value of 0.01, while other statements use a p-value of 0.001. Thus far the authors have not replied to multiple emails we have sent and so we therefore cannot provide additional insight. Statement 6 is incorrect because the p-value is not a measure of experimental replicability, the so-called Replication Fallacy. Statement 7 is incorrect because the p-value is not a direct measure of effect size, a separate calculation is needed to determine effect size. The p-value can act as an indirect measure of effect sizes insofar as statistically significant p-values create a filter effect that we term “The zone of nonsignificance.” Statement 8 is incorrect because whether a result is important depends not on the p-value, but on the broader scientific context under which the hypothesis test is being performed. Statement 9 is correct as it is a restatement of the p-value definition. The authors also consider Statement 10 correct, but its interpretation depends on how one interprets the word “conclude.” One cannot definitively conclude that a small p-value implies differences that are not due to random chance as large random errors are sometimes observed which heavily influence p-values. Some subjects might also interpret “conclude” to mean the decision is correct. However, as previously stated NHST does not provide probabilities of hypotheses and therefore the correctness of the decision cannot be determined from the p-value. On the other hand, a p-value of 0.001 is typically small enough that a researcher considers random errors sufficiently unlikely that they can be ignored for whatever decision or conclusion is currently at hand.
For the study of Spanish academic psychologists a total of 418 subjects were recruited that worked at Spanish public universities. Participants were contacted via email based on a list collected from publicly available sources. The mean period of time working at a university was 14 years. Subjects were asked to provide their subfield.
Again, results show significant misunderstanding with 94% of Spanish academic psychologists choosing at least one incorrect response related to the five inverse probability statements, 35% incorrectly answering statement six related to replication, and 40% incorrectly marking statement seven or eight as true. The percentage of respondents demonstrating at least one misunderstanding across all statements was not provided by the authors. Although Statements 9 and 10 were listed in the instrument, the results were not presented.
The percentage of respondents incorrectly answering each statement broken out by psychological subfield is shown below. The subfield with the highest proportion of misunderstandings by statement is highlighted in red. Developmental and educational psychologists fared worst as there were four statements for which that population had the highest proportion of misunderstandings.
Statements | Personality, Evaluation and Psychological Treatments | Behavioral Sciences Methodology | Basic Psychology | Social Psychology | Psychobiology | Developmental and Educational Psychology |
---|---|---|---|---|---|---|
1. The null hypothesis has been shown to be true | 8% | 2% | 7% | 5% | 7% | 13% |
2. The null hypothesis has been shown to be false | 65% | 36% | 61% | 66% | 55% | 62% |
3. The probability of the null hypothesis has been determined (p=0.001) | 51% | 58% | 68% | 62% | 62% | 56% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 41% | 13% | 23% | 37% | 38% | 44% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 33% | 19% | 25% | 31% | 41% | 36% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | 35% | 16% | 36% | 39% | 28% | 46% |
7. The value p=0.001 directly confirms that the effect size was large | 12% | 3% | 9% | 16% | 24% | 18% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 35% | 16% | 36% | 35% | 28% | 46% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | Not shown to respondents | |||||
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | Not shown to respondents |
Table notes:
1. The academic sub-field with the highest proportion of misinterpretation by statement is highlighted in red.
2. Percentages are not meant to add to 100%.
3. Sample sizes: Personality, Evaluation and Psychological Treatments (n=98), Behavioral Sciences Methodology (n=67), Basic Psychology (56), Social Psychology (n=74), Psychobiology (n=29), Developmental and Educational Psychology (n=94)
4. Reference: "Interpretation of the p value: A national survey study in academic psychologists from Spain", Laura Badenes-Ribera, Dolores Frías-Navarro, Héctor Monterde-i-Bort, and Marcos Pascual-Soler, Psicothema, 2015 [link]
The study testing the knowledge of Spanish students used the same 10-statement instrument as the study of Spanish academic psychologists, except the statement about p-value replicability was not present. A total of 63 students took part in the study, all recruited from the University of Valencia. On average students were 20 years of age and all had previously studied statistics. The results broken down by by fallacy category were that 97% of subjects demonstrated at least one NHST misinterpretation related to the Inverse Probability Fallacy (Statements 1-5), 49% of subjects demonstrated at least one NHST misinterpretation related to either the Effect Size Fallacy or the Clinical or Practical Importance Fallacy (Statements 7 and 8), and 73% of subjects demonstrated a NHST misinterpretation related to correct decision making (Statements 9 and 10). The percentage of incorrect responses by statement is shown below.
Statements | Percentage incorrectly answering the question |
---|---|
1. The null hypothesis has been shown to be true | 25% |
2. The null hypothesis has been shown to be false | 56% |
3. The probability of the null hypothesis has been determined (p=0.001) | 65% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 29% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 51% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | Not shown to respondents |
7. The value p=0.001 directly confirms that the effect size was large | 18% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 41% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | 49% |
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | 50% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample size is 63 students.
3. Reference: "Misinterpretations Of P Values In Psychology University Students", Laura Badenes-Ribera, Dolores Frías-Navarro, and Marcos Pascual-Soler, Anuari de Psicologia de la Societat Valenciana de Psicologia, 2015 [link]
The study of Chilean and Italian academic psychologists included 164 participants overall (134 Italian, 30 Chilean). Participants were contacted via email based on a list collected from Chilean and Italian universities. For both countries subjects were broken out into methodology and non-methodology areas of expertise. In Italy the average years of teaching or conducting research was 13, with a standard deviation of 10.5 years. The gender break down was 54% women and 46% men; 86% were from public universities, the remaining 14% were from private universities. In Chile the average years of teaching or conducting research was 15 and a half, with a standard deviation of eight and a half years. Subjects were evenly split between women and men; 57% were from private universities, while 43% were from public universities.
Overall, 56% of methodology instructors and 74% of non-methodology instructors selected at least one incorrect response. Note that the sample size for both Chilean methodologists (n=5) and Italian methodologists (n=13), were substantially smaller than the corresponding sample size for non-methodologists, 25 and 121 respectively. Although Statements 9 and 10 were listed in the instrument, the results were not presented.
The percentage of incorrect responses by statement is shown below. The population with the highest proportion of misinterpretation by statement is highlighted in red, showing that the two countries had roughly equal levels of misinterpretations. While Chilean methodologists did not have any statements for which their level of misinterpretation was highest, there were only five psychology researchers in the sample.
Chilean psychologists | Italian psychologists | |||
---|---|---|---|---|
Statements | Methodology | Other | Methodology | Other |
1. The null hypothesis has been shown to be true | 0% | 4% | 0% | 4% |
2. The null hypothesis has been shown to be false | 40% | 60% | 23% | 28% |
3. The probability of the null hypothesis has been determined (p=0.001) | 20% | 12% | 31% | 26% |
4. The probability of the experimental hypothesis has been deduced (p=0.001) | 0% | 16% | 8% | 12% |
5. The probability that the null hypothesis is true, given the data obtained, is 0.01 | 0% | 8% | 23% | 14% |
6. A later replication would have a probability of 0.999 (1-0.001) of being significant | 0% | 20% | 8% | 12% |
7. The value p=0.001 directly confirms that the effect size was large | 0% | 0% | 8% | 6% |
8. Obtaining a statistically significant result indirectly implies that the effect detected is important | 0% | 8% | 8% | 9% |
9. The probability of the result of the statistical test is known, assuming that the null hypothesis is true | Not shown to respondents | |||
10. Given that p = 0.001, the result obtained makes it possible to conclude that the differences are not due to chance | Not shown to respondents |
Table notes:
1. Percentages are rounded and may not add to 100%.
2. Sample sizes: Chilean methodologists (n=5), Chilean other areas (n=25), Italian methodologists (n=13), Italian other areas (n=121).
3. Reference: "Misconceptions of the p-value among Chilean and Italian Academic Psychologists", Laura Badenes-Ribera, Dolores Frias-Navarro, Bryan Iotti, Amparo Bonilla-Campos, and Claudio Longobardi, Frontiers in Psychology, 2016 [link]
In 2010 Hector Monterde-i-Bort, Dolores Frías-Navarro, Juan Pascual-Llobell used Part II of the The Psychometrics Group Instrument developed by Mittag (1999) to survey 120 psychology researchers in Spain. All subjects had a doctorate degree or proven research or teaching experience in psychology.
Previously, Mittag and Thompson (2000) surveyed 225 members of the American Educational Research Association (AERA) using the same instrument and Gordon (2001) surveyed 113 members of the American Vocational Education Research Association (AVERA). Both studies are discussed in our article on NHST and confidence interval misunderstandings of education researchers (coming soon).
Part II of the the Psychometrics Group Instrument contains 29 statements broken out into nine categories. Subjects responded using a 5-point Likert scale where for some statements 1 meant agree and 5 meant disagree and for others the scale was reversed so that 1 denoted disagreement and 5 agreement. We refer to the first scale direction as positive (+) and the second as negative (-).
The 5-point Likert scale is often constructed using the labels 1 = Strongly agree, 2 = Agree, 3 = Neutral, 4 = Disagree, 5 = Strongly disagree (for example see Likert’s original 1932 paper [16], or more recent examples [17, 18]). However, the Psychometrics Group Instrument uses the labels 1 = Agree, 2 = Somewhat agree, 3 = Neutral, 4 = Somewhat disagree, 5 = Disagree (with the word “strongly” omitted).
The instrument included both opinion- and fact-based statements. These two categories are not recognized within the instrument itself, but are our categorization based on whether the question has a normatively correct answer; if so it is considered fact-based. An example of an opinion-based statement was:
It would be better if everyone used the phrase, “statistically significant,” rather than “significant”, to describe the results when the null hypothesis is rejected.
An example of a fact-based statement was:
It is possible to make both Type I and Type II error in a given study.
We selected eight fact-based statements related to NHST as a measure of NHST misunderstandings of this group. Mean responses on the Likert scale are shown below. As a measure of the degree of misunderstanding the absolute difference between the mean and normatively correct answer was calculated. For example, Statement 2 reads, “Type I errors may be a concern when the null hypothesis is not rejected” is incorrect. This is because Type I error refers to falsely rejecting a true null hypothesis; if the null hypothesis is not rejected there is no possibility of making an error in the rejection. For this reason every respondent should have selected 5 (disagree). However, the mean response was in fact 3.53, therefore the deviation was 5.0 - 3.53 = 1.47 in Likert scale points.
One case in which ambiguity might exist is Statement 1, which reads “It is possible to make both Type I and Type II error in a given study.” For the purposes of this analysis we consider this statement incorrect, as was intended by the authors. This is because Type II error occurs when the null hypothesis is not rejected, while Type I error occurs when the null hypothesis is rejected. Because the null is either rejected or not, with mutual exclusion between the two possibilities, only one type of error can occur in a given hypothesis test. However, the word “study” is used in the statement and therefore one could easily argue that a study can contain multiple hypothesis tests. There is no way to know if respondents interpreted “study” to mean the entire process of data collection and hypothesis testing of a single outcome of interest or if they interpreted it to mean multiple hypothesis tests used to examine various aspects of a single phenomenon.
Averaging the deviation from the correct answer across all eight responses resulted in a figure of 1.32. This is roughly equivalent to somewhat agreeing with a statement that is in fact true, and therefore normatively should elicit complete agreement. The 1.32 mean difference is substantially lower than that found in both the AERA and AVERA populations, which both had mean differences above 1.7.
Statement 3 had the largest mean difference. This was also the case with the AERA and AVERA populations, however for those populations all three statements relating to Type I and Type II errors had larger deviations from the correct answer than other statements, not true for the psychologists in the Monterde-i-Bort et al. (2010) study.
Statements 4, 6 and 7 were both related to the Clinical or Practical Significance Fallacy. Statement 7 had the lowest deviation of all eight statements, 0.99, again also true for the AERA and AVERA populations. The fact that Statement 6 had larger mean differences than Statements 4 and 7 may be due to the difference in statement wording. Statement 6 read that, “Finding that a p < 0.05 is one indication that the results are important.” While some might argue that this statement is true — a p-value less than 0.05 result is one, but not the only, indication of an important result — we unambiguously find this statement to be incorrect as the p-value has no bearing at all on whether a result is important.
Statement 5 relates to the Effect Size Fallacy and is also incorrect. It is true that the relative effect size is one determinate of the size of the p-value; this is the reason for what we call the “zone of nonsignificance”. However, the p-value is not a direct measure of the effect size. For example, a small effect can still produce a small p-value if the sample size is sufficiently large.
Statement 8 relates to the Replication Fallacy. It had a deviation of 1.47 from the correct response. The statement is incorrect as the p-value is not a measure of experimental replicability.
Statement | Mean response (Likert scale) |
Upper CI limit | Lower CI limit | Correct answer | Deviation from correct answer (Likert scale) |
Scale direction |
---|---|---|---|---|---|---|
1. It is possible to make both Type I and Type II error in a given study. | 3.53 | 3.22 | 3.84 | Incorrect | 1.47 | + |
2. Type I errors may be a concern when the null hypothesis is not rejected. | 3.91 | 3.62 | 4.20 | Incorrect | 1.09 | + |
3. A Type II error is impossible if the results are statistically significant. | 2.26 | 1.97 | 2.55 | Correct | 2.74 | - |
4. If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important. | 3.90 | 3.70 | 4.10 | Incorrect | 1.10 | + |
5. Smaller p values provide direct evidence that study effects were larger. | 3.54 | 3.29 | 3.79 | Incorrect | 1.46 | + |
6. Finding that p < .05 is one indication that the results are important. | 3.24 | 3.67 | 4.11 | Incorrect | 1.76 | + |
7. Studies with non-significant results can still be very important. | 1.99 | 1.81 | 2.17 | Correct | 0.99 | + |
8. Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research. | 3.53 | 3.28 | 3.78 | Incorrect | 1.47 | + |
Table notes:
1. Deviation from correct answer is calculated by assuming that -- when the scale is positive -- for incorrect answers 5 (disagree) is normatively correct and subtracts the mean from 5. This is reversed for the negative scale. The same logic applies to answers that are correct, in which case a response of 1 (agree) is considered normatively correct.
2. + scale direction indicates that 1 = agree and 5 = disagree, - scale direction indicates that 1 = disagree and 5 = agree.
3. The mapping between our statement numbering and that in the survey instrument is as follows (our statement = instrument statement): 1 = 22, 2 = 17, 3 = 9, 4 = 14, 5 = 11, 6 = 6, 7 = 18, 8 = 8.
4. Reference "Uses and abuses of statistical significance tests
and other statistical resources: a comparative study", Hector Monterde-i-Bort, Dolores Frías-Navarro, Juan Pascual-Llobell, European Journal of Psychology of Education, 2010 [link]
There have also been more targeted studies of specific misinterpretations. In a 1979 study Michael Oakes found that psychology researchers overestimated the size of the effect when the significance threshold was changed from 0.05 to 0.01. Oakes asked 30 academic psychologists to answer the prompt below:
Suppose 250 psychologist and 250 psychiatrists are given a test of psychopathic tendencies. The resultant scores are analyzed by an independent means t-test which reveals that the psychologists are significantly more psychopathic than the psychiatrists at exactly the 0.05 level of significance (two-tailed). If the 500 scores were to be rank ordered, how many of the top 250 (the more psychopathic half) would you guess to have been generated by psychologists?
Significance level | ||
---|---|---|
0.05 | 0.01 | |
0.05 level presented first | 163 | 181 |
0.05 level presented second | 163 | 184 |
Table notes:
1. Sample sizes: academic psychologists first presented 0.05 and then asked to revise at 0.01 (n=30); academic psychologists first presented 0.01 and then asked to revise at 0.05 (n=30).
2. The standard deviation for all four groups was around 20, ranging from 18.7 to 21.2.
3. Reproduced from Michael Oakes, Statistical Inference, Epidemiology Resources Inc. [link]; the book was published in 1990, but the study was conducted in 1979.
The participants were then asked to revise their answer assuming the 0.05 level of significance was changed to 0.01. For a separate set of an additional 30 academic psychologists the order of the prompts was reversed, with respondents asked first about the 0.01 level and then about the 0.05 level.
The results of the respondents’ answers are shown in the table at right (answers have been rounded). The first row shows responses from the group asked first to consider the 0.05 significance level while the second row shows responses from the group asked first to consider the 0.01 significance level.
The correct answer is that moving from a level of significance of 0.05 to a level of 0.01 implies three additional psychologists appear in the top 250. However, the average answers for both groups shows that on average the psychologists estimated a difference of around 20 additional psychologists would appear in the top 250. Oakes also calculated the median responses, which did not substantively change the results.
Oakes also tested academic psychologists’ understanding of p-values and replication by asking 54 of them to predict via intuition (or direct calculation if desired) the probability of replication under three different scenarios.
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the results is Z = 2.33 (p = 0.01, one-tailed), N=40. This experiment is theoretically important and you decide to repeat it with 20 new subjects. What do you think the probability is that these 20 subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.05 level?
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the result is Z = 1.16 (p=0.12, one-tailed), N=20. This experiment is theoretically important and you decide to repeat it with 40 new subjects. What do you think the probability is that these 40 new subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.05 level?
Suppose you are interested in training subjects on a task and predict an improvement over a previously determined control mean. Suppose the result is Z = 1.64 (p=0.05, one-tailed), N=20. This experiment is theoretically important and you decide to repeat it with 40 new subjects. What do you think the probability is that these 40 new subjects, taken by themselves, will yield a one-tailed significant result at the p < 0.01 level?
Scenario | Mean intuition of replicability | True replicability |
---|---|---|
1 | 80% | 50% |
2 | 29% | 50% |
3 | 75% | 50% |
Table notes:
1. Reproduced from Michael Oakes, Statistical Inference, Epidemiology Resources Inc. [link]; the book was published in 1990, but the study was conducted in 1979.
The results of Oakes’ test is presented in the table at right. Oakes designed the scenarios in a clever manner so that each of the three scenarios produced the same answer: the true replicability is always 50%. In all three cases the difference between the average intuition about the replicability of the scenarios and the true replicability is substantial. As outlined above Oakes argues that this difference is due to statistical power being underappreciated by psychologists who instead rely on mistaken notions of replicability linked to the statistical significance of the p-value.
During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simple NHST statement. This survey included 70 University of Granada students in the field of psychology. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work. Both works appear to be based on Vallecillos’ doctoral thesis written in Spanish. We have not yet been able to obtain a copy of this thesis. Note that using online search it appears Augustias is also sometimes spelled Angustias.
Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:
A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.
University speciality | Sample size | Correct answer | Incorrect answer |
---|---|---|---|
Psychology | 70 | 17% | 74% |
Table notes:
1. The exact number of respondents coded under each category were as follows: true - 52, false - 12, blank - 6 (8.6%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
Students were asked to answer either “true” or “false” and to explain their answer, although an explanation was not required. The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.
The quantitative results are shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation although we have omitted those figures from the table for clarity. It was not stated why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.
Nearly three quarters of psychology students incorrectly answered the statement as true, the highest proportion of any of the seven university specialties surveyed. In addition, as described below, when providing written explanations justifying their responses few psychology students demonstrated correct reasoning.
Vallecillos coded the written explanation of student answers into one of six categories:
Correct argument (C) - These responses are considered to be completely correct.
Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”
Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”
Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”
Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.
Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”
Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is a case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show, the fallacy is quite common among both students and professionals.
Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”
Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.
Example response: “What it establishes is the possibility that the answer formed is the correct one.”
Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.
Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”
In total 59 of the 70 psychology students (84%) provided written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. Just 9% of psychology students gave a correct written explanation and only 20% gave a partially correct explanation. At 51%, psychology students had the highest proportion of explanations falling into the M1 category, but the second lowest proportion of explanations falling into the M2 category (business students had the lowest proportion).
Vallecillos notes that when considering the full sample of all 436 students across all seven majors, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.
University speciality | Number of subjects who provided written explanations | C | PC | M1 | M2 | M3 | DI |
---|---|---|---|---|---|---|---|
Psychology | 59 | 9% | 20% | 51% | 10% | 7% | 3% |
Table notes:
1. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to interpret". See full explanations in description above the table.
2. Percentages have been rounded for clarity and may not add to 100%.
3. The exact number of respondents coded under each category were as follows: C - 5, PC - 12, M1 - 30, M2 - 6, M3 - 4, DI - 2.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]
In 1993 Miron Zuckerman, Holley Hodgins, Adam Zuckerman, and Robert Rosenthal tested 508 academic psychologists and 17 psychology students by asking five statistical questions. The subjects were recruited among a wide range of subfields drawn from authors that had published one or more articles in the following journals between 1989 and 1990: Developmental Psychology, Journal of Abnormal Psychology, Journal of Consulting and Clinical Psychology, Journal of Counseling Psychology, Journal of Educational Psychology, Journal of Experimental Psychology: General, Journal of Experimental Psychology: Human Perception and Performance, Journal of Experimental Psychology: Learning, Memory and Cognition, and Journal of Personality and Social Psychology. Zuckerman, Hodgins, and Zuckerman note that, “The respondents included 17 students, 175 assistant professors, 134 associate professors, 182 full professors, and 43 holders of nonacademic jobs. The earliest year of Ph.D. was 1943 and the median was between 1980 and 1981.”
Four of the questions outlined in Zuckerman, Hodgins, and Zuckerman (1993) are outside the scope of this article, but readers are encouraged to reference the full paper for details. One question in particular asked about effect size estimates within an NHST framework:
Lisa showed that females are more sensitive to auditory nonverbal cues than are males, t = 2.31, df = 88, p < .05. Karen attempted to replicate the same effect with visual cues but obtained only a t of 1.05, df = 18, p < .15 (the mean difference did favor the females). Karen concluded that visual cues produce smaller sex differences than do auditory cues. Do you agree with Karen’s reasoning?
The correct answer was “No.” Three choices “Yes,” “No,” and “It depends” were available for selection. The reason for an answer of “No” is that by using the provided data one can reverse engineer the effect sizes of the two studies, which leads to an equal effect for both. In addition, the correct method to compare the sensitivity of visual and auditory cues would be to compare them directly, not indirectly by looking at p-values. The direct calculation would produce an estimated mean difference between auditory and visual cues as well as a confidence interval for that mean. Additionally, the authors note that Karen’s study would have been stronger had she also replicated the auditory condition.
This question had the highest accuracy of the five with 90% of respondents answering correctly. However, the author notes that respondents may have had a bias to answer “no” and that questions where the correct answer was “yes” had much lower proportions of respondents answering correctly. However, the bias is inconclusive since questions where the answer was “yes” might have also been more difficult.
Lecoutre, Poitevineau, and Lecoutre (2003) surveyed 20 psychologists working at laboratories throughout France, specifically hoping to identify two common NHST misinterpretations (25 statisticians were also surveyed; these results are presented in our article on NHST and confidence interval misunderstandings of statistics researchers, which is coming soon).
The authors constructed a hypothetical scenario in which the efficacy of a drug is being tested by using two groups, one given the drug and one given a placebo. Each group had 15 participants, for a total of 30. The drug was to be considered clinically interesting by experts in the field if the unstandardized difference between the treatment mean and the placebo mean was more than 3. Four different scenarios were constructed crossing statistical significance with effect size (large/small). These are shown in the table below.
Situation three and four are considered by the authors to offer conflicting information since in one case the result is nonsignificant, but the effect size is large and in the other the result is significant, but the effect size is small. These two situations are meant to test two common NHST misinterpretations: Interpreting a nonsignificant result as evidence of no effect — the Nullification Fallacy — and confusing statistical significance with scientific significance, the Clinical or Practical Significance Fallacy.
Only the t-statistic, p-value, and effect size were provided to subjects. The authors suggest that two metrics in particular are useful in determining the drug’s efficacy. The first is the 100(1 – α)% confidence interval. The standard 2σ rule can be used to calculate the 95% confidence interval, by multiplying two times the standard error, 2(D/t). The second metric is the sampling error, which can be calculated by squaring the effect size divided by the t-statistic, (D/t)^2. The authors note that for larger variances the estimated effect size is not very precise and no conclusion should be made about the drug’s efficacy. Although these two metrics were not shown to respondents they are provided in the table below for completeness.
Situation | t-statistic | P-value | Effect size (D) | Estimated sampling error (D/t)2 | Standard error (D/t) 95% CI |
Normative answer |
---|---|---|---|---|---|---|
1. Significant result, large effect size | 3.674 | 0.001 | 6.07 | 2.73 | 1.65 2.77 to 9.37 |
Clinically interesting effect |
2. Nonsignificant result, small effect size | 0.683 | 0.5 | 1.52 | 4.95 | 2.23 -2.93 to 5.97 |
No firm conclusion |
3. Significant result, small effect size | 3.674 | 0.001 | 1.52 | 0.17 | 0.41 0.69 to 2.5 |
No clinically interesting effect |
4. Nonsignificant result, large effect size | 0.683 | 0.5 | 6.07 | 78.98 | 8.89 -11.7 to 23.84 |
No firm conclusion |
Subjects were asked the following three questions:
1. For each of the four situations, what conclusion would you draw for the efficacy of the drug? Justify your answer.
2. Initially, the experiment was planned with 30 subjects in each group and the results presented here are in fact intermediate results. What would be your prediction of the final results for D then t, for the conclusion about the efficacy of the drug?
3. From an economical viewpoint, it would of course be interesting to stop the experiment with only the first 15 subjects in each group. For which of the four situations would you make the decision to stop the experiment, and conclude?
Only the results for Question 1 are discussed here as they align with commonly documented NHST misinterpretations. For a discussion of Questions 2 and 3 please see Lecoutre, Poitevineau, and Lecoutre (2003). The results for Question 1 are shown in the table below. The three response categories below were coded by the authors based on interviews with subjects (we have made the category names slightly friendlier without changing their meaning). Green indicates the subject's response aligns with the authors' normatively correct response. Red indicates it does not align. The authors note that, “subjects were requested to respond in a spontaneous fashion, without making explicit calculations.”
All subjects gave responses that matched the normatively correct response to Situation 1. When looking at the confidence interval it can be seen that values below the clinically important effect of 3 are still reasonably compatible with the data, meaning that the true impact of the drug on the population may be clinically ineffective. However, the results are promising enough that more research is likely warranted, that is there is a “clinically interesting effect” as the authors noted in their normatively correct response. Clinically interesting is not the same as effective, however, and it is unclear what coding methodology was used by the authors to produce the figures shown in the table below.
All but three subjects responded incorrectly to Situation 2, indicating that the drug is ineffective. The confidence interval for Situation 2 included both 3, the clinically important value, as well as 0, indicating no effect at all. Therefore, the normatively correct response was “Do not know” since the true impact of the drug on the population could be either clinically important or completely ineffective (or even mildly harmful). The authors note that interview statements by subjects implying the drug is not effective demonstrate the Nullification Fallacy, a nonsignificant result should not be taken as evidence that the drug is ineffective.
Situation 3 was split between correct and incorrect responses, with just 40% responding correctly. Here the confidence interval does not include 0, but also does not include the clinically important effect of 3. Therefore, “No clinically interesting effect” was the normatively correct response selected by the authors. Again, “No clinically interesting effect” is not the same as ineffective and it is unclear why the authors seem to conflate the two. The authors note that statements implying the drug is effective are exhibiting the Clinical or Practical Significance Fallacy since Situation 3 had a small p-value (0.001), but clinically unimportant effect size.
Situation 4 had a slightly higher correct response rate of 65%. Here the confidence intervals are extremely large relative to the other situations, ranging from -11 to 23. Again, “No firm conclusion” was normatively correct and so respondents who were coded into the “Do not know” category are considered to have given the correct response.
Some may object to this study because the authors discouraged any direct calculation. The results, therefore, are based on statistical intuition. Though the authors note that all subjects “perceived the task as routine for their professional activities,” suggesting statistical intuition is a key part of performing their job successfully. Likewise, no subjects raised concerns about the statistical setup of the four situations, for instance whether requirements such as normality or equality of variances were fulfilled.
Others may object that if a value of 3 was clinically important perhaps the correct approach would be to conduct a one-sided t-test and obtain a p-value representing whether the observed data were compatible with a hypothesis that the drug was equal to or greater than this level of effect. The authors likely chose their method because it allowed more explicit testing of the fallacies as they typically occur in practice. For instance, the Nullification Fallacy typically arises when the result is nonsignificant (the confidence interval covers zero).
Situation | Response | ||
---|---|---|---|
The drug is effective | The drug is ineffective | Do not know | |
1. Significant result, large effect size | 100% | 0% | 0% |
2. Nonsignificant result, small effect size | 0% | 85% | 15% |
3. Significant result, small effect size | 45% | 40% | 15% |
4. Nonsignificant result, large effect size | 0% | 35% | 65% |
Table notes:
1. Green indicates the subject's response aligns with the authors' normative response. Red indicates it does not align.
In 2012 Jerry Lai, Fiona Fidler, and Geoff Cumming tested the Replication Fallacy by conducting three separate studies of psychology researchers. Subjects were authors of psychology articles in high-impact journals written between 2005 and 2007. Medical researchers and statisticians were also surveyed and those results will be covered in upcoming articles.
Subjects were provided with the p-value from an initial hypothetical experiment and given the task of intuiting a range of probable p-values that would be obtained if the experiment were repeated. Each of the three studies varied slightly in its design and task prompt. All subjects carried out the tasks via an email response to an initial email message.
The Replication Fallacy could be considered to have a strict and loose form. In the strict form a researcher or student misinterprets the p-value itself as a replication probability. This misconception might be tested using, for instance, the survey instrument from the series of studies by Badenes-Ribera et al. As a reminder the instrument asked subjects to respond true or false to the statement, “A later replication [of an initial experiment which obtained p = 0.001] would have a probability of 0.999 (1-0.001) of being significant.” The loose form of the fallacy is not tied so directly to the p-value itself. Even if one does not strictly believe the p-value is a replication probability, researchers and students might believe a small p-value is somehow indicative of strong replication properties, with repeated experiments of an initially statistically significant result likely to produce more statistically significant results most of the time. It is this loose version which Lai et al. (2012) attempted to investigate.
Each of the three tasks started with the same initial summary description below. The sample size and z-score and p-value were sometimes modified.
Suppose you conduct a study to compare a Treatment and a Control group, each with size N = 40. A large sample test, based on the normal distribution, of the difference between the two means for your independent groups gives z = 2.33, p = .02 (two-tailed).
The specific task wording then followed.
All three studies attempt to obtain the subjects’ 80% p-value interval. That is, the range in which 80% of the p-values from the repeated experiments would fall. These p-value intervals were then compared to the normatively correct intervals derived from the formulas presented in Cumming (2008). Simulation can also be used to help determine p-value intervals.
The tasks in Study 2 and 3 were designed to accommodate findings from research on interval judgement. For example, research that suggests that prompting subjects for interval endpoints and then asking them what percentage of cases their interval covers may improve their estimates over prompts that ask for a pre-specified X% interval. For a full review of how the authors considered that literature please see Lai et al. (2012).
Study 1 elicited 71 usable responses. Subjects were shown the summary description above and then asked to carry out the following task. Email sends were randomized so that roughly half contained the z = 2.33 and p = 0.02 version of the summary description, and the other half contained a z = 1.44 and p = 0.15 version. Thus each subject responded to a single task with a specified p-value. A total of 37 responses were returned with the p = 0.02 version and 34 responses were returned with the p = 0.15 version.
The specific task wording was as follows:
Suppose you carry out the experiment a further 10 times, with everything identical but each time with new samples of the same size (each N = 40). Consider what p values you might obtain. Please enter 10 p values that, in your opinion, could plausibly be obtained in this series of replications. (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)
The authors then extrapolated the 10 p-values into a distribution and then took the 80% interval, with the authors noting that, “To analyze the results we assumed that underlying the 10 p values given by a respondent was an implicit subjective distribution of replication p.” The details of the extrapolation are included in the Appendix of Lai et al. (2012). Results are discussed further below.
In Study 2 instead of being asked for 10 separate p-values subjects were asked directly about the upper and lower limits of their 80% p-interval. Subjects were shown the summary description above and then asked to carry out the task, described below. P-value variations were crossed with sample size variations, for a total of four versions of the task. These variations applied to both the summary description and the task specific language. All subjects were asked to respond to all four versions. The four combinations were: (1) a z = 2.33 and p = 0.02 version with a sample size of 40; (2) the same p-value as in Version 1, but with a sample size of 160; (3) a z = 1.44 and p = 0.15 version with a sample size of 40; (4) the same p-value as in Version 3, but with a sample size of 160. The number of respondents varied between 37 to 39 as not every respondent completed all four tasks as instructed.
The specific task wording was as follows:
Suppose you repeat the experiment, with everything identical but with new samples of the same size (each N = 40). Consider what p value you might obtain. Please estimate your 80% prediction interval for two-tailed p. In other words, choose a range so you guess there’s an 80% chance the next p value will fall inside this range, and a 20% chance it will be outside the range (i.e., a 10% chance it falls below the range, and 10% it falls above). (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)
LOWER limit of my 80% prediction interval for p= [ ] Type a valueless than .02. (You guess a 10% chance p will be less than this value, and 90% it will be greater.)
UPPER limit of my 80% prediction interval for p=[ ] Type a value more than .02. (You guess a 10% chance p will be greater than this value, and 90% it will be less.)
Results are discussed further below.
Study 3 elicited 62 usable responses. Subjects were shown the summary description above and then asked to carry out the following task which involved identifying p-value bounds as well as defining a percentage p-value interval those bounds cover. The methodology was similar to Study 1, except instead of randomizing email sends all subjects saw both the z = 2.33 and p = 0.02 and z = 1.44 and p = 0.15 versions. The sample size for both versions was 40. The task wording is shown below.
Suppose you repeat the experiment, with everything identical but with new samples of the same size (each N = 40). Consider what p value you might obtain. Please type your estimates for the statements below. (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)
The replication study might reasonably find a p value as low as: p= [ ] Type a value less than .02
The replication study might reasonably find a p value as high as: p= [ ] Type a value more than .02
The chance the p value from the replication study will fall in the interval between my low and high p value estimates above is [ ] %.
A summary of the task for each study is shown in the table below.
Study | Number of responsents | Number of task versions presented to each respondent | P-value variations | Sample size variations | Task summary |
---|---|---|---|---|---|
Study 1 | 71 | 1 | p = 0.02 or p = 0.15 | N = 40 | Provide 10 p-values that you might obtain if the initial experiment were repeated 10 times. |
Study 2 | 37-39 | 4 | p = 0.02 and p = 0.15 | N = 40 and N = 160 | Provide a lower and upper p-value bound for an 80% p-value interval if the initial experiment were repeated. |
Study 3 | 62 | 2 | p = 0.02 and p = 0.15 | N = 40 | Provide the largest and smallest p-value you would expect if the initial experiment were repeated. Provide the chance the p-value will fall within the bound you gave. |
Table notes:
1. For Study 2 the number of respondents varied between 37 to 39 as not every respondent completed all four tasks as instructed.
2. Reference: "Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication", Jerry Lai, Fiona Fidler, and Geoff Cumming, Methodology, 2012 [link]
A summary of the results across the three studies is shown in the chart at right. Here figures closer to zero are considered better as they represent estimates with less misestimation. The chart was reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion.
For reference, the authors note that if an initial experiment resulted in a p-value of 0.02, under conservative assumptions an 80% replication interval would be p = 0.0003 to p = 0.3.
For Study 1 both p-value task versions resulted in underestimates 40 percentage points too low. That is, instead of providing a set of p-values that resulted in an 80% p-value interval, respondents provided p-values that resulted in just a 40% p-value interval.
Subjects in Study 2 did slightly better, averaging an underestimate of around 27 percentage points for the N = 40 task version and between 35 and 40 percentage points for the N = 160 version.
Subjects in Study 3 did about the same as subjects in the N = 40 task version in Study 2.
Overall, it seems the authors’ considerations for the interval judgement literature for Studies 2 and 3 may have had a positive impact on underestimation. However, respondents still substantially underestimated p-value variation in replications, providing approximately a 52% p-value interval on average rather than an 80% interval.
Data for the N = 40 condition were combined via basic meta-analysis into an overall average. The precise sample size is not known due to Study 2, which the authors note ranged between 37 and 39, depending on the task version, without providing details. We made a conservative estimate using the 160 figure, but the sample size can be no greater than 162. The meta-analysis resulted in an average underestimate of 32 percentage points, equal to a 48% p-value interval.
Not all figures were broken down by academic discipline. Looking across all three disciplines of psychology, medicine, and statistics showed psychology researchers had higher magnitudes of underestimation than statisticians, but lower than medical researchers. Again, considering all three disciplines, 98% of respondents underestimated interval width in Study 1, 94% underestimated in Study 2, and 82% underestimated in Study 3.
Respondents were also invited to provide any comments on the task they received. The authors note that many subjects responded positively, with a substantial number noting that they found the task novel as it was uncommon in their day-to-day research.
In 2015 Anton Kühberger, Astrid Fritz, Eva Lermer and Thomas Scherndl set out to test the susceptibility of psychology student to both the Effect Size Fallacy and the Nullification Fallacy, the interpretation of a nonsignificant result as evidence of no effect. The authors surveyed 133 students at the University of Salzburg in Austria that were enrolled in a basic statistics course.
Without being shown the results students were asked to estimate the sample and effect sizes of two real psychology studies: a study on the effect of temperature on perceived social distance — what the authors call the “thermometer study” — and a study on the influence of physical movement on cognitive processing, the “locomotion study”. The thermometer study was conducted by Hans IJzerman and Gün R. Semin and published in 2009 in Psychological Science under the title, “The thermometer of social relations. Mapping social proximity on temperature.” The locomotion study was conducted by Severine Koch, Rob W Holland, Maikel Hengstler, and Ad van Knippenberg; it was also published in 2009 in Psychological Science under the title, “Body Locomotion as Regulatory Process: Stepping Backward Enhances Cognitive Control.”
Overviews of the two studies are provided in the task descriptions below. Respondents were each shown both scenarios, but randomized into a version citing either a significant or nonsignificant result. The order in which they saw the scenarios was also randomized.
Both studies started with the following instructions.
Dear participant, thank you for taking part in our survey. You will see descriptions of two scientific research papers and we ask you to indicate your personal guess on several features of these studies (sample size, p-value, …). It is important that you give your personal and intuitive estimates.
You must not be shy in delivering your estimates, even if you are not sure at all. We are aware that this may be a difficult task for you – yet, please try.
The thermometer study description was as follows:
Task 1: The influence of warmth on social distance
In this study researchers investigated the influence of warmth on social distance. The hypothesis was that warmth leads to social closeness. There were two groups to investigate this hypothesis:
Participants of group 1 held a warm drink in their hand before filling in a questionnaire. Participants of group 2 held a cold drink in their hands before they filled in the same questionnaire. Participants were told to think about a known person and had to estimate their felt closeness to this person. They had to indicate closeness on a scale from 1–7, whereas 1 means ‘very close’ and 7 means ‘very distant’.
The closeness ratings of the participants of group 1 were then compared to the closeness ratings of group 2.
Researchers found a statistically significant [non-significant] effect in this study.
While the locomotion study gave this overview:
Task 2: The influence of body movement on information processing speed
Previous studies have shown that body movements can influence cognitive processes. For instance, it has been shown that movements like bending an arm for pulling an object nearer go along with diminished cognitive control. Likewise, participants showed more cognitive control during movements pushing away from the body. In this study, the influence of movement of the complete body (stepping forward vs. stepping backward) on speed of information processing was investigated.
The hypothesis was that stepping back leads to more cognitive control, i.e., more capacity. There were two conditions in this study: In the first condition participants were taking four steps forwards, and in the second condition participants were taking four steps backwards. Directly afterwards they worked on a test capturing attention in which their responses were measured in milliseconds. The mean reaction time of the stepping forward-condition was compared to the mean reaction time of the stepping backward-condition.
Researchers found a statistically significant [non-significant] effect in this study.
Due to data quality concerns, results from 126 participants are provided for the thermometer study, whereas the locomotion study has data from 133 participants. Note that this data quality issue is why the 214 total students cited in the “Methods” section of Kühberger et al. is inconsistent with the actual results presented subsequently by the authors. Further note that in the “Results” section of Kühberger et al. the authors mistakenly cite a figure of 127 participants in the thermometer study; all other figures in the paper support the 126 figure that we have presented here (for example, adding sample sizes in data tables results in 126 participants). One additional error in the Table 3 crosstabulation was also found and is outlined below in the discussion of the Nullification Fallacy.
Results of student estimates are shown below alongside the results from the actual psychology studies of which students were given descriptions. The data suggest that the students indeed fell prey to the Effect Size fallacy, consistently estimating larger effect sizes for the significant than for the nonsignificant versions, measured by the difference between Group 1 and Group 2 means (Diff. of means) and Cohen’s d.
The authors formally tested the proclivity of this overestimation in the significant and nonsignificant versions using the Mann–Whitney U-Test. As a measure of effect size Mann–Whitney uses what is known as a “rank-biserial correlation,” abbreviated by ‘r’. To calculate r the Cohen’s d estimates for both the significant and nonsignificant cases were ranked in order of highest to lowest and set side by side. The proportion of rows favorable to the null hypothesis — rows in which Cohen’s d was larger for the nonsignificant version than the significant version — were subtracted from the proportion of rows unfavorable to the null hypothesis, that is rows in which Cohen’s d was larger for the significant version than the nonsignificant version. Here the null hypothesis is that the significant version does not produce a larger effect size estimate than the nonsignificant version. The same procedure was done with the Diff. of Means estimate.
After conducting this analysis the authors found that for the thermometer study a test of equality between the significant and nonsignificant version produced z =-5.27 (p < .001) with r =-0.47 for Diff. of means and for Cohen’s d produced z =-3.88 (p < .001) with r = -0.34. For the locomotion study the equivalent numbers were z =-2.48 (p = .013) with r = -0.21 for the Diff. of means and z =-4.16 (p < .001) with r = -0.36 for Cohen’s d. These results suggest that the observed estimates from students are relatively incompatible with the hypothesis that students would estimate equal effect sizes in the significant and nonsignificant versions. The corresponding calculation for the difference in sample size estimates were z =-1.75 (p = .08) with r = -0.15 for the thermometer study and z =-0.90 (p = .37) with r = -.08 for the locomotion study, suggesting that differences in sample size estimates between the significant and nonsignificant version was perhaps due to simple random variation. These two results are in line with the author’s hypothesis: while the Effect Size Fallacy would cause a noticeable correlation between significant results and larger effect sizes, there is no reason to believe student estimates of sample size would vary between the significant and nonsignificant versions.
However, less formal analysis paints a similar picture. For the thermometer study the estimated difference of means was two times larger in the significant version (2.0 vs. 1.0) as was Cohen’s d (0.6 vs 0.3). The ratio of sample sizes was smaller, 1.5. The pattern was even more pronounced with the locomotion study with the estimated difference of means five times larger in the significant version (50 vs. 10) while Cohen’s d was 3.5 times larger (0.7 vs 0.2). The sample size meanwhile was just 1.2 times larger (60 vs. 50).
Thermometer study | Locomotion study | |||||
---|---|---|---|---|---|---|
Value | Actual | Median estimate from significant version | Median estimate from nonsignificant version | Actual | Median estimate from significant version | Median estimate from nonsignificant version |
Sample size (n) | 33 | 76 | 50 | 38 | 60 | 50 |
Group 1 mean | 5.12 | 2.7 | 3.5 | 712 | 150 | 150 |
Group 2 mean | 4.13 | 4.05 | 4.0 | 676 | 120 | 118 |
Diff. of means | 0.99 | 2.0 | 1.0 | 36 | 50 | 10 |
Group 1 SD | 1.22 | 1.0 | 1.25 | 83 | 10 | 5 |
Group 2 SD | 1.41 | 8.0 | 10.0 | 95 | 8 | 5 |
Cohen's d | 0.78 | 0.6 | 0.3 | 0.79 | 0.7 | 0.2 |
Table notes:
1. Actual estimates are from the original studies, Median estimates are from participants.
2. Sample sizes: Thermometer significant (n=53), Thermometer nonsignificant (n=73), locomotion significant (n=65), locomotion nonsignificant (n=68).
3. Units for the thermometer study was a 7-point Likert scale, units for the locomotion study were milliseconds.
4. Reference: "The significance fallacy in inferential statistics", Anton Kühberger, Astrid Fritz, Eva Lermer, & Thomas Scherndl, BMC Research Notes, 2015 [link]
To examine the Nullification Fallacy Cohen’s d was again used. The authors argued that the primary sign of the fallacy would be an estimate of zero difference between means for the nonsignificant version, and a nonzero estimate for the significant version. This reasoning follows from the fallacy itself: that a nonsignificant result is evidence of no difference between group means.
To investigate the presence of the fallacy the authors categorized the students’ Cohen’s d estimates into four ranges:
Large (d > 0.8)
Medium (0.5 < d < 0.8)
Small (0.3 < d < 0.5)
Very small (d < 0.3)
While only 6% of students in the thermometer study and 3% of students in locomotion study estimated effect sizes of exactly zero in the nonsignificant versions of the two experiments, a much greater number estimated what the authors termed a “very small” difference.
The proportion of students choosing a “very small” Cohen’s d was indeed much larger in the nonsignificant version across both studies. A “very small” difference was estimated twice as often for the nonsignificant version of the thermometer study (59% versus 30%) and four times as much in the locomotion study (60% vs. 13%).
These trends are quickly distinguishable in the chart at right. This chart uses data from Table 3 of Kühberger et al. Note that the table in the paper miscalculated percentages in the significant version of the thermometer study, however we corrected these percentages when producing the chart.
In sum this data does not necessarily support the Nullification Fallacy outright. It could be viewed instead as further evidence of the Effect Size Fallacy, that significant results are in general associated with larger effects than nonsignificant results. However, this is somewhat dependent on subject-level data. If subjects assumed equal sample sizes in the treatment and control group then smaller p-values would indeed be associated with large effect sizes.
One might argue these results are also indicative of the Clinical or Practical Significance Fallacy, which describes the phenomenon of equating statistically nonsignificant results with practically unimportant results. However, to clearly delineate the extent of each of these fallacies among this population more research would be needed.
NHST meta-analysis
The 16 journal articles reviewed in the “Surveys of NHST knowledge” section used different methodologies to assess common NHST misinterpretations and overall NHST understanding of psychology researchers and students. Several of these studies cannot be coherently combined. However, others can be. In particular, a subset of 10 studies were identified that used survey instruments eliciting true or false responses, thereby consistently measuring rates of correct responses. These 10 studies represent nine different journal articles; one article surveyed respondents in two different countries and is counted twice for this reason.
Two primary quantitive assessments of NHST knowledge are available across these studies:
Measure 1: The percentage of respondents demonstrating at least one NHST misunderstanding (available in all 10 studies)
Measure 2: The average number of incorrect responses to the survey instrument (available in 5 studies)
Measure 1 was combined across studies using a simple weighted average based on sample size. Although the length and wording of the survey instruments varied we ignore these factors and weight each survey equally regardless of the number of statements included. One could come up with various weighting schemes to account for survey length, but we decided to keep things simple. We have made the underlying meta-analysis data available so that others may recalculate the results using their own methodologies.
Measure 1 should be considered a lower bound. This is because for some survey instruments the percentage of incorrect responses across the entire survey was not available. For example, the set of studies by Badenes-Ribera used a 10-statement instrument (although the results from all 10 questions were not always reported in the results). However, incorrect response rates were broken down by fallacy. For that reason response rates from the 5-statement Inverse Probability Fallacy category were used as a proxy for the total misinterpretation rate. If the percentage of respondents demonstrating at least one NHST misinterpretation was calculated across the entire instrument it would no doubt be higher. The longest survey instrument used in this meta-analysis was therefore six statements in length. Methodological challenges not withstanding it seems reasonable to conclude that regardless of education level a clear understanding of NHST should enable one to correctly answer six NHST statements based on a simple hypothetical setup. Possible methodological challenges are discussed momentarily.
For Measure 2 the average number of incorrect responses was divided by the number of questions in the survey instrument to obtain a proportion of incorrect responses. A simple weighted average based on sample size was then applied.
There are limitations to our methodology. First, the survey instruments themselves can be criticized in various ways including incorrect hypothetical setups, confusing statement language, and debates over the normatively correct response. The specific criticisms of each study were outlined in the “Surveys of NHST knowledge” section. Second, survey instruments were administered across multiple countries and it is unclear what effects translation or cultural differences may have had on subject responses. In total, studies were conducted in seven different countries. Third, the experience of students and researchers varied across studies. Four broad education categories were used to span the full breadth of respondent experience: undergraduates, master’s students, PhD students, and Post-PhD researchers. At the end of this section details on the specific experience level of each study included in the meta-analysis is provided. Fourth, all studies used convenience sampling, thereby limiting the external validity of any meta-analysis. However, other forms of evidence corroborate the general problem of NHST misinterpretation and misusage. As outlined in the “Article summary and results section” this evidence includes incorrect textbooks definitions, audits of NHST usage in published academic articles, and calls for reform from NHST practitioners that have been published in academic journals across nearly every academic discipline. These three types of evidence will be reviewed in full in upcoming articles.
Despite the challenges Measures 1 and 2 appear to be fair measures of base NHST knowledge.
Turning to the results of the meta-analysis, across the 10 studies a total of 1,569 students and researchers were surveyed between 1986 and 2020. A weighted average across all 10 studies resulted in 90% of subjects demonstrating at least one NHST misinterpretation. Note that some studies surveyed multiple education levels. See the Excel in the “Additional Resources” section for the detailed calculations across studies.
A breakdown by education level is shown below. Post-PhD researchers have the lowest rates of NHST misinterpretations, but still have a misinterpretation rate of 87%. Master’s students had the highest rate at 95%.
Education | Number of studies | Sample size | Percentage with at least one NHST misunderstanding (weighted average) |
---|---|---|---|
Undergraduates | 5 | 403 | 92% |
Master's | 2 | 284 | 95% |
PhD | 2 | 94 | 90% |
Post-PhD | 7 | 788 | 87% |
Total | 10 | 1,569 | 90% |
Table notes:
1. Calculated from 10 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
A breakdown by country is shown below. Most countries have only a single study that was conducted, although China has two and Spain three. Chile and Italy have by far the lowest rate of misinterpretations, however it is unclear if there is an underlying causal mechanism or if this is random sampling variation.
Country | Number of studies | Sample size | Percentage with at least one NHST misunderstanding (weighted average) |
---|---|---|---|
U.S. | 1 | 70 | 96% |
Israel | 1 | 53 | 92% |
Germany | 1 | 113 | 91% |
Spain | 3 | 551 | 92% |
China | 2 | 618 | 94% |
Chile | 1 | 30 | 67% |
Italy | 1 | 134 | 60% |
Total | 10 | 1,569 | 90% |
Table notes:
1. Calculated from 10 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
Misinterpretation rates do not appear to have gone down over time. This can be seen in the chart below. This chart depicts the 10 studies included in the Measure 1 meta-analysis with the publishing year on the x-axis. The y-axis shows the percentage of respondents endorsing at least one false statement, that is having at least one NHST misinterpretation. The size of the bubble represents the sample size; the sample size key can be seen in the bottom left. Between 2015 and 2020 four separate studies show misinterpretation rates equal to or near Oakes’ original work in 1986; all have misinterpretation rates above 90%.
The results of Measure 2 are shown below. Only half of the total 10 studies used for Measure 1 were available for Measure 2.
The average respondent missed nearly half the statements on their survey instrument. Post-PhD researchers fared best, missing on average two out of five statements. All other education levels were clustered around the 50% mark.
Education | Number of studies | Sample size | Average percentage of statements that respondents answered incorrectly |
---|---|---|---|
Undergraduates | 4 | 287 | 52% |
Master's | 2 | 284 | 48% |
PhD | 2 | 94 | 51% |
Post-PhD | 4 | 206 | 40% |
Total | 5 | 871 | 48% |
Table notes:
1. Calculated from 5 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
Measure 2 is broken down by country below. Only China had more than a single study included in the Measure 2 analysis. Indeed, China drove much of the Measure 2 calculation due to its large sample size. Spain had the highest average percentage answered incorrectly, but note that this was a single-statement instrument and the sample size was relatively small compared to the German and Chinese studies.
Country | Number of studies | Sample size | Average percentage of statements that respondents answered incorrectly |
---|---|---|---|
U.S. | 1 | 70 | 42% |
Germany | 1 | 113 | 36% |
Spain | 1 | 70 | 74% |
China | 2 | 618 | 48% |
Total | 5 | 871 | 48% |
Table notes:
1. Calculated from 5 studies on NHST misinterpretations. See Excel in the Additional Resources section for details.
Intuitively, looking at the results of individual studies paints a picture of mass misinterpretation of NHST among all education levels. Using simplistic meta-analysis methods to combine data across studies confirms this regardless of whether the average number of misinterpretations is used as a measure (Measure 2) or whether one focuses on the average number of respondents with at least one incorrect response (Measure 1).
Details of the experience level of each population is shown below.
Authors | Year | Country | Instrument length | Population details |
---|---|---|---|---|
Oakes | 1986 | U.S. | 6 | Book title: Statistical Inference [link] The subjects were academic psychologists. Oakes notes they were, "university lecturers, research fellows, or postgraduate students with at least two years' research experience." |
Falk & Greenbaum | 1995 | Israel | 5 | Article title: "Significance tests die hard: The amazing persistence of a probabilistic misconception" [link] The authors note that the psychology students attended Hebrew University of Jerusalem. The authors note that, "these students have had two courses in statistics and probability plus a course of Experimental Psychology in which they read Bakan's (1966) paper." Bakan (1966) warns readers against the Inverse Probability Fallacy. |
Vallecillos | 2000 | Spain | 1 | Article title: "Understanding of the Logic of Hypothesis Testing Amongst University Students" [link] The author notes that psychology students were selected because they have obtained, "...prior humanistic grounding during their secondary teaching..." |
Haller & Krauss | 2002 | Germany | 6 | Article title: "Misinterpretations of Significance: A Problem Students Share with Their Teachers?" [link] Subjects were surveyed in 6 different German universities. There were three groups. First, methodology instructors, which consisted of "university teachers who taught psychological methods including statistics and NHST to psychology students." Note that in Germany methodology instructors can consist of "scientific staff (including professors who work in the area of methodology and statistics), and some are advanced students who teach statistics to beginners (so called 'tutors')." The second group were scientific psychologists, which consisted of "professors and other scientific staff who are not involved in the teaching of statistics." The third group were psychology students. |
Badenes-Ribera et al. | 2015 | Spain | 5 | Article title: "Misinterpretations of p-values in psychology university students" The subject consisted of psychology students from the Universitat de Valencia who have already studied statistics. The mean age of the participants was 20.05 years (SD = 2.74). Men accounted for 20% and women 80%. |
Badenes-Ribera et al. | 2015 | Spain | 5 | Article title: "Interpretation of the p value: A national survey study in academic psychologists from Spain" [link] Academic psychologists from Spanish public universities. The mean number of years teaching and/or conducting research was 14.16 (SD = 9.39). |
Badenes-Ribera et al. | 2016 | Italy | 5 | Article title: "Misinterpretations Of P Values In Psychology University Students (Spanish language)" [link] Subjects were academic psychologists. The average years of teaching or conducting research was 13.28 years (SD = 10.52), 54% were women and 46% were men, 86% were from public universities, the remaining 14% were from private universities. |
Badenes-Ribera et al. | 2016 | Chile | 5 | Article title: "Misinterpretations Of P Values In Psychology University Students (Spanish language)" [link] Subjects were academic psychologists. The average years of teaching or conducting research was 15.53 years (SD = 8.69). Subjects were evenly split between women and men, 57% were from private universities, while 43% were from public universities. |
Lyu et al. | 2018 | China | 6 | Article title: "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation" [link] The online survey "recruited participants through social media (include WeChat, Weibo, blogs etc.), without any monetary or other material payment...The paper-pen survey data were collected during the registration day of the 18th National Academic Congress of Psychology, Tianjin, China..." This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. |
Lyu et al. | 2020 | Mainland China (83%), Overseas (17%) | 4 | Article title: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields" [link] Recruitment was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. All respondents were awarded their degree in China. |
Confidence Interval Review
Because of the difficulty in properly interpreting NHST, confidence intervals are sometimes suggested as a replacement or supplement (for example, see references 1, 2, 3, 10). Numerous arguments have been put forward about the advantages of confidence intervals. These arguments have included [3, 24]:
Confidence intervals redirect focus from p-values to effect sizes, typically the real meat of any decision making criteria
Confidence intervals help quantify the precision of the effect size estimate, helping to take into account uncertainty
Confidence intervals can more fruitfully be combined into meta-analyses than effect sizes alone
Confidence intervals are better at predicting future experimental results than p-values
However, to use them correctly students and researchers must be able to interpret confidence intervals without misinterpretation. This section reviews the evidence on confidence interval misinterpretation.
In 2014 Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers tested confidence interval misinterpretations among 120 Dutch researchers, 442 Dutch first-year university students, and 34 master’s students all in the field of psychology. The authors used a six-statement instrument adapted from Oakes (1987), inspired by the discussion in Gigerenzer (2004).
The authors note that the undergraduate students "were first-year psychology students attending an introductory statistics class at the University of Amsterdam." None of the students had previously taken a course in inferential statistic. The master's students "were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years." The researchers came from three universities: Groningen, Amsterdam, and Tilburg.
The instrument presented six statements and asked participants to mark each as true or false (all six were false). The six statements are shown below.
Professor Bumbledorf conducts an experiment, analyzes the data, and reports: “The 95% confidence interval for the mean ranges from 0.1 to 0.4!”
Please mark each of the statements below as “true” or “false”. False means that the statement does not follow logically from Bumbledorf’s result. Also note that all, several, or none of the statements may be correct:
1. The probability that the true mean is greater than 0 is at least 95%.
2. The probability that the true mean equals 0 is smaller than 5%.
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect.
4. There is a 95% probability that the true mean lies between 0.1 and 0.4.
5. We can be 95% confident that the true mean lies between 0.1 and 0.4.
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.
Statements 1, 2, 3, and 4 are incorrect because confidence intervals are not probability statements about either parameters or hypotheses. Statements 5 and 6 are incorrect because the “confidence” in confidence intervals are a result of the properties of the procedure used to calculate them, not any particular interval: if we repeat the population sampling and data collection process and the subsequent confidence interval calculation, approximately 95% of the resulting intervals will contain the true population mean. However, Statement 6 is also incorrect for another reason: it implies the true mean might vary, 95% of the time falling between 0.1 and 0.4, but 5% falling within some other interval. However, the true population mean is a fixed number and not a random variable.
There are several critiques of Hoekstra et al., which are discussed momentarily.
Overall 98% of first-year students, 100% of master’s students, and 97% of researchers had at least one misunderstanding. A breakdown of the percentage of each group incorrectly asserting each of the six statements is shown below. The education level with the highest proportion of misinterpretation by statement is highlighted in red.
The least common misconception was the first statement, that the probability of the true mean being larger than 0 is at least 95%. The most common misconception was that the null hypothesis is likely to be false; this is similar to the NHST statement that a small p-value disproves the null hypothesis.
Statement summaries | First year undergraduates | Master's | Researchers |
---|---|---|---|
1. The probability that the true mean is greater than 0 is at least 95%. | 51% | 32% | 38% |
2. The probability that the true mean equals 0 is smaller than 5%. | 55% | 44% | 47% |
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. | 73% | 68% | 86% |
4. There is a 95% probability that the true mean lies between 0.1 and 0.4. | 58% | 50% | 59% |
5. We can be 95% confident that the true mean lies between 0.1 and 0.4. | 49% | 50% | 55% |
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. | 66% | 79% | 58% |
Table notes:
1. Percentages are not meant to add to 100%.
2. Sample sizes: first-year students (n=442), master's students (n=34), researchers (n=120).
3. Reference: "Robust misinterpretation of confidence intervals", Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers, Psychonomic Bulletin & Review, 2014 [link]
The distribution of the number of statements endorsed by each group is shown below. The mean number of items endorsed for first-year students was 3.51, for master’s students was 3.24, and for researchers was 3.45.
In a critique of Hoekstra et al. (2014) Jeff Miller and Rolf Ulrich questioned whether respondents might be using a different definition of the word “probability” that was equally correct but judged wrong by Hoekstra and team, consequently leading to overstated misinterpretation rates, especially for Statements 4 and 5 [20]. Hoekstra and team responded in Morey et al. (2016), retorting that “probability” as used in Miller and Ulrich (2015) was not suitable for correct interpretation of confidence intervals and that at any rate the data from the original study did not support Miller and Ulrich’s thesis [21].
Setting aside the rebuttal in Morey et al. (2016), Lyu et al. (2018) replicated Hoekstra et al. (2014) in China with similar results, discussed below. Additionally, Lyu et al. (2020) administered a different confidence interval survey instrument, also to a group of Chinese psychology researchers, and again obtained similar results.
However, Miguel García-Pérez and Rocío Alcalá-Quintana provided a different set of critiques in their 2016 discussion and replication of Hoekstra et al. (2014). Their criticism is plentiful.
First, the authors argue that there are methodological flaws in a survey instrument in which it is compulsory to respond to every statement, all of which are incorrect. This does not allow for a respondent to indicate that they do not know the answer, instead forcing respondents into “misinterpretations.” A better design would have had equal numbers of correct and incorrect statements to distinguish true misinterpretations from random guessing. Likewise an “I do not know” option would avoid forcing respondents into providing responses to statements they admit they do not understand or interpret as unclear.
Second, statements which dichotomize responses into true/false mask researchers’ “natural verbalizations” of confidence interval definitions and meaning.
Third, García-Pérez and Alcalá-Quintana — in the spirit of Miller and Ulrich (2015) — outline various ways in which the six statements are unclear, could be interpreted as true, or are correct or incorrect for reasons outside of confidence interval interpretations.
To understand their criticism first consider the following analogy. Suppose there is a large bowl with 100 foam balls, 95 red and five white. Imagine you close your eyes, select a ball at random, and hold it in your hand, continuing to keep your eyes closed. Then the following statement is true, “The claim ‘the color of the ball in my hand is red’ is true with probability 0.95.” Notably, the statement is still true even though a particular ball has already been selected. Now imagine you open your eyes and peer down at your hand to inspect the ball’s actual color. You discover that the ball you selected is either red or white. The previous statement is no longer true. You cannot claim that, “There is a 95% probability that the color of the ball is red.” The ball is either red or it is not.
This analogy helps to separate the random generating process (selecting a ball with closed eyes) from the realization of a particular instance of that process (opening your eyes to discover the ball is red).
Now apply the same logic to the survey instrument in Hoekstra et al. (2014). García-Pérez and Alcalá-Quintana claim the following statement is true, “The claim ‘The true mean lies between 0.1 and 0.4’ is true with probability 0.95.” (Let’s call this Statement A). We have in essence “selected from a bowl” one possible confidence interval which, like any other confidence interval generated from the data sampling procedure, has an a prior 0.95 probability of containing the true population mean. Conversely the following statement is false, “There is a 95% probability that the true mean lies between 0.1 and 0.4.” (Let’s call this Statement B). This is a realization of particular confidence interval. It either contains the true mean or it does not.
The authors claim at once that the statistical difference between Statements A and B is unequivocal, but that Statement B could be true if reasonably interpreted as having the same meaning as Statement A. However, they reject the notion that Statement A could be interpreted as false, instead arguing it is unambiguously true. Readers can judge for themselves whether they feel the two statements have distinct truth values.
García-Pérez and Alcalá-Quintana argue that out of the six statements in Hoekstra et al. (2014), Statement 1, 2, 4, and 5 could all be reasonably interpreted as either true or false, but Hoekstra and team chose only the interpretation that would render them false when grading responses. Meanwhile, they argue that Statement 3 — “The null hypothesis that the true mean equals 0 is likely to be incorrect” — is problematic for reasons outside of any confidence interval interpretation. First, the word “likely” is ill-defined. Although it has mathematical meaning in some contexts (i.e. “likelihood”), its meaning here is not specified. This leaves the word, and therefore the statement, up for interpretation. What’s more, whatever treatment effect is present it is almost certainly not exactly equal to zero, even if the effect size is practically small. That is, the phrase, “the true mean equals 0,” must be false. But if that phrase is false then the statement, “the true mean equals 0 is likely to be incorrect,” is true, not false as Hoekstra and team claimed. Statement 6 is also incorrect for reasons outside of confidence interval interpretations. As a reminder Statement 6 was, “If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.” The statement implies that the true mean varies, 95% of the time falling between 0.1 and 0.4, but 5% falling within some other interval. However, the true population mean is a fixed number and not a random variable. This implies that marking the statement as true could be indicative of deeper misunderstandings about statistics unrelated to confidence intervals.
García-Pérez and Alcalá-Quintana set out to replicate Hoekstra et al. (2014), but to solve what they viewed as the study’s deficiencies. They made two primary modifications. Respondents were initially required to mark all statements as true or false, similar to the original study in Hoekstra et al. (2014). However, after subjects had marked the statements as true or false they were allotted a second pass at the responses to indicate items they would have left blank if allowed. The reasons for respondent preference to leave the items unanswered were not recorded, but could have included not knowing the answer, not understanding the statement, or an inability to resolve internal conflict within the statement (for example, if they identified that different interpretations of a single statement could result in conflicting responses as García-Pérez and Alcalá-Quintana had argued). The second modification was the addition of two statements they argued were nominally correct:
7. The claim “The true mean lies between 0.1 and 0.4” is true with probability 0.95.
8. The data are compatible with the notion that the true mean lies between 0.1 and 0.4.
The first statement was already discussed in depth. The second statement is another interpretation of confidence intervals as “compatibility intervals.” (See Amrhein, Trafimow, and Greenland (2019) for a discussion of why “compatibility intervals” should replace “confidence intervals” as the language of choice [27]). The authors acknowledge that one could quarrel over the Statement 8 language, but that nothing in it is incorrect, and therefore it should be considered true.
Respondents were 313 first-year university students enrolled in a psychology major at the Universidad Complutense de Madrid (UCM). They had no previous statistical experience that would prepare them for being able to correctly interpret the survey statements. The authors therefore treat these students as a kind of control group, their response patterns indicative of uninformed guessing or common sense intuition (to the extent it can help assess the statements). A second group of 158 master’s students, also psychology students at UCM, were sampled as well. They all had ample opportunity to become familiar with confidence intervals, including during required statistics classes. Unlike in Hoekstra et al. (2014) no researchers were included in the replication.
When considering just the original six statements first-year students endorsed at similar rates as Hoekstra et al. (2014). This can be seen in the chart below which shows 99% of first-year students in García-Pérez and Alcalá-Quintana (2016) and 98% in Hoekstra et al. (2014) answered at least one statement as true. The average number of statements marked “true” was higher in García-Pérez and Alcalá-Quintana (2016), 3.87 compared to 3.51. The authors noted the propensity of first-year students to endorse was a mystery, writing that, “Why first-year students are more prone to endorsing than to not endorsing items in either study is unclear but, obviously, these results do not reflect misinterpretation of CIs but plain (and understandable) incognizance.” This is because first-year students have no knowledge whatsoever and so should be just as likely to select “false” as “true.”
Master’s students also had a similar proportion endorsing, 97% in García-Pérez and Alcalá-Quintana (2016) compared to 100% in Hoekstra et al. (2014). The average number of responses marked as “true” was substantially higher, 3.78 compared to 3.24 in Hoekstra et al. (2014).
Both student populations tended to endorse one or both of the two nominally correct statements. For master’s students the authors compared beta distributions of the propensity to endorse the two sets of statements — the original six and the two new — and found there was a higher probability to endorse the two nominally correct statements.
When considering all eight statements, no student in either population answered Statements 1-6 as “false” while simultaneously marking both Statements 7 and 8 as “true”. It is not possible from the data provided to calculate the average number of incorrect responses for all eight statements.
Proportion with at least one misunderstanding | Average number of misinterpretations | |||
---|---|---|---|---|
Education | García-Pérez & Alcalá-Quintana (2016) | Hoekstra et al. (2014) | García-Pérez & Alcalá-Quintana (2016) | Hoekstra et al. (2014) |
First-year undergraduates | 99% | 98% | 3.87 | 3.51 |
Master's | 97% | 100% | 3.78 | 3.24 |
Table notes:
1. Reference: "Robust misinterpretation of confidence intervals", Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers, Psychonomic Bulletin & Review, 2014 [link];"The Interpretation of Scholars' Interpretations of Confidence Intervals: Criticism, Replication, and Extension of Hoekstra et al. (2014)", Miguel A. García-Pérez and Rocío Alcalá-Quintana, Frontiers in Psychology, 2016 [link]
The number of students endorsing the associated number of statements can be seen below, for both the original statements in Hoekstra et al. (2014) and the new statements.
The authors then examined data from the second pass in which master’s students indicated responses they would have chosen not to answer if given the choice, a declaration they could not provide an informed response. First-years students were not allowed this opportunity as they had no previous statistics classes and were considered uninformed. Recall that the reasons for the “no response” preference were not requested by the authors. Although 158 master’s students were initially present for the first pass of the survey in which statements were simply marked as true or false, several students left the classroom before instructed to conduct a second pass. Therefore, informed responses are only available for 144 students.
Results from the second pass are are shown at right, produced from García-Pérez and Alcalá-Quintana (2016) using our standard tracing technique. The height of the bars represents the percentage of respondents endorsing the associated statement. The percentage above the second pass bars represents the percentage of students indicating they provided an informed response to that statement.
Only 15 students (10%) declared they provided informed responses to all eight statements. This was lower than the proportion of students who declared they could not provide informed responses to any of the eight statements, 18 students (13%).
The proportion of students providing informed responses varied dramatically by Statement. For Statement 3 just 40% of students indicated they provided an informed response. Meanwhile for Statement 5, twice as many — 80% — indicated they provided an informed response.
The proportion indicating they provided an informed response did not correlate to whether students actually answered the statement correctly. While 80% of students felt they provided an informed response to Statement 5, more than 90% endorsed the statement, despite it being nominally incorrect. Uniformly, students who indicated they provided informed responses were more likely to endorse the statements, regardless of their nominal truth or falsity.
In summary, the statement language of Hoekstra et al. (2014) and its replications — Lyu et al. (2018) and partially Lyu et al. (2020), both discussed below — can be construed as unclear by sufficiently sophisticated confidence interval users. While many students indicated they would have rather provided no response to the statements, the reason behind this preference was not recorded. Therefore, it is unclear how much of subject preference for response omission was due to recognizing the dualistic nature of the statements. The misinterpretation rates of self-declared informed responses were actually higher than when considering master’s students as a whole. Therefore this explanation seems unlikely.
Curiously, the authors themselves are at odds on the matter. While they argue that students and researchers in Hoekstra et al. (2014) may have interpreted the confidence interval statements in a manner that is nominally correct rather than nominally incorrect, in an introductory section of their article they state the opposite. Analogizing the difference between the confidence interval random generating process and a specific instance of the process — the difference at the heart of the correct/incorrect interpretation — the authors admit the difference is unfamiliar to many professional psychologists, wondering aloud, “Why psychologists find trouble dealing with the analogous statement…is anybody's guess…” But if professional psychologists struggle with the difference in correct and incorrect interpretations of confidence interval statements what leads them to believe master’s students would be familiar with the distinction? Yet this supposition was one stated motivation for their replication.
Nonetheless, García-Pérez and Alcalá-Quintana (2016) raises legitimate concerns about a number of studies reviewed in this article. More details about the implications were covered in the “Themes and criticism” section near the beginning of this article.
Lyu et al. (2018) translated the six-statement confidence interval instrument from Hoekstra et al. (2014) into Chinese and surveyed 347 Chinese students and researchers. Recall that Lyu et al. (2018) also surveyed the same population for NHST misinterpretations. Undergraduate and master’s students had roughly similar rates of misinterpretations for NHST and confidence intervals. PhD, postdocs, and experienced professors had generally fewer NHST misinterpretations than CI misinterpretations.
Compared to Hoekstra et al. (2014) Chinese undergraduate students had a higher proportion of misunderstanding for Statements 1, 3, 4, and 5; master’s had a higher proportion for Statements 1 and 2. The comparison of researchers is more complex and depends on whether you compare researchers in Hoekstra et al. (2014) against Chinese assistant professors or experienced professors.
Education | Average number of misinterpretations Lyu et al. (2018) |
Average number of misinterpretations Hoekstra et al. (2014) |
---|---|---|
Undergraduate | 3.66 | 3.51 |
Master's | 2.89 | 3.24 |
PhD student | 3.51 | Not surveyed |
Post-doc/assistant prof. | 3.13 | 3.45 |
Teaching/research for years | 3.50 |
Table notes:
1. Reference: "Robust misinterpretation of confidence intervals", Rink Hoekstra, Richard Morey, Jeffrey Rouder, and Eric-Jan Wagenmakers, Psychonomic Bulletin & Review, 2014 [link]
2. Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
A comparison of the number of average misinterpretations between Lyu et al. (2018) and Hoekstra et al. (2014) is shown at right. The average number of misinterpretations for undergraduates was roughly comparable, although subjects in Lyu et al. (2018) did have a higher rate. Master’s students had a larger difference, with more than an additional half statement missed by the Chinese subjects (0.65). Hoekstra et al. (2014) did not break out researchers by experience whereas Lyu et al. (2018) segmented researchers into less and more experienced. The average number of misinterpretations for the category of “Teaching/research for years” from Lyu et al. (2018) was quite close to the general researcher figure from Hoekstra et al., 3.50 and 3.45, respectively. However, some of the researchers from Hoekstra et al. may have been less experienced, and for that population Lyu et al. (2018) reported a lower rate of average misinterpretation, 3.13.
The breakdown of incorrect responses by statement and education level is shown below. Data in the table below was calculated by us using open data from the authors. The education level with the highest proportion of misinterpretation by statement is highlighted in red. Undergraduates fared worse, they had the highest proportion of misinterpretations for four of the six statements. This aligns with the table above showing they also had the highest average number of misinterpretations.
Statement summaries | Undergraduate | Master's | PhD | Postdoc and assistant prof. | Experienced professors |
---|---|---|---|---|---|
1. The probability that the true mean is greater than 0 is at least 95%. | 54% | 49% | 43% | 48% | 50% |
2. The probability that the true mean equals 0 is smaller than 5%. | 63% | 51% | 62% | 39% | 50% |
3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. | 79% | 49% | 72% | 61% | 62% |
4. There is a 95% probability that the true mean lies between 0.1 and 0.4. | 65% | 43% | 72% | 52% | 75% |
5. We can be 95% confident that the true mean lies between 0.1 and 0.4. | 71% | 40% | 57% | 61% | 62% |
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. | 34% | 59% | 45% | 52% | 50% |
Table notes:
1. The education level with the highest proportion of misinterpretation by statement is highlighted in red.
2. Percentages are not meant to add to 100%.
3. Sample sizes: Undergraduates (n=106), Master's (n=162), PhD (n=47), Postdoc or assistant prof (n=23), Experienced professor (n=8).
4. Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Lyu et al. (2018) also looked at misinterpretation by subfield. Data in the table below was calculated by us using open data from the authors. The percentage of each subfield incorrectly marking at least one statement as “true” is shown below. All subfields besides clinical and medical psychology had misinterpretation rates of at least 95%. Multiple sub-fields had all subjects mark at least one statement incorrectly. However, note the relatively small sample sizes for some sub-fields.
These misinterpretation rates are slightly higher than those found in Lyu et al. (2018) for NHST statements. The average number of misinterpretations was also high. Cognitive neuroscientists had the lowest rate, but still had 2.68 misinterpretations. Psychometric and psycho-statistics researchers had the highest number of misinterpretations, a strikingly high 4.31.
Psychological sub-field | Sample size | Percentage with at least one CI misunderstanding | Average number of misinterpretations |
---|---|---|---|
Fundamental research & cognitive psychology | 74 | 96% | 3.18 |
Cognitive neuroscience | 121 | 95% | 2.68 |
Social & legal psychology | 51 | 100% | 3.41 |
Clinical & medical psychology | 19 | 84% | 3.58 |
Developmental & educational psychology | 30 | 100% | 3.77 |
Psychometric and psycho-statistics | 16 | 100% | 4.31 |
Neuroscience/neuroimaging | 9 | 100% | 3.67 |
Others | 17 | 100% | 4.0 |
Table notes:
Data calculated from "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]
Lyu et al. (2020) used a modified version of their four-statement NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The hypothetical summary sentence and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.
The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.
3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.
4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.
For statements one and three there were more misinterpretations resulting from the nonsignificant version, whereas for statements two and four more misinterpretations resulted from the significant version.
At 92%, psychology researchers had the fifth highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). The percentage of incorrect responses to each statement broken down by the significant and nonsignificant versions is shown below. Statement 1 had one of the highest incorrect response rates for both the significant and nonsignificant versions among the eight fields surveyed, while Statement 4 had the highest incorrect response rate for the significant version across any field. Meanwhile, the significant version of Statement 3 had the lowest rate across any field. There was fairly wide variation between the significant and nonsignificant versions for Statement 3 and Statement 4, 15 percentage points and 17 percentage points, respectively.
Statement summaries | Significant version | Nonsignificant version |
---|---|---|
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. | 66% | 69% |
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. | 54% | 48% |
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. | 31% | 46% |
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. | 70% | 53% |
Table notes:
1. Percentages are not meant to add up to 100%.
2. Sample sizes: significant version (n=125), nonsignificant version (n=147).
3. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
Psychology researchers ranked fourth out of eight in terms of the average number of confidence interval misinterpretations, 1.82 out of a total of four possible. This was lower than the 1.94 average misinterpretations in the NHST instrument. The nonsignificant version had a slightly higher proportion of misinterpretations compared to the significant version, 1.84 compared to 1.79.
Like in the case of NHST we used an independent means t-test to compare the average number of incorrect responses between the two test versions, which resulted in a p-value of 0.71 (95% CI: -0.22 to 0.33). This indicates that while particular statements appear to have been more challenging, there is little evidence that on the whole one test version was more or less difficult than the other.
A breakdown of confidence interval misunderstandings by education is shown below. All education levels had high misinterpretation rates of confidence intervals, ranging from 85% to 93%. PhD students faired best with the lowest rate of confidence interval misinterpretations (85%), but the group had the highest average number of misinterpretations with 2.1.
Education | Sample size | Percentage with at least one CI misunderstanding | Average number of CI misunderstandings (out of four) |
---|---|---|---|
Undergraduates | 67 | 93% | 1.8 |
Master's | 122 | 93% | 1.8 |
PhD | 47 | 85% | 2.1 |
Post-PhD | 36 | 92% | 1.7 |
Table notes:
1. Reference: "Beyond psychology: prevalence of p value
and confidence interval misinterpretation
across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]
In 2004 Geoff Cumming, Jennifer Williams, and Fiona Fidler surveyed 263 researchers including 89 in the field of psychology. Participants were recruited from authors of articles published between 1998 and 2002 in a selection of 20 high-impact psychology journals. Behavioral neuroscientists and medical researchers also participated; their results will be covered in other articles.
The authors sought to turn the normal definition of confidence intervals on their head. Instead of testing knowledge of the well-known property that a 95% confidence interval will contain the true population mean on average 95 times out of 100, the authors instead asked a different question: given that you have calculated a 95% confidence interval, what is the probability that estimated means from experimental replications will fall within the first interval. The answer is 83.4%. This answer is due to the two types of variation at play. The first type is variation between the original estimated mean and the true population mean. The second type is variation between estimated means as the experiment is replicated.
Participants were sent a link via email. Clicking on the link took respondents to an applet that contained a confidence interval with the label “Group 1” and a scale that ranged from 0 to 1500. Respondents were randomized into one of two different versions. A confidence interval version in which the mean was centered at 750 and the width of each bar was 300 and a standard error version that was the same except the width of each bar was 150. The intervals were meant as estimates of the mean reaction time in milliseconds (ms).
Participants were instructed to click the blank space next to the confidence interval to place horizontal lines representing the estimated means of experimental replications. Once a participant clicked a horizontal bar would appear. Respondents were allowed to delete lines if they wished. After 10 lines were place the task was concluded. The specific instructions shown to participants was as follows (with original bold text):
The large dot represents the mean of the sample of n = 36 participants. The bars show a 95% confidence interval around the mean. Of course, if the experiment were replicated, the mean obtained would almost certainly not be exactly the same as the mean shown. Please plot ten lines that you think could plausibly be a set of means for ten further independent samples of the same size taken from the same population.
On average respondents did not properly account for the variation in replications. For the confidence interval task 80% of respondents placed nine or 10 horizontal bars within the original interval. The normatively correct answer would have been to place eight bars within the interval since on average 83% of future replication means are expected to fall within the original interval. For the confidence interval task 70% of respondents placed between six and 10 horizontal bars within the original interval. The normatively correct answer would have been to place five bars within the interval since on average 52% of future replication means are expected to fall within the original interval.
The authors suggest this is evidence of what they call the confidence-level misconception, which they describe as, “the belief that about C% of replication means will fall within an original C% CI.”
In 2005 Sarah Belia, Fiona Fidler, Jennifer Williams, and Geoff Cumming conducted a unique experiment to probe psychology researchers’ understanding of overlapping confidence intervals. A total of 162 psychology researchers participated, recruited from authors of articles published between 1998 and 2002 in a selection of 21 high-impact psychology journals. Behavioral neuroscientists and medical researchers also participated; their results will be covered in other articles.
Participants were recruited via an email which included a link to a web applet that walked respondents through one of three tasks. In the first task respondents were shown a chart containing a 95% independent mean confidence interval representing the mean reaction time of a group in milliseconds (ms). This Group 1 mean was fixed at 300ms. A second 95% independent mean confidence interval was shown for Group 2. However, by clicking on the chart respondents could move the Group 2 confidence interval. Respondents’ task was to position the Group 2 confidence interval such that the difference of means between the two groups would produce a p-value of 0.05. The fidelity of the movement of the Group 2 interval was 3ms with the chart scale stretching from 0ms to 1,000ms. To emphasize the groups were independent the sample size of Group 1 was set to 36, while for Group 2 it was set to 34.
The instructions accompanying the confidence interval task are shown below (with original bolded words). For a full image of the applet please see Belia et al. (2005).
Please imagine that you see the figure below published in a journal article.
Figure 1. Mean reaction time (ms) and 95% Confidence Intervals Group 1 (n=36) and Group 2 (n=34).
Please click a little above or below the mean on the right: You should see the mean move up or down to your click (the first time it may take a few seconds to respond). Please keep clicking to move this mean until you judge that the two means are just significantly different (by conventional t-test, two-tailed, p < .05). (I’m not asking for calculations, just your approximate eyeballing).
Because the authors had observed an anchoring effect in preliminary testing the starting point of the Group 2 confidence interval was randomized such that for approximately half of participants it was initially placed at 800ms and for the other half it was placed at 300ms. Due to this anchoring the authors adjusted the final Group 2 position of respondents in their reporting and analysis. After the authors averaged the respondent placement of Group 2 under each initial position a difference of 53ms was observed. The authors halved this difference and subtracted it from the Group 2 placement of each respondent that was randomized into the 800ms initial position. The distance was added for respondents who were randomized into the 300ms initial position.
The second task was similar to the one described above, but involved standard error bars. The third task was also involving standard error bars, but instead of a labeling of “Group 1” and “Group 2” the groups were formulated as a repeated measure with labels of “Pre Test” and “Post test.” Respondents were randomized into one of the three tasks.
Before reviewing the results notice that moving the Group 2 mean closer to Group 1 will increase the p-value under the standard null hypothesis of no difference between group means. This is because the p-value is a measure of data compatibility with that hypothesis. If Group 2 were repositioned so that the mean perfectly aligned with the Group 1 mean of 300ms, the null hypothesis of no difference between group means would have strong support which would be reflected in a larger p-value underlying this probability. If Group 2 was moved very far away from Group 1 the p-value would be small, because the probability of observing such a difference becomes smaller under the null hypothesis of equality of means.
The results of the confidence interval task and standard error task are shown in the chart below. The chart was reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion. The presentation for the two tasks are identical. A summary of results are provided after the chart. The repeated measures task is discussed momentarily.
The exact Group 2 position of each respondent is not shown, instead results are aggregated into a histogram. The histogram represents the number of respondents that placed Group 2 within the corresponding bin. For example, in the standard error task there were nine respondents that placed Group 2 somewhere between 400ms and 450ms, thus the histogram has height nine for this bin. Note that although the authors report a sample size of 71 for the standard error task, the frequency histogram only sums to 70 participants.
Under the histogram the fixed placement of Group 1 is shown as is the proper placement of Group 2 to produce a between-means p-value of 0.05. The grey interval below Group 1 and Group 2 represents the Group 2 placement that results in the Group 1 and Group 2 intervals just touching. Although this placement is not correct, in both tasks — and especially the standard error task — respondents tended to prefer this placement. This phenomenon can be observed by looking just above the histogram. Here the average Group 2 placement of respondents is shown along with the actual p-value produced by this position. The grey band running the length of the chart corresponds to Group 2 placements which produce p-values between 0.025 and 0.10, the range the authors feel is within a reasonable estimate for respondents.
For the confidence interval task the correct position is to place the Group 2 confidence interval so that the mean is at 454ms. To properly position Group 2 respondents needed to recognize that some overlap of confidence intervals is necessary. If two confidence intervals do not overlap then there is by definition a statistically significant difference between their means. However, the inverse is not true, two overlapping confidence intervals do not necessarily fail to produce a statistically significant difference. The authors note that a rule of thumb is, “CIs that overlap by one quarter of the average length of the two intervals yield p values very close to, or a little less than, .05.” This rule of thumb works well for sample sizes over 10 and for confidence intervals that aren’t too different in width (the wider of the two intervals cannot be more than a factor of two wider than the narrower).
For the confidence interval task respondents were on average too strict, placing the Group 2 interval too far from Group 1. The average p-value produced was 0.017 rather than 0.05. It is perhaps worth noting, however, that the average is somewhat skewed upward as a small number of respondents moved Group 2 quite far from Group 1. In fact, the 450ms to 500ms bin — which contained the correct Group 2 placement (454ms) — was the modal response, with one-quarter of respondents within this range. However, respondent-level data would be needed to understand where in this bin respondents placed Group 2. Overall respondents did indeed misplace Group 2. Using a more generous band of p-values between 0.25 and 0.10 did not alter this overall finding. The authors note that in total just 15% of the respondents positioned Group 2 within the 0.025 to 0.10 p-value band judged as a reasonable placement by the authors.
Whereas in the confidence interval task respondents tended to place Group 2 too far from Group 1, in the standard error task they tended to place it too close. The correct Group 2 placement was at 614ms. However, the average placement of respondents was much smaller, producing a p-value of 0.158 rather than 0.05. Again, using a more generous band of p-values between 0.25 and 0.10 did not alter this overall finding. Respondents failed to recognize that a gap is necessary between to standard error bars for there to be a statistically significant difference. Like for confidence intervals, the authors noted a rule of thumb, “SE bars that have a gap equal to the average of the two SEs yield p values close to .05.” The provisos here are the same as for the confidence interval rule of thumb. On average respondents tended to place the Group 2 standard error interval so that it was just touching the Group 1 interval. This can be seen in the chart below by noticing that the small yellow box above the histogram aligns closely with the grey interval under the chart. The overall accuracy for the standard error task was the same as for the confidence interval task, just 15% of the respondents positioned Group 2 within the 0.025 to 0.10 p-value band judged as a reasonable placement by the authors.
The authors also investigated the unadjusted Group 2 placement and found that for the standard error task about 25% of respondents positioned Group 2 so that the error bars were just touching Group 1. The corresponding figure for confidence interval bars was about 23%.
As for the third task involving repeated measures, curiously there was not enough information to successfully complete it. This is because the provided error bars represented between-subject variation, but failed to account for within-subject variation. Therefore, the authors expected respondents to provide comments that more information about the design was necessary to successfully complete the task (ex. “Is this a paired design?”). Note that during the second phase of the task respondents were shown a second screen with open-ended response options, so the ability to provide comments was allotted. Just three of the 51 psychologists provided such feedback in the open-ended comments. However, it is unclear what should be made of this result as perhaps the experimental setup itself confused respondents into attempting to complete an impossible task.
The authors conclude that their findings are indicative of four misconceptions. First, respondents have an overall poor understanding of how two independent samples interact to produce p-values. On average respondents placed Group 2 too far away in the confidence interval task, but too close in the standard error task. This conclusion must be discussed in its full context, however. More than half of respondents (63%) positioned Group 2 so that there was some overlap between the intervals. It’s unclear the extent to which confidence interval interactions were understood by this subgroup. Perhaps they intuit — or explicitly understood — that confidence intervals must overlap to produce the requested p-value of 0.05, but do no know the proper rules of thumb to be able to place Group 2 as precisely as necessary. Sill, this generous interpretation implies that nearly 40% of respondents in the confidence interval task that do not understand key features of interval interaction. A similar case could be made for the standard error task, with respondents understanding there must be some distance between the two intervals, but not fully internalizing the correct rules of thumb.
Their second proposed misconception is that respondents didn’t adequately distinguish the properties of confidence intervals and standard errors. This may be true on average, but again the same arguments above apply. Having participants undertake both tasks would have given some sense of the within-subject discrimination abilities. A subject may understand simultaneously that Group 1 and Group 2 confidence intervals need some overlap and that standard error intervals need some gap, without fully recognizing the correct rules of thumb. This would demonstrate an appreciation for the distinction between the two intervals, although it could still lead to nominally incorrect answers (i.e. misplacing Group 2).
Their third proposed misconception is that respondents may use incorrect rules that two error bars should be touching to produce the desired statistically significant result. This does seem to be true for some portion of respondents, as was discussed above. Whatever rule respondents used the majority certainly do not appear to be familiar with those outlined by the authors (again, discussed above).
The forth proposed misconception is that respondents don’t properly appreciate the different types of variation within repeated measures. As discussed above, it is unclear whether this is a fair interpretation of results from the third task.
In 2010 Melissa Coulson, Michelle Healey, Fiona Fidler, and Geoff Cumming conducted a pair of studies with psychologists. In the first study 330 subjects from three separate academic disciplines were surveyed, two medical fields plus 102 psychologists who had authored recent articles in leading psychology journals.
The subjects were first shown a short background summary of two fictitious studies that evaluated the impact of a new treatment for insomnia:
Only two studies have evaluated the therapeutic effectiveness of a new treatment for insomnia. Both Simms (2003) and Collins (2003) used two independent, equal-sized groups and reported the difference between the means for the new treatment and current treatment.
Then subjects were shown, at random, one of two formats, either a confidence interval format or an NHST format. For each format there was a text version and a figure version. This resulted in subjects being shown one of four result summaries: either an NHST figure, which included a column chart of the average treatment effect and associated p-values; an NHST text version, which simply described the results; a confidence interval figure which provided the 95% confidence intervals of the two studies side-by-side; or a confidence interval text version, which simply described the results. Only one version was shown to each subject. The confidence interval text version is shown below. All four versions can be found in the original paper.
Simms (2003), with total N = 44, found the new treatment had a mean advantage over the current treatment of 3.61 (95% Confidence Interval: 0.61 to 6.61). The study by Collins (2003), with total N = 36, found the new treatment had a mean advantage of 2.23 (95% Confidence Interval: -1.41 to 5.87).
Subjects were first prompted to provide a freeform response to the question, “What do you feel is the main conclusion suggested by these studies?” Next three statements were presented regarding the extent to which the findings from the two studies agreed or disagreed. Subjects used a 1 to 7 Likert response scale to indicate their level of agreement where 1 equated to “strongly disagree” and 7 to “strongly agree.” The three statements were as follows:
Statement 1: The results of the two studies are broadly consistent.
Statement 2: There is reasonable evidence the new treatment is more effective.
Statement 3: There is conflicting evidence about the effectiveness of the new treatment.
The primary purpose of the study was to evaluate whether respondents viewed the two studies as contradictory as one confidence interval covers zero while the other does not. These results are a form of dichotomization of evidence and therefore are discussed in that section of this article.
However, one finding is relevant to the misinterpretation of confidence intervals. Unfortunately, this finding was not segmented out by academic discipline, so it is impossible to know the rate for psychology researchers alone. The authors note that 64 of the 145 subjects (44%) answering one of the two confidence interval result summaries made mention of “p-values, significance, a null hypothesis, or whether or not a CI includes zero.” This can be considered a type of confidence interval misinterpretation as NHST is distinct from confidence intervals. NHST produces a p-value, an evidentiary measure of data compatibility with a null hypothesis, while confidence intervals produce a treatment effect estimation range.
In a separate study 50 academic psychologists from psychology departments in Australian universities were presented with only the confidence interval figure result summary and asked the same four questions. In addition, for Statements 1 to 3 the subjects were also prompted to give freeform text responses. The freeform responses from Statements 1 and 3 were analyzed by the authors. Together there were 96 total responses between the two questions; each of the 50 subjects had the chance to respond to both statements, which would have give a total of 100 responses, however a few psychologists abstained from providing written responses. In 26 of 96 cases (27%) there was mention of NHST elements, despite the survey instrument presenting only a confidence interval scenario.
In 2018 Pav Kalinowski, Jerry Lai, and Geoff Cumming investigated what they call confidence interval “subjective likelihood distributions” (SLDs), the mental model one uses to assess the likelihood of a mean within a confidence interval. As an example, one possible, but incorrect, distribution is the belief that the mean is equally likely to fall along any point within a confidence interval.
A total of 101 students participated in a set of three tasks. Although the academic disciplines of the students varied, the research is appearing in this article as two thirds (66%) of the students self-identified as psychology students. The remaining disciplines were social science (13%), neuroscience (6%), medicine (5%), and not identified (10%). The authors note that, “Most students (63%) were enrolled in a post graduate program, and the remaining students where completing their honors (fourth year undergraduate).”
Percentage of students "drawing" this shape | ||
---|---|---|
Shape | 95% CI | 50% CI |
Correct | 15% | 17% |
Bell | 12% | 18% |
Triangle | 4% | 7% |
Half circle | 10% | 5% |
Mesa | 16% | 12% |
Square | 19% | 13% |
Other | 25% | 36% |
Table notes:
1. Percentages for the 50% confidence interval do not add to 100% as the Triangle shape was practically indistinguisable from the correct shape. For this reason students whose response formed a triangle distribution are counted twice, once in the Correct category and once in Triangle.
2. Reference: "A Cross-Sectional Analysis of Students’ Intuitions When Interpreting CIs", Pav Kalinowski, Jerry Lai, and Geoff Cumming, Frontiers in Psychology, 2018 [link]
In Task 1 students saw a 95% confidence interval and a set of nine markers. Five of the markers were within the confidence interval, while four were outside of it. Students had to rank the likelihood that each point was associated with the mean. A 19-point scale was used for the ranking. Example values were, (1) “More likely [to] land on the [the mean],” (3) “About equally likely [to] land on the [the mean],” (5) “Very slightly less likely to be the [the mean],” and (19) “Almost zero likelihood.” This procedure allowed the authors to construct each student’s SLD, which was then judged to be correct if 97% or more of the variance was explained by the normatively correct distribution. If the student’s SLD was incorrect it was categorized into one of six other distributions. The procedure was repeated with a 50% confidence interval to examine performance on intervals of different widths.
The results are shown at right. Note that for the 50% interval the correct shape could not be distinguished from students whose SLD was a triangle shape. Therefore, these respondents are counted twice, once in “Correct” and once in “Triangle.”
Please see the original article for a screenshot of Task 1, example student SLDs, and the shape classification rules of the authors.
In Task 2 students had to choose one of six shapes that best represented their SLD. Please see the original article for a screenshot of the shapes provided. In total, 61% of students selected the correct shape, a normal distribution.
In Task 3, students were shown a confidence interval. The task had two questions. In the first question a 95% confidence interval was presented and using a slider students were asked to select two points on the interval that corresponded to an 80% and 50% interval, respectively. In the second question they were shown a 50% interval, and had to select two points that would correspond to an 80% and 95% interval, respectively.
Most students — 75% — selected the correct direction of the intervals. For instance, understanding that 50% intervals are narrower than 95% intervals. However, 25% misunderstood the relationship between intervals of different percentages, for example believing that 95% confidence intervals were narrower than 50% intervals. When starting with a 95% confidence interval students overestimated the needed width. On average students attempting to mark an 80% interval instead marked an 86% interval; when attempting to mark a 50% interval they instead marked a 63% interval. When starting with a 50% confidence interval students instead underestimated the needed width. On average students attempting to mark an 80% interval instead marked an 79% interval; when attempting to mark a 95% interval they instead marked a 92% interval.
Across the three tasks 74% of students gave at least one answer that was normatively incorrect.
After Task 3 an open-ended response option was presented to participants.
A combination of the three tasks plus the open-ended response option resulted in four confidence interval misconceptions that have been rarely documented in the literature:
All points inside a confidence interval are equally likely to land on the true population mean
All points outside a confidence interval are equally unlikely to land on the true population mean
50% confidence intervals and 95% confidence intervals have the same distribution (in terms of the likelihood of each point in the interval to land on the true population mean)
A 95% confidence interval is roughly double the width of a 50% confidence interval
In Task 4, 24 students agreed to interviews. The full results of Task 4 are difficult to excerpt and readers are encouraged to review the original article. Two primary findings are worth noting. First, after coding student responses into 17 different confidence interval misconceptions, the authors note that, “Overall every participant held at least one CI misconception, with a mean of 4.6 misconceptions per participant.” Second, the authors found that cat’s eye diagrams may be effective at remedying some misconceptions and helping to enforce correct concepts.
Confidence Interval meta-analysis
Four studies reviewed in the “Confidence Interval Review” section used similar methodologies to assess common confidence interval misinterpretations and overall understanding of psychology students and researchers.
Two primary quantitive assessments of confidence interval knowledge are available across these studies:
Measure 1: The percentage of respondents demonstrating at least one NHST misunderstanding
Measure 2: The average number of incorrect responses to the survey instrument
Both measures are available in all four studies. Measures 1 was combined across studies using a simple weighted average based on sample size. For Measure 2 the average number of incorrect responses was divided by the number of questions in the survey instrument to obtain a proportion of incorrect responses. A simple weighted average based on sample size was then applied.
See the “NHST meta-analysis” section for a discussion of the methodological critiques of these two measures.
Across four studies focusing on confidence interval misinterpretations a total of 1,683 students and researchers were surveyed between 2014 and 2019. A weighted average across all four studies and all populations resulted in a misinterpretation rate of 97% of subjects demonstrating at least one confidence interval misinterpretation. See the Excel file in the “Additional resources” section for the detailed calculations across studies.
A breakdown of confidence interval misinterpretation rates by education level is shown below. All education levels have similar misinterpretation rates. When looking across studies weighted average confidence interval misinterpretation rates are equal to or higher than weighted average NHST misinterpretation rates across all education levels.
Education | Number of studies | Sample size | Percentage with at least one confidence interval misunderstanding |
---|---|---|---|
Undergraduates | 4 | 928 | 98% |
Master's | 4 | 476 | 96% |
PhD | 3 | 212 | 95% |
Post-PhD | 2 | 67 | 96% |
Total | 4 | 1,683 | 97% |
Table notes:
1. Calculated from three studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
Only three countries are represented in this analysis: Germany, China, and Spain. All had high rates of misinterpretations.
Country | Number of studies | Sample size | Percentage with at least one confidence interval misunderstanding |
---|---|---|---|
China | 2 | 618 | 94% |
Germany | 1 | 594 | 98% |
Spain | 1 | 471 | 99% |
Total | 3 | 1,212 | 96% |
Table notes:
1. Calculated from 3 studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
Turning to Measure 2, the average percentage of statements answered incorrectly was high across all education levels. Undergraduates, PhD students, and post-PhD researchers all averaged incorrect response rates of more than half the statements on the survey instrument. For all education levels these averages were higher than for NHST survey instruments.
Education | Number of studies | Sample size | Average percentage of statements that respondents answered incorrectly |
---|---|---|---|
Undergraduates | 4 | 928 | 60% |
Master's | 4 | 476 | 52% |
PhD | 2 | 94 | 56% |
Post-PhD | 3 | 185 | 54% |
Total | 4 | 1,683 | 57% |
Table notes:
1. Calculated from 3 studies on confidence misinterpretations. See Excel in the Additional Resources section for details.
2. Some studies include surveys of multiple education levels.
Breaking things down by country, Chinese subjects averaged 50% while German subjects averaged a slightly higher 58%.
Country | Number of studies | Sample size | Average percentage of statements that respondents answered incorrectly |
---|---|---|---|
China | 2 | 618 | 50% |
Germany | 1 | 594 | 58% |
Spain | 1 | 471 | 64% |
Total | 4 | 1,683 | 57% |
Table notes:
1. Calculated from 3 studies on confidence interval misinterpretations. See Excel in the Additional Resources section for details.
There is disagreement about the role of confidence intervals as a replacement or way to augment NHST as is often suggested (for example, see references 1, 2, 3, 10). Consider the following two misinterpretations, one related to confidence intervals and the other to NHST:
CI: There is a 95% probability that the true mean lies between 0.1 and 0.4 (from Hoekstra et. al., 2014)
NHST: You have found the probability of the null hypothesis being true [when you calculate a p-value] (from Oakes, 1986).
While both statements are incorrect confidence intervals have the benefit of including the precision of the point estimate. Therefore, wrongly concluding a confidence interval is a probability statement still keeps focus on the range of parameter estimates reasonably compatible with the observed data and by doing so forces an analyst to acknowledge values both higher and lower than the point estimate when considering a particular course of action. Keeping focus effect sizes and precision is important because these factors are the real meat of any decision making criteria. Plus, confidence intervals from different studies are amenable to synthesis via meta-analysis [28]. And confidence intervals are also better at predicting future results than p-values [29].
Confidence intervals can fall prey to dichotomized thinking, however, which results in a similar kind of uselessness as NHST statistical significance. Indeed, in two studies using the six-statement confidence interval instrument from Hoekstra et. al (2014), Statement 3 — “The ‘null hypothesis’ that the true mean equals 0 is likely to be incorrect” — resulted in the highest rates of incorrect endorsement. Most researchers misunderstand the properties of confidence interval precision as well, for instance: narrow confidence intervals do not necessarily imply more precise estimates. And as the article summaries and this meta-analysis demonstrate, it can be difficult for researchers to truly understand the information contained in confidence intervals and the resulting interpretation. For these reasons some have argued that confidence intervals are deceivingly hard to use correctly and therefore the costs still outweigh the benefits [30].
The debate about how confidence intervals should be incorporated into the analysis and presentation of scientific results is not likely to end soon.
Details of the experience level of each population is shown below.
Authors | Year | Country | Instrument length | Population details |
---|---|---|---|---|
Hoekstra et al. | 2014 | The Netherlands | 6 | Article title: "Robust misinterpretation of confidence intervals" [link] This study included bachelor's and master's students as well as researchers. The authors note that the bachelor students "were first-year psychology students attending an introductory statistics class at the University of Amsterdam." None of the students had previously taken a course in inferential statistic. The master's students "were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years." The researchers came from three universities: Groningen, Amsterdam, and Tilburg. |
Lyu et al. | 2018 | China | 6 | Article title: "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation" [link] The online survey "recruited participants through social media (include WeChat, Weibo, blogs etc.), without any monetary or other material payment...The paper-pen survey data were collected during the registration day of the 18th National Academic Congress of Psychology, Tianjin, China..." This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. |
Lyu et al. | 2020 | Mainland China (83%), Overseas (17%) | 4 | Article title: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields" [link] Recruitment was done by placing advertisements on the following WeChat Public Accounts: The Intellectuals, Guoke Scientists, Capital for Statistics, Research Circle, 52brain, and Quantitative Sociology. This resulted in four populations: psychology undergraduate students, psychology master's students, psychology PhD students, psychologists with a PhD. All respondents were awarded their degree in China. |
Cliff Effect
The cliff effect refers to a researcher’s drop in the confidence of an experimental result based on the p-value. Typically, the cliff effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below the 0.05 threshold and lower confidence for values above 0.05. This is often manifested as an interpretation that an experimental treatment effect is real, important, or robust if the experimental result reaches statistical significance and implausible or unimportant if it does not. In practice the cliff effect could refer to any p-value that results in a precipitous drop in researcher confidence, for instance at the 0.1 threshold.
The cliff effect is often considered an artifact of NHST misinterpretations. Why would two p-values of, say, 0.04 and 0.06 elicit a drop in researcher confidence that is more severe than two p-values of 0.2 and 0.22? From a decision theoretic point of view — in the spirt of Neyman and Pearson — a pre-specified cutoff, the “alpha” value, is required to control long-run false positive rates. It is a common confusion to interpret the p-value as a measure of this Type I error rate; this is a mistake. Nonetheless even using that mistaken interpretation does not warrant the presence of a cliff effect: regardless of whether one moves from a p-value of 0.04 to 0.06 or from 0.20 to 0.22, on average two additional Type I errors out of 100 would be observed. Returning to correct NHST usage, it is true the alpha cutoff would cause us to either accept or reject a hypothesis based on the p-value as a means of controlling our Type I error rate. However, rejection regions in that sense should be considered distinct from researcher confidence. When the p-value is treated as an evidentiary measure of the null hypothesis, as it most often is, the cliff effect is unwarranted absent some additional scientific context.
Research of the cliff effect began in 1963 with a small sample study of nine psychology faculty and ten psychology graduate students by Robert Rosenthal and John Gatio. Subjects were presented with 14 different p-values. However, the full list of values was not presented in the paper. Subjects were then asked to rate their confidence on a scale of zero to five, where zero indicated extreme confidence and five indicated no confidence. The confidence scale presented to subjects is shown below.
0 - Complete absence of confidence or belief
1 - Minimal confidence or belief
2 - Mild confidence or belief
3 - Moderate confidence or belief
4 - Great confidence or belief
5 - Extreme confidence or belief
The authors conclude a cliff effect exists at the 0.05 threshold by noting that 84% of subjects had a larger decrease in confidence between p-values of 0.05 and 0.10 than a decrease in confidence between p-values of 0.10 and 0.15. This was the highest proportion of subjects expressing decreased confidence of any of the ranges the authors presented in the paper. The authors also tested the confidence by presenting two different sample sizes, 10 and 100.
The drop in confidence between 0.05 and 0.1 and between 0.1 and 0.15 can be seen by looking at the two distances d1 and d2 on the chart at right. These distances are for students responding to the scenario with a sample size of 100, however it can easily be seen that both populations across both sample size scenarios experience similar drops.
Note that the chart at right was reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion.
In general graduate students expressed greater confidence than faculty members for a given p-value. All respondents tended to express more confidence at higher sample sizes for the same p-value. However, sample size does not impact Type I error. The authors therefore concluded that respondents either intentionally or subconsciously consider Type II error.
Kenneth Beauchamp and Richard May replicated Rosenthal and Gatio a year later and wrote up a short one-page summary. Beauchamp and May gave their survey instrument to nine psychology faculty and 11 psychology graduate students. The authors report that no cliff effect was observed at any p-level. However, a one-page response by Rosenthal and Gatio the same year made use of an extended report from Beauchamp and May (provided to Rosenthal and Gatio upon request). The Rosenthal and Gatio rejoinder disputed the conclusion of no cliff effect. Rosenthal and Gatio noted that in Beauchamp and May’s extended report confidence for a p-value of 0.05 was actually higher than for a p-value of 0.03, indicating 0.05 is treated as if it has special properties. However, this pattern of usage in itself does not fit the normal definition of the cliff effect. Rosenthal and Gatio also use as evidence for a 0.05 cliff effect the fact that Beauchamp and May considered Rosenthal and Gatio’s main effects from their original paper as nonsignificant despite the p-value being 0.06, thereby falling prey to the very cliff effect they argued wasn’t present in their data.
In 1972 Eric Minturn, Leonard Lansky, and William Dember surveyed 51 bachelor, master’s, and PhD graduates. Participants were asked about their level of confidence at 12 different p-values with two sample sizes, 20 and 200. A cliff effect was reported at p-values of 0.01, 0.05, and 0.10. Participants also had more confidence in the larger sample size scenario. The results of Minturn, Lansky, and Dember (1972) presented here are as reported in Nelson, Rosenthal, and Rosnow (1986). The original Minturn, Lansky, and Dember (1972) paper could not be obtained. It was presented at the meeting of the Eastern Psychological Association, Boston under the title, “The interpretation of levels of significance by psychologists.” Despite several emails to the Eastern Psychological Association to obtain the original paper no response was received. We confirmed that Leonard Lansky and William Dember are now deceased and therefore could not be contacted. We attempted to reach Eric Minturn via a LinkedIn profile that appeared to be the same Eric Minturn who authored the paper, however no response was received. All three authors had an academic affiliation with the University of Cincinnati, however the paper is not contained in the authors’ listed works kept by the university nor is it contained in the listed works collected by any of the third party journal libraries we searched.
Nanette Nelson, Robert Rosenthal, and Ralph Rosnow surveyed 85 academic psychologists in 1986. They asked about 20 p-values ranging from 0.001 to 0.90 using the same confidence scale as Rosenthal and Gatio (1963). The authors found a cliff effect at 0.05 and 0.10. They also found that there was a general increase in confidence as the effect size increased, as sample size increased, and when the experiment was a replication rather than the first run of an experiment.
Curve | Percentage of respondents |
---|---|
All-or-none | 22% |
Negative exponential | 56% |
1-p linear | 22% |
Table notes:
1. Sample sizes: all-or-none (n=10), negative exponential (n=4), linear (n=4)
2. Reference: "Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated", Poitevineau & Lecoutre, Psychonomic Bulletin & Review, 2001 [link]
In 2001 Jacques Poitevineau and Bruno Lecoutre wrote a paper titled, “Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated” in which they presented a questionnaire to 18 psychology researchers to measure the presence of a cliff effect. As the title of their paper suggests Poitevineau and Lecoutre found the cliff effect to be less pronounced than perviously identified. The authors’ main finding was that averaging across subjects may mask between-subject heterogeneity in p-value confidence. In particular the authors note that when averaged across all 18 subject a cliff effect was present. However, this was largely driven by four subjects that expressed what the authors call an “all-or-none” approach to p-values with extremely high confidence for p-values less than 0.05 and almost zero confidence for p-values larger than 0.05. Four other subjects had a linear decrease in confidence across p-values. The majority — 10 subjects — expressed exponential decrease in confidence. The exponential group had higher decreases in confidence between small p-values than between large ones, but did not exhibit a cliff effect in the traditional sense.
In 2010 Jerry Lai replicated the between-subject analysis in his confidence elicitation study of 258 psychology and medical researchers. Participants were authors of journal articles that appeared in one the two fields. Data in the paper is not split by discipline so all figures below are for the population of both researchers. If psychology-only data is made available the figures will be updated.
Participants first saw the following hypothetical scenario, where the confidence interval version is shown in brackets, substituting for the previous sentence.
Suppose you conduct an experiment comparing a treatment and a control group, with n = 15 in each group. The null hypothesis states there is no difference between the two groups. Suppose a two-sample t test was conducted and a two-tailed p value calculated. [Suppose the difference between the two group means is calculated, and a 95% confidence interval placed around it].
Respondents were asked about each of the following p-values: p = .005, .02, .04, .06, .08, .20, .40, .80. Then a scenario with a sample size of 50 was shown. All combinations assumed equal variances, with a pulled standard deviation of four. Respondents were asked to rate the strength of evidence of each p-value and sample size combination on a scale of 0 to 100.
The NHST version displayed a typical result with a t-score, p-value, and effect size. The confidence interval version showed a visual display for each p-value and sample size combination. A total of 172 researches saw the NHST version and 86 saw the confidence interval version. This sample size is an order of magnitude larger than in Poitevineau and Lecoutre (2001). Participants were not randomized into one of the two version, instead different sets of researchers were contacted for the NHST study and confidence interval study.
Curve | NHST | CI |
---|---|---|
All-or-none | 4% | 17% |
Moderate cliff | 17% | 15% |
Negative exponential | 35% | 31% |
1-p linear | 23% | 0% |
Table notes:
1. Sample sizes do not add to 100% because not all responses could be categorized into one of the four main types.
2. NHST sample sizes: all-or-nothing (n=7), moderate cliff (n=29), negative exponential (n=60), 1-p linear (n=39), unclassified (n=38). Confidence interval sample sizes: all or nothing (n=28), moderate cliff (n=13), negative exponential (n=27), 1-p linear (n=0), unclassified (n=32).
3. Reference: "Dichotomous Thinking: A Problem Beyond NHST", Jerry Lai, Proceedings of the Eighth International Conference on Teaching Statistics, 2010 [link]
Lai assessed the extent of a cliff effect by calculating what he called the “Cliff Ratio (CR).” The numerator of the CR was the decrease in the rated strength of evidence between p-values of 0.04 and 0.06. The denominator of the CR was calculated by averaging the decrease in rated strength of evidence between p-values of 0.02 and 0.04 and between 0.06 and 0.08.
The CR as well as the overall shape of each participant’s responses were used to manually categorize patterns, with a focus on the categories from Poitevineau and Lecoutre (2001). One additional pattern, a moderate cliff effect, was also identified.
The results are shown in the table at right. Not all respondents could be categorized into one of the four main groups selected by Lai; 79% of responses from the NHST version and 63% of responses from the confidence interval version were placed into one of these categories.
In total 21% of participants demonstrated an all-or-none or moderate cliff effect for the NHST version. However, in the paper Lai mistakenly cites a 22% figure in his discussion.
In the initial research by Poitevineau and Lecoutre (2001) a 22% figure for a cliff effect was found, but this was for the single all-or-none category, of which only 4% of respondents were categorized into by Lai (2010). Poitevineau and Lecoutre (2001) did not create an explicit “moderate cliff effect” category.
The corresponding figure for the confidence interval version was that 32% of participants demonstrated an all-or-none or moderate cliff effect. This implies that the confidence interval presentation type did not decrease dichotomization of evidence with respect to NHST (and in fact made it worse) among the sample collected by Lai. However, as the authors discuss confidence intervals are sometimes suggested specifically because of the believe that one benefit is protection against dichotomization of evidence.
Lai notes that “sample size was found to have little impact on researchers’ interpretation…” and no breakdown of results by sample size is provided in his paper. This does differ from other studies of the cliff effect, however, it is unclear why.
Several participants exhibited fallacies when providing responses in the open-ended section of the response. For instance, one participant believed that p-values represent, “the likelihood that the observed difference occurred by chance,” an example of the Odds-Against-Chance Fallacy. Another claimed that, “I have estimated…based on probability that the null hypothesis is false,” an example of the Inverse Probability Fallacy.
In 2012 Rink Hoekstra, Addie Johnson, and Henk Kiers published the results of their study of 65 PhD students within psychology departments in the Netherlands. Each subject was presented with eight different scenarios, four related to NHST and four related to confidence intervals. The NHST scenario included the mean difference along with degrees of freedom, sample size, a p-value, and standard error. For each scenario four p-values were shown designed to be just above the traditional 0.05 cutoff (.04 and .06), or clearly above or below the cutoff (.02 or .13).
Confidence in both the original result and the result if the study were repeated were measured by responses to the question below.
What would you say the probability is that there is an effect in the expected direction in the population based on these results?
What would you say the probability is that you would find a significant effect if you were to do the same study again?
The authors found a cliff effect around the 0.05 level, the decrease in confidence for both questions was greatest between 0.04 and 0.06 among the four p-values presented. There was a 14 point drop in confidence (85 to 71) between p-values of 0.04 to 0.06, compared to a drop of just 6.5 (91.5 to 85) between p-values of 0.02 and 0.04 and a drop of 25 (71 to 46) between the much wider p-value interval of 0.06 to 0.13. These results can be seen visually in the chart at right; the line segment has the steepest slope between 0.04 and 0.06. Note that this chart and the associated values above were reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion.
The cliff effect for confidence intervals was found to be more pronounced than for NHST. This can be seen in the chart by noting the slope of the yellow line is steeper than the red line between p-values of 0.04 and 0.06. There was a 17 point confidence interval drop (73 to 56) between p-values of 0.04 to 0.06, which was three points greater than the associated p-value interval for NHST.
Cliff effect literature summary
Due to the varying methodologies of studies investigating the cliff effect, a meta-analysis could not be conducted. However, nearly all results suggest that a cliff effect of some magnitude exists for psychology researchers. More work is likely needed in the spirt of Poitevineau and Lecoutre (2001) and Lai (2010) to determine if the aggregated cliff effect reported in the other studies is the result of a relatively small number of “all-or-none” and “moderate cliff” respondents and the extent to which subject confidence in p-values can be segmented into distinguishable patterns such as linear or negative exponential decreases.
One finding was that confidence intervals did not appear to moderate the magnitude of the cliff effect. In fact, both Lai (2010) and Hoekstra et al. (2012) found a more pronounced cliff effect for confidence interval visual presentation.
A second pattern was that the negative exponential decrease is a common mental model with which researchers evaluate strength of evidence. For instance, the average confidence of respondents in Rosenthal and Gatio (1963) can be seen in the accompanying chart to plot a negative exponential decrease. The trend was also present in Poitevineau and Lecoutre (2001). Although the sample size in both studies was quite low, the pattern was also present in the larger sample size of Lai (2010), where it was the modal pattern observed for both the NHST and confidence interval presentation types.
It is worth asking if there is a normatively correct pattern of belief for confidence changes based on the p-value, but here opinions may differ. It is worth noting however, that the negative exponential decrease is consistent with some methods of evidentiary evaluation. For example, plotting the so-called surprisal, or s-value [23], of the p-value results in a negative exponential curve. However, another way to view the problem is to think about fictitious replications of a single experiment. As noted in Lai et al. (2012) if an initial experiment resulted in a p-value of 0.02, under conservative assumptions an 80% replication interval would be p = 0.0003 to p = 0.3. Using this rubric it is unclear how to discount confidence as the p-values decreases since any p-value one observes would to be different, perhaps substantially different, if the experiment were repeated.
Dichotomization of evidence Review
The dichotomization of evidence is a specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant. It is often a result of the cliff effect and is closely related to the Nullification Fallacy, in which a nonsignificant result is interpreted as evidence of no effect.
In 2010 Melissa Coulson, Michelle Healey, Fiona Fidler, and Geoff Cumming conducted a pair of studies with psychologists. In the first study 330 subjects from three separate academic disciplines were surveyed, two medical fields plus 102 psychologists who had authored recent articles in leading psychology journals.
The subjects were first shown a short background summary of two fictitious studies that evaluated the impact of a new treatment for insomnia:
Only two studies have evaluated the therapeutic effectiveness of a new treatment for insomnia. Both Simms (2003) and Collins (2003) used two independent, equal-sized groups and reported the difference between the means for the new treatment and current treatment.
Then subjects were shown, at random, one of two formats, either a confidence interval format or an NHST format. For each format there was a text version and a figure version. This resulted in subjects being shown one of four result summaries: either an NHST figure, which included a column chart of the average treatment effect and associated p-values; an NHST text version, which simply described the results; a confidence interval figure which provided the 95% confidence intervals of the two studies side-by-side; or a confidence interval text version, which simply described the results. Only one version was shown to each subject. The confidence interval text version is shown below. All four versions can be found in the original paper.
Simms (2003), with total N = 44, found the new treatment had a mean advantage over the current treatment of 3.61 (95% Confidence Interval: 0.61 to 6.61). The study by Collins (2003), with total N = 36, found the new treatment had a mean advantage of 2.23 (95% Confidence Interval: -1.41 to 5.87).
Subjects were first prompted to provide a freeform response to the question, “What do you feel is the main conclusion suggested by these studies?” Next three statements were presented regarding the extent to which the findings from the two studies agreed or disagreed. Subjects used a 1 to 7 Likert response scale to indicate their level of agreement where 1 equated to “strongly disagree” and 7 to “strongly agree.” The three statements were as follows:
Statement 1: The results of the two studies are broadly consistent.
Statement 2: There is reasonable evidence the new treatment is more effective.
Statement 3: There is conflicting evidence about the effectiveness of the new treatment.
The primary purpose of the study was to evaluate whether respondents viewed the two studies as contradictory. The authors view the results as supportive. The confidence intervals have a large amount of overlap despite one interval covering zero while the other does not; or in NHST terms, both have a positive effect size in the same direction despite one p-value being statistically significant and the other statistically nonsignificant. However, dichotomization of evidence may lead one to believe that the results are contradictory. This type of dichotomization of evidence was Fallacy 8 in our common NHST misinterpretations. The authors refer to the philosophy undergirding the interpretation of the two studies as supportive as meta-analytical thinking, the notion that no single study should be viewed as definitive. Instead, studies should be considered in totality as providing some level of evidence toward a hypothesis. This is a good rule of thumb, however, note that meta-analyses are not always better than a single study. This was Fallacy 11 in our common NHST misinterpretations.
To analyze the results of the study the authors averaged the Likert scores for Statements 1 and 3 for each subject (this was done after first reversing the scale of Statement 3 since the sentiment of the two statements were in complete opposition). The authors called this the “Consistency score.” The results for Statement 2 were captured in what was termed the “Effective score.”
The results of the first study are shown at the right, including the average across subjects (the dot) and the 95% confidence interval for the results (the bars). These were reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion. There were only small differences between the text and figure versions of the two formats, and thus the results were averaged, resulting in a single Effective score and Consistency score for both NHST and confidence intervals.
Because the authors considered the two statements to be in agreement both average Likert responses should be near 6, somewhat agree, or 7, strongly agree. However, neither score even crossed a Likert response of 5, denoting mild agreement.
For both formats more subjects believed that the two studies were consistent than believed that the totality of evidence indicated the treatment was effective, with the Consistency score between 0.75 and 1.0 higher on the Likert scale than the Effective Score. For the Consistency score NHST produced average Likert responses higher than those of confidence intervals. For the Effective score this was reversed. However, note that the magnitudes of the differences are modest.
Some analysis by the authors was not segmented by academic field. We have requested the raw data from the authors and will update this write-up if they are made available.
However, the authors note that looking across all three academic disciplines, “Only 29/325 (8.9%) of respondents gave 6 or above on both ratings, and only 81/325 (24.9%) gave any degree of agreement – scores of 5 or more – on both.”
Coding of freeform responses to the question, “What do you feel is the main conclusion suggested by these studies?” painted a similar picture when looking across all three academic disciplines. Coding lead to 81 out of 126 (64%) subjects indicating that the confidence interval format showed the two fictions studies were consistent or similar in results. For the NHST format the corresponding proportion was 59 out of 122 (48%). Again, these freeform responses can be viewed as a possible indication of dichotomous thinking.
In addition, the authors note that 64 of the 145 subjects (44%) answering one of the two confidence interval result summaries made mention of “p-values, significance, a null hypothesis, or whether or not a CI includes zero.” This can be considered a type of confidence interval misinterpretation as NHST is distinct from confidence intervals. NHST produces a p-value, an evidentiary measure of data compatibility with a null hypothesis, while confidence intervals produce a treatment effect estimation range.
In the second study 50 academic psychologists from psychology departments in Australian universities were presented with only the confidence interval figure result summary and asked the same four questions. This time for Statements 1 and 3 the subjects were also prompted to give freeform text responses. Results of this study are also shown in the figure. Note that for this study we did not use tracing, instead results were simply plotted.
The results were similar to the first study, although the Consistency score and Effect score were closer together, centered between a Likert response of 4 and 4.5.
The freeform responses from Statements 1 and 3 were analyzed by the authors. Together there were 96 total responses between the two questions (each of the 50 subjects had the chance to respond to both statements, which would have give a total of 100 responses, however a few psychologists abstained from providing written responses). In 26 of 96 cases (27%) there was mention of NHST elements, despite the survey instrument presenting only a confidence interval scenario. Mention of NHST was negatively correlated with agreement levels of 5, 6, or 7 on the Likert scale for both the Consistency score and Effect score, again suggesting NHST references were indicative of dichotomous thinking.
In their 2016 paper “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” Blakeley McShane and David Gal set out to test the prevalence of the dichotomization of evidence induced by the statistically significant/nonsignificant threshold. They conducted a series of surveys among populations of different academic disciplines, including members of the editorial board of three different psychology journals.
In email correspondence Professor McShane provided detailed responses to the initial version of this section. We are grateful for his responses and have incorporated his feedback in what follows.
The first population surveyed in “Blinding Us to the Obvious?” were 54 editorial board members of Psychological Science. The survey instrument given to these participants is shown below. The question asked in this survey was termed the “Descriptive” question as it involves interpretation of descriptive statistics rather than statistical inference. Two versions of the summary were presented, one with a p-value of 0.01 and one with a p-value of 0.27. Respondents saw both versions, but the order of which version was presented first was randomized.
Below is a summary of a study from an academic paper:
The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to either write daily about positive things they were blessed with or to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants who wrote about the positive things they were blessed with lived, on average, 8.2 months after diagnosis whereas participants who wrote about others’ misfortunes lived, on average, 7.5 months after diagnosis (p = 0.27). Which statement is the most accurate summary of the results?
A. The results showed that participants who wrote about their blessings tended to live longer post-diagnosis than participants who wrote about others’ misfortunes.
B. The results showed that participants who wrote about others’ misfortunes tended to live longer post-diagnosis than participants who wrote about their blessings.
C. The results showed that participants’ post-diagnosis lifespan did not differ depending on whether they wrote about their blessings or wrote about others’ misfortunes.
D. The results were inconclusive regarding whether participants’ post-diagnosis lifespan was greater when they wrote about their blessings or when they wrote about others’ misfortunes.
As discussed in more detail below, the authors argue Option A is the nominally correct response for both p-value scenarios and so frame their results around the proportion of respondents selecting this option. A graphical representation of the results is shown at right, with the dot representing the mean and the lines representing the 95% confidence interval of the mean. The drop in the proportion of respondents selecting Option A between the 0.05 and 0.27 p-value versions was due to increased proportions of respondents selecting Option C and Option D, 33% and 37% respectively.
For some academic disciplines so-called “Judgement” and “Choice” questions were asked instead of the Descriptive question.
In the Judgement question respondents were asked to select from several options their opinion about what would happen, hypothetically, if a new patient were provided the treatment described in the study summary.
The Choice question had two versions. In one version respondents were asked to choose a drug treatment for themselves based on a hypothetical scenario in which two drugs were being compared for efficacy in disease treatment. In the second version respondents were asked to make a recommendation about the hypothetical treatment to either a socially close or socially distant person (for example, a family member or a stranger).
Note that while the Judgement and Choice questions were usually paired — participants were asked to respond to both question types — the Descriptive and Choice questions were never asked together.
Both the Judgement and Choice questions were presented to 31 members of the Cognition editorial board and 33 members of the Social Psychological and Personality Science editorial board (SPPS). We chose to include the results of the Cognition board here as the journal crosses both cognitive neuroscience and cognitive psychology. It will also be included in an upcoming review of NHST and confidence interval misinterpretations of medical researchers, similar to this article.
The judgement question presented to these two populations was as follows.
Below is a summary of a study from an academic paper:
The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of patients who took Drug A recovered from the disease while forty-four percent (44%) of patients who took Drug B recovered from the disease (p = 0.26).
Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?
A. A person drawn randomly from the same patient population as the patients in the study is more likely to recover from the disease if given Drug A than if given Drug B.
B. A person drawn randomly from the same patient population as the patients in the study is less likely to recover from the disease if given Drug A than if given Drug B.
C. A person drawn randomly from the same patient population as the patients in the study is equally likely to recover from the disease if given Drug A or if given Drug B.
D. It cannot be determined whether a person drawn randomly from the same patient population as the patients in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.
Here the statements were more explicit about the population/sample distinction, noting that of interest was, “A person drawn randomly from the same patient population as the patients in the study.”
After answering the Judgment question participants were asked the Choice question. This included the same hypothetical setup, but asked subjects to make a choice about their preference of drug:
If you were a patient from the same population as the patients in the study, what drug would you prefer to take to maximize your chance of recovery?
A. I prefer Drug A.
B. I prefer Drug B.
C. I am indifferent between Drug A and Drug B.
A summary of the survey setup for the three editorial boards is shown below.
Editorial Board | Descriptive | Judgement | Choice | Instrument setup | P-values |
---|---|---|---|---|---|
Psychological Science | How does writing about positive things in your life or others' negative experiences impact life expectancy? | 0.01, 0.27 | |||
Cognition | Is Drug A or Drug B better at aiding in recovery from a certain disease? | 0.01, 0.26 | |||
SPPS | Is Drug A or Drug B better at aiding in recovery from a certain disease? | 0.01, 0.26 |
Table notes:
1. Reference: Supplement to “Blinding Us to the Obvious?
The Effect of Statistical Training on the Evaluation of Evidence”, Blakeley B. McShane & David Gal, Management Science, 2016 [link]
Overall, 27 of 31 members of the Cognition board (87%) chose Option A in the p = 0.01 version of the Judgement question compared to just 11 (35%) in the p = 0.26 version, but just one subject failed to prefer Drug A in the p = 0.01 version of the Choice question. While fewer, 20 subjects or 65%, preferred Drug A in the p = 0.26 version of the Choice question, this was still almost twice as high as the number of subjects that viewed Drug A as superior in the p = 0.26 version of the Judgement question.
The results for the SPPS board were similar; 29 of 33 subjects (89%) chose Option A in the p = 0.01 version of the Judgement question compared to just 5 (15%) in the p = 0.26 version. Only four subjects had no preference of drug in the p = 0.01 version of the Choice question, with the remaining 29 (89%) preferring Drug A. While 19 subjects or 58%, preferred Drug A in the p = 0.26 version of the Choice question, this was still almost four times as high as the number of subjects that viewed Drug A as superior in the p = 0.26 version of the Judgement question.
Note that in the data tables presented in Appendix A of McShane and Gal (2016) the authors mistakenly indicate that the p-value presented to the Cognition and SPPS subjects was 0.27. All other references in the supplementary material report the p-value in this scenario as 0.26.
The response proportions are provided in table form below. Note that no respondent selected Option B.
Psychological Science | Cognition | SPPS | ||||
---|---|---|---|---|---|---|
Value | p = 0.01 | p = 0.27 | p = 0.01 | p = 0.26 | p = 0.01 | p = 0.26 |
Option A | 87% | 17% | 87% | 35% | 88% | 15% |
Option C | 4% | 37% | 0% | 10% | 0% | 33% |
Option D | 9% | 46% | 13% | 55% | 12% | 52% |
Table notes:
1. Option B has been omitted as no respondent selected it.
2. Reference: Supplement to “Blinding Us to the Obvious?
The Effect of Statistical Training on the Evaluation of Evidence”, Blakeley B. McShane & David Gal, Management Science, 2016 [link]
The authors argue the correct answer regardless of p-value is Option A for all three question type: Descriptive, Judgement, and Choice. In the Descriptive question this follows simply from the language used in the survey instrument, which asked about what actually happened in the referenced study, not about statistical inference of some population. As noted in the study summary those who wrote about positive things lived 8.2 months on average compared to the 7.5 months lived on average for the group writing about others’ misfortunes. Since 8.2 is larger than 7.5 Option A is the only sensible response. (Recall that Option A in the Descriptive question stated that a patient writing about positive things lived longer).
Despite being purely descriptive, respondents consistently switched from Option A in the 0.01 p-value scenario to Options C or D in the 0.27 p-value scenario. Data for this switching behavior is shown in the table below. In analysis by discussants of a follow-up paper by McShane and Gal there was some criticism of the language used in the instrument wording, with concerns that the language was a “trick question” since participants might be likely to interpret the instrument as prompting interpretation of statistical inference. This criticism is discussed in more detail below.
The authors also consider Option A to be nominally correct in the Judgement question for both 0.01 and 0.27 p-value scenarios. (Recall that Option A in the Judgement question stated that a patient taking Drug A was more likely to recover). The authors’ argument follows from the fact that the effect size is always the best estimate of the treatment effect regardless of the p-value. For the Judgement question the respondent option selections between the 0.01 and 0.27 p-value scenarios can be broken down into three key patterns.
Respondents who switched from Option A to Option C
In the Judgement question the clearest indication of dichotomization of evidence would be a respondent switching from Option A to Option C in the 0.01 and 0.27 p-value scenarios, respectively. (Recall that Option C in the Judgement question suggested Drug A and Drug B were equally effective). This switching behavior is troublesome because selection of Option C in the 0.27 p-value scenario confuses a nonsignificant p-value for evidence of no effect (the Nullification Fallacy).
Data for the switching behavior from Option A to Option C is presented in the table below. The results show that for Cognition 10% of respondents exhibited this behavior and for SPPS nearly a third of respondents did so.
Respondents who switched from Option A to Option D
Another type of switching behavior, which was even more common among the editorial boards, was a move from Option A to Option D, in the 0.01 and 0.27 p-value scenarios, respectively. (Recall that Option D in the Judgement question was “It cannot be determined [which drug is more effective]”). Again, the authors argue this switching pattern is a kind of dichotomization of evidence since the treatment effect size indicates Drug A is the better choice to maximize the chances of recovery even in the 0.27 p-value scenario. However, this switching behavior gets at a fundamental question: for a given effect size should the p-value ever influence the interpretation of the treatment efficacy? If so when and how? There is much that can and has been said about these questions and we do not attempt to answer them here.
Data for the switching behavior from Option A to Option D is presented in the table below. The results show that for both SPPS and Cognition 42% of respondents exhibited this behavior.
Respondents who selected Option D in both p-value scenarios
The third response pattern of interest is the number of users that selected Option D in both p-value scenarios. Option D is always epistemically valid since at no p-value is a result definitive. Likewise some researchers may prefer multiple studies on a topic before passing judgement (recall the Judgement question wording noted that, “Assuming no prior studies have been conducted with these drugs…”). A consistent selection of Option D reflects some level of epistemic skepticism with regard to the data. About this response pattern the authors note the following:
“An argument might be made that there is a sense in which option D is the correct option for the judgment question because, as discussed above, at no p is the null definitively overturned. More specifically, under a classical frequentist interpretation of the question, which drug is “more likely” to result in recovery depends on the parameters governing the probability of recovery for each drug. As these parameters are unknown and unknowable, option D could be construed as the correct answer under this interpretation. We note that no such difficulty arises under a Bayesian interpretation of the question and for which option A is definitively the correct response.”
Consistent selection of Option D was less common than the other two response patterns discussed with just 9% of the SPPS boards exhibiting this response pattern and 13% of Cognition board members doing so.
A brief word is in order about Option B as it has not yet been discussed in depth. In both the Descriptive and Judgement questions Option B gets the treatment results completely backward. (Recall that in the Descriptive question Option B specified that writing about others’ misfortunes increased lifespan and in the Judgement question specified that Drug B was more effective than Drug A). Selecting Option B would indicate that either the respondent misread the scenario or is statistically immature (or that they are providing a “trolling” answer they know to be wrong). However, note that no respondent across the three editorial boards selected Option B for either p-value scenario.
Editoral Board | Switching from Option A to C | Switching from Option A to D | Selecting Option D in both scenarios |
---|---|---|---|
Psychological Science | 33% | 37% | 9% |
Cognition | 10% | 42% | 13% |
SPPS | 30% | 42% | 9% |
Table notes:
1. Total respondents: Psychological Science - 54 , Cognition - 31, SPPS - 33. Respondent count switching from A to C: Psychological Science - 18 , Cognition - 3, Social Psychology - 10. Respondent count switching from A to D: Psychological Science - 20 , Cognition - 13, SPPS - 14. Respondent count selecting D in both scenarios: Psychological Science - 5 , Cognition - 4, SPPS - 3.
2. Option B has been omitted as no respondent selected it.
3. Reference: Supplement to “Blinding Us to the Obvious?
The Effect of Statistical Training on the Evaluation of Evidence”, Blakeley B. McShane & David Gal, Management Science, 2016 [link]
When comparing results from the Judgement and Choice questions McShane and Gal highlighted the drastic difference in the proportion of respondents choosing Option A. That is, in the 0.26 p-value scenario drastically more respondents chose to take Drug A in the Choice question than in the Judgement question. The authors hypothesize that respondents are more likely to select the correct answer for the Choice question because it short circuits the automatic response to interpret the results with statistical significance in mind. Instead, focus is redirected toward a simpler question, like, “which drug is better?” This is a possibility, but cannot be concluded with certainty from the available data. It could be consistent for a respondent to choose Option D in the Judgement question, but Option A in the Choice question. In other words a respondent might have the point of view that Drug A might be better than Drug B given a p-value is 0.27, but feel the evidence is largely inconclusive. However, when forced to take one of the two drugs they might still select Drug A. Their reasoning might be that while Drug A is not conclusively better than Drug B, the evidence suggests that it is probably not worse, and if forced to take one of the two drugs chose Drug A as a kind of “long shot” despite possible misgivings about its superiority. For the population of psychologists covered here it is not possible in the data provided to pair selections from the Judgement and Choice questions to determine how those with a particular response pattern to the Judgement question responded to the Choice question.
In correspondence with Blake McShane he made two counterarguments about the “long shot” possibility. First, recall that the statement wording for Option D in the Judgement question was not that the results are “inconclusive,” but specifically that the data suggest “it cannot be determined” which drug is better. Therefore, it may not be fair to interpret a respondent selecting Option D in the Judgement question as believing, “Drug A might be better than Drug B given a p-value is 0.27.” Second, the Judgement and Choice questions ask the same question but about different populations: the Judgement question is about a hypothetical “other” (which drug is more likely to help a patient recover) while the Choice question asks about oneself (i.e. which drug would you take to maximize your chance of recovery). “Under this frame, it does seem inconsistent,” McShane wrote, to chose Drug A in the Choice question, but not the Judgement question. Additional research would be needed to definitively measure the extent of “long shot” thinking.
There are also questions as to how respondents from the board of Psychological Science chose to interpret the Descriptive question. The hypothetical scenario was written to inquire only about the participants of the study itself so that respondents would avoid invoking statistical inference. Such language makes the correct answer unambiguously Option A, as it is a simple question about averages between two groups with measured levels of lifespan. However, it is unclear if respondents actually interpreted the question in this way, or mistakenly interpreted it as a question about inference.
A similar question arose in the follow-up paper by McShane and Gal that surveyed JASA authors. McShane and Gal addressed the criticism thusly:
We also wish to reiterate that the claim that the mere presence of a p-value in the question naturally led our subjects to focus on statistical inference rather than description is not really a criticism but rather is essentially our point: our subjects are so trained to focus on statistical significance that the mere presence of a p-value leads them to automatically view everything through the lens of the null hypothesis significance testing (NHST) paradigm—even in cases where it is unwarranted.
Readers can judge for themselves whether interpreting a non-inferential question as inferential when provided a survey instrument is a sign of trouble. It is worth noting that the experimental replications by McShane and Gal used several different versions (specifically, wordings for the response options) of the survey instrument. The results did not differ substantially, every survey version across every academic discipline showed a substantial drop in the proportion choosing Option A between a smaller and a larger p-value.
In total four different response options were used, each with different language about whether the results were meant to be interpreted inferentially or non-inferentially. The language used in Descriptive questions is shown below.
“The results showed that participants…” For example, “The results showed that participants who wrote about their blessings tended to live longer post-diagnosis than participants who wrote about others’ misfortunes.”
“Speaking only of the subjects who took part in this particular study…” For example, “Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B.”
“The participants who were in Group A…” For example, “The participants who were in Group A tended to live longer post-diagnosis than the participants who were in Group B.”
The language used in Judgement questions is shown below.
“A person drawn randomly from the same patient population as the patients in the study…” For example, “A person drawn randomly from the same patient population as the patients in the study is more likely to recover from the disease if given Drug A than if given Drug B.”
Notably, the results were invariant even when using clear language such as, “Speaking only of the subjects who took part in this particular study.” This does not definitively answer the question as to how the members of the Psychological Science board interpreted the survey. However, it does suggest that the language did not seem to matter for in general for the populations surveyed.
In a 2018 study examining publication bias Rolf Ulrich and Jeff Miller sought to understand the cliff effect in the context of an adviser recommendation to a PhD student about whether an experimental finding should be published. Only the p-value was provided.
In total 1,200 usable responses were received, 590 from German psychologists recruited form the email list of the German Psychological Society, and 610 English-speaking experimental psychologists recruited from attendees of the Psychonomic Society annual meetings that took place between 2013 and 2015. The majority of respondents, 90%, had published at least one academic paper. Nearly 4 in 10 had published six or more papers. In terms of education, 80% had a PhD while 16% had a master’s degree. The remaining 4% had a bachelor’s degree. Additional demographic information is available in Appendix C of Ulrich and Miller (2018).
A PhD student conducted an experimental study. This study involved an experimental group (n =40) and a control group (n= 40). The difference between the two groups was consistent with the experimental hypothesis, t(78) =2.316, p =.0232, two-tailed test. The student is excited about this result. However, he wonders whether it should be submitted for publication immediately or whether the experiment should first be replicated in order to assure its stability. Imagine that this PhD student asks you for your advice. Would you recommend immediate submission or replication before submission?
Your answer (please choose one option):
1. Submit the results obtained.
2. Replicate the study before submission.
Subjects were sent a link to the survey via email and randomized into a scenario with one of four p-values: 0.0532, 0.0432, 0.0232, 0.0132.
Results are shown in the table at right. In general German psychologists were more likely to recommend publication than experimental psychologists. No obvious cliff effect was observed for this population with an almost linear increase in the recommendation to publish between p-values of 0.0532 and 0.232.
A much more noticeable cliff effect was observed for experimental psychologists. The percentage recommending publication increased by 25% (from 10% to 35%) from the 0.0532 to 0.0432 p-value increase. Increases in the recommendation to publish were quite small between subsequent p-values.
The authors perform more formal statistical analyses in their paper, noting the results are consistent with a “gradual submission bias” which they describe as “gradual increases in confidence with decreasing p values.”
But here the authors may be making a mountain out of a molehill. While a cliff effect is usually an unambiguous signal that NHST misinterpretations underlie the subject’s statistical decision making, a pattern of gradual increasing confidence based on a p-value is not necessarily problematic. The p-value is an evidentiary measure and smaller p-values do indicate that the observed data are less compatible with the null hypothesis.
In practice other information is needed to make an informed publication recommendation. For instance, the effect size, confidence intervals, outcomes of robustness tests, the practical importance of the finding, and more.
This sentiment was echoed in some feedback by respondents. Although there was no option to provide comments in the survey, some researchers chose to respond directly to the initial email. Several of the comments were provided by the authors as exemplary. One such comment is reproduced below, pointing to the fact that not enough information was provided, but that nonetheless the respondent provided an answer.
I also am incapable of making the judgment from a t-test alone. What is the hypothesis? What are the DVs and the descriptive statistics? I chose “submit for publication,” but I suspect most people who choose this option share similar views—not submitting single-experiment papers, not relying solely on p-values, etc.
If this response pattern was common then this study in itself may not provide overwhelming evidence that p-value myopia is a primary driver behind publication bias.
Dichotomization of evidence meta-analysis
A small meta-analysis was possible for dichotomization of evidence using the three studies from McShane and Gal. This included 118 psychology researchers in total: 31 from the editorial board of Cognition, 33 from the editorial board of Social Psychological and Personality Science, and 54 from the editorial board of Psychological Science. A simple weighted average based on sample size was used to determine the difference in the proportion of respondents selecting Option C in the p = 0.01 and p = 0.27 (or 0.26) cases. This is the clearest signal of dichotomization of evidence from the available data. On average Option C occurred 27 percentage points more often in the larger p-value scenario than in the smaller p-value scenario. Note that this meta-analysis only pertains to the so-called “judgement” question. For a full review of these studies please see the “Dichotomization of Evidence Review” section above.
References
Rex Kline, “Beyond significance testing: Reforming data analysis methods in behavioral research”, American Psychological Association, 2004 [link]. On page 80 it is noted that, “Confidence intervals are not a magical alternative to NHST. However, interval estimation in individual studies and replication together offer a much more scientifically credible way to address sampling error than the use of statistical tests in individual studies.”
American Psychological Association, Publication Manual of the American Psychological Association, Seventh Edition, American Psychological Association, 2020 [link]. On page 88 it is noted that, “It can be extremely effective to include confidence intervals (for estimates of parameters; functions of parameters, such as differences in means; and effect sizes) when reporting results.”
Robert J Calin-Jageman, “After p Values: The New Statistics for Undergraduate Neuroscience Education”, Journal of Undergraduate Neuroscience Education, 2017 [link]
Michèle B. Nuijten, Chris Hartgerink, Marcel van Assen, Sacha Epskamp, and Jelte Wicherts, “The prevalence of statistical reporting errors in psychology (1985–2013)”, Behavior Research Methods, 2016 [link]
Paul Pollard and John Richardson, “On the Probability of Making Type I Errors”, Quantitative Methods In Psychology, 1987 [link]
Heiko Haller and Stefan Krauss, “Misinterpretations of Significance: A Problem Students Share with Their Teachers?”, Methods of Psychological Research Online, 2002 [link]
Brian Haig, “The philosophy of quantitative methods”, Oxford handbook of Quantitative Methods, 2012 [link]
Robert Calin-Jageman and Geoff Cumming, “The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known”, The American Statistician, 2019 [link]
Sharon Lane-Getaz, “Development of a reliable measure of students’ inferential reasoning ability”, Statistics Education Research Journal, 2013 [link]
Robert Delmas, Joan Garfield, Ann Ooms, and Beth Chance, “Assessing students’ conceptual understanding after a first course in statistics”, Statistics Education Research Journal, 2007 [link]
The CAOS appears to be available on the Assessment Resource Tools for Improving Statistical Thinking (ARTIST) website [link]
Hector Monterde-i-Bort, Dolores Frías-Navarro, and Juan Pascual-Llobell, “Uses and abuses of statistical significance tests and other statistical resources: a comparative study”, European Journal of Psychology of Education, 2010 [link]
Translated
Laura Badenes-Ribera, Dolores Frias-Navarro, Nathalie O. Iotti, Amparo Bonilla-Campos, and Claudio Longobardi, “Perceived Statistical Knowledge Level and Self-Reported Statistical Practice Among Academic Psychologists”, Frontiers in Psychology, 2018 [link]
Laura Badenes-Ribera, Dolores Frías-Navarro, and Amparo Bonilla-Campos, “Un estudio exploratorio sobre el nivel de conocimiento sobre el tamaño del efecto y meta-análisis en psicólogos profesionales españoles”, European Journal of Investigation in Health, Psychology and Education, 2017 [link]
Rensis Likert, “A technique for the measurement of attitudes”, Archives of Psychology, 1932 [link]
Susan Jamieson, “Likert scales: how to (ab)use them”, Medical Education, Blackwell Publishing Ltd., 2004 [link]
“5-Point Likert Scale”, in Preedy Watson Handbook of Disease Burdens and Quality of Life Measures, Springer Publishing, 2010 [link]
David Bakan, “The test of significance in psychological research”, Psychological Bulletin, 1966 [link]
Jeff Miller and Rolf Ulrich, “Interpreting confidence intervals: A comment on Hoekstra, Morey, Rouder, and Wagenmakers”, Psychonomic Bulletin & Review, 2014 [link]
Richard Morey, Rink Hoekstra, Jeffrey Rouder, and Eric-Jan Wagenmakers, “Continued misinterpretation of confidence intervals: response to Miller and Ulrich”, Psychonomic Bulletin & Review, 2015 [link]
Ana Elisa Castros Sotos, Stijn Vanhoof, Wim Van den Noortgate, Patrick Onghena, “The Transitivity Misconception of Pearson’s Correlation Coefficient”, Statistics Eduction Research Journal, 2009 [link]
Sander Greenland, “Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values”, American Statistician, 2019 [link]
Geoff Cumming, “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better”, Perspectives on Psychological Science, 2008 [link]
Richard Morey, Rink Hoekstra, Jeffrey Rouder, Michael Lee, and Eric-Jan Wagenmakers, “The fallacy of placing confidence in confidence intervals”, Psychonomic Bulletin & Review, 2016 [link]
Rink Hoekstra, Richard Morey, and Eric-Jan Wagenmakers, “The Interpretation Of Confidence And Credible Intervals”, Proceedings of the Tenth International Conference on Teaching Statistics, 2018 [link]
Valentin Amrhein, David Trafimow, & Sander Greenland, “Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication”, The American Statistician, 2019 [link]
Robert Calin-Jageman, “After p Values: The New Statistics for Undergraduate Neuroscience Education”, Journal of Undergraduate Neuroscience Education, 2017 [link]
Geoff Cumming, “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better”, Perspectives on Psychological Science, 2008 [link]
Richard Morey, Rink Hoekstra, Jeffrey Rouder, Michael Lee, & Eric-Jan Wagenmakers, “The fallacy of placing confidence in confidence intervals”, Psychonomic Bulletin & Review, 2016 [link]
Author comments
All authors of the studies covered in this article were contacted for optional comment. Below is a summary of these responses. Updates made as a result of author comments were incorporated into the body of the article. We would like to thank all of the authors that provided feedback as they were universally supportive of the project.
Author | Studies included in this article | Summary of comments |
---|---|---|
Adam Zuckerman | Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists | Could not find email. |
Addie Johnson | Confidence Intervals Make a Difference Effects of Showing Confidence Intervals on Inferential Reasoning | Could not find email. |
Amparo Bonilla | Misconceptions of the p-value among Chilean and Italian Academic Psychologists | |
Anton Kühberger | The significance fallacy in inferential statistics | |
Astrid Fritz | The significance fallacy in inferential statistics | Could not find email. |
Augustias Vallecillos | Understanding of the Logic of Hypothesis Testing Amongst University Students | Could not find email. |
Blakeley McShane | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” | Numerous comments. |
Bruno Lecoutre | Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated (2001) |
Email address found online did not work. |
Bryan Iotti | Misconceptions of the p-value among Chilean and Italian Academic Psychologists | |
Charles Greenbaum | Significance tests die hard: The amazing persistence of a probabilistic misconception | |
Chuan-Peng Hu | P-value, confidence intervals, and statistical inference: A new dataset of misinterpretation Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields |
Minor corrections about demographics of participants in the 2020 study and correction regarding journal of publication. |
Claudio Longobardi | Misconceptions of the p-value among Chilean and Italian Academic Psychologists | |
David Gal | Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” | David Gal was included in Blakeley McShane's response. |
Dolores Frías-navarro | Misconceptions of the p-value among Chilean and Italian Academic Psychologists Misinterpretations Of P-values In Psychology University Students Interpretation of the p value: A national survey study in academic psychologists from Spain Uses and abuses of statistical significance tests and other statistical resources: a comparative study |
|
Eric Minturn | The interpretation of levels of significance by psychologists: A replication and extension | No response. |
Eric-Jan Wagenmakers | Robust misinterpretation of confidence intervals | |
Eva Lermer | The significance fallacy in inferential statistics | |
Fiona Fiddler | "Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication Confidence intervals permit, but do not guarantee, better inference than statistical significance testing Researchers Misunderstand Confidence Intervals and Standard Error Bars Replication and Researchers’ Understanding of Confidence Intervals and Standard Error Bars" |
No response. |
Geoff Cumming | Researchers misunderstand confidence intervals and standard error bars (2005) Confidence intervals permit, but do not guarantee, better inference than statistical significance testing (2010) Subject p intervals, researchers Underestimate the variability of p values over replication (2012) A cross-sectional analysis of student's intuitions when interpreting CIs (2018) |
Supportive of project, but doesn't currently have time to review due to other obligations. |
Héctor Monterde-i-Bort | "Interpretation of the p value: A national survey study in academic psychologists from Spain Uses and abuses of statistical significance tests and other statistical resources: a comparative study" |
|
Heiko Haller | Misinterpretations of Significance: A Problem Students Share with Their Teachers? | No response. |
Henk Kiers | Confidence Intervals Make a Difference: Effects of Showing Confidence Intervals on Inferential Reasoning | Out of office. |
Holley S. Hodgins | Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists | |
Jacques Poitevineau | Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated |
Could not find email. |
Jeff Miller | Some Properties of p-Curves, With an Application to Gradual Publication Bias | Could not find email. |
Jeffrey N. Rouder | Robust misinterpretation of confidence intervals | |
Jennifer Williams | Researchers Misunderstand Confidence Intervals and Standard Error Bars Replication and Researchers’ Understanding of Confidence Intervals and Standard Error Bars |
Could not find email. |
Jerry Lai | A Cross-Sectional Analysis of Students’ Intuitions When Interpreting Cis Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication Dichotomous Thinking: A Problem Beyond NHST |
cc'd on response from Geoff Cumming, no respones received. |
John Gaito | The Interpretation of Levels of Significance by Psychological Researchers Further evidence for the cliff effect in the interpretation of levels of significance |
Could not find email. |
Juan Pascual-Llobell | Uses and abuses of statistical significance tests and other statistical resources: a comparative study | Could not find email. |
Kaiping Peng | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |
Kenneth Beauchamp | Replication report: Interpretation of levels of significance by psychological researchers | Deceased. |
Laura Badenes-Ribera | Misconceptions of the p-value among Chilean and Italian Academic Psychologists Misinterpretations Of P-values In Psychology University Students Interpretation of the p value: A national survey study in academic psychologists from Spain |
|
Leonard Lansky | The interpretation of levels of significance by psychologists: A replication and extension | Deceased. |
Marcos Pascual-Soler | Misinterpretations Of P-values In Psychology University Students Interpretation of the p value: A national survey study in academic psychologists from Spain |
Could not find email. |
Marie‐Paule Lecoutre | Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated |
Could not find email. |
Melissa Coulson | Confidence intervals permit, but do not guarantee, better inference than statistical significance testing | Could not find email. |
Michael Oakes | Statistical Inference The Statistical Evaluation of Psychological Evidence (unpublished doctorial thesis) |
|
Michelle Healey | Confidence intervals permit, but do not guarantee, better inference than statistical significance testing | Could not find email. |
Miguel A García-Pérez | The Interpretation of Scholars' Interpretations of Confidence Intervals: Criticism, Replication, and Extension of Hoekstra et al. | Could not find email. |
Miron Zuckerman | Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists | |
Nanette Nelson | Interpretation of significance levels and effect sizes by psychological researchers | Could not find email. |
Pav Kalinowski | A Cross-Sectional Analysis of Students’ Intuitions When Interpreting CIs | cc'd on response from Geoff Cumming, no respones received. |
Ralph Rosnow | Interpretation of significance levels and effect sizes by psychological researchers | |
Richard D. Morey | Robust misinterpretation of confidence intervals | Contacted about project by Rink Hoekstra, no direct response received. |
Richard May | Replication report: Interpretation of levels of significance by psychological researchers | Could not find email. |
Rink Hoekstra | Robust misinterpretation of confidence intervals Confidence Intervals Make a Difference Effects of Showing Confidence Intervals on Inferential Reasoning |
Supportive of project, may review in future if time permits. |
Robert Rosenthal | The Interpretation of Levels of Significance by Psychological Researchers Further evidence for the cliff effect in the interpretation of levels of significance Contemporary Issues in the Analysis of Data: A Survey of 551 Psychologists Interpretation of significance levels and effect sizes by psychological researchers |
|
Rocio Alcala-Quintana | The Interpretation of Scholars' Interpretations of Confidence Intervals: Criticism, Replication, and Extension of Hoekstra et al. | |
Rolf Ulrich | Some Properties of p-Curves, With an Application to Gradual Publication Bias | Email not listed on faculty page, contacted department administator but no response. |
Ruma Falk | Significance tests die hard: The amazing persistence of a probabilistic misconception | |
Sarah Belia | Researchers Misunderstand Confidence Intervals and Standard Error Bars | Could not find email. |
Stefan Krauss | Misinterpretations of Significance: A Problem Students Share with Their Teachers? | |
Thomas Scherndl | The significance fallacy in inferential statistics | |
William Dember | The interpretation of levels of significance by psychologists: A replication and extension | Deceased. |
Xi-Nian Zuo | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |
Xiao-Fan Zhao | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |
Xiao-Kang Lyu | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |
Yuepei Xu | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |
Ziyang Lyu | P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation | Chuan-Peng Hu was contacted as he was listed as the author for coorespondance. |