HYPOTHESIS TESTING AND CONFIDENCE INTERVAL MISINTERPRETATIONS AMONG medical RESEARCHERS AND STUDENTS

Article Summary

This article reviews the evidence on direct inquires of hypothesis testing and parameter estimation misinterpretations of psychology researchers and students. A total of 32 articles are reviewed, falling into one of four categories: null hypothesis significant testing (NHST), confidence intervals (CI), the cliff effect, and dichotomization of evidence. Where possible meta-analyses are used to combine results.

A detailed table of contents outlines the studies, with links to each organized by the four categories above. The introductory sections further outline this articles purpose and contributions as well as provides summary results and themes and criticism across the studies.

Article status

This article is being written.

Contact information

We’d love to hear from you! Please direct all inquires to james@theresear.ch.

Table of contents


This article reviews the evidence on direct inquiry of null hypothesis significance testing (NHST) and confidence interval (CI) knowledge of psychology researchers and students. Direct inquires include surveys, multiple choice questionnaires, statistical tasks, and other methods of directly interacting with participants to determine knowledge and misunderstanding. As far as we are aware this is by far the most comprehensive review of this topic available. In particular this article attempts to make the following contributions:

  • Comprehensive. This article reviews evidence from more than 30 different studies. As additional studies are released and existing studies are uncovered the material will be added.

  • Concentration on psychology. Many studies in the literature on NHST and CI knowledge are cross-discipline. These studies therefore offer a good comparison across disciplines of performance on a particular statistical task, survey instrument, or other inquiry method. This article aims for something different, however: “breaking apart” the literature, extracting only research on psychology, and then reassembling it, highlighting common themes and criticism. This includes re-analyzing open data sources and extracting only the subset applicable to psychology (which might have been subsumed in reporting of the original findings) as well as combining data into meta-analyses where possible. Forthcoming articles will focus on other disciplines.

  • Detailed summaries. The key methodological elements of each study are outlined along with the main findings. Our own criticisms are presented as well as any published responses from other researchers. Of course, no article can be summarized comprehensively without duplicating it outright. Those particularly interested are encouraged to use the provided links to review the original articles.

  • Author input. All authors have been contacted for their input on the summaries of their studies. When applicable their responses have been incorporated into the article.

  • Categorization. Articles are categorized for easy navigation into one of four research areas and seven methodological types. These are incorporated in the table of contents as well as in the article summary below. In addition, a main table is presented at the beginning of the literature review with one-sentence summaries of primary findings along with other important information about the article. Included are links to the original articles themselves.

  • Broader literature. This article focuses on a single type of evidence regarding NHTS and confidence interval misinterpretation: direct inquires of NHST and CI knowledge. However, we have identified three types of additional evidence. These four types of evidence are covered briefly, with relevant links where applicable. Statistical evaluations outside of NHST and CIs are also noted when relevant.

  • Meta-analysis. Simple meta-analyses are conducted when possible. The raw data used to calculate them is available in the Supplementary Material section. Charts and tables are also provided to clarify meta-analysis findings.

  • Clear identification of errors. A number of papers reviewed here were published with errors. These errors have been highlighted and corrected in our summaries. This will help interested readers better interpret the original articles if sought out.

  • Translation. Articles that were not original published in English have been professionally translated and are available for free on our Google drive. Currently, only one article is included, but others are in the works.

  • Better charts and tables. When reproducing charts improvements were made to increase readability and clarity. Some charts may appear small, but clicking the chart will expose a larger version in a lightbox. Charts also use a common design language to increase visual continuity within the article. The same is true with tables, which also utilize highlighting where necessary to aim focus at important information. All tables have extensive notes including links to the original article.

  • Living document. Because this article is not published in any journal, we have the freedom to continually add new studies as they are discovered or released. It also provides us the opportunity to improve the article over time by clarifying or adding content from those studies already summarized as well as correct any errors if they are found.

SUMMARY & RESULTS

There are four categories of evidence regarding knowledge and misinterpretations of NHST and CI by professional researchers and students. The current article focuses on the first category below for the population of psychology researchers and students. Psychology has by far the most studies in this area of any academic disincline.

  1. Direct inquires of statistical knowledge. Although not without methodological challenges, this work is the most direct method of assessing statistical understanding. The standard procedure is a convenience sample of either students or researchers (or both) in a particular academic discipline to which a survey instrument, statistical task, or other inquiry method is administered.

  2. Examination of NHST and CI usage in statistics and methodology textbooks. This line of research includes both systematic reviews and casual observations documenting incorrect or incomplete language when NHST or CI is described in published textbooks. In these cases it is unclear if the textbook authors themselves do not fully understand the techniques and procedures or if they were simply imprecise in their writing and editing or otherwise thought it best to omit or simplify the material for pedagogical purposes. For an early draft of our article click here. We will continue to expand this article in the coming months.

  3. Audits of NHST and CI usage in published articles. Similar to reviews of textbooks, these audits include systematic reviews of academic journal articles making use of NHST and CIs. The articles are assessed for correct usage. Audits are typically focused on a particular academic discipline, most commonly reviewing articles in a small number of academic journals over a specified time period. Quantitative metrics are often provided that represent the percentage of reviewed articles that exhibited correct and incorrect usage. Click here for a growing list of relevant papers.

  4. Journal articles citing NHST or CI misinterpretations. A large number of researchers have written articles underscoring the nuances of the procedures and common misinterpretations directed at their own academic discipline. In those cases it is implied that in the experience of the authors and journal editors the specified misinterpretations are common enough in their field that a corrective is warranted. Using a semi-structured search we identified more than 60 such articles, each in a different academic discipline. Click here to read the article.

Category 1, inquires of statistical knowledge, can be further subdivided into four areas:

  1. Null hypothesis significance testing (NHST) misinterpretations. These misinterpretations include p-values, Type I and Type II errors, statistical power, sample size, standard errors, or other components of the NHST framework. Examples of such misinterpretations are interpreting the p-value as the probability of the null hypothesis or the probability of replication.

  2. Confidence interval misinterpretations. For example, interpreting the confidence interval as a probability.

  3. The dichotomization of evidence. A specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant. Dichotomization of evidence is closely related to the cliff effect (see Item 4 below).

  4. The cliff effect. A specific misuse of NHST in which there is a dramatic drop in the confidence of an experimental result based on the p-value. For example, having relatively high confidence in a result with a p-value of 0.04, but much lower confidence in a result with a p-value of 0.06.

Some studies are mixed, testing a combination of two or more subcategories above. A total of 32 psychology studies were found that fall into one of these four subcategories, each of which is covered in detail later in this article.

Authors, year, & title Category Type Country Participants Primary findings
Wulff et al. (1987)

What do doctors know about statistics? [link]
TBD TBD TBD TBD TBD
Scheutz et al. (1988)

Understanding of the Logic of Hypothesis Testing Amongst University Students [link]
TBD TBD TBD TBD TBD
Vallecillos (2000)

Understanding of the Logic of Hypothesis Testing Amongst University Students [link]
TBD TBD TBD Medicine students (n=61) 1. When shown a statement claiming NHST can prove the truth of a hypothesis, 26% of pedagogy students incorrectly marked the statement as true. Eight medicine students that had correctly answered the statemenet also provided a correct written explanation of their reasoning.
Cumming, Williams, & Fidler (2004)

Replication and Researchers’ Understanding of Confidence Intervals and Standard Error Bars [link]
CI Interactive Task J.A. Authors of 20 high-impact psychology journals (n=89) 1. 80% of respondents overestimated the probability of a sample mean from a replication falling within the confidence interval of the original sample mean.
2. 70% of respondents overestimated the probability of a sample mean from a replication falling within the standard error interval of the original sample mean.
Belia, Fidler, Williams, & Cumming (2005)

Researchers Misunderstand Confidence Intervals and Standard Error Bars [link]
TBD TBD TBD TBD 1. TBD
Coulson, Healey, Fidler, & Cumming (2010)

Confidence intervals permit, but do not guarantee, better inference than statistical significance testing [link]
TBD TBD TBD TBD 1. TBD
Lai (2010)

Dichotomous Thinking: A Problem Beyond NHST [link]
Cliff effect Confidence Elicitation J.A. Authors in major psychology and medical journals (n=258) 1. 21% of respondents demonstrated a cliff effect.
Lai, Fidler, & Cumming (2012)

Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication [link]
TBD TBD TBD TBD 1. TBD
McShane & Gal (2015)

Supplement to “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” [link]
TBD TBD TBD TBD 1. A cliff effect was found between p-values of 0.025 and 0.075.
Castro Sotos et al. (2017)

How Confident are Students in their Misconceptions about Hypothesis Tests?” [link]
TBD TBD TBD TBD 1. TBD
Kalinowski, Jerry, & Cumming (2018)

A Cross-Sectional Analysis of Students’ Intuitions When Interpreting CIs [link]
CI Interactive Task Australia Students, various disciplines but 66% were psychology (n=101) 1. 74% of students had at least one CI misconception in a set of three tasks.
Lyu et al. (2018)

P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation [link]
TBD TBD TBD Psychology students and researchers in one of four medically related subfields (n=137) 1. 100% of undergraduate students demonstrated at least one NHST misinterpretation.
2. 97% of masters students demonstrated at least one NHST misinterpretation.
3. 100% of PhD students demonstrated at least one NHST misinterpretation.
4. 100% of subjects with a PhD demonstrated at least one NHST misinterpretation.
Lyu et al. (2020)

Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields [link]
TBD TBD TBD Medical undergraduate students (n=19)
Medical masters students (n=69)
Medical PhD students (n=24)
Medical with a PhD (n=18)
1. 79% of undergraduate students demonstrated at least one NHST misinterpretation.
2. 96% of masters students demonstrated at least one NHST misinterpretation.
3. 96% of PhD students demonstrated at least one NHST misinterpretation.
4. 94% of subjects with a PhD demonstrated at least one NHST misinterpretation.
Helske et al. (2020)

Are You Sure You’re Sure? - Effects of Visual Representation on the Cliff Effect in Statistical Inference [link]
TBD TBD TBD Medical practitioners and researchers (n=16) 1. A cliff effect was found between p-values of 0.04 and 0.06.
Studies directly surveying the NHST, p-value, and confidence interval knowledge of medical researchers and students

NHST Review

In 1987 Henrik Wulff, Björn Andersen, Preben Brandenhoff, and Flemming Guttler surveyed Danish doctors and medical students. Initially 250 Danish doctors were randomly sampled from a list of all Danish doctors and sent the survey instrument described below; in the end 148 responded. In addition, 97 medical students in an introductory biostatistics course were given the survey.

The survey was 11 questions long, with the first question asking about the subject’s self-evaluated statistical knowledge and the last question asking about the subject’s perception of the survey’s usefulness. The other nine questions were statistical in nature. For the population of doctors no respondent had more than seven correct answers. The median number of correct answers was just 2.4. Even those doctors who selected Option A for Question 1, “I understand all the [statistical] expressions” scored just 4.1 out of nine.

Perhaps surprisingly medical students scored better than the doctors; the median number of correct answers for students was 4.0. Two students answered all nine questions correctly. The distribution of correct answers by doctors and students is shown at right.

For doctors, there were four questions for which the correct answer was also the most selected response. Those were Questions 4, 5, 7, and 9. Students also had four such questions: Questions 2, 4, 7, and 8. For Questions 4 and 7 a plurality of both populations selected the correct answer. Question 4 asked about using the standard error to create a 95% confidence interval, while Question 7 asked about the correct interpretation of a p-value. Question 7, in fact, had the highest correct response rate for students, with 67% answering correctly. For doctors it was Question 9, which asked about interpreting “statistical significance,” with 41% answering correctly. This means that even for the most correctly answered question 6 out of 10 doctors still answered incorrectly.

Question 3 had the fewest doctors selecting the correct response, 8%; for students Question 5 had the fewest correct responses, 27%. Question 3 asked about the correct usage of the standard deviation; Question 5 asked about the correct usage of the standard error.

On a per-question basis doctors selected the correct response 28.7% of the time on average. Students fared much better, at 41.9%. Comparison with a 1988 study replication involving dentists and dental students is discussed below in the section on Scheutz, Andersen, and Wulff (1988).

The survey instrument is shown below along with the corresponding number of doctors and students selecting each statement response. Correct answers are highlighted in green. A discussion of each statement and its correct answer is provided below the table.

Statements & responses Percent of doctors selecting statement Percent of students selecting statement
1. Which of the following statements reflects your attitude to the most common statistical expressions in medical literature, such as SD, SE, p-values, confclassence limits and correlation coefficients?
a. I understand all the expressions. 20% 10%
b. I understand some of the expressions. 35% 51%
c. I have a rough idea of the meaning of these expressions. 22% 32%
d. I know vaguely what it is all about, but not more. 17% 7%
e. I do not understand the expressions. 6% 0%
2. In a medical paper 150 patients were characterized as ‘Age 26 years ± 5 years (mean ± standard deviation)’. Which of the following statements is the most correct?
a. It is 95 per cent certain that the true mean lies within the interval 16-36 years. 26% 30%
b. Most of the patients were aged 26 years; the remainder were aged between 21 and 31 years. 38% 13%
c. Approximately 95 per cent of the patients were aged between 16 and 36 years. 30% 51%
d. I do not understand the expression and do not want to guess. 6% 6%
3. A standard deviation has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct?
a. My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation. 8% 29%
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. 23% 10%
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150. 37% 23%
d. Such expressions are used only when research workers have assured themselves that the assumption is fulfilled. 20% 35%
e. I know nothing about the normal distribution and do not want to guess. 12% 3%
4. A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration one hour after oral administration of 10 mg of the drug was 188 ng/ml ± 10 ng/ml (mean ± standard error). Which of the following statements do you prefer?
a. Ninety-five per cent of the volunteers had plasma concentrations between 168 and 208 ng/ml. 27% 28%
b. The interval from 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration. 20% 7%
c. We are 95 per cent confident that the true mean lies somewhere within the interval 168 to 208 ng/ml. 39% 55%
d. I do not understand the expression and do not wish to guess. 14% 10%
5. A standard error has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct?
a. My interpretation presupposes a normal distribution. However, biological data are rarely distributed normally, and this is why expressions of this kind cannot usually be interpreted sensibly. 5% 15%
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. 20% 14%
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large. 38% 27%
d. Such expressions are only used when research workers have assured themselves that the assumption is fulfilled. 19% 35%
e. I know nothing about the normal distribution and do not want to guess. 18% 9%
6. A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer?
a. It has been proved that the treatment is better than placebo. 20% 6%
b. If the treatment is not effective, there is less than a 5 per cent chance of obtaining such results. 13% 39%
c. The observed effect of the treatment is so large that there is less than a 5 per cent chance that the treatment is no better than placebo. 51% 54%
d. I do not really know what a p-value is and do not want to guess. 16% 1%
7. A research team wishes to examine whether or not the ingestion of licorice decreases the plasma concentration of magnesium. Twenty-three volunteers ingest a considerable amount of licorice, and no significant change in the serum magnesium is found (p > 0.05). Which of the following statements do you prefer?
a. There is more than a 5 per cent chance of obtaining such results if licorice does not decrease the serum magnesium. 39% 67%
b. There is only a small probability of obtaining such results if licorice does decrease the serum magnesium. 29% 26%
c. The research workers ought to have studied more volunteers, then the difference would have become significant. 13% 5%
d. I do not know what p-values are and do not want to guess. 18% 2%
8. A new drug was tested independently in two randomized controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective (p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective (p > 0.05). The actual p-values were 0.041 and 0.097. Which of the following interpretations do you prefer?
a. The first trial gave a false-positive result. 2% 6%
b. The second trial gave a false-negative result. 3% 3%
c. Obviously, the trials were not comparable after all. 41% 34%
d. One must not attach too much importance to small differences between p-values. 34% 43%
e. I do not understand the problem and do not wish to guess. 20% 14%
9. Patients with ischaemic heart disease and healthy subjects are compared in a population survey of 20 environmental factors. A statistically significant association is found between ischaemic heart disease and one of those factors. Which of the following interpretations do you prefer?
a. The association is true as it is statistically significant. 2% 6%
b. This is no doubt a false-positive result. 3% 3%
c. The result is not conclusive but might inspire a new investigation of this particular problem. 41% 34%
d. I do not understand the question and do not wish to guess. 34% 43%
10. In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, N = 83. Which of the following answers do you prefer?
a. There is a strong correlation between concentration and effect. 22% 17%
b. There is only a weak correlation between concentration and effect. 16% 32%
c. I am not able to interpret the expressions and do not wish to guess. 62% 51%
11. What is your opinion of this survey?
a. It is very important that this problem is raised. 65% 80%
b. I do not think that the problem is very important, but it may be reasonable to take it up. 27% 9%
c. The problem is unimportant and the survey is largely a waste of time. 8% 11%
Medical doctors and students responses to basic statistical questions
Wulff et al., (1987)

Table notes:
1. Correct answers highlighted in green.
2. Reference: "What do doctors know about statistics", Henrik Wulff, Björn Andersen, Preben Brandenhoff, and Flemming Guttler, Statistics in Medicine, 1987 [link]

Question 1 asked about self-reported understanding of a set of statistical concepts. The most frequent response was that the respondent understood “some” of the statistical terms.

Question 2 asked about the correct interpretation of patient age characterized by “Age 26 years plus ± 5 years (mean ± standard deviation).” The correct response was Option C, “Approximately 95 per cent of the patients were aged between 16 and 36 years,” which 30% of doctors and 51% of students selected. However, as outlined in the Question 3 summary below, to interpret standard deviations in this way the population data must be normally distributed. Somewhat curiously age is not an attribute that follows a normal distribution. It is unclear why age was chosen for this question. Option B was the most selected for doctors at 38%. It stated that, “Most of the patients were aged 26 years; the remainder were aged between 21 and 31 years.” This is obviously incorrect as 26 is simply the mean age of the observed patients and it is unclear from the given context how many were exactly that age. The age interval given in Option B, 21 to 31 years, would correspond to the 1-sigma rule, meaning about 68% of patients are within this range (again, assuming age is normally distributed).

Question 3 asks about the standard deviation, noting that it has something to do with the “normal distribution.” In fact, the standard deviation can be calculated regardless of the distribution of the data. However, when the data comes from the normal distribution the interpretation is tractable, following well-known statistical properties (for example the proportion of the data falling between various standard deviation intervals). For this reason the author’s preferred response is Option A, “My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation.” Just 8% of doctors responded in this way. However, 29% of students selected the correct answer. It’s possible other answers could make sense with certain assumptions. For instance, Option B: “My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research.” If the doctors who were surveyed are specialists, it may be that their area of expertise is one for which the normal distribution applies, for example research on height or weight. However, if the doctors who were surveyed are general practitioners, and therefore are called upon to broadly interpret research across aspects of the human body, then it may be true that the majority of studies cannot assume a normal distribution. While we are not medical specialists, Wulff et al. are; in their explanation of Question 3 they simply note that, “…biological phenomena are rarely distributed [normally]….” For this reason Option A does seem like the most correct of those provided. The most selected answer for doctors, Option C with 37%, was incorrect. Option C stated that “My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150.” This is a confusion of the application of the Central Limit Theorem, which holds for measures of the sampling distribution of the sample mean, for example the standard error, but not for measures of population dispersion such as the standard deviation.

Question 4 asked about the correct interpretation of plasma concentration characterized by a concentration of “188 ng/ml ± 10 ng/ml (mean ± standard error)” one hour after oral administration of a drug. The correct answer was Option C, “We are 95 per cent confident that the true mean lies somewhere within the interval 168 to 208 ng/ml,” which was selected by 39% of doctors and 55% of students. The standard error is the standard deviation of the sampling distribution, usually applied to the so-called “sampling distribution of the sample mean” in which case the standard error is called “the standard error of the mean.” The standard error of the mean is used to construct the familiar 95% confidence interval using the 2-sigma rule: 95% of sample means fall within two standard errors of the mean. The 2-sigma rule can be used because the sampling distribution of the sample mean follows a normal distribution due to the Central Limit Theorem. The standard deviation from Question 2 above asks the question, “How spread out are the ages of the patients I surveyed?” The standard error for Question 4 asks the questions, “How spread out are my estimates of the sample mean (used to estimate the population mean)?” This distinction is why Option A is incorrect, “Ninety-five per cent of the volunteers had plasma concentrations between 168 and 208 ng/ml.” Option A refers to the standard deviation. However, more than a quarter of respondents in both population selected Option A.

Question 5 mirrors Question 3, but asks about the standard error, noting that it has something to do with the “normal distribution.” The correct response was Option C, “My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large.” Option C was selected by 38% of doctors, the modal Option for this question, and 27% of students. Options A and C are incorrect because as stated in the Question 4 summary the normality condition always holds in sampling distributions due to the Central Limit Theorem. One caveat is that the sample size needs to be sufficiently large, the condition met in Option C. The usual rule of thumb is a sample size of 30. Option B, “My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research” is incorrect, because biological research may have studies in which the sample size is too small to guarantee the Central Limit Theorem applies. Option D was selected by the plurality of students, 35%. Option D read, “Such expressions are only used when research workers have assured themselves that the assumption is fulfilled.” What “the assumption” means is not specified and it is therefore hard to judge this statement as incorrect: how are we to know if a subject interpreted “the assumption” to mean the sample size assumption? Still, one could argue that in comparison to the explicit statement about sample size assumption in Option C, Option D is less correct.

Question 6 regarded p-values, “A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer?” The correct answer was Option B “If the treatment is not effective, there is less than a 5 per cent chance of obtaining such results.” However, this was selected by just 13% of doctors and 39% of students. Option B is a restatement of the p-value definition, which can also be written as, “The probability of obtaining the observed data or more extreme values assuming the null hypothesis is true.” Option A was select by 20% of doctors and read, “It has been proved that the treatment is better than placebo.” In a similar question to Option A, Vallecillos (2000) asked 61 medical students to judge the following statement as true or false, “A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.” The majority of students, 69%, correctly answered that the statement was false. Option C was selected by the majority of students, 54%. It read, “The observed effect of the treatment is so large that there is less than a 5 per cent chance that the treatment is no better than placebo.” This is a version of the Effect Size Fallacy, the p-value does not directly measure the size of the treatment.

Question 7 was similar to Question 6, but asked about a p-value of less than 0.05 obtained in study investigating the concentration of magnesium in plasma due to licorice ingestion. Option A was correct, “There is more than a 5 per cent chance of obtaining such results if licorice does not decrease the serum magnesium.” This answer was the most selected by doctors, 39%, and students, 67%. Again, the correct answer follows form the definition of the p-value.

Question 8 regarded two randomized trials of a new drug in which one was statistically significant and the other statistically nonsignificant. The actual p-values were 0.041 and 0.097. The correct answer was Option D, “One must not attach too much importance to small differences between p-values.” This option was selected by 34% of doctors and 43% of students. The Research’s article on p-value replicability demonstrates that p-values have substantial natural variation regardless of the truth or falsity of the null hypothesis. Option C — “Obviously, the trials were not comparable after all” — was selected by 41% of doctors. This is incorrect, however. The trials may or may not be comparable given their underlying methodologies, but the comparability of the two trials cannot be determined from their p-values.

Question 9 attempted to determine the impact of multiple comparisons. It stated that “A statistically significant association is found between ischaemic heart disease and one of those [20 environmental] factors [that was tested].” Given 20 independent hypothesis tests — one for each environmental factor — on average one will obtain statistical significance by chance alone even if the null hypothesis is true for all 20. Therefore, the correct response is Option C, “The result is not conclusive but might inspire a new investigation of this particular problem.” Option A was selected by 41% of doctors and 34% of students. Along with Question 10, Question 9 had the largest proportion of students respond that, “I do not understand the question and do not wish to guess,” at 43%.

Question 10 considered the impact of correlation. It read: “In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, N = 83." The majority of both populations, 62% of doctors and 51% of students indicated that, “I am not able to interpret the expressions and do not wish to guess.” This is the highest proportion of respondents selecting that answer for any question. Option A indicated that the correlation was strong, “There is a strong correlation between concentration and effect.” Option A was selected by 22% of doctors and 17% of students. However, using the standard r-squared metric results in a value of 0.17, typically considered small. In a bivariate correlation r-squared is simply the square of ‘r’, the correlation coefficient, which was given as 0.41 in this scenario. The interpretation of r-squared is that 17% of the variation in plasma concentration can be explained by the effect of the drug. Some assumptions are needed for that interpretation to be true, which is likely why the question includes the statement, “In a methodologically impeccable investigation…” The correct answer is therefore Option B, “There is only a weak correlation between concentration and effect,” which was selected by 16% of doctors and 32% of students.

Question 11 asked respondents’ opinion of the survey. Most respondents believe the project of statistical evaluation and education had some merits. Just 8% of doctors and 11% of students felt the survey was a waste of time.

The authors conclude that the lack of basic statistical knowledge of the respondents indicates a serious problem for the medical profession as doctors are required to interpret new and existing medical research in order to best provide care to patients.


In 1988 Wulff et al. (1987) was replicated using a group of dentists and dental students. Two of the authors were the same as in the 1987 study of Danish doctors — Henrik Wulff and Björn Andersen — and one was new, Flemming Scheutz, who at the time was associated with the Royal Dental College. Initially 250 Danish dentists were randomly sampled from a list of all Danish dentist and sent the survey instrument described below; in the end 125 responded, fewer than the 148 doctors that responded in the original survey. In addition, 27 dental students in an introductory statistics course were given the survey. This was a substantially smaller sample size than the number of medical students that participated in the study of doctors, 97.

The surveyed used was identical to that in the 1987 study of doctors. For the population of dentists no respondent had more than six correct answers. The median number of correct answers was just 2.2, even lower than the 2.4 median correct responses of doctors. Even those doctors who selected Option A for Question 1, “I understand all the [statistical] expressions” scored just 4.5 out of nine. Students scored better than dentists; the median number of correct answers was 3.4, which was lower than the 4.0 of medical students from the 1987 study. No student answered more than six questions correctly (recall that out of the 97 medical students, two answered all nine questions correctly). The distribution of correct answers by doctors and students is shown at right.

For dentists, there was just one question for which the correct answer was also the most selected response, Question 7. Question 7 asked about the correct interpretation of a p-value. Still, only 32% of dentists answered correctly. In fact, more dentists answered Question 9 correctly, 39%, but this correct response was not the most selected; 46% of dentists incorrectly selected Option A (the correct answer was Option C).

Dental students on the other had had four questions for which the correct answer was also the most selected response. Those were Questions 2, 4, 5, and 9. Of these, Question 2 had the highest correct response rate, 59%. Question 2 asked about the correct interpretation of standard deviation. Questions 4 and 5 asked about the correct interpretation of standard error. Question 9 asked about the multiple comparison problem when using statistical significance.

Compared to doctors dentists performed more poorly: for doctors there were four questions for which the correct answer was also the most selected response. Dental students were roughly comparable to medical students. Both groups had four questions for which the correct answer was also the most selected response. The highest correct response rate for medical students was students 67% (Question 7), for dental students it was 59% (Question 2).

Question 10 had the fewest dentists selecting the correct response, 6%. However, Questions 3 and 6 had were a close second and third, with just 7% and 8%, respectively, of dentists selecting the correct response. Question 3 asked about the correct interpretation of standard deviation, Question 6 asked about the correct interpretation of the p-value, and Question 10 asked about the magnitude of a correlation effect. For students, it was Question 3 with 4%. For students the next lowest correct response rate was Question 8, with 22%.

On a per-question basis dentists selected the correct response 22.4% of the time on average, while dental students fared better at 33.3%. When ranking correct response rates of the four populations included in both the 1987 and 1988 studies, medical students were the best (41.9%), followed by dental students (33.3%), doctors (28.7%), and dentists (22.4%).

The survey instrument is shown below along with the corresponding number of doctors and students selecting each statement response. Correct answers are highlighted in green. For a full explanation of the correct and incorrect answers for each question see the section above discussing the results for doctors from Wulff et al. (1987).

Statements & responses Percent of dentists selecting statement Percent of students selecting statement
1. Which of the following statements reflects your attitude to the most common statistical expressions in medical literature, such as SD, SE, p-values, confidence limits and correlation coefficients?
a. I understand all the expressions. 6% 4%
b. I understand some of the expressions. 26% 41%
c. I have a rough idea of the meaning of these expressions. 23% 26%
d. I know vaguely what it is all about, but not more. 31% 30%
e. I do not understand the expressions. 14% 0%
2. In a medical paper 150 patients were characterized as ‘Age 26 yr ± 5 yr (mean ± standard deviation)’. Which of the following statements is the most correct?
a. It is 95% certain that the true mean lies within the interval 16-36 years. 13% 7%
b. Most of the patients were aged 26 yr; the remainder were aged between 21 and 31 yr. 41% 19%
c. Approximately 95% of the patients were aged between 16 and 36 yr. 34% 59%
d. I do not understand the expression and do not want to guess. 12% 15%
3. A standard deviation has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct?
a. My interpretation assumes a normal distribution. However, biological data are rarely distributed normally, for which reason expressions of this kind usually elude interpretation. 7% 4%
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. 14% 19%
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is as large as 150. 28% 37%
d. Such expressions are used only when research workers have assured themselves that the assumption is fulfilled. 29% 19%
e. I know nothing about the normal distribution and do not want to guess. 22% 22%
4. A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration 1 hour after oral administration of 10 mg of the drug was 188 ng/ml ± 10 ng/ml (mean ± standard error). Which of the following statements do you prefer?
a. Ninety-five percent of the volunteers had plasma concentrations between 168 and 208 ng/ml. 21% 15%
b. The interval from 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration. 22% 19%
c. We are 95% confident that the true mean lies somewhere within the interval 168 to 208 ng/ml. 24% 41%
d. I do not understand the expression and do not wish to guess. 34% 26%
5. A standard error has something to do with the so-called normal distribution and must be interpreted with caution. Which statement is the most correct?
a. My interpretation presupposes a normal distribution. However, biological data are rarely distributed normally, and this is why expressions of this kind cannot usually be interpreted sensibly. 7% 7%
b. My interpretation presupposes a normal distribution, but in practice this assumption is fulfilled in biological research. 10% 11%
c. My interpretation presupposes a normal distribution, but this assumption is fulfilled when the number of patients is so large. 26% 41%
d. Such expressions are only used when research workers have assured themselves that the assumption is fulfilled. 21% 22%
e. I know nothing about the normal distribution and do not want to guess. 35% 19%
6. A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo: p < 0.05. Which of the following statements do you prefer?
a. It has been proved that the treatment is better than placebo. 18% 4%
b. If the treatment is not effective, there is less than a 5% chance of obtaining such results. 8% 33%
c. The observed effect of the treatment is so large that there is less than a 5% chance that the treatment is no better than placebo. 50% 56%
d. I do not really know what a p-value is and do not want to guess. 25% 7%
7. A research team wishes to examine whether or not the ingestion of licorice decreases the plasma concentration of magnesium. Twenty-three volunteers ingest a considerable amount of licorice, and no significant change in the serum magnesium is found (p > 0.05). Which of the following statements do you prefer?
a. There is more than a 5% chance of obtaining such results if licorice does not decrease the serum magnesium. 32% 30%
b. There is only a small probability of obtaining such results if licorice does decrease the serum magnesium. 24% 48%
c. The research workers ought to have studied more volunteers, then the difference would have become significant. 16% 19%
d. I do not know what p-values are and do not want to guess. 28% 4%
8. A new drug was tested independently in two randomized controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective (p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective (p > 0.05). The actual p-values were 0.041 and 0.097. Which of the following interpretations do you prefer?
a. The first trial gave a false-positive result. 6% 4%
b. The second trial gave a false-negative result. 1% 4%
c. Obviously, the trials were not comparable after all. 34% 52%
d. One must not attach too much importance to small differences between p-values. 26% 22%
e. I do not understand the problem and do not wish to guess. 34% 19%
9. Patients with ischaemic heart disease and healthy subjects are compared in a population survey of 20 environmental factors. A statistically significant association is found between ischaemic heart disease and one of these factors. Which of the following interpretations do you prefer?
a. The association is true as it is statistically significant. 46% 11%
b. This is no doubt a false-positive result. 1% 7%
c. The result is not conclusive but might inspire a new investigation of this particular problem. 39% 44%
d. I do not understand the question and do not wish to guess. 14% 37%
10. In a methodologically impeccable investigation of the correlation between the plasma concentration and the effect of a drug it is concluded that r = + 0.41, p < 0.001, n = 83. Which of the following answers do you prefer?
a. There is a strong correlation between concentration and effect. 21% 26%
b. There is only a weak correlation between concentration and effect. 6% 26%
c. I am not able to interpret the expressions and do not wish to guess. 73% 48%
11. What is your opinion of this survey?
a. It is very important that this problem is raised. 35% 45%
b. I do not think that the problem is very important, but it may be reasonable to take it up. 52% 41%
c. The problem is unimportant and the survey is largely a waste of time. 13% 14%
Dentists and dental students responses to basic statistical questions
Guttler, Andersen, & Wulff (1988)

Table notes:
1. Correct answers highlighted in green.
2. Reference: "What do doctors know about statistics", Flemming Guttler, Björn Andersen, and Henrik Wulff, Scandinavian Journal of Dental Research, 1988 [link]


During the 1991-1992 academic year researcher Augustias Vallecillos asked 436 university students across seven different academic specializations to respond to a simply NHST statement. This survey included 61 students in the field of medicine. It is unclear how many universities were included and what their location was. The results were written up in Vallecillos’ 1994 Spanish-language paper, “Estudio teorico-experimental de errores y concepciones sobre el contraste estadistico de hipotesis en estudiantes universitarios.” The results appeared again in his 2000 English-language article, “Understanding of the Logic of Hypothesis Testing Amongst University Students.” What is presented here is from his 2000 work.

Vallecillos’ statement was a short sentence asking about the ability of the NHST procedure to prove either the null or alternative hypotheses:

A statistical hypotheses test, when properly performed, establishes the truth of one of the two null or alternative hypotheses.

University speciality Sample size Correct answer Incorrect answer
Medicine 61 69% 26%
Percentage of medicine students responding to a statement claiming NHST can determine the truth of a hypothesis
Vallecillos (2000)

Table notes:
1. The exact number of respondents coded under each category were as follows: true - 16, false - 42, blank - 3 (4.9%).
2. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]

Students were asked to answer either “true” or “false” and to explain their answer (although an explanation was not required). The correct answer to the statement is false because NHST only measures the compatibility between observed data and a null hypothesis. It cannot prove either the null hypothesis true. In addition, the alternative hypothesis is not explicitly considered in the NHST model nor is the compatibility of the null hypothesis considered relative to the alternative.

The quantitative results of both engineering specialties surveyed is shown at right. Correct and incorrect answers do not add up to 100% because some students left the response blank. Vallecillos includes the percentage of blank responses in his presentation (we have omitted those figures from the table above for clarity). It is unclear why blank responses were included in Vallecillos’ table instead of treating blanks as non-responses and omitting them completely from the response calculation. It may be that some subjects did not give a true/false response but did give a written response, however, this is not explicitly stated.

Medicine students had the second highest proportion of correct responses, 69%, just behind mathematics students at 71%. As described in more detail below, medicine students also had the largest proportion of correct written explanations.

Vallecillos coded the written explanation of student answers into one of six categories:

  1. Correct argument (C) - These responses are considered to be completely correct.

    • Example response: “The hypotheses test is based on inferring properties from the population based on some sample data. The result means that one of the two hypotheses is accepted, but does not mean that it is true.”

  2. Partially correct argument (PC) - These responses are considered to be partially, but not completely correct because “answers analysed include other considerations regarding the way of taking the decision, which are not always correct.”

    • Example response: “What it does establish is the acceptance or rejection of one of the two hypotheses.”

  3. Mistaken argument that NHST establishes the truth of a hypothesis (M1) - These responses include explanations about why the initial statement proposed by Vallecillos was true.

    • Example response: “Because before posing the problem, we have to establish which one is the null hypothesis H0 and the alternative H1 and one of the two has to be true.”

  4. Mistaken argument that hypothesis testing establishes the probability of the hypotheses (M2) - This argument is another case of the Inverse Probability Fallacy. As the results of other studies summarized in this article show it is quite common among both students and professionals.

    • Example response: “What is established is the probability, with a margin of error, that one of the hypotheses is true.”

  5. Other mistaken arguments (M3) - This category includes all other mistaken arguments that do not fall into either M1 or M2.

    • Example response: “What it establishes is the possibility that the answer formed is the correct one.”

  6. Arguments that are difficult to interpret (DI) - These arguments were either not interpretable or did not address the subject’s reasoning behind answering the statement.

    • Example response: “The statistical hypotheses test is conditioned by the size of the sample and the level of significance.”

Not all of the respondents to the statement gave a written explanation, 54 of the 61 medicine students (88%) gave written explanations. Summary results are shown in the table below. Percentages are out of the number who gave written explanations, not out of the number that provided a response to the original statement. Medicine students had the largest proportion of correct written explanations both in percentages terms (15%) and in absolute (8 students). However, medicine student had the second lowest proportion of partially correct explanations. Still the combined proportion of correct and incorrect written explanations was about 33%, the third largest proportion after mathematics and business students.

Mistake M1 was the most common of the three mistake categories. Medicine students had 15% of written explanations categorized as DI, or “difficult to interpret,” the second largest proportion, just behind business students.

[compare to other types of misinterpretations]

Vallecillos notes that when considering the full sample of all 436 students, 9.7% of those who correctly answered the statement also provided a correct written explanation and 31.9% of the students who correctly answered the statement gave a partially correct written explanation. This means that across the full sample about 60% of students who correctly answered the statement did so for incorrect reasons or were not able to clearly articulate their reasoning.

University speciality Number of subjects who provided written explanations C PC M1 M2 M3 DI
Medicine 54 15% 19% 28% 17% 7% 15%
Percentage of medicine students whose written explanation falls into one of six categories
Vallecillos (2000)

Table notes:
1. Percentages have been rounded for clarity and may not add to 100%.
2. Key: C - "Correct", PC - "Partially correct", M1 - "Mistake 1", M2 - "Mistake 2", M3 - "Mistake 3", DI - "Difficult to intepret". See full explanations in description above the table.
3. The exact number of respondents coded under each category were as follows: C - 8, PC - 10, M1 - 15, M2 - 9, M3 - 4, DI - 8.
4. Reference: "Understanding of the Logic of Hypothesis Testing Amongst University Students", Augustias Vallecillos, Journal für Mathematik-Didaktik, 2000 [link]

In addition seven second-year medicine students were interviewed in the 1992-1993 school year. However, the summary of these interviews provided in Vallecillos (2000) are difficult to understand.


In 2012 Jerry Lai, Fiona Fidler, and Geoff Cumming tested the Replication Fallacy by conducting three separate studies of psychology researchers. Subjects were authors of psychology articles in high-impact journals written between 2005 and 2007. Medical researchers and statisticians were also surveyed and those results will be covered in upcoming articles.

Subjects were provided with the p-value from an initial hypothetical experiment and given the task of intuiting a range of probable p-values that would be obtained if the experiment were repeated. Each of the three studies varied slightly in its design and task prompt. All subjects carried out the tasks via an email response to an initial email message.

The Replication Fallacy could be considered to have a strict and loose form. In the strict form a researcher or student misinterprets the p-value itself as a replication probability. This misconception might be tested using, for instance, the survey instrument from the series of studies by Badenes-Ribera et al. As a reminder the instrument asked subjects to respond true or false to the statement, “A later replication [of an initial experiment which obtained p = 0.001] would have a probability of 0.999 (1-0.001) of being significant.” The loose form of the fallacy is not tied so directly to the p-value itself. Even if one does not strictly believe the p-value is a replication probability, researchers and students might believe a small p-value is somehow indicative of strong replication properties, with repeated experiments of an initially statistically significant result likely to produce more statistically significant results most of the time. It is this loose version which Lai et al. (2012) attempted to investigate.

Each of the three tasks started with the same initial summary description below. The sample size and z-score and p-value were sometimes modified.

Suppose you conduct a study to compare a Treatment and a Control group, each with size N = 40. A large sample test, based on the normal distribution, of the difference between the two means for your independent groups gives z = 2.33, p = .02 (two-tailed).

The specific task wording then followed.

All three studies attempt to obtain the subjects’ 80% p-value interval. That is, the range in which 80% of the p-values from the repeated experiments would fall. These p-value intervals were then compared to the normatively correct intervals derived from the formulas presented in Cumming (2008). Simulation can also be used to help determine p-value intervals.

The tasks in Study 2 and 3 were designed to accommodate findings from research on interval judgement. For example, research that suggests that prompting subjects for interval endpoints and then asking them what percentage of cases their interval covers may improve their estimates over prompts that ask for a pre-specified X% interval. For a full review of how the authors considered that literature please see Lai et al. (2012).

Study 1 elicited 71 usable responses. Subjects were shown the summary description above and then asked to carry out the following task. Email sends were randomized so that roughly half contained the z = 2.33 and p = 0.02 version of the summary description, and the other half contained a z = 1.44 and p = 0.15 version. Thus each subject responded to a single task with a specified p-value. A total of 37 responses were returned with the p = 0.02 version and 34 responses were returned with the p = 0.15 version.

The specific task wording was as follows:

Suppose you carry out the experiment a further 10 times, with everything identical but each time with new samples of the same size (each N = 40). Consider what p values you might obtain. Please enter 10 p values that, in your opinion, could plausibly be obtained in this series of replications. (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)

The authors then extrapolated the 10 p-values into a distribution and then took the 80% interval, with the authors noting that, “To analyze the results we assumed that underlying the 10 p values given by a respondent was an implicit subjective distribution of replication p.” The details of the extrapolation are included in the Appendix of Lai et al. (2012). Results are discussed further below.

In Study 2 instead of being asked for 10 separate p-values subjects were asked directly about the upper and lower limits of their 80% p-interval. Subjects were shown the summary description above and then asked to carry out the task, described below. P-value variations were crossed with sample size variations, for a total of four versions of the task. These variations applied to both the summary description and the task specific language. All subjects were asked to respond to all four versions. The four combinations were: (1) a z = 2.33 and p = 0.02 version with a sample size of 40; (2) the same p-value as in Version 1, but with a sample size of 160; (3) a z = 1.44 and p = 0.15 version with a sample size of 40; (4) the same p-value as in Version 3, but with a sample size of 160. The number of respondents varied between 37 to 39 as not every respondent completed all four tasks as instructed.

The specific task wording was as follows:

Suppose you repeat the experiment, with everything identical but with new samples of the same size (each N = 40). Consider what p value you might obtain. Please estimate your 80% prediction interval for two-tailed p. In other words, choose a range so you guess there’s an 80% chance the next p value will fall inside this range, and a 20% chance it will be outside the range (i.e., a 10% chance it falls below the range, and 10% it falls above). (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)

LOWER limit of my 80% prediction interval for p= [ ] Type a valueless than .02. (You guess a 10% chance p will be less than this value, and 90% it will be greater.)

UPPER limit of my 80% prediction interval for p=[ ] Type a value more than .02. (You guess a 10% chance p will be greater than this value, and 90% it will be less.)

Results are discussed further below.

Study 3 elicited 62 usable responses. Subjects were shown the summary description above and then asked to carry out the following task which involved identifying p-value bounds as well as defining a percentage p-value interval those bounds cover. The methodology was similar to Study 1, except instead of randomizing email sends all subjects saw both the z = 2.33 and p = 0.02 and z = 1.44 and p = 0.15 versions. The sample size for both versions was 40. The task wording is shown below.

Suppose you repeat the experiment, with everything identical but with new samples of the same size (each N = 40). Consider what p value you might obtain. Please type your estimates for the statements below. (Please, no calculations, and no debate about what model might be appropriate! We are interested in your guesstimate, your intuitions!)

The replication study might reasonably find a p value as low as: p= [ ] Type a value less than .02

The replication study might reasonably find a p value as high as: p= [ ] Type a value more than .02

The chance the p value from the replication study will fall in the interval between my low and high p value estimates above is [ ] %.

A summary of the task for each study is shown in the table below.

Study Number of responsents Number of task versions presented to each respondent P-value variations Sample size variations Task summary
Study 1 71 1 p = 0.02 or p = 0.15 N = 40 Provide 10 p-values that you might obtain if the initial experiment were repeated 10 times.
Study 2 37-39 4 p = 0.02 and p = 0.15 N = 40 and N = 160 Provide a lower and upper p-value bound for an 80% p-value interval if the initial experiment were repeated.
Study 3 62 2 p = 0.02 and p = 0.15 N = 40 Provide the largest and smallest p-value you would expect if the initial experiment were repeated. Provide the chance the p-value will fall within the bound you gave.
Summary of task verions by study
Lai et al. (2012)

Table notes:
1. For Study 2 the number of respondents varied between 37 to 39 as not every respondent completed all four tasks as instructed.
2. Reference: "Subjective p Intervals: Researchers Underestimate the Variability of p Values Over Replication", Jerry Lai, Fiona Fidler, and Geoff Cumming, Methodology, 2012 [link]

A summary of the results across the three studies is shown in the chart at right. Here figures closer to zero are considered better as they represent estimates with less misestimation. The chart was reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion.

For reference, the authors note that if an initial experiment resulted in a p-value of 0.02, under conservative assumptions an 80% replication interval would be p = 0.0003 to p = 0.3.

For Study 1 both p-value task versions resulted in underestimates 40 percentage points too low. That is, instead of providing a set of p-values that resulted in an 80% p-value interval, respondents provided p-values that resulted in just a 40% p-value interval.

Subjects in Study 2 did slightly better, averaging an underestimate of around 27 percentage points for the N = 40 task version and between 35 and 40 percentage points for the N = 160 version.

Subjects in Study 3 did about the same as subjects in the N = 40 task version in Study 2.

Overall, it seems the authors’ considerations for the interval judgement literature for Studies 2 and 3 may have had a positive impact on underestimation. However, respondents still substantially underestimated p-value variation in replications, providing approximately a 52% p-value interval on average rather than an 80% interval.

Data for the N = 40 condition were combined via basic meta-analysis into an overall average. The precise sample size is not known due to Study 2, which the authors note ranged between 37 and 39, depending on the task version, without providing details. We made a conservative estimate using the 160 figure, but the sample size can be no greater than 162. The meta-analysis resulted in an average underestimate of 32 percentage points, equal to a 48% p-value interval.

Not all figures were broken down by academic discipline. Looking across all three disciplines of psychology, medicine, and statistics showed psychology researchers had higher magnitudes of underestimation than statisticians, but lower than medical researchers. Again, considering all three disciplines, 98% of respondents underestimated interval width in Study 1, 94% underestimated in Study 2, and 82% underestimated in Study 3.

Respondents were also invited to provide any comments on the task they received. The authors note that many subjects responded positively, with a substantial number noting that they found the task novel as it was uncommon in their day-to-day research.


Castro Sotos et al. (2017)


In 2018 Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu surveyed 346 psychology students and researchers in China using the same wording as in Haller and Krauss (2002), translated into Chinese. The authors also corrected the degrees of freedom from 18 to 38, which were incorrect in the original scenario.

Using the open source data provided by the authors we segmented psychology researchers that were deemed as being in a medically related psychology subfield. The procedure used to categorize these subfields as “medically related” was simple judgement on our part. The four subfields were cognitive neuroscience, biological or neuropsychology, psychiatry or medical psychology, and neuroscience or neuroimaging. The sample sizes for all but cognitive neuroscience (n=121) are quite small; biological and neuropsychology had four respondents, psychiatry and medical psychology had three respondents, and neuroscience and neuroimaging had nine respondents.

The English version of the wording is shown below.

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false”. “False” means that the statement does not follow logically from the above premises. Several or none of the statements may be correct.

1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

2. You have found the probability of the null hypothesis being true.

3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

4. You can deduce the probability of the experimental hypothesis being true.

5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

The correct answer to all of these questions is "false” (for an explanation see Statistical Inference or https://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf).

[add interpretation]

Statement summaries Undergraduate Masters PhD Postdoc and assistant professors
1. Null Hypothesis disproved 21% 72% 21% 56%
2. Probability of null hypothesis 50% 57% 47% 44%
3. Null hypothesis proved 36% 60% 5% 44%
4. Probability of null is found 64% 32% 47% 22%
5. Probability of Type I error 79% 25% 95% 44%
6. Probability of replication 29% 64% 42% 56%
Percentage of Chinese students and researchers that answered each of six NHST questions incorrectly (from Lyu et al., 2018)

Table notes:
1. Percentages do not add to 100% because each respondent answered all questions.
2. Sample sizes: Undergraduates (n=14), Masters (n=92), PhD (n=19), Postdoc or assistant prof (n=9)
3. Data in this table is for a subset of psychology researchers that self-identified as being in one of the four following subfields related to medicine: cognitive neuroscience, biological/neuropsycho, psychiatry/medical, or neuroscience/neuroimaging.
4. Data calculated from (a) "P-Value, Confidence Intervals, and Statistical Inference: A New Dataset of Misinterpretation", Ziyang Lyu, Kaiping Peng, and Chuan-Peng Hu, Frontiers in Psychology, 2018, paper: [link], data: [link]

In 2020 Xiao-Kang Lyu, Yuepei, Xiao-Fan Zhao, Xi-Nian Zuo, and Chuan-Peng Hu tested 1,479 students and researchers in China including 130 in the field of medicine. They used a four-question instrument where respondents were randomized into either a version where the p-value was statistically significant or statistically nonsignificant. Subjects were prompted to answer each question as either “true” or “false.” Respondents are considered to have a misinterpretation about the item if they incorrectly mark it as “true” — the correct answer to all statements was “false.”

The author’s instrument wording is shown below:

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the population means corresponding to experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).

The response statements read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.

1. You have absolutely disproved (proved) the null hypothesis.

2. You have found the probability of the null (alternative) hypothesis being true.

3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision.

4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions.

Using the open source data made available by the authors we attempted to reproduce the findings in Lyu et al. (2020). However, we were not able to reproduce either the top-level figure for NHST or CI or any of the figures from the main table (Table 1). We contacted co-author Chuan-Peng Hu about the possible errors over email and shared our R code. He and the other paper authors then reexamined the data and confirmed our analysis was correct, later issuing a correction to the paper.

In terms of the proportion with at least one NHST misinterpretation, medical researchers were near the top of the pack of the professions surveyed by Lyu et al. (2020). They had the second highest proportion of respondents with at least one misinterpretation out of eight professions surveyed. Medical researchers ranked third out of eight in terms of the average number of misinterpretations, 1.88 out of a total of four possible misinterpretations. The nonsignificant version had a slightly higher proportion of misinterpretations compared to the significant version, 1.92 compared to 1.84. This pattern was true for all fields except general science.

The highest proportion of incorrect responses varied between the significant and nonsignificant version. Statement four was the most misinterpreted on the significant version, while statement three was the most misinterpreted in the nonsignificant version. A separate version of statement three was also the most misinterpreted across three studies in psychology: Oakes (1986), a 2002 replication by Haller and Krauss, the 2018 replication by Lyu et al. (translated into Chinese).

The reason for participants having an especially difficult time with statement three is likely due to the fact that it is a very subtle reversal of conditional probabilities involving the Type I error rate. While the Type I error rate is the probability of rejecting the null hypothesis given that the null hypothesis is actually true, this question asks about the probability of the null hypothesis being true given that the null hypothesis has been rejected. In fact knowing the Type I error rate does not involve anything more than the pre-specified value called “alpha” — typically set to 5% — so none of the test results would need to be presented in a hypothetical scenario to determine this rate.

One might argue that the language is so subtle that some participants who have a firm grasp of Type I error may mistakenly believe this question is simply describing the Type I error definition. In the significant version of the instrument statement three has two clauses: (1) “if you decide to reject the null hypothesis” and “the probability that you are making the wrong decision.” In one order these clauses read: “the probability that you are making the wrong decision if you decide to reject the null hypothesis.” With the implicit addition of the null being true, this statement is achingly close to one way of stating the Type I error rate, “The probability that you wrongly reject a true null hypothesis.” Read in the opposite order these two clauses form the statement on the instrument. There is no temporal indication in the statement itself as to which order the clauses should be read in, such as “first…then…”. While it is true that in English we read left to right, it is also true that many English statements can have their clauses reversed without changing the meaning of the statement. Other questions in the instrument are likely more suggestive of participants having an NHST misinterpretation.

Statement summaries Significant version Nonsignificant version
1. You have absolutely disproved (proved) the null hypothesis 49% 48%
2. You have found the probability of the null (alternative) hypothesis being true. 52% 54%
3. You know, if you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision. 51% 64%
4. You have a reliable (unreliable) experimental finding in the sense that if, hypothetically, the experiment was repeated a great number of times, you would obtain a significant result on 99% (21%) of occasions. 64% 43%
Percentage of Chinese medical students and researchers that answered each of four NHST questions incorrectly (from Lyu et al., 2020)

Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=69), nonsignificant version (n=61).
3. Reference: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]

A breakdown of NHST misunderstandings by education is shown below. All education levels had high rates of misunderstanding at least one NHST question. Undergraduate students fared best, with “only” 79% demonstrating at least one NHST misinterpretation, but they also had the highest average number of incorrect responses with 2.4 (out of four possible), indicating that those respondents that did have a misinterpretation tended to have multiple misinterpretations.

Education Sample size Percentage with at least one NHST misunderstanding Average number of NHST misunderstandings (out of four)
Undergraduates 19 79% 2.4
Masters 69 96% 1.8
PhD 24 96% 1.8
Post-PhD 18 94% 1.8
Percentage of Chinese medical students and researchers that answered each of four NHST questions incorrectly by education (from Lyu et al., 2020)

Confidence interval review

In 2005 Sarah Belia, Fiona Fidler, Jennifer Williams, and Geoff Cumming conducted a unique experiment to probe psychology researchers’ understanding of overlapping confidence intervals. A total of 162 psychology researchers participated, recruited from authors of articles published between 1998 and 2002 in a selection of 21 high-impact psychology journals. Behavioral neuroscientists and medical researchers also participated; their results will be covered in other articles.

Participants were recruited via an email which included a link to a web applet that walked respondents through one of three tasks. In the first task respondents were shown a chart containing a 95% independent mean confidence interval representing the mean reaction time of a group in milliseconds (ms). This Group 1 mean was fixed at 300ms. A second 95% independent mean confidence interval was shown for Group 2. However, by clicking on the chart respondents could move the Group 2 confidence interval. Respondents’ task was to position the Group 2 confidence interval such that the difference of means between the two groups would produce a p-value of 0.05. The fidelity of the movement of the Group 2 interval was 3ms with the chart scale stretching from 0ms to 1,000ms. To emphasize the groups were independent the sample size of Group 1 was set to 36, while for Group 2 it was set to 34.

The instructions accompanying the confidence interval task are shown below (with original bolded words). For a full image of the applet please see Belia et al. (2005).

Please imagine that you see the figure below published in a journal article.

Figure 1. Mean reaction time (ms) and 95% Confidence Intervals Group 1 (n=36) and Group 2 (n=34).

Please click a little above or below the mean on the right: You should see the mean move up or down to your click (the first time it may take a few seconds to respond). Please keep clicking to move this mean until you judge that the two means are just significantly different (by conventional t-test, two-tailed, p < .05). (I’m not asking for calculations, just your approximate eyeballing).

Because the authors had observed an anchoring effect in preliminary testing the starting point of the Group 2 confidence interval was randomized such that for approximately half of participants it was initially placed at 800ms and for the other half it was placed at 300ms. Due to this anchoring the authors adjusted the final Group 2 position of respondents in their reporting and analysis. After the authors averaged the respondent placement of Group 2 under each initial position a difference of 53ms was observed. The authors halved this difference and subtracted it from the Group 2 placement of each respondent that was randomized into the 800ms initial position. The distance was added for respondents who were randomized into the 300ms initial position.

The second task was similar to the one described above, but involved standard error bars. The third task was also involving standard error bars, but instead of a labeling of “Group 1” and “Group 2” the groups were formulated as a repeated measure with labels of “Pre Test” and “Post test.” Respondents were randomized into one of the three tasks.

Before reviewing the results notice that moving the Group 2 mean closer to Group 1 will increase the p-value under the standard null hypothesis of no difference between group means. This is because the p-value is a measure of data compatibility with that hypothesis. If Group 2 were repositioned so that the mean perfectly aligned with the Group 1 mean of 300ms, the null hypothesis of no difference between group means would have strong support which would be reflected in a larger p-value underlying this probability. If Group 2 was moved very far away from Group 1 the p-value would be small, because the probability of observing such a difference becomes smaller under the null hypothesis of equality of means.

The results of the confidence interval task and standard error task are shown in the chart below. The chart was reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion. The presentation for the two tasks are identical. A summary of results are provided after the chart. The repeated measures task is discussed momentarily.

The exact Group 2 position of each respondent is not shown, instead results are aggregated into a histogram. The histogram represents the number of respondents that placed Group 2 within the corresponding bin. For example, in the standard error task there were nine respondents that placed Group 2 somewhere between 400ms and 450ms, thus the histogram has height nine for this bin. Note that although the authors report a sample size of 71 for the standard error task, the frequency histogram only sums to 70 participants.

Under the histogram the fixed placement of Group 1 is shown as is the proper placement of Group 2 to produce a between-means p-value of 0.05. The grey interval below Group 1 and Group 2 represents the Group 2 placement that results in the Group 1 and Group 2 intervals just touching. Although this placement is not correct, in both tasks — and especially the standard error task — respondents tended to prefer this placement. This phenomenon can be observed by looking just above the histogram. Here the average Group 2 placement of respondents is shown along with the actual p-value produced by this position. The grey band running the length of the chart corresponds to Group 2 placements which produce p-values between 0.025 and 0.10, the range the authors feel is within a reasonable estimate for respondents.

adsf

For the confidence interval task the correct position is to place the Group 2 confidence interval so that the mean is at 454ms. To properly position Group 2 respondents needed to recognize that some overlap of confidence intervals is necessary. If two confidence intervals do not overlap then there is by definition a statistically significant difference between their means. However, the inverse is not true, two overlapping confidence intervals do not necessarily fail to produce a statistically significant difference. The authors note that a rule of thumb is, “CIs that overlap by one quarter of the average length of the two intervals yield p values very close to, or a little less than, .05.” This rule of thumb works well for sample sizes over 10 and for confidence intervals that aren’t too different in width (the wider of the two intervals cannot be more than a factor of two wider than the narrower).

For the confidence interval task respondents were on average too strict, placing the Group 2 interval too far from Group 1. The average p-value produced was 0.017 rather than 0.05. It is perhaps worth noting, however, that the average is somewhat skewed upward as a small number of respondents moved Group 2 quite far from Group 1. In fact, the 450ms to 500ms bin — which contained the correct Group 2 placement (454ms) — was the modal response, with one-quarter of respondents within this range. However, respondent-level data would be needed to understand where in this bin respondents placed Group 2. Overall respondents did indeed misplace Group 2. Using a more generous band of p-values between 0.25 and 0.10 did not alter this overall finding. The authors note that in total just 15% of the respondents positioned Group 2 within the 0.025 to 0.10 p-value band judged as a reasonable placement by the authors.

Whereas in the confidence interval task respondents tended to place Group 2 too far from Group 1, in the standard error task they tended to place it too close. The correct Group 2 placement was at 614ms. However, the average placement of respondents was much smaller, producing a p-value of 0.158 rather than 0.05. Again, using a more generous band of p-values between 0.25 and 0.10 did not alter this overall finding. Respondents failed to recognize that a gap is necessary between to standard error bars for there to be a statistically significant difference. Like for confidence intervals, the authors noted a rule of thumb, “SE bars that have a gap equal to the average of the two SEs yield p values close to .05.” The provisos here are the same as for the confidence interval rule of thumb. On average respondents tended to place the Group 2 standard error interval so that it was just touching the Group 1 interval. This can be seen in the chart below by noticing that the small yellow box above the histogram aligns closely with the grey interval under the chart. The overall accuracy for the standard error task was the same as for the confidence interval task, just 15% of the respondents positioned Group 2 within the 0.025 to 0.10 p-value band judged as a reasonable placement by the authors.

The authors also investigated the unadjusted Group 2 placement and found that for the standard error task about 25% of respondents positioned Group 2 so that the error bars were just touching Group 1. The corresponding figure for confidence interval bars was about 23%.

As for the third task involving repeated measures, curiously there was not enough information to successfully complete it. This is because the provided error bars represented between-subject variation, but failed to account for within-subject variation. Therefore, the authors expected respondents to provide comments that more information about the design was necessary to successfully complete the task (ex. “Is this a paired design?”). Note that during the second phase of the task respondents were shown a second screen with open-ended response options, so the ability to provide comments was allotted. Just three of the 51 psychologists provided such feedback in the open-ended comments. However, it is unclear what should be made of this result as perhaps the experimental setup itself confused respondents into attempting to complete an impossible task.

The authors conclude that their findings are indicative of four misconceptions. First, respondents have an overall poor understanding of how two independent samples interact to produce p-values. On average respondents placed Group 2 too far away in the confidence interval task, but too close in the standard error task. This conclusion must be discussed in its full context, however. More than half of respondents (63%) positioned Group 2 so that there was some overlap between the intervals. It’s unclear the extent to which confidence interval interactions were understood by this subgroup. Perhaps they intuit — or explicitly understood — that confidence intervals must overlap to produce the requested p-value of 0.05, but do no know the proper rules of thumb to be able to place Group 2 as precisely as necessary. Sill, this generous interpretation implies that nearly 40% of respondents in the confidence interval task that do not understand key features of interval interaction. A similar case could be made for the standard error task, with respondents understanding there must be some distance between the two intervals, but not fully internalizing the correct rules of thumb.

Their second proposed misconception is that respondents didn’t adequately distinguish the properties of confidence intervals and standard errors. This may be true on average, but again the same arguments above apply. Having participants undertake both tasks would have given some sense of the within-subject discrimination abilities. A subject may understand simultaneously that Group 1 and Group 2 confidence intervals need some overlap and that standard error intervals need some gap, without fully recognizing the correct rules of thumb. This would demonstrate an appreciation for the distinction between the two intervals, although it could still lead to nominally incorrect answers (i.e. misplacing Group 2).

Their third proposed misconception is that respondents may use incorrect rules that two error bars should be touching to produce the desired statistically significant result. This does seem to be true for some portion of respondents, as was discussed above. Whatever rule respondents used the majority certainly do not appear to be familiar with those outlined by the authors (again, discussed above).

The forth proposed misconception is that respondents don’t properly appreciate the different types of variation within repeated measures. As discussed above, it is unclear whether this is a fair interpretation of results from the third task.


In 2018 Pav Kalinowski, Jerry Lai, and Geoff Cumming investigated what they call confidence interval “subjective likelihood distributions” (SLDs), the mental model one uses to assess the likelihood of a mean within a confidence interval. As an example, one possible, but incorrect, distribution is the belief that the mean is equally likely to fall along any point within a confidence interval.

A total of 101 students participated in a set of three tasks. Although the academic disciplines of the students varied, the research is appearing in this article as two thirds (66%) of the students self-identified as psychology students. The remaining disciplines were social science (13%), neuroscience (6%), medicine (5%), and not identified (10%). The authors note that, “Most students (63%) were enrolled in a post graduate program, and the remaining students where completing their honors (fourth year undergraduate).”

Percentage of students "drawing" this shape
Shape 95% CI 50% CI
Correct 15% 17%
Bell 12% 18%
Triangle 4% 7%
Half circle 10% 5%
Mesa 16% 12%
Square 19% 13%
Other 25% 36%
Percentage of students whose responses on Task 1 cooresponded to one of seven shapes
Kalinowski et al. (2018)

Table notes:
1. Percentages for the 50% confidence interval do not add to 100% as the Triangle shape was practically indistinguisable from the correct shape. For this reason students whose response formed a triangle distribution are counted twice, once in the Correct category and once in Triangle.
2. Reference: "A Cross-Sectional Analysis of Students’ Intuitions When Interpreting CIs", Pav Kalinowski, Jerry Lai, and Geoff Cumming, Frontiers in Psychology, 2018 [link]

In Task 1 students saw a 95% confidence interval and a set of nine markers. Five of the markers were within the confidence interval, while four were outside of it. Students had to rank the likelihood that each point was associated with the mean. A 19-point scale was used for the ranking. Example values were, (1) “More likely [to] land on the [the mean],” (3) “About equally likely [to] land on the [the mean],” (5) “Very slightly less likely to be the [the mean],” and (19) “Almost zero likelihood.” This procedure allowed the authors to construct each student’s SLD, which was then judged to be correct if 97% or more of the variance was explained by the normatively correct distribution. If the student’s SLD was incorrect it was categorized into one of six other distributions. The procedure was repeated with a 50% confidence interval to examine performance on intervals of different widths.

The results are shown at right. Note that for the 50% interval the correct shape could not be distinguished from students whose SLD was a triangle shape. Therefore, these respondents are counted twice, once in “Correct” and once in “Triangle.”

Please see the original article for a screenshot of Task 1, example student SLDs, and the shape classification rules of the authors.

In Task 2 students had to choose one of six shapes that best represented their SLD. Please see the original article for a screenshot of the shapes provided. In total, 61% of students selected the correct shape, a normal distribution.

In Task 3, students were shown a confidence interval. The task had two questions. In the first question a 95% confidence interval was presented and using a slider students were asked to select two points on the interval that corresponded to an 80% and 50% interval, respectively. In the second question they were shown a 50% interval, and had to select two points that would correspond to an 80% and 95% interval, respectively.

Most students — 75% — selected the correct direction of the intervals. For instance, understanding that 50% intervals are narrower than 95% intervals. However, 25% misunderstood the relationship between intervals of different percentages, for example believing that 95% confidence intervals were narrower than 50% intervals. When starting with a 95% confidence interval students overestimated the needed width. On average students attempting to mark an 80% interval instead marked an 86% interval; when attempting to mark a 50% interval they instead marked a 63% interval. When starting with a 50% confidence interval students instead underestimated the needed width. On average students attempting to mark an 80% interval instead marked an 79% interval; when attempting to mark a 95% interval they instead marked a 92% interval.

Across the three tasks 74% of students gave at least one answer that was normatively incorrect.

After Task 3 an open-ended response option was presented to participants.

A combination of the three tasks plus the open-ended response option resulted in four confidence interval misconceptions that have been rarely documented in the literature:

  1. All points inside a confidence interval are equally likely to land on the true population mean

  2. All points outside a confidence interval are equally unlikely to land on the true population mean

  3. 50% confidence intervals and 95% confidence intervals have the same distribution (in terms of the likelihood of each point in the interval to land on the true population mean)

  4. A 95% confidence interval is roughly double the width of a 50% confidence interval

In Task 4, 24 students agreed to interviews. The full results of Task 4 are difficult to excerpt and readers are encouraged to review the original article. Two primary findings are worth noting. First, after coding student responses into 17 different confidence interval misconceptions, the authors note that, “Overall every participant held at least one CI misconception, with a mean of 4.6 misconceptions per participant.” Second, the authors found that cat’s eye diagrams may be effective at remedying some misconceptions and helping to enforce correct concepts.


Because of the difficulty in properly interpreting NHST, confidence intervals have been proposed as an alternative [citation]. Confidence intervals also have the benefit of giving a measure of the precision of the effect size. For that reason, in addition to NHST instruments, some researchers have also tested confidence interval misinterpretations. Again, most of this research has occurred in the field of psychology. Only Lyu et. al. (2020) have directly tested medical researchers for common confidence interval misinterpretations. All researchers surveyed were from China.

Lyu et al. (2020) used a modified version of their four-question NHST instrument adapted to test confidence interval knowledge. There were two versions, one with a statistically significant result and one without. The English translation of the hypothetical experimental situation and four statements are shown below. The significant version of each statement read as follows with the nonsignificant version wording appearing in parenthesis, substituting for the word directly preceding it.

The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (–.1 to .4).

1. A 95% probability exists that the true mean lies between .1 (–.1) and .4.

2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4.

3. If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis.

4. The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%.

At 92%, medical researchers had the fourth highest proportion of respondents with at least one confidence interval misinterpretation out of eight professions surveyed by Lyu et al. (2020). Medical researchers also ranked third out of eight in terms of the average number of confidence interval misinterpretations, 1.86 out of a total of four possible. This was comparable to the 1.88 average misinterpretations in the NHST instrument. The nonsignificant version had a lower proportion of misinterpretations compared to the significant version, 1.74 compared to 1.97. This pattern was true for all fields except medicine and the social sciences.

There was fairly wide variation between the significant and nonsignificant versions for statement four, 15 percentage points. Statement four suffers from the same subtle wording issue as the NSHT version.

Statement summaries Significant version Nonsignificant version
1. A 95% probability exists that the true mean lies between .1 (–.1) and .4. 60% 61%
2. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 (-0.1) and 0.4. 54% 56%
3. If the null hypothesis is that there is no difference between the mean of experimental group and control group, the experiment has disproved (proved) the null hypothesis. 53% 47%
4. The null hypothesis is that there is no difference between the mean of experimental group and control group. If you decide to (not to) reject the null hypothesis, the probability that you are making the wrong decision is 5%. 66% 51%
Percentage of Chinese students and researchers that answered each of four confidence interval questions incorrectly (from Lyu et al., 2020)

Table notes:
1. Percentages do not add to 100% because each responsent answered all four questions.
2. Sample sizes: significant version (n=93), nonsignificant version (n=71).
3. Reference: "Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields", Xiao-Kang Lyu, Yuepei Xu2, Xiao-Fan Zhao1, Xi-Nian Zuo, and Chuan-Peng Hu, Frontiers in Psychology, 2020, paper: [link], data: [link]

A breakdown of confidence interval misunderstandings by education is shown below. All education levels had approximately the same number of average confidence interval misinterpretations, at 1.9 out of a possible four. There was some variation in the percentage of each education level with at least one confidence interval misinterpretations. Undergraduates had the lowest rate at 84%, although the sample size of undergraduates was relatively small at 19 participants. All Post-PhD participants had at least one misunderstanding, but again the sample size was relatively small at 18 participants. Masters and PhD students faired about equally.

Education Sample size Percentage with at least one CI misunderstanding Average number of CI misunderstandings (out of four)
Undergraduates 19 84% 1.9
Masters 69 93% 1.9
PhD 24 92% 1.9
Post-PhD 18 100% 1.8
Percentage of Chinese medical students and researchers that answered each of four confidence interval questions (CI) incorrectly by education (from Lyu et al., 2020)

Cliff Effect

The cliff effect refers to a drop in the confidence of an experimental result based on the p-value. Typically, the effect refers to the dichotomization of evidence at the 0.05 level where an experimental or analytical result produces high confidence for p-values below 0.05 and lower confidence for values above 0.05.

In 2010 Jerry Lai replicated the between-subject analysis in his confidence elicitation study of 258 psychology and medical researchers. Participants were authors of journal articles that appeared in one the two fields. Data in the paper is not split by discipline so all figures below are for the population of both researchers. If psychology-only data is made available the figures will be updated.

Participants first saw the following hypothetical scenario, where the confidence interval version is shown in brackets, substituting for the previous sentence.

Suppose you conduct an experiment comparing a treatment and a control group, with n = 15 in each group. The null hypothesis states there is no difference between the two groups. Suppose a two-sample t test was conducted and a two-tailed p value calculated. [Suppose the difference between the two group means is calculated, and a 95% confidence interval placed around it].

Respondents were asked about each of the following p-values: p = .005, .02, .04, .06, .08, .20, .40, .80. Then a scenario with a sample size of 50 was shown. All combinations assumed equal variances, with a pulled standard deviation of four. Respondents were asked to rate the strength of evidence of each p-value and sample size combination on a scale of 0 to 100.

The NHST version displayed a typical result with a t-score, p-value, and effect size. The confidence interval version showed a visual display for each p-value and sample size combination. A total of 172 researches saw the NHST version and 86 saw the confidence interval version. This sample size is an order of magnitude larger than in Poitevineau and Lecoutre (2001). Participants were not randomized into one of the two version, instead different sets of researchers were contacted for the NHST study and confidence interval study.

Curve NHST CI
All-or-none 4% 17%
Moderate cliff 17% 15%
Negative exponential 35% 31%
1-p linear 23% 0%
Categorization of respondent confidence curves

Table notes:
1. Sample sizes do not add to 100% because not all responses could be categorized into one of the four main types.
2. NHST sample sizes: all-or-nothing (n=7), moderate cliff (n=29), negative exponential (n=60), 1-p linear (n=39), unclassified (n=38). Confidence interval sample sizes: all or nothing (n=28), moderate cliff (n=13), negative exponential (n=27), 1-p linear (n=0), unclassified (n=32).
3. Reference: "Dichotomous Thinking: A Problem Beyond NHST", Jerry Lai, Proceedings of the Eighth International Conference on Teaching Statistics, 2010 [link]

Lai assessed the extent of a cliff effect by calculating what he called the “Cliff Ratio (CR).” The numerator of the CR was the decrease in the rated strength of evidence between p-values of 0.04 and 0.06. The denominator of the CR was calculated by averaging the decrease in rated strength of evidence between p-values of 0.02 and 0.04 and between 0.06 and 0.08.

The CR as well as the overall shape of each participant’s responses were used to manually categorize patterns, with a focus on the categories from Poitevineau and Lecoutre (2001). One additional pattern, a moderate cliff effect, was also identified.

The results are shown in the table at right. Not all respondents could be categorized into one of the four main groups selected by Lai; 79% of responses from the NHST version and 63% of responses from the confidence interval version were placed into one of these categories.

In total 21% of participants demonstrated an all-or-none or moderate cliff effect for the NHST version. However, in the paper Lai mistakenly cites a 22% figure in his discussion.

In the initial research by Poitevineau and Lecoutre (2001) a 22% figure for a cliff effect was found, but this was for the single all-or-none category, of which only 4% of respondents were categorized into by Lai (2010). Poitevineau and Lecoutre (2001) did not create an explicit “moderate cliff effect” category.

The corresponding figure for the confidence interval version was that 32% of participants demonstrated an all-or-none or moderate cliff effect. This implies that the confidence interval presentation type did not decrease dichotomization of evidence with respect to NHST (and in fact made it worse) among the sample collected by Lai. However, as the authors discuss confidence intervals are sometimes suggested specifically because of the believe that one benefit is protection against dichotomization of evidence.

Lai notes that “sample size was found to have little impact on researchers’ interpretation…” and no breakdown of results by sample size is provided in his paper. This does differ from other studies of the cliff effect, however, it is unclear why.

Several participants exhibited fallacies when providing responses in the open-ended section of the response. For instance, one participant believed that p-values represent, “the likelihood that the observed difference occurred by chance,” an example of the Odds-Against-Chance Fallacy. Another claimed that, “I have estimated…based on probability that the null hypothesis is false,” an example of the Inverse Probability Fallacy.


In 2016 Blakeley McShane and David Gal released the results of a multi-year study that investigated the prominence of the cliff effect within various academic fields, including medicine. To test medical researchers McShane and Gal surveyed two populations: 261 authors of articles from the American Journal of Epidemiology and 75 from the New England Journal of Medicine. Both populations of authors were from articles published in 2013.

To fully study the impact of the cliff effect McShane and Gal created a hypothetical scenario. The details of the hypothetical scenario differed for the two populations. For the American Journal of Epidemiology two questions were asked after the hypothetical scenario was presented. The first was the so-called “judgement” question, meant simply to test medical researchers’ statistical understanding of the scenario presented. The judgement question randomized medical researchers into a hypothetical scenario with one of four p-values: of 0.025, 0.075, 0.125, or 0.175. and one of two treatment magnitudes: a small treatment effect of 52% vs. 44% and a large treatment effect of 57% vs. 39%, for the Drug A and Drug B recovery rates respectively.

The second was the “choice” question, in which economists were asked to make a recommendation based on the hypothetical scenario. The choice question randomized the recommendation to be toward either a close other or a distant other.

The judgement question is shown below.

Below is a summary of a study from an academic paper:

The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly drawn from a fixed population and then randomly assigned to Drug A or Drug B. Fifty-two percent (52%) of subjects who took Drug A recovered from the disease while forty-four percent (44%) of subjects who took Drug B recovered from the disease. A test of the null hypothesis that there is no difference between Drug A and Drug B in terms of probability of recovery from the disease yields a p-value of 0.175. Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate?

A. A person drawn randomly from the same population as the subjects in the study is more likely to recover from the disease if given Drug A than if given Drug B.

B. A person drawn randomly from the same population as the subjects in the study is less likely to recover from the disease if given Drug A than if given Drug B.

C. A person drawn randomly from the same population as the subjects in the study is equally likely to recover from the disease if given Drug A than if given Drug B.

D. It cannot be determined whether a person drawn randomly from the same population as the subjects in the study is more/less/equally likely to recover from the disease if given Drug A or if given Drug B.

The choice question for a close loved one then saw the following wording:

If you were to advise a loved one who was a patient from the same population as those in the study, what drug would you advise him or her to take?

Participants in the distant other condition saw this wording instead:

If you were to advise physicians treating patients from the same population as those in the study, what drug would you advise these physicians prescribe for their patients?

All participants then saw the following three response options:

A. I would advise Drug A.

B. I would advise Drug B.

C. I would advise that there is no evidence of a difference between Drug A and Drug B.

The correct answer to all versions of the judgement statements was Option A since Drug A had a higher percentage of patients recover from the disease. However, respondents were much more likely to select an incorrect response in the version of the question with the nonsignificant p-value, likely believing that a nonsignificant p-value was evidence of no effect between Drug A and Drug B. McShane and Gal identify this as evidence of the cliff effect at work.

The evidence is supplemented by respondents’ selection for the choice statement. While strictly speaking there is no correct answer to the choice question as it is a recommendation, Drug A had a higher recovery rate from the disease and is therefore the natural choice. Like in the judgement question the nonsignificant p-value induced fewer respondents to answer correctly. Nonetheless, the proportion answering correctly in the nonsignificant version of the choice question was substantially higher than in the judgement question. As in McShane and Gal’s article we collapse the choice question across the “close other” and “distant other” categories as this is not the primary hypothesis being considered. In general, respondents were more likely to recommend Drug A in the “close other” scenario. For complete details see McShane and Gal (2016).

McShane and Gal hypothesis that respondents are more likely to select the correct answer for the choice question because it short circuits the automatic response to interpret the results with statistical significance in mind. Instead, focus is redirected toward a simpler question: which drug is better?

For the New England Journal of Medicine (NEJM) two questions were asked after the hypothetical scenario was presented. The first was the so-called “judgement” question, meant simply to test medical researchers’ statistical understanding of the scenario presented. The judgement question presented the same scenario twice, first with a p-value of 0.27 and next with a p-value of 0.01. Participants were randomized into one of three scenario wordings. Wording one is shown below.

Below is a summary of a study from an academic paper. The study aimed to test how different interventions might affect terminal cancer patients’ survival. Participants were randomly assigned to one of two groups. 1 Group A was instructed to write daily about positive things they were blessed with while Group B was instructed to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants in Group A lived, on average, 8.2 months post-diagnosis whereas participants in Group B lived, on average, 7.5 months post-diagnosis (p = 0.27). Which statement is the most accurate summary of the results?

A. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B.

B. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was less than that lived by the participants who were in Group B.

C. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was no different than that lived by the participants who were in Group B.

D. Speaking only of the subjects who took part in this particular study, it cannot be determined whether the average number of post-diagnosis months lived by the participants who were in Group A was greater/no different/less than that lived by the participants who were in Group B.

Response wording two was identical to response wording one above except it omitted the phrase “Speaking only of the subjects who took part in this particular study” from each of the four response options.

Response wording three omitted “Speaking only of the subjects who took part in this particular study” and rephrased “the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B” with “The participants who were in Group A tended to live longer post-diagnosis than the participants who were in Group B.” The complete list of options was then:

A. The participants who were in Group A tended to live longer post-diagnosis than the participants who were in Group B.

B. The participants who were in Group A tended to live shorter post-diagnosis than the participants who were in Group B.

C. Post-diagnosis lifespan did not differ between the participants who were in Group A and the participants who were in Group B.

D. It cannot be determined whether the participants who were in Group A tended to live longer/no different/shorter post-diagnosis than the participants who were in Group B.

Like with American Journal of Epidemiology authors a substantial cliff effect was observed for NEJM authors.

The authors produced a follow-up study in their paper “Statistical Significance and the Dichotomization of Evidence” that was focused specifically on statisticians. However, the experimental setup was the same. That article was published in the Journal of the American Statistical Association and selected by the editors for discussion.

In their discussions both Donald Berry and Eric Labera and Kerby Shedden levied the criticism at the “Speaking only” phrasing used in the questionnaire. However, in their rejoinder by McShane and Gal note that respondents were randomized into two versions of question phrasing, one of which did not include the “Speaking only” language. Nonetheless the response patterns were the same regardless of the phrasing. The original paper and as well as the discussions and rejoinders can be found in citation X [https://statmodeling.stat.columbia.edu/wp-content/uploads/2017/11/jasa_combined.pdf].


One more study was uncovered that examined the cliff effect, Helske et al. (2020). More than a hundred researchers participated including 16 medical researchers. One of the medical researchers had received an undergraduate degree, three had received a master’s degree, and 12 had a PhD. While the sample size is somewhat small, the results of the study are included here for completeness.

Like in McShane and Gal a hypothetical scenario was presented. The instrument wording was as follows:

A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?

One of four visualizations was then presented: a text box describing the p-value and 95% confidence interval, a 95% confidence interval visual display, a gradient confidence interval visual display, or a violin plot visual display. For each scenario respondents were presented with one of eight p-values between 0.001 and 0.8. The specific p-values were 0.001, 0.01, 0.04, 0.05, 0.06, 0.1, 0.5, and 0.8. Respondents then used a slidebar to select their confidence on a scale of 0 to 100.

Visualization Largest difference in confidence Difference in confidence
(percentage points)
P-value 0.04 to 0.06 21%
CI 0.04 to 0.06 19%
Cliff effect results due to two different visual presentation types
(from Helske et al., 2020)

Using the open source data made available by the authors we analyzed the extent of a cliff effect. A cliff effect was observed after plotting the drop in confidence segmented by visual presentation type. Only results from the p-value and confidence interval (CI) visual presentation types are presented since these are the most common methods of presenting analytical results. However, in their analysis Helske et al. looked across all 114 respondents and employed Bayesian multilevel models to investigate the influence of the four visual presentation types. The authors concluded that gradient and violin presentation types may moderate the cliff effect in comparison to standard p-value descriptions or confidence interval bounds.

Although the p-values presented to respondents were not evenly spaced, the drop in confidence between two consecutive p-values was used to determine the presence of a cliff effect. One additional difference was calculated, that between p-values of 0.04 and 0.06, the typical cliff effect boundary.

The 0.04 and 0.06 interval was associated with the highest drop in confidence for both the p-value and confidence interval presentation methods.

Dichotomization of evidence

The dichotomization of evidence is a specific NHST misinterpretation in which results are interpreted differently depending on whether the p-value is statistically significant or statistically nonsignificant. It is often a result of the cliff effect and is closely related to the Nullification Fallacy, in which a nonsignificant result is interpreted as evidence of no effect.

In 2010 Melissa Coulson, Michelle Healey, Fiona Fidler, and Geoff Cumming conducted a pair of studies with psychologists. In the first study 330 subjects from three separate academic disciplines were surveyed, two medical fields plus 102 psychologists who had authored recent articles in leading psychology journals.

The subjects were first shown a short background summary of two fictitious studies that evaluated the impact of a new treatment for insomnia:

Only two studies have evaluated the therapeutic effectiveness of a new treatment for insomnia. Both Simms (2003) and Collins (2003) used two independent, equal-sized groups and reported the difference between the means for the new treatment and current treatment.

Then subjects were shown, at random, one of two formats, either a confidence interval format or an NHST format. For each format there was a text version and a figure version. This resulted in subjects being shown one of four result summaries: either an NHST figure, which included a column chart of the average treatment effect and associated p-values; an NHST text version, which simply described the results; a confidence interval figure which provided the 95% confidence intervals of the two studies side-by-side; or a confidence interval text version, which simply described the results. Only one version was shown to each subject. The confidence interval text version is shown below. All four versions can be found in the original paper.

Simms (2003), with total N = 44, found the new treatment had a mean advantage over the current treatment of 3.61 (95% Confidence Interval: 0.61 to 6.61). The study by Collins (2003), with total N = 36, found the new treatment had a mean advantage of 2.23 (95% Confidence Interval: -1.41 to 5.87).

Subjects were first prompted to provide a freeform response to the question, “What do you feel is the main conclusion suggested by these studies?” Next three statements were presented regarding the extent to which the findings from the two studies agreed or disagreed. Subjects used a 1 to 7 Likert response scale to indicate their level of agreement where 1 equated to “strongly disagree” and 7 to “strongly agree.” The three statements were as follows:

  • Statement 1: The results of the two studies are broadly consistent.

  • Statement 2: There is reasonable evidence the new treatment is more effective.

  • Statement 3: There is conflicting evidence about the effectiveness of the new treatment.

The primary purpose of the study was to evaluate whether respondents viewed the two studies as contradictory. The authors view the results as supportive. The confidence intervals have a large amount of overlap despite one interval covering zero while the other does not; or in NHST terms, both have a positive effect size in the same direction despite one p-value being statistically significant and the other statistically nonsignificant. However, dichotomization of evidence may lead one to believe that the results are contradictory. This type of dichotomization of evidence was Fallacy 8 in our common NHST misinterpretations. The authors refer to the philosophy undergirding the interpretation of the two studies as supportive as meta-analytical thinking, the notion that no single study should be viewed as definitive. Instead, studies should be considered in totality as providing some level of evidence toward a hypothesis. This is a good rule of thumb, however, note that meta-analyses are not always better than a single study. This was Fallacy 11 in our common NHST misinterpretations.

To analyze the results of the study the authors averaged the Likert scores for Statements 1 and 3 for each subject (this was done after first reversing the scale of Statement 3 since the sentiment of the two statements were in complete opposition). The authors called this the “Consistency score.” The results for Statement 2 were captured in what was termed the “Effective score.”

The results of the first study are shown at the right, including the average across subjects (the dot) and the 95% confidence interval for the results (the bars). These were reproduced using our standard tracing technique and therefore may contain minor inaccuracies, although we do not consider these to be materially relevant for the current discussion. There were only small differences between the text and figure versions of the two formats, and thus the results were averaged, resulting in a single Effective score and Consistency score for both NHST and confidence intervals.

Because the authors considered the two statements to be in agreement both average Likert responses should be near 6, somewhat agree, or 7, strongly agree. However, neither score even crossed a Likert response of 5, denoting mild agreement.

For both formats more subjects believed that the two studies were consistent than believed that the totality of evidence indicated the treatment was effective, with the Consistency score between 0.75 and 1.0 higher on the Likert scale than the Effective Score. For the Consistency score NHST produced average Likert responses higher than those of confidence intervals. For the Effective score this was reversed. However, note that the magnitudes of the differences are modest.

Some analysis by the authors was not segmented by academic field. We have requested the raw data from the authors and will update this write-up if they are made available.

However, the authors note that looking across all three academic disciplines, “Only 29/325 (8.9%) of respondents gave 6 or above on both ratings, and only 81/325 (24.9%) gave any degree of agreement – scores of 5 or more – on both.”

Coding of freeform responses to the question, “What do you feel is the main conclusion suggested by these studies?” painted a similar picture when looking across all three academic disciplines. Coding lead to 81 out of 126 (64%) subjects indicating that the confidence interval format showed the two fictions studies were consistent or similar in results. For the NHST format the corresponding proportion was 59 out of 122 (48%). Again, these freeform responses can be viewed as a possible indication of dichotomous thinking.

In addition, the authors note that 64 of the 145 subjects (44%) answering one of the two confidence interval result summaries made mention of “p-values, significance, a null hypothesis, or whether or not a CI includes zero.” This can be considered a type of confidence interval misinterpretation as NHST is distinct from confidence intervals. NHST produces a p-value, an evidentiary measure of data compatibility with a null hypothesis, while confidence intervals produce a treatment effect estimation range.

In the second study 50 academic psychologists from psychology departments in Australian universities were presented with only the confidence interval figure result summary and asked the same four questions. This time for Statements 1 and 3 the subjects were also prompted to give freeform text responses. Results of this study are also shown in the figure. Note that for this study we did not use tracing, instead results were simply plotted.

The results were similar to the first study, although the Consistency score and Effect score were closer together, centered between a Likert response of 4 and 4.5.

The freeform responses from Statements 1 and 3 were analyzed by the authors. Together there were 96 total responses between the two questions (each of the 50 subjects had the chance to respond to both statements, which would have give a total of 100 responses, however a few psychologists abstained from providing written responses). In 26 of 96 cases (27%) there was mention of NHST elements, despite the survey instrument presenting only a confidence interval scenario. Mention of NHST was negatively correlated with agreement levels of 5, 6, or 7 on the Likert scale for both the Consistency score and Effect score, again suggesting NHST references were indicative of dichotomous thinking.


Refererences