Title of article

Intro…

Additional stuff

Statistical Significance Testing: A Historical Overview of Misuse and Misinterpretation with Implications for the Editorial Policies of Educational Journals

https://www.researchgate.net/publication/252393220_Statistical_Significance_Testing_A_Historical_Overview_of_Misuse_and_Misinterpretation_with_Implications_for_the_Editorial_Policies_of_Educational_Journals

Promoting Good Statistical Practices: Some Suggestions

https://journals.sagepub.com/doi/abs/10.1177/00131640121971185?journalCode=epma

Concise, Simple, and Not Wrong: In Search of a Short-Hand Interpretation of Statistical Significance

https://www.frontiersin.org/articles/10.3389/fpsyg.2018.02185/full

Textbooks

Review gelman article about textbooks

https://link.springer.com/article/10.3758/BF03213921

https://psycnet.apa.org/buy/1999-00560-017

https://www.ncbi.nlm.nih.gov/pubmed/21692788

BEHAVIORAL STATISTICS TEXTBOOKS: SOURCE OF MYTHS AND MISCONCEPTIONS?

https://www.jstor.org/stable/pdf/1164796.pdf

Statistical Significance Testing: A Historical Overview of Misuse and Misinterpretation with Implications for the Editorial Policies of Educational Journals

http://www.personal.psu.edu/users/d/m/dmr/sigtest/master.pdf#page=28

https://link.springer.com/article/10.3758/s13423-015-0955-8#Fig1

See following section that says “Consider first de Groot (1989), which MU quote in its second edition: “[w]e can then make the statement that the unknown value of μ lies in the interval [...] with confidence 0.95.” This text is now in its fourth edition (de Groot and Schervish, 2012); the current version of the text does not contain the passage MU emphasize. The text later makes clear why it was removed, emphasizing that “the observed interval...is not so easy to interpret...[S]ome people would like to interpret the interval...as meaning that we are 95 % confident that μ is between [the observed confidence limits]. Later...we shall show why such an interpretation is not safe in general” (p. 487). Following an example (coincidentally, one explored by Morey et al. 2015) and dismissed by MU as an “artificial case”), the reason is given:…”

Haller and Krauss also examined NHST explanations in a popular textbook, Introduction to statistics for psychology and education. The authors note that within a span of just three pages the author provides eight different interpretations of a statistically significant test result, all of which are incorrect statements [https://www.metheval.uni-jena.de/lehre/0405-ws/evaluationuebung/haller.pdf]:

  • “the improbability of observed results being due to error”

  • “the probability that an observed difference is real”

  • “if the probability is low, the null hypothesis is improbable”

  • “the statistical confidence ... with odds of 95 out of 100 that the observed difference will hold up in investigations”

  • “the degree to which experimental results are taken ‘seriously’”

  • “the danger of accepting a statistical result as real when it is actually due only to error”

  • “the degree of faith that can be placed in the reality of the finding”

  • “the investigator can have 95 percent confidence that the sample mean actually differs from the population mean”

Education researchers Jeffrey Gliner, Nancy Leech, and George Morgan examined 12 different textbooks popular in graduate-level education programs in the United States and ranked how well the authors clarified one of three common NHST misinterpretations [https://www.tandfonline.com/doi/pdf/10.1080/00220970209602058]. Textbooks were categorized into two groups: educational research books and statistics for education books. The three clarifications of common NHST misinterpretations were:

  1. There is disagreement about NHST. While not strictly a misinterpretation the fact that there is disagreement and controversy about NHST is an issue that many non-statisticians aren’t aware of. In many texts NHST is taught as if it were a mechanical process, however in practice statisticians believe NHST should either not be used or requires a significant amount of education around correct usage [citation].

  2. The p-value does not indicate the strength of a relationship. This is Fallacy 5 above. It is the estimated effect size that most closely corresponds to the notion of “strength of a relationship.” The p-value simply measures how compatible the null hypothesis is to the observed data.

  3. Statistical significance is not the same as practical significance.

Rating Degree of emphasis
0 None
1 Indirect
2 Direct statement but brief; easily missed (no heading, box, examples, etc.)
3 More than brief statement, but not not very helpful in forms of examples of best practice
4 Clear statement with emphasis, examples, or both and with help about best practice
Textbook scoring system for common NHST misinterpretations

Table notes:
1. Reproduced from "Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?", Jeffrey Gliner, Nancy Leech, and George Morgan, The Journal of Experimental Education, 2002 [link]


The three authors independently ranked each textbook on a scale of 0 to 4, indicating varying levels of emphasis on the issue (rating disagreements between the authors were later resolved so that a common score was created). The rating system is shown in the chart at right. A rating of 0 indicated there was no discussion of the misinterpretation, while a rating of 4 indicated there was a full discussion including clear statements, examples, and help regarding best practices.

The authors found that both sets of textbooks did a poor job of covering the first two misinterpretations: disagreement over NHST and the p-value not indicating the strength of a relationship.

No research textbook mentioned that NHST was controversial or there was disagreement. Some statistics books covered the controversy, but the average score was only 1.17, indicating the emphasis on the topic was mostly indirect. Only two of the six statistics textbooks were rated as 2 or above by the authors.

The misinterpretation regarding the p-value as a measure of the strength of a relationship didn’t fare much better, scoring just 1.5 for research textbooks and 1.67 for statistics textbooks. Four out of the six research textbooks and four out of the six statistics books were rated a 2 or above by the authors.

The fact that statistical significance is not the same as practical significance was mostly well covered. Research textbooks scored 3.33 on average, with all six books scoring a 2 or better rating from the authors. Somewhat surprisingly statistics textbooks did slightly less well. Just four out of the six scored a rating of 2 or better for an average score of 3.17. A summary of the scores for the two textbook categories is shown in the table below.

Issue Average score of research texts Average score of statistics texts
There is disagreement about NHST 0 1.17
The p-value does not indicate the strength of a relationship 1.5 1.67
Statistical significance is not the same as practical significance 3.33 3.17
Average score of graduate textbooks in education discussion of three NHST misinterpretation

Table notes:
1. A total of 12 textbooks were reviewed, 6 used in graduate-level research classes in education and 6 graduate-level statistics classes in education.
2. Textbooks were scored by three judges on a scale of 0 to 4. A score of 0 indicated no discussion of the issue and a score of 4 indicated a full discussion including clear statements, examples, and help regarding best practices.
3. Reproduced from "Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?", Jeffrey Gliner, Nancy Leech, and George Morgan, The Journal of Experimental Education, 2002 [link]

A small number of other researchers have examined textbooks for accuracy and collected examples of books they think contain errors. Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch found four books that they believe promote, "the illusion that the level of significance would specify the probability of [the] hypothesis,” without quoting specific examples.

In “On the probability of making Type 1 Errors,” Paul Pollard and John Richardson list a number of articles that misstate the probability of making Type 1 errors. These articles state that using the normal Type 1 error rate of 0.05 causes a false positive 1 out of 20 times; however, the articles fail to mention that this probability is conditional on the null hypothesis being true. If the null hypothesis is not true the Type 1 error rate no longer applies. This is shown in detail in our article on p-value replication.

In a late January of 2020 PhD candidate Felix Singleton Thorn (@FSingletonThorn) posted a Twitter thread which included numerous examples of confusing or incorrect definitions of p-values and the null hypothesis significance testing framework.

Each of these three sources are compiled in the table below along with the name of the textbook, the year of publish, and the statistical error if listed.

Textbook name Year published Statistical error
Differential psychology1 1958 Not stated
Statistical analysis in psychology and education1 1959 Not stated
Statistical analysis in educational research1 1959 Not stated
Psychology: The science of mental life1 1973 Not stated
Fundamental statistics in psychology and education2 1978 "The probability of making a Type 1 error is very simply and directly indicated by α"
Basic statistics for business and economics2 1977 "The probability of committing a Type 1 error, which is denoted by α, is called the significance level of the test"
Experimental design and statistics2 1975 "The probability of committing such an error is actually equivalent to the significance level we select..."
"If we reject the null hypothesis whenever the chance of it being true is less than 0.05, then obviously we shall be wrong 5 per cent of the time..."
Understanding statistics in the behavioral sciences2 1981 "Alpha determines the probability of making a Type I error"
Experiment, design, and statistics in psychology2 1973 "The significance level is simply the probability of making a Type 1 error"
Experimental methodology2 1980 "If the .05 significance level is set, you run the risk of being wrong and committing Type 1 error five times in 100"
Introduction to design and analysis: A students' handbook2 1980 "We will make a Type 1 error a small percentage of the time—the exact amount being specified by our significance level"
Learning to use statistical tests in psychology: A students' guide2 1982 "after correctly describing the significance level as a conditional prior probability, later refer to it as the 'percentage probability...that your results are due to chance'"
Nonparametric statistics for the behavioral sciences2 1956 "In fact, the probability that [the null hypothesis is true, but the p-value lies in the rejection region]...is given by α, for rejecting H0 when in fact it is true is the Type 1 error"
A Dictionary of Dentistry3 2010 "For example, a p-value of 0.01 (p = 0.01) means there is a 1 in 100 chance the result occurred by chance."
Oxford Dictionary of Biochemistry and Molecular Biology3 2008 "The closer the p‐value to 0, the more significant the match."
A Dictionary of Business Research Methods3 2016 "... part of the output of a statistical test, such as regression or ANOVA, that can take a value between 0.0 and 1.0. This is the probability that the observed difference or association among variables has arisen by chance."
A Dictionary of Social Research Methods3 2016 "The p-value is the probability that the observed data are consistent with the null hypothesis."
Statistical errors in textbooks

Table notes:
1. Source: "The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask," Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch, The Sage handbook of quantitative methodology for the social sciences, 2004 [link] The authors do not specify the actual statistical misinterpretations, simply citing that the authors of the listed textbooks promote, "the illusion that the level of significance would specify the probability of [the] hypothesis."
2. Source: "On the Probability of Making Type I Errors," P. Pollard and J. T. E. Richardson, Psychological Bulletin, 1987 [link]
3. Source: These were collected in a Twitter thread by Felix Singleton Thorn (@FSingletonThorn), a PhD candidate studying statistics in the Interdisciplinary Meta Research Group at the University of Melbourne [link].

JOURNAL ARTICLES EXAMING MISINTERPRETATIONS IN THE LITERATURE

See Interpretation of the p value: A national survey study in academic psychologists from Spain (http://www.psicothema.com/pdf/4266.pdf) for a set of studies analyzing journal articles misinterpretation of NHST.

Statistical significance testing as it relates to practice: Use within Professional Psychology: Research and Practice.

https://psycnet.apa.org/buy/1999-00560-017

A peculiar prevalence of p values just below .05

https://www.tandfonline.com/doi/abs/10.1080/17470218.2012.711335

The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation

https://link.springer.com/article/10.1186/1471-2288-10-44

Quantifying Support for the Null Hypothesis in Psychology: An Empirical Investigation

https://journals.sagepub.com/doi/10.1177/2515245918773742

The prevalence of statistical reporting errors in psychology (1985–2013)

https://link.springer.com/article/10.3758/s13428-015-0664-2

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

https://journals.sagepub.com/doi/full/10.1177/0956797611417632

What’s in a p? Reassessing Best Practices for Conducting and Reporting Hypothesis-Testing Research

https://link.springer.com/chapter/10.1007/978-3-030-22113-3_4

Reflections Concerning Recent Ban on NHST and Confidence Intervals

https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?referer=https://scholar.google.com/&httpsredir=1&article=2077&context=jmasm

THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES

https://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_6B3_HOEKSTRA.pdf

Statistical significance testing as it relates to practice: Use within Professional Psychology: Research and Practice

https://psycnet.apa.org/buy/1999-00560-017

Reporting of statistical inference in the Journal of Applied Psychology : Little evidence of reform

Reformers have long argued that misuse of Null Hypothesis Significance Testing (NHST) is widespread and damaging. The authors analyzed 150 articles from the Journal of Applied Psychology (JAP) covering 1940 to 1999. They examined statistical reporting practices related to misconceptions about NHST, American Psychological Association (APA) guidelines, and reform recommendations. The analysis reveals (a) inconsistency in reporting alpha and p values, (b) the use of ambiguous language in describing NHST, (c) frequent acceptance of null hypotheses without consideration of power, (d) that power estimates are rarely reported, and (e) that confidence intervals were virtually never used. APA guidelines have been followed only selectively. Research methodology reported in JAP has increased greatly in sophistication over 60 yrs, but inference practices have shown remarkable stability. There is little sign that decades of cogent critiques by reformers had by 1999 led to changes in statistical reporting practices in JAP.

Probability as certainty: Dichotomous thinking and the misuse of p values

Rink Hoekstra, Sue Finch, Henk A. L. Kiers, and Addie Johnson (2006)

Significance testing is widely used and often criticized. The Task Force on Statistical Inference of the American Psychological Association (TFSI, APA; Wilkinson & TFSI, 1999) addressed the use of significance testing and made recommendations that were incorporated in the fifth edition of the APA Publication Manual (APA, 2001). They emphasized the interpretation of significance testing and the importance of reporting confidence intervals and effect sizes. We examined whether 286 Psychonomic Bulletin & Review articles submitted before and after the publication of the TFSI recommendations by APA complied with these recommendations. Interpretation errors when using significance testing were still made frequently, and the new prescriptions were not yet followed on a large scale. Changing the practice of reporting statistics seems doomed to be a slow process.

JOURNAL ARTICLES CITING NHST MISINTERPRETATIONS

A large volume of journal articles have been published critiquing NHST usage in various academic disciplines. In these cases it is implied that in the experience of the authors NHST misinterpretations are common enough in their discipline that authors feel a corrective is warranted. More than 60 such articles across several dozen disciplines are outlined in the table below. These disciplines are as varied as hematology and wildlife management, machine learning and accounting. The table is not a complete list of all such articles, nor is it meant to be. Instead, its purpose is to give some sense of the breadth of disciplines for which authors have identified NHST misinterpretations as a key problem. Likewise, by virtue of the fact that almost all of the articles were published in peer-reviewed journals, the journal editors must have also felt at some level that NHST misinterpretations were a concern worth addressing.

To generate the table a semi-structured search was used to identify articles and book chapters where the authors implicitly or explicitly identify NHST misinterpretations. The search largely consisted of the two phrases “p-value misinterpretations” and “null hypothesis misinterpretations” with occasional variations such as adding the name of a specific academic discipline. The search included numerous digital journal libraries such as Google Scholar, ResearchGate, JSTOR, SpringerLink, and Access to Research. NHST misinterpretations included authors identifying misuse or misunderstandings about p-values, Type I errors, statistical power, sample size, standard errors, or other components of the NHST framework. The articles span several decades, although an attempt was made to find recent articles when possible, with a significant portion published in the previous 10 years and more than half published in the previous five.

An explicit example of NHST misinterpretations comes from the 2011 article “P Values: Use and Misuse in Medical Literature” in the American Journal of Hypertension:

P values are widely used in the medical literature but many authors, reviewers, and readers are unfamiliar with a valid definition of a P value, let alone how to interpret one correctly...The article points out how to better interpret P values by avoiding common errors.

This article is considered an explicit example of the author identifying NHST misinterpretations because the abstract explicitly notes that many authors, reviewers, and readers do not know how to interpret a p-value, a key ingredient in the NHST framework.

Contrast the above example with an implicit set of misinterpretations identified in “Significance Testing in Accounting Research: A Critical Evaluation Based on Evidence,” a 2018 article appearing in the journal Abacus, a journal of accounting, finance, and business:

From a survey of the papers published in leading accounting journals in 2014, we find that accounting researchers conduct significance testing almost exclusively at a conventional level of significance, without considering key factors such as the sample size or power of a test. We present evidence that a vast majority of the accounting studies favour large or massive sample sizes and conduct significance tests with the power extremely close to or equal to one. As a result, statistical inference is severely biased towards Type I error, frequently rejecting the true null hypotheses. Under the ‘p‐value less than 0.05’ criterion for statistical significance, more than 90% of the surveyed papers report statistical significance. However, under alternative criteria, only 40% of the results are statistically significant. We propose that substantial changes be made to the current practice of significance testing for more credible empirical research in accounting.

This article is considered an implicit example of the author identifying NHST misinterpretations because nowhere is it mentioned that there are misinterpretations within the field of accounting. Instead, the authors imply misinterpretations are present by noting that key elements of NHST are used in a manner the authors feel is incorrect. The implication is that if accounting researchers knew how to interpret all of the components of NHST correctly a different pattern of usage would have been observed within the accounting journals surveyed.

The passage from the American Journal of Hypertension demonstrates that some of the language used to identify the scope of misinterpretations is general rather than specific. For example, the authors call out the “medical literature” in general, not hypertension as a specific subfield. Nonetheless, the assumption is that NHST misinterpretations are indeed a problem in the medical subfield of hypertension, but the authors may not feel the misinterpretations are limited only to the subfield of hypertension. If the authors didn’t feel NHST misinterpretations were a concern in the field of hypertension presumably they would have chosen to publish their comments in a more relevant journal.

The table does not included submission or editorial guidance from journals, which sometimes include notes on use of NHST and p-values [https://link.springer.com/chapter/10.1007/978-3-030-22113-3_4]. Nor does the table include tutorials or primers on NHST unless misinterpretations are explicitly called out. Many articles attempt to address NHST with various solutions, for example by suggesting a transition to Bayesian statistics or information criteria approaches. Inclusion in the table below does not imply an endorsement of such proposals. Indeed, many such proposals themselves have associated critiques, misuse, and misinterpretations.

The table includes the article title, authors, journal in which the article appeared, year of publication, and the relevant passage from the abstract in which misinterpretations were outlined (in a limited number of cases the article did not have an abstract and text was pulled from the body of the article). We have not yet had the time to read the full text of each article below and therefore readers are encouraged to seek out relevant articles on their own and examine them in detail as needed.

Article title Authors Year Journal Description
The Significance of Statistical Significance Tests in Marketing Research [link] Sawyer and Peter 1983 Journal of Marketing Research "Classical statistical significance testing is the primary method by which marketing researchers empirically test hypotheses and draw inferences about theories. The authors discuss the interpretation and value of classical statistical significance tests and suggest that classical inferential statistics may be misinterpreted and overvalued by marketing researchers in judging research results."
Why We Don't Really Know What Statistical Significance Means: Implications for Educators [link] Hubbard and Armstrong 2006 Journal of Marketing Education "In marketing journals and market research textbooks, two concepts of statistical significance—p values and α levels—are commonly mixed together. This is unfortunate because they each have completely different interpretations. The upshot is that many investigators are confused over the meaning of statistical significance. We explain how this confusion has arisen and make several suggestions to teachers and researchers about how to overcome it."
Significance Testing in Accounting Research: A Critical Evaluation Based on Evidence [link] Kim, Ahmed, and Ji 2018 Abacus: A Journal of Accounting, Finance, and Business Studies "From a survey of the papers published in leading accounting journals in 2014, we find that accounting researchers conduct significance testing almost exclusively at a conventional level of significance, without considering key factors such as the sample size or power of a test. We present evidence that a vast majority of the accounting studies favour large or massive sample sizes and conduct significance tests with the power extremely close to or equal to one. As a result, statistical inference is severely biased towards Type I error, frequently rejecting the true null hypotheses. Under the ‘p‐value less than 0.05’ criterion for statistical significance, more than 90% of the surveyed papers report statistical significance. However, under alternative criteria, only 40% of the results are statistically significant. We propose that substantial changes be made to the current practice of significance testing for more credible empirical research in accounting."
Tackling False Positives in Finance: A Statistical Toolbox with Applications [link] Kim 2018 31st Australasian Finance and Banking Conference 2018 "Serious concerns have been raised that false positive findings are widespread in empirical research in business research including finance. This is largely because researchers almost exclusively adopt the 'p-value less than 0.05' criterion for statistical significance; and they are often not fully aware of large-sample biases which can potentially mislead their research outcomes. This paper proposes that a statistical toolbox (rather than a single hammer) be used in empirical research, which offers researchers a range of statistical instruments, including alternatives to the p-value criterion and cautionary analyses for large-sample bias. It is found that the positive results obtained under the p-value criterion cannot stand, when the toolbox is applied to three notable studies in finance."
A Re-Interpretation of 'Significant' Empirical Financial Research [link] Kellner and Roesch 2018 White paper [finance] "Currently, the use and interpretation of statistics and p-values is under scrutiny in various scientific fields for several reasons: p-hacking, data dredging, misinterpretation, multiple testing, or selective reporting, among others. To the best of our knowledge, this discussion has hardly reached the empirical finance community. Thus, the aim of this paper is to show how the typical p-value based analysis of empirical findings in finance can be fruitfully enriched by the supplemental use of further statistical tools."
Comments and recommendations regarding the hypothesis testing controversy [link] Bonett and Wright 2007 Journal of Organizational Behavior "Hypothesis tests are routinely misinterpreted in scientific research. Specifically, the failure to reject a null hypothesis is often interpreted as support for the null hypothesis while the rejection of a null hypothesis is often interpreted as evidence of an important finding. Many of the most frequently used hypothesis tests are 'non-informative because the null hypothesis is known to be false prior to hypothesis testing. We discuss the limitations of non-informative hypothesis tests and explain why confidence intervals should be used in their place. Several examples illustrate the use and interpretation of confidence intervals."
Time to dispense with the p-value in OR? [link] Hofmann and Meyer-Nieberg 2017 Central European Journal of Operations Research "Null hypothesis significance testing is the standard procedure of statistical decision making, and p-values are the most widespread decision criteria of inferential statistics both in science, in general, and also in operations research...Yet, the use of significance testing in the analysis of research data has been criticized from numerous statisticians—continuously for almost 100 years...Is it time to dispense with the p-value in OR? The answer depends on many factors, including the research objective, the research domain, and, especially, the amount of information provided in addition to the p-value. Despite this dependence from context three conclusions can be made that should concern the operational analyst: First, p-values can perfectly cast doubt on a null hypothesis or its underlying assumptions, but they are only a first step of analysis, which, stand alone, lacks expressive power. Second, the statistical layman almost inescapably misinterprets the evidentiary value of p-values. Third and foremost, p-values are an inadequate choice for a succinct executive summary of statistical evidence for or against a research question..."
Misinformation in MIS Research: The Problem of Statistical Power [link] Baroudi and Orlikowski 2018 White paper [Management Information Systems] "This study reviews 57 MIS articles employing statistical inference testing published in leading MIS journals over the last five years. The statistical power of the articles was evaluated and found on average to fall substantially below the accepted norms. The consequence of low power is that it can lead to misinterpretation of data and results. Collectively these misinterpretations resultin a body of MIS research that is built on potentially erroneous conclusions."
Institutionalized dualism: statistical significance testing as myth and ceremony [link] Orlitzky 2011 Journal of Management Control "Several well-known statisticians regard significance testing as a deeply problematic procedure in statistical inference. Yet, in-depth discussion of null hypothesis significance testing (NHST) has largely been absent from the literature on organizations or, more specifically, management control systems. This article attempts to redress this oversight by drawing on neoinstitutional theory to frame, analyze, and explore the NHST problem. Regulative, normative, and cultural-cognitive forces partly explain the longevity of NHST in organization studies. The unintended negative consequences of NHST include a reinforcement of the academic-practitioner divide, an obstacle to the growth of knowledge, discouragement of study replications, and mechanization of researcher decision making. An appreciation of these institutional explanations for NHST as well as the harm caused by NHST may ultimately help researchers develop superior methodological alternatives to a controversial statistical technique."
Caveats for using statistical significance tests in research assessments [link] Schneider 2013 Journal of Informetrics "This article raises concerns about the advantages of using statistical significance tests in research assessments...Statistical significance tests are highly controversial and numerous criticisms have been leveled against their use. Based on examples from articles by proponents of the use statistical significance tests in research assessments, we address some of the numerous problems with such tests...we generally believe that statistical significance tests are over- and misused in the empirical sciences including scientometrics and we encourage a reform on these matters."
Fair Statistical Communication in HCI [link] Dragicevic 2016 Book chapter in Modern Statistical Methods for HCI part of the Human-Computer Interaction book series "[A]reas such as human-computer interaction (HCI) have adopted tools — i.e., p-values and dichotomous testing procedures — that have proven to be poor at supporting these tasks [of advancing scientifc knowledge]. The abusive use of these procedures has been severely criticized in a range of disciplines for several decades...This chapter explains in a non-technical manner why it would be beneficial for HCI to switch to an estimation approach..."
Statistics [link] de Winter and Dodou 2017 Book chapter in Human Subject Research for Engineers "After the measurements have been completed, the data have to be statistically analysed. This chapter explains how to analyse data and how to conduct statistical tests...We draw attention to pitfalls that may occur in statistical analyses, such as misinterpretations of null hypothesis significance testing and false positives. Attention is also drawn to questionable research practices and their remedies. Replicability of research is also discussed, and recommendations for maximizing replicability are provided."
Statistical Significance and the Dichotomization of Evidence [link] McShane and Gal 2017 Journal of the American Statistical Association "In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations."
The Waves and the Sigmas [link] Gilles D’Agostini 2016 White paper [physics] "This paper shows how p-values do not only create, as well known, wrong expectations in the case of flukes, but they might also dramatically diminish the ‘significance’ of most likely genuine signals. As real life examples, the 2015 first detections of gravitational waves are discussed. The March 2016 statement of the American Statistical Association, warning scientists about interpretation and misuse of p-values, is also reminded and commented. (The paper is complemented with some remarks on past, recent and future claims of discoveries based on sigmas from Particles Physics.)"
Should significance testing be abandoned in machine learning? [link] Berrar and Dubitzky 2019 International Journal of Data Science and Analytics "Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far...Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis."
Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference [link] Glaser 2019 Machine Learning and Knowledge Extraction "Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science...In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections."
Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers [link] Berrar 2016 Machine Learning "Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails...[T]he view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment."
Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses [link] Azer, Khashabi, Sabharwal, Roth 2019 White paper [Natural language processing] "Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community...[C]ommon fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research..."
A systematic review of statistical power in software engineering experiments [link] Dyba, Kampenes, and Sjoberg 2006 Information and Software Technology "This paper reports a quantitative assessment of the statistical power of empirical software engineering research based on the 103 papers on controlled experiments (of a total of 5,453 papers) published in nine major software engineering journals and three conference proceedings in the decade 1993–2002. The results show that the statistical power of software engineering experiments falls substantially below accepted norms as well as the levels found in the related discipline of information systems research. Given this study's findings, additional attention must be directed to the adequacy of sample sizes and research designs to ensure acceptable levels of statistical power. Furthermore, the current reporting of significance tests should be enhanced by also reporting effect sizes and confidence intervals."
“Bad Smells" in Software Analytics Papers [link] Information and Software Technology 2019 Information and Software Technology The authors list 12 "bad smells." Among them are "over-reliance on, or abuse of, null hypothesis significance testing," "p-hacking," and "under-powered studies."
The limits of p-values for biological data mining [link] Malley, Dasgupta, and Moore 2013 BioData Mining "Use of p-values is widespread in the sciences, especially so in biomedical research, and also underlies several analytic approaches in data mining...However, simple criticisms and essential distinctions are immediate..."
Impact of Criticism of Null-Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology [link] Fidler et. al. 2006 Conservation Biology "Over the last decade, criticisms of null-hypothesis significance testing have grown dramatically...Have these calls for change had an impact on the statistical reporting practices in conservation biology? In 2000 and 2001, 92% of sampled articles in Conservation Biology and Biological Conservation reported results of null-hypothesis tests. In 2005 this figure dropped to 78%...Of those articles reporting null-hypothesis testing — which still easily constitute the majority — very few report statistical power (8%) and many misinterpret statistical nonsignificance as evidence for no effect (63%). Overall, results of our survey show some improvements in statistical practice, but further efforts are clearly required to move the discipline toward improved practices."
Misuse of null hypothesis significance testing: would estimation of positive and negative predictive values improve certainty of chemical risk assessment? [link] Bundschuh et. al. 2013 Environmental Science and Pollution Research "Although generally misunderstood, the p value is the probability of the test results or more extreme results given H0 is true: it is not the probability of H0 being true given the results...[Our] approach also reinforces the value of considering α, β, and a biologically meaningful effect size a priori."
Statistics: reasoning on uncertainty, and the insignificance of testing null [link] Läärä 2009 Annales Zoologici Fennici "The practice of statistical analysis and inference in ecology is critically reviewed. The dominant doctrine of null hypothesis significance testing (NHST) continues to be applied ritualistically and mindlessly. This dogma is based on superficial understanding of elementary notions of frequentist statistics in the 1930s, and is widely disseminated by influential textbooks targeted at biologists. It is characterized by silly null hypotheses and mechanical dichotomous division of results being “significant” (P < 0.05) or not. Simple examples are given to demonstrate how distant the prevalent NHST malpractice is from the current mainstream practice of professional statisticians. Masses of trivial and meaningless “results” are being reported, which are not providing adequate quantitative information of scientific interest. The NHST dogma also retards progress in the understanding of ecological systems and the effects of management programmes, which may at worst contribute to damaging decisions in conservation biology."
The Insignificance of Statistical Significance Testing [link] Johnson 1999 The Journal of Wildlife Management "Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research...This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one."
Viewpoint: Improving Range Science through the Appropriate Use of Statistics [link] Gould and Steiner 2002 Journal of Range Management "We examined a stratified random sample of articles published over 3 decades of the Journal of Range Management to study the applications and changes in statistical methodology employed by range scientists...Articles that reported an effect size via a sample mean frequently did not report an associated standard error...Improper identification of the experimental or sampling unit and/or the interdependence of observations occurred in all decades. We recommend increased inferential use of confidence intervals and suggest that the practical significance (as opposed to statistical significance) of results be considered more often..."
How significant (p < 0.05) is geomorphic research? [link] Hutton 2014 Earth Surface Processes and Landforms "The pervasive application of the Null Hypothesis Significance Test in geomorphic research runs counter to widespread, long running, and often severe criticism of the method in the broader scientific literature...Though not without their own assumptions, wider application of such [Bayesian] methods can help facilitate a transition towards a broader approach to statistical, and in turn, scientific inference in geomorphic systems."
Inference without significance: measuring support for hypotheses rather than rejecting them [link] Gerrodette 2011 Marine Ecology "Despite more than half a century of criticism, significance testing continues to be used commonly by ecologists. Significance tests are widely misused and misunderstood, and even when properly used, they are not very informative for most ecological data. Problems of misuse and misinterpretation include: (i) invalid logic; (ii) rote use; (iii) equating statistical significance with biological importance; (iv) regarding the P‐value as the probability that the null hypothesis is true; (v) regarding the P‐value as a measure of effect size; and (vi) regarding the P‐value as a measure of evidence. Significance tests are poorly suited for inference because they pose the wrong question. In addition, most null hypotheses in ecology are point hypotheses already known to be false, so whether they are rejected or not provides little additional understanding. Ecological data rarely fit the controlled experimental setting for which significance tests were developed."
I can see clearly now: Reinterpreting statistical significance [link] Dushoffdragons, Kain, and Bolker 2019 Methods in Ecology and Evolution "Null hypothesis significance testing (NHST) remains popular despite decades of concern about misuse and misinterpretation. We believe that much misinterpretation of NHST is due to language: significance testing has little to do with other meanings of the word ‘significance’. We therefore suggest that researchers describe the conclusions of null‐hypothesis tests in terms of statistical ‘clarity’ rather than ‘significance’."
Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values [link] Rinella and James 2010 Invasive Plant Science and Management "Null hypothesis significance testing (NHST) forms the backbone of statistical inference in invasive plant science. Over 95% of research articles in Invasive Plant Science and Management report NHST results such as P-values or statistics closely related to P-values such as least significant differences. Unfortunately, NHST results are less informative than their ubiquity implies. P-values are hard to interpret and are regularly misinterpreted. Also, P-values do not provide estimates of the magnitudes and uncertainties of studied effects, and these effect size estimates are what invasive plant scientists care about most. In this paper, we reanalyze four datasets (two of our own and two of our colleagues; studies put forth as examples in this paper are used with permission of their authors) to illustrate limitations of NHST. The re-analyses are used to build a case for confidence intervals as preferable alternatives to P-values. Confidence intervals indicate effect sizes, and compared to P-values, confidence intervals provide more complete, intuitively appealing information on what data do/do not indicate."
Wise use of statistical tools in ecological field studies [link] Halvorsen 2007 Folia Geobotanica "The currently dominating hypothetico-deductive research paradigm for ecology has statistical hypothesis testing as a basic element. Classic statistical hypothesis testing does, however, present the ecologist with two fundamental dilemmas when field data are to be analyzed...I argue that a research strategy entirely based on rigorous statistical testing of hypotheses is insufficient for field ecological data and that inductive and deductive approaches are complementary in the process of building ecological knowledge. I recommend that great care is taken when statistical tests are applied to ecological field data. Use of less formal modelling approaches is recommended for cases when formal testing is not strictly needed. Sets of recommendations, “Guidelines for wise use of statistical tools”, are proposed both for testing and for modelling. Important elements of wise-use guidelines are parallel use of methods that preferably belong to different methodologies, selection of methods with few and less rigorous assumptions, conservative interpretation of results, and abandonment of definitive decisions based a predefined significance level."
Does the P Value Have a Future in Plant Pathology? (Letter to the editor) [link] Madden, Shah, and Esker 2015 Phytopathology "The P value (significance level) is possibly the mostly widely used, and also misused, quantity in data analysis. P has been heavily criticized on philosophical and theoretical grounds, especially from a Bayesian perspective....Plant pathologists may view this latest round of P value criticism with anything from indifference to alarm. If anything, we do hope the new round of attention will increase awareness within our own discipline of the very likely misuse of the calculated P value in plant pathological science. "
Explicit consideration of critical effect sizes and costs of errors can improve decision-making in plant science [link] Mudge 2013 New Phytologist "Plant scientists...are responsible for ensuring that data are analyzed and presented in a way that facilitates good decision-making. Good statistical practices can be an important tool for ensuring objective and transparent and data analysis. Despite frequent criticism over the last few decades, null hypothesis significance testing (NHST) remains widely used in plant science...Use of a consistent significance threshold regardless of sample size has contributed to frequent misinterpretations of P-values as being ‘highly significant’ or ‘marginally significant’, and/or as measures of how likely the alternate hypothesis is to be true..."
The continuing misuse of null hypothesis significance testing in biological anthropology [link] Smith 2018 American Journal of Physical Anthropology "There is over 60 years of discussion in the statistical literature concerning the misuse and limitations of null hypothesis significance tests (NHST). Based on the prevalence of NHST in biological anthropology research, it appears that the discipline generally is unaware of these concerns..."
Common scientific and statistical errors in obesity research [link] George et. al. 2016 Obesity "This review identifies 10 common errors and problems in the statistical analysis, design, interpretation, and reporting of obesity research and discuss how they can be avoided."
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations [link] Greenland et. al. 2016 European Journal of Epidemiology "Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature...We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting."
Toward evidence-based medical statistics. 1: The P value fallacy [link] Goodman 1999 Annals of Internal Medicine "There is little appreciation in the medical community that the [hypothesis test] methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy."
P Values: Use and Misuse in Medical Literature [link] Cohen 2011 American Journal of Hypertension "P values are widely used in the medical literature but many authors, reviewers, and readers are unfamiliar with a valid definition of a P value, let alone how to interpret one correctly...The article points out how to better interpret P values by avoiding common errors."
Understanding the Role of P Values and Hypothesis Tests in Clinical Research [link] Mark, Lee, and Harrell 2016 JAMA Cardiology "P values and hypothesis testing methods are frequently misused in clinical research. Much of this misuse appears to be owing to the widespread, mistaken belief that they provide simple, reliable, and objective triage tools for separating the true and important from the untrue or unimportant."
A Dirty Dozen: Twelve P-Value Misconceptions [link] Goodman 2008 Seminars in Hematology "The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value’s inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong."
The Value of a p-Valueless Paper [link] Connor 2004 American Journal of Gastroenterolog "As is common in current biomedical research, about 85% of original contributions in The American Journal of Gastroenterology in 2004 have reported p-values. However, none are reported in this issue's article by Abraham et al. who, instead, rely exclusively on effect size estimates and associated confidence intervals to summarize their findings. Authors using confidence intervals communicate much more information in a clear and efficient manner than those using p-values. This strategy also prevents readers from drawing erroneous conclusions caused by common misunderstandings about p-values. I outline how standard, two-sided confidence intervals can be used to measure whether two treatments differ or test whether they are clinically equivalent."
To P or Not to P : Backing Bayesian Statistics [link] Buchinsky and Chadha 2017 Otolaryngology Head and Neck Surgery "In biomedical research, it is imperative to differentiate chance variation from truth before we generalize what we see in a sample of subjects to the wider population. For decades, we have relied on null hypothesis significance testing, where we calculate P values for our data to decide whether to reject a null hypothesis. This methodology is subject to substantial misinterpretation and errant conclusions..."
Computation of measures of effect size for neuroscience data sets [link] Hentschke and Stuttgen 2011 European Journal of Neuroscience "Here we review the most common criticisms of significance testing and provide several examples from neuroscience where use of MES [measures of effect size] conveys insights not amenable through the use of P‐values alone. We introduce an open‐access matlab toolbox providing a wide range of MES to complement the frequently used types of hypothesis tests, such as t‐tests and analysis of variance."
Can nursing epistemology embrace p-values? [link] Ou, Hall, and Thorne 2017 Nursing Philosophy "The use of correlational probability values (p-values) as a means of evaluating evidence in nursing and health care has largely been accepted uncritically. There are reasons to be concerned about an uncritical adherence to the use of significance testing, which has been located in the natural science paradigm...Nursing has been minimally involved in the rich debate about the controversies of treating significance testing as evidentiary in the health and social sciences...We argue that nursing needs to critically reflect on the limitations associated with this tool of the evidence-based movement, given the complexities and contextual factors that are inherent to nursing epistemology. Such reflection will inform our thinking about what constitutes substantive knowledge for the nursing discipline."
P Value and the Theory of Hypothesis Testing An Explanation for New Researchers [link] Jean Biau, Jolles, Porcher 2009 Clinical Orthopaedics and Related Research "[T]he p value and the theory of hypothesis testing are different theories that often are misunderstood and confused, leading researchers to improper conclusions. Perhaps the most common misconception is to consider the p value as the probability that the null hypothesis is true rather than the probability of obtaining the difference observed, or one that is more extreme, considering the null is true. Another concern is the risk that an important proportion of statistically significant results are falsely significant. Researchers should have a minimum understanding of these two theories so that they are better able to plan, conduct, interpret, and report scientific experiments."
Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing [link] Gagnier & Morgenstern 2017 The Journal of Bone and Joint Surgery "The purpose of this article was to discuss these principles. We make several recommendations for moving forward: (1) Authors should avoid statements such as 'statistically significant' or 'statistically nonsignificant.' (2) Investigators should report the magnitude of effect of all outcomes together with the appropriate measure of precision or variation..."
Statistics in ophthalmology revisited: the (effect) size matters [link] Grzybowski and Mianowany 2018 Acta Ophthalmologica "[T]he null hypothesis significance testing (NHST) has been the kernel of statistical data analysis for a long period. Medicine is making use of it in a very wide range. The main problem in NHST arises out of a dichotomic perception of reality. Scientific reasoning was biased in the aftermath of the uncritical, almost dogmatic, trust in a level of statistical significance, namely a p‐value, defined as ‘the dictating paradigm of p‐value’...The p‐value refers directly to a formulated hypothesis only, that is the probability of observing a given outcome under the condition posited by a specific null hypothesis and given a specific model of the distribution of outcomes under the null hypothesis...To enhance the plausibility and true scientific value of research works, investigators should also take into consideration the effect size along with its confidence limits..."
The role of P-values in analysing trial results [link] Freeman 1993 Statistics in Medicine "Reasons for grave concern over the present situation [widepsread use of p-values] range from the unsatisfactory nature of p-values themselves, their very common misunderstanding by statisticians as well as by clinicians and their serious distorting influence on our perception of the very nature of clinical trials. Some of the ways [more sensible reporting can be introduced]...are discussed."
P in the right place: Revisiting the evidential value of P -values [link] Lytsy 2018 Journal of Evidence-Based Medicine "P‐values are often calculated when testing hypotheses in quantitative settings, and low P‐values are typically used as evidential measures to support research findings in published medical research. This article reviews old and new arguments questioning the evidential value of P‐values. Critiques of the P‐value include that it is confounded, fickle, and overestimates the evidence against the null. P‐values may turn out falsely low in studies due to random or systematic errors. Even correctly low P‐values do not logically provide support to any hypothesis. Recent studies show low replication rates of significant findings, questioning the dependability of published low P‐values. P‐values are poor indicators in support of scientific propositions. P‐values must be inferred by a thorough understanding of the study's question, design, and conduct. Null hypothesis significance testing will likely remain an important method in quantitative analysis but may be complemented with other statistical techniques that more straightforwardly address the size and precision of an effect or the plausibility that a hypothesis is true."
Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice [link] Diong et. al 2018 PLoS One
However two journals were studied in the article: The Journal of Physiology and British Journal of Pharmacology
"The Journal of Physiology and British Journal of Pharmacology jointly published an editorial series in 2011 to improve standards in statistical reporting and data analysis...We conducted a cross-sectional analysis of reporting practices in a random sample of research papers published in these journals before (n = 202) and after (n = 199) publication of the editorial advice...There was no evidence that reporting practices improved following publication of the editorial advice...Of papers that reported p-values between 0.05 and 0.1, 56-63% interpreted these as trends or statistically significant...Overall, poor statistical reporting, inadequate data presentation and spin were present before and after the editorial advice was published..."
Common Misconceptions about Data Analysis and Statistics [link] Motulsky 2014 British Journal of Pharmacology "Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, however, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: 1) P-hacking, which is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want; 2) overemphasis on P values rather than on the actual size of the observed effect; 3) overuse of statistical hypothesis testing, and being seduced by the word "significant"; and 4) over-reliance on standard errors, which are often misunderstood."
The Legend of the P Value [link] Zeev 2005 Anesthesia & Analgesia "Although there is a growing body of literature criticizing the use of mere statistical significance as a measure of clinical impact, much of this literature remains out of the purview of the discipline of anesthesiology. Currently, the magical boundary of P < 0.05 is a major factor in determining whether a manuscript will be accepted for publication or a research grant will be funded. Similarly, the Federal Drug Administration does not currently consider the magnitude of an advantage that a new drug shows over placebo. As long as the difference is statistically significant, a drug can be advertised in the United States as 'effective' whether clinical trials proved it to be 10% or 200% more effective than placebo. We submit that if a treatment is to be useful to our patients, it is not enough for treatment effects to be statistically significant; they also need to be large enough to be clinically meaningful."
The Controvery of Significance Testing: Misconceptions and Alternatives [link] Glaser 1999 American Journal of Critical Care "The significance testing approach has had defenders and opponents for decades...The primary concerns have been (1) the misuse of significance testing, (2) the misinterpretation of P values, and (3) the lack of accompanying statistics, such as effect sizes and confidence intervals...This article presents the current thinking...on significance testing."
Consequences of relying on statistical significance: Some illustrations [link] Van Calster et. al. 2018  European Journal of Clinical Investigation "Despite regular criticisms of null hypothesis significance testing (NHST), a focus on testing persists, sometimes in the belief to get published and sometimes encouraged by journal reviewers. This paper aims to demonstrate known key limitations of NHST using simple nontechnical illustrations...Researchers and journals should abandon statistical significance as a pivotal element in most scientific publications. Confidence intervals around effect sizes are more informative, but should not merely be reported to comply with journal requirements."
To “P” or not to “P”-That is not the question [link] MacDermid 2018 Journal of Hand Threapy "Reoccurring issues that arise in papers submitted to The Journal of Hand Therapy are around the use of P values. These issues affect the way results are reported and understood. Misconceptions about the P value often lead authors to focus on P values, which are relatively uninformative, instead of focusing on the size and importance of the effects they observed–which are very important. Authors are potentially misleading knowledge users like clinicians or policymakers, when they erroneously focus on P values rather than the nature and size of their findings, and how they relate to their hypothesis or the relevance of the effect size in practice."
Misinterpretations of the ‘p value’: a brief primer for academic sports medicine [link] Stovitz, Verhagen, and Shrier 2016 British Journal of Sports Medicine "When comparing treatment groups, the p value is a statistical measure that summarises the chance (‘p’ for probability) that one would obtain the observed result (or more extreme), if and only if, the treatment is ineffective (ie, under the assumption of the ‘null’ hypothesis). The p value does not tell us the probability that the null hypothesis is true. This editorial discusses how some common misinterpretations of the p value may impact sports medicine research."
"Evidence"-based medicine in eating disorders research: The problem of "confetti p values" [link] Kraemer 2017 International Journal of Eating Disorders "Eating disorders hold a unique place among mental health disorders, in that salient symptoms can be objectively observed and measured rather than determined only from patient interviews or subjective evaluations. Because of this measurement advantage alone, evidence-based medicine would be expected there to make the most rapid strides. However, conclusions in Eating Disorders research, as in all medical research literature, often continue to be misleading or ambiguous. One major and long-known source of such problems is the misuse and misinterpretation of "statistical significance", with "p values" strewn throughout research papers like so much confetti, a problem that has become systemic, that is, enforced, rather than corrected, by the peer-review system. This discussion attempts to clarify the issues, and to suggest how readers might deal with this issue in processing the research literature."
Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice [link] Quatto, Ripamonti, and Marasini 2019 Journal of Biopharmaceutical Statistics "The p-value is a classical proposal of statistical inference, dating back to the seminal contributions by Fisher, Neyman and E. Pearson. However, p-values have been frequently misunderstood and misused in practice, and medical research is not an exception."
Statistical hypothesis testing and common misinterpretations: Should we abandon p-value in forensic science applications? [link] Taroni, Biedermann, and Bozza 2016 Forensic Science International "Many people regard the concept of hypothesis testing as fundamental to inferential statistics...More recently, controversial discussion was initiated by an editorial decision of a scientific journal to refuse any paper submitted for publication containing null hypothesis testing procedures. Since the large majority of papers published in forensic journals propose the evaluation of statistical evidence based on the so called p-values, it is of interest to expose the discussion of this journal's decision within the forensic science community."
Problems In Common Interpretations Of Statistics In Scientific Articles, Expert Reports, And Testimony [link] Greenland and Poole 2016 Jurimetrics "Despite articles and books on proper interpretation of statistics, it is still common in expert reports as well as scientific and statistical literature to see basic misinterpretations and neglect of background assumptions that underlie all statistical inferences. This problem can be attributed to the complexities of correct definitions of concepts such as P-values, statistical significance, and confidence intervals. These complexities lead to oversimplifications and subsequent misinterpretations by authors and readers. Thus, the present article focuses on what these concepts are not, which allows a more nonmathematical approach. The goal is to provide reference points for courts and other lay readers to identify misinterpretations and misleading claims."
Some recurrent problems in interpreting statistical evidence in equal employment cases [link] Gastwirth 2017 Law Probability and Risk "Although the U.S. Supreme Court accepted statistical evidence in cases concerning discrimination against minorities in jury pools and equal employment in 1977, several misinterpretations of the results of statistical analyses still occur in legal decisions. Several of these problems will be described and statistical approaches that are more reliable are presented. For example, a number of opinions give an erroneous description of the p-value of a statistical test or fail to consider the power of the test. Others do not distinguish between an analysis of a simple aggregation of data stratified into homogeneous subgroups, and one that controls for subgroup membership. Courts have used measures of 'practical significance' that lack a sound statistical foundation. This has led to a split in the Circuits concerning the appropriateness of 'practical' versus 'statistical' significance for the evaluation of statistical evidence."
The Insignificance of Null Hypothesis Significance Testing [link] Gill 1999 Political Research Quarterly "The current method of hypothesis testing in the social sciences is under intense criticism, yet most political scientists are unaware of the important issues being raised...In this article I review the history of the null hypothesis significance testing paradigm in the social sciences and discuss major problems, some of which are logical inconsistencies while others are more interpretive in nature..."
An Analysis of the Use of Statistical Testing in Communication Research [link] Katzer and Sodt 1973 Journal of Communication "A study was conducted to determine the adequacy of statistical testing in communication research. Every article published in the 1971–72 issues of the Journal of Communication was studied. For those studies employing statistical testing, we computed the power of those tests and the observed effect size. We found the average a priori power to be 0.55, a figure which is probably much lower than communication researchers would desire. While the average observed effect size was high, we found little evidence of its being used in the interpretation of findings. This study also discovered a large amount of inconsistency in the reporting of statistical findings...we suggest some guidelines for presenting statistical data."
A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research[link] Levine1 et al. 2008 Human Communication Research "Null hypothesis significance testing (NHST) is the most widely accepted and frequently used approach to statistical inference in quantitative communication research. NHST, however, is highly controversial, and several serious problems with the approach have been identified. This paper reviews NHST and the controversy surrounding it. Commonly recognized problems include a sensitivity to sample size, the null is usually literally false, unacceptable Type II error rates, and misunderstanding and abuse. Problems associated with the conditional nature of NHST and the failure to distinguish statistical hypotheses from substantive hypotheses are emphasized. Recommended solutions and alternatives are addressed in a companion article."
Theoretical and empirical distributions of the p value [link] Butler and Jones 2017 Metron "The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research."
Is there life after P<0.05? Statistical significance and quantitative sociology [link] Engman 2013 Quality and Quantity "The overwhelming majority of quantitative work in sociology reports levels of statistical significance. Often, significance is reported with little or no discussion of what it actually entails philosophically, and this can be problematic when analyses are interpreted...The first section of this paper deals with this common misunderstanding...The third section is devoted to a discussion of the consequences of misinterpreting statistical significance for sociology. It is argued that reporting statistical significance provides sociology with very little value, and that the consequences of misinterpreting significance values outweighs the benefits of their use."
The need for nuance in the null hypothesis significance testing debate [link] Häggström 2017 Educational and Psychological Measurement "Null hypothesis significance testing (NHST) provides an important statistical toolbox, but there are a number of ways in which it is often abused and misinterpreted, with bad consequences for the reliability and progress of science. Parts of contemporary NHST debate, especially in the psychological sciences, is reviewed, and a suggestion is made that a new distinction between strongly, weakly and very weakly anti-NHST positions is likely to bring added clarity to the debate."
Effect Size Use in Studies of Learning Disabilities [link] Ives 2003 Journal of Learning Disabilities "The misinterpretation and overuse of significance testing in the social sciences has been widely criticized. This criticism is reviewed, along with several recommendations found in the literature, including the use of effect size measures to enhance the interpretation of significance testing. A review of typical effect size measures and their application is followed by an analysis of the extent to which effect size measures have been applied in three prominent journals on learning disabilities over a 10-year period. Specific recommendations are offered for using effect size measures to improve the quality of reporting on quantitative research in the field of learning disabilities."
Scientific rigour in psycho-oncology trials: Why and how to avoid common statistical errors [link] Bell, Olivier, and King 2013 Psycho-Oncology "It is well documented that statistical and methodological flaws are common in much of the health research literature, including psycho-oncology. These can have far-reaching effects, including the publishing of misleading results; the wasting of time, effort, and financial resources; exposure of patients to the potential harms of research and decreased confidence in science and researchers by the public. Several of the most common statistical errors and methodological pitfalls that occur in the field of psycho-oncology are discussed...These include proper approaches to power...and correct interpretation of p-values..."
Do We Understand Classic Statistics? [link] Blasco 2017 Book chapter in Bayesian Data Analysis for Animal Scientists "In this chapter, we review the classical statistical concepts and procedures, test of hypothesis, standard errors and confidence intervals...and we examine the most common misunderstandings about them..."
Articles in which authors cite researchers in their field misinterpret p-values or NHST