journal articles citing Null hypothesis misinterpretations

Article Summary

Even professional researchers are not immune to misinterpretations of Null Hypothesis Significance Testing (NHST). Presented below are more than 60 published articles in which researchers have identified that peers in their field misuse or misinterpret NHST.

Contact information

We’d love to hear from you! Please direct all inquires to james@theresear.ch.

Article Status

This article is complete and is being actively maintained.

Paid reviewers

Because this article contains no original statistical content it only went through a simple standard editing process with a single reviewer.

Dan Hippe, M.S., Statistics, University of Washington (2011). Dan is currently a statistician in the Clinical Biostatistics Group at the Fred Hutchinson Cancer Research Center in Seattle, Washington. He is a named co-authored on more than 180 journal articles, is on the editorial board of Ultrasound Quarterly, and is a statistical consultant and reviewer for the Journal of the American College of Radiology.

Article summary

There are four categories of evidence regarding misinterpretations of Null Hypothesis Significance Testing (NHST) by professional researchers. The current article focuses on the fourth category.

Direct inquires of statistical knowledge. Although not without methodological challenges, this work is the most direct method of assessing statistical understanding. The standard procedure is a convenience sample of either students or researchers (or both) in a particular academic discipline to which a survey instrument, statistical task, or other inquiry method is administered. Click here to read a summary of psychology studies using direct inquiry.
Examination of NHST and CI usage in statistics and methodology textbooks. This line of research includes both systematic reviews and casual observations documenting incorrect or incomplete language when NHST or CI is described in published textbooks. In these cases it is unclear if the textbook authors themselves do not fully understand the techniques and procedures or if they were simply imprecise in their writing and editing or otherwise thought it best to omit or simplify the material for pedagogical purposes. For an early draft of our article click here. We will continue to expand this article in the coming months.
Audits of NHST and CI usage in published articles. Similar to reviews of textbooks, these audits include systematic reviews of academic journal articles making use of NHST and CIs. The articles are assessed for correct usage. Audits are typically focused on a particular academic discipline, most commonly reviewing articles in a small number of academic journals over a specified time period. Quantitative metrics are often provided that represent the percentage of reviewed articles that exhibited correct and incorrect usage. Click here for a growing list of relevant papers.
Journal articles citing NHST or CI misinterpretations. A large number of researchers have written articles underscoring the nuances of the procedures and common misinterpretations directed at their own academic discipline. In those cases it is implied that in the experience of the authors and journal editors the specified misinterpretations are common enough in their field that a corrective is warranted. Using a semi-structured search we identified more than 60 such articles, each in a different academic discipline.

JOURNAL ARTICLES CITING NHST MISINTERPRETATIONS

A large volume of journal articles have been published critiquing NHST usage in various academic disciplines. In these cases it is implied that in the experience of the authors NHST misinterpretations are common enough in their discipline that authors feel a corrective is warranted. More than 60 such articles across several dozen disciplines are outlined in the table below. These disciplines are as varied as hematology and wildlife management, machine learning and accounting. The table is not a complete list of all such articles, nor is it meant to be. Instead, its purpose is to give some sense of the breadth of disciplines for which authors have identified NHST misinterpretations as a key problem. Likewise, by virtue of the fact that almost all of the articles were published in peer-reviewed journals, the journal editors must have also felt at some level that NHST misinterpretations were a concern worth addressing.

To generate the table a semi-structured search was used to identify articles and book chapters where the authors implicitly or explicitly identify NHST misinterpretations. The search largely consisted of the two phrases “p-value misinterpretations” and “null hypothesis misinterpretations” with occasional variations such as adding the name of a specific academic discipline. The search included numerous digital journal libraries such as Google Scholar, ResearchGate, JSTOR, SpringerLink, and Access to Research. NHST misinterpretations included authors identifying misuse or misunderstanding of p-values, Type I and Type II errors, statistical power, sample size, standard errors, or other components of the NHST framework. The articles span several decades, although an attempt was made to find recent articles when possible, with a significant portion published in the previous 10 years and more than half published in the previous five.

Two broad categories of language were identified when selecting studies: explicit and implicit. An explicit example of NHST misinterpretations comes from the 2011 article “P Values: Use and Misuse in Medical Literature” in the American Journal of Hypertension:

P values are widely used in the medical literature but many authors, reviewers, and readers are unfamiliar with a valid definition of a P value, let alone how to interpret one correctly...The article points out how to better interpret P values by avoiding common errors.

This article is considered an explicit example of the author identifying NHST misinterpretations because the abstract explicitly notes that many authors, reviewers, and readers do not know how to interpret a p-value, a key ingredient in the NHST framework.

Contrast the above example with an implicit set of misinterpretations identified in “Significance Testing in Accounting Research: A Critical Evaluation Based on Evidence,” a 2018 article appearing in the journal Abacus, a journal of accounting, finance, and business:

From a survey of the papers published in leading accounting journals in 2014, we find that accounting researchers conduct significance testing almost exclusively at a conventional level of significance, without considering key factors such as the sample size or power of a test. We present evidence that a vast majority of the accounting studies favour large or massive sample sizes and conduct significance tests with the power extremely close to or equal to one. As a result, statistical inference is severely biased towards Type I error, frequently rejecting the true null hypotheses. Under the ‘p‐value less than 0.05’ criterion for statistical significance, more than 90% of the surveyed papers report statistical significance. However, under alternative criteria, only 40% of the results are statistically significant. We propose that substantial changes be made to the current practice of significance testing for more credible empirical research in accounting.

This article is considered an implicit example of the author identifying NHST misinterpretations because nowhere is it mentioned that there are misinterpretations within the field of accounting. Instead, the authors imply misinterpretations are present by noting that key elements of NHST are used in a manner the authors feel is incorrect. The implication is that if accounting researchers knew how to interpret all of the components of NHST correctly a different pattern of usage would have been observed within the accounting journals surveyed.

Articles using both explicit and implicit language are included in this review, however an effort was made to select articles using explicit language when possible.

The passage from the American Journal of Hypertension demonstrates that some of the language used to identify the scope of misinterpretations is general rather than specific. For example, the authors call out the “medical literature” in general, not hypertension as a specific subfield. Nonetheless, the assumption is that NHST misinterpretations are indeed a problem in the medical subfield of hypertension, but the authors may not feel the misinterpretations are limited only to the subfield of hypertension. If the authors didn’t feel NHST misinterpretations were a concern in the field of hypertension presumably they would have chosen to publish their comments in a more relevant journal.

The table does not include submission or editorial guidance from journals, which sometimes include notes on use of NHST and p-values [1]. Nor does the table include tutorials or primers on NHST unless misinterpretations are explicitly called out. Many articles attempt to address NHST with various solutions, for example by suggesting a transition to Bayesian statistics or information criteria approaches. Inclusion in the table below does not imply an endorsement of such proposals. Indeed, many such proposals themselves have associated critiques, misuse, and misinterpretations.

The table includes the article title, authors, journal in which the article appeared, year of publication, and the relevant passage from the abstract in which misinterpretations were outlined (in a limited number of cases the article did not have an abstract and text was pulled from the body of the article). We have not yet had the time to read the full text of each article below and therefore readers are encouraged to seek out relevant articles on their own and examine them in detail as needed.

**Articles in which authors cite researchers in their field misinterpreting aspects of NHST**
Article title	Authors	Year	Journal	Description
The Significance of Statistical Significance Tests in Marketing Research [link]	Sawyer and Peter	1983	Journal of Marketing Research	"Classical statistical significance testing is the primary method by which marketing researchers empirically test hypotheses and draw inferences about theories. The authors discuss the interpretation and value of classical statistical significance tests and suggest that classical inferential statistics may be misinterpreted and overvalued by marketing researchers in judging research results."
Why We Don't Really Know What Statistical Significance Means: Implications for Educators [link]	Hubbard and Armstrong	2006	Journal of Marketing Education	"In marketing journals and market research textbooks, two concepts of statistical significance—p values and α levels—are commonly mixed together. This is unfortunate because they each have completely different interpretations. The upshot is that many investigators are confused over the meaning of statistical significance. We explain how this confusion has arisen and make several suggestions to teachers and researchers about how to overcome it."
Significance Testing in Accounting Research: A Critical Evaluation Based on Evidence [link]	Kim, Ahmed, and Ji	2018	Abacus: A Journal of Accounting, Finance, and Business Studies	"From a survey of the papers published in leading accounting journals in 2014, we find that accounting researchers conduct significance testing almost exclusively at a conventional level of significance, without considering key factors such as the sample size or power of a test. We present evidence that a vast majority of the accounting studies favour large or massive sample sizes and conduct significance tests with the power extremely close to or equal to one. As a result, statistical inference is severely biased towards Type I error, frequently rejecting the true null hypotheses. Under the ‘p‐value less than 0.05’ criterion for statistical significance, more than 90% of the surveyed papers report statistical significance. However, under alternative criteria, only 40% of the results are statistically significant. We propose that substantial changes be made to the current practice of significance testing for more credible empirical research in accounting."
Tackling False Positives in Finance: A Statistical Toolbox with Applications [link]	Kim	2018	31st Australasian Finance and Banking Conference 2018	"Serious concerns have been raised that false positive findings are widespread in empirical research in business research including finance. This is largely because researchers almost exclusively adopt the 'p-value less than 0.05' criterion for statistical significance; and they are often not fully aware of large-sample biases which can potentially mislead their research outcomes. This paper proposes that a statistical toolbox (rather than a single hammer) be used in empirical research, which offers researchers a range of statistical instruments, including alternatives to the p-value criterion and cautionary analyses for large-sample bias. It is found that the positive results obtained under the p-value criterion cannot stand, when the toolbox is applied to three notable studies in finance."
A Re-Interpretation of 'Significant' Empirical Financial Research [link]	Kellner and Roesch	2018	White paper [finance]	"Currently, the use and interpretation of statistics and p-values is under scrutiny in various scientific fields for several reasons: p-hacking, data dredging, misinterpretation, multiple testing, or selective reporting, among others. To the best of our knowledge, this discussion has hardly reached the empirical finance community. Thus, the aim of this paper is to show how the typical p-value based analysis of empirical findings in finance can be fruitfully enriched by the supplemental use of further statistical tools."
Comments and recommendations regarding the hypothesis testing controversy [link]	Bonett and Wright	2007	Journal of Organizational Behavior	"Hypothesis tests are routinely misinterpreted in scientific research. Specifically, the failure to reject a null hypothesis is often interpreted as support for the null hypothesis while the rejection of a null hypothesis is often interpreted as evidence of an important finding. Many of the most frequently used hypothesis tests are 'non-informative because the null hypothesis is known to be false prior to hypothesis testing. We discuss the limitations of non-informative hypothesis tests and explain why confidence intervals should be used in their place. Several examples illustrate the use and interpretation of confidence intervals."
Time to dispense with the p-value in OR? [link]	Hofmann and Meyer-Nieberg	2017	Central European Journal of Operations Research	"Null hypothesis significance testing is the standard procedure of statistical decision making, and p-values are the most widespread decision criteria of inferential statistics both in science, in general, and also in operations research...Yet, the use of significance testing in the analysis of research data has been criticized from numerous statisticians—continuously for almost 100 years...Is it time to dispense with the p-value in OR? The answer depends on many factors, including the research objective, the research domain, and, especially, the amount of information provided in addition to the p-value. Despite this dependence from context three conclusions can be made that should concern the operational analyst: First, p-values can perfectly cast doubt on a null hypothesis or its underlying assumptions, but they are only a first step of analysis, which, stand alone, lacks expressive power. Second, the statistical layman almost inescapably misinterprets the evidentiary value of p-values. Third and foremost, p-values are an inadequate choice for a succinct executive summary of statistical evidence for or against a research question..."
Misinformation in MIS Research: The Problem of Statistical Power [link]	Baroudi and Orlikowski	2018	White paper [Management Information Systems]	"This study reviews 57 MIS articles employing statistical inference testing published in leading MIS journals over the last five years. The statistical power of the articles was evaluated and found on average to fall substantially below the accepted norms. The consequence of low power is that it can lead to misinterpretation of data and results. Collectively these misinterpretations resultin a body of MIS research that is built on potentially erroneous conclusions."
Institutionalized dualism: statistical significance testing as myth and ceremony [link]	Orlitzky	2011	Journal of Management Control	"Several well-known statisticians regard significance testing as a deeply problematic procedure in statistical inference. Yet, in-depth discussion of null hypothesis significance testing (NHST) has largely been absent from the literature on organizations or, more specifically, management control systems. This article attempts to redress this oversight by drawing on neoinstitutional theory to frame, analyze, and explore the NHST problem. Regulative, normative, and cultural-cognitive forces partly explain the longevity of NHST in organization studies. The unintended negative consequences of NHST include a reinforcement of the academic-practitioner divide, an obstacle to the growth of knowledge, discouragement of study replications, and mechanization of researcher decision making. An appreciation of these institutional explanations for NHST as well as the harm caused by NHST may ultimately help researchers develop superior methodological alternatives to a controversial statistical technique."
Caveats for using statistical significance tests in research assessments [link]	Schneider	2013	Journal of Informetrics	"This article raises concerns about the advantages of using statistical significance tests in research assessments...Statistical significance tests are highly controversial and numerous criticisms have been leveled against their use. Based on examples from articles by proponents of the use statistical significance tests in research assessments, we address some of the numerous problems with such tests...we generally believe that statistical significance tests are over- and misused in the empirical sciences including scientometrics and we encourage a reform on these matters."
Fair Statistical Communication in HCI [link]	Dragicevic	2016	Book chapter in Modern Statistical Methods for HCI part of the Human-Computer Interaction book series	"[A]reas such as human-computer interaction (HCI) have adopted tools — i.e., p-values and dichotomous testing procedures — that have proven to be poor at supporting these tasks [of advancing scientifc knowledge]. The abusive use of these procedures has been severely criticized in a range of disciplines for several decades...This chapter explains in a non-technical manner why it would be beneficial for HCI to switch to an estimation approach..."
Statistics [link]	de Winter and Dodou	2017	Book chapter in Human Subject Research for Engineers	"After the measurements have been completed, the data have to be statistically analysed. This chapter explains how to analyse data and how to conduct statistical tests...We draw attention to pitfalls that may occur in statistical analyses, such as misinterpretations of null hypothesis significance testing and false positives. Attention is also drawn to questionable research practices and their remedies. Replicability of research is also discussed, and recommendations for maximizing replicability are provided."
Statistical Significance and the Dichotomization of Evidence [link]	McShane and Gal	2017	Journal of the American Statistical Association	"In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations."
The Waves and the Sigmas [link]	Gilles D’Agostini	2016	White paper [physics]	"This paper shows how p-values do not only create, as well known, wrong expectations in the case of flukes, but they might also dramatically diminish the ‘significance’ of most likely genuine signals. As real life examples, the 2015 first detections of gravitational waves are discussed. The March 2016 statement of the American Statistical Association, warning scientists about interpretation and misuse of p-values, is also reminded and commented. (The paper is complemented with some remarks on past, recent and future claims of discoveries based on sigmas from Particles Physics.)"
Should significance testing be abandoned in machine learning? [link]	Berrar and Dubitzky	2019	International Journal of Data Science and Analytics	"Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far...Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis."
Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference [link]	Glaser	2019	Machine Learning and Knowledge Extraction	"Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science...In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections."
Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers [link]	Berrar	2016	Machine Learning	"Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails...[T]he view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment."
Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses [link]	Azer, Khashabi, Sabharwal, Roth	2019	White paper [Natural language processing]	"Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community...[C]ommon fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research..."
A systematic review of statistical power in software engineering experiments [link]	Dyba, Kampenes, and Sjoberg	2006	Information and Software Technology	"This paper reports a quantitative assessment of the statistical power of empirical software engineering research based on the 103 papers on controlled experiments (of a total of 5,453 papers) published in nine major software engineering journals and three conference proceedings in the decade 1993–2002. The results show that the statistical power of software engineering experiments falls substantially below accepted norms as well as the levels found in the related discipline of information systems research. Given this study's findings, additional attention must be directed to the adequacy of sample sizes and research designs to ensure acceptable levels of statistical power. Furthermore, the current reporting of significance tests should be enhanced by also reporting effect sizes and confidence intervals."
“Bad Smells" in Software Analytics Papers [link]	Information and Software Technology	2019	Information and Software Technology	The authors list 12 "bad smells." Among them are "over-reliance on, or abuse of, null hypothesis significance testing," "p-hacking," and "under-powered studies."
The limits of p-values for biological data mining [link]	Malley, Dasgupta, and Moore	2013	BioData Mining	"Use of p-values is widespread in the sciences, especially so in biomedical research, and also underlies several analytic approaches in data mining...However, simple criticisms and essential distinctions are immediate..."
Impact of Criticism of Null-Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology [link]	Fidler et. al.	2006	Conservation Biology	"Over the last decade, criticisms of null-hypothesis significance testing have grown dramatically...Have these calls for change had an impact on the statistical reporting practices in conservation biology? In 2000 and 2001, 92% of sampled articles in Conservation Biology and Biological Conservation reported results of null-hypothesis tests. In 2005 this figure dropped to 78%...Of those articles reporting null-hypothesis testing — which still easily constitute the majority — very few report statistical power (8%) and many misinterpret statistical nonsignificance as evidence for no effect (63%). Overall, results of our survey show some improvements in statistical practice, but further efforts are clearly required to move the discipline toward improved practices."
Misuse of null hypothesis significance testing: would estimation of positive and negative predictive values improve certainty of chemical risk assessment? [link]	Bundschuh et. al.	2013	Environmental Science and Pollution Research	"Although generally misunderstood, the p value is the probability of the test results or more extreme results given H₀ is true: it is not the probability of H₀ being true given the results...[Our] approach also reinforces the value of considering α, β, and a biologically meaningful effect size a priori."
Statistics: reasoning on uncertainty, and the insignificance of testing null [link]	Läärä	2009	Annales Zoologici Fennici	"The practice of statistical analysis and inference in ecology is critically reviewed. The dominant doctrine of null hypothesis significance testing (NHST) continues to be applied ritualistically and mindlessly. This dogma is based on superficial understanding of elementary notions of frequentist statistics in the 1930s, and is widely disseminated by influential textbooks targeted at biologists. It is characterized by silly null hypotheses and mechanical dichotomous division of results being “significant” (P < 0.05) or not. Simple examples are given to demonstrate how distant the prevalent NHST malpractice is from the current mainstream practice of professional statisticians. Masses of trivial and meaningless “results” are being reported, which are not providing adequate quantitative information of scientific interest. The NHST dogma also retards progress in the understanding of ecological systems and the effects of management programmes, which may at worst contribute to damaging decisions in conservation biology."
The Insignificance of Statistical Significance Testing [link]	Johnson	1999	The Journal of Wildlife Management	"Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research...This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one."
Viewpoint: Improving Range Science through the Appropriate Use of Statistics [link]	Gould and Steiner	2002	Journal of Range Management	"We examined a stratified random sample of articles published over 3 decades of the Journal of Range Management to study the applications and changes in statistical methodology employed by range scientists...Articles that reported an effect size via a sample mean frequently did not report an associated standard error...Improper identification of the experimental or sampling unit and/or the interdependence of observations occurred in all decades. We recommend increased inferential use of confidence intervals and suggest that the practical significance (as opposed to statistical significance) of results be considered more often..."
How significant (p < 0.05) is geomorphic research? [link]	Hutton	2014	Earth Surface Processes and Landforms	"The pervasive application of the Null Hypothesis Significance Test in geomorphic research runs counter to widespread, long running, and often severe criticism of the method in the broader scientific literature...Though not without their own assumptions, wider application of such [Bayesian] methods can help facilitate a transition towards a broader approach to statistical, and in turn, scientific inference in geomorphic systems."
Inference without significance: measuring support for hypotheses rather than rejecting them [link]	Gerrodette	2011	Marine Ecology	"Despite more than half a century of criticism, significance testing continues to be used commonly by ecologists. Significance tests are widely misused and misunderstood, and even when properly used, they are not very informative for most ecological data. Problems of misuse and misinterpretation include: (i) invalid logic; (ii) rote use; (iii) equating statistical significance with biological importance; (iv) regarding the P‐value as the probability that the null hypothesis is true; (v) regarding the P‐value as a measure of effect size; and (vi) regarding the P‐value as a measure of evidence. Significance tests are poorly suited for inference because they pose the wrong question. In addition, most null hypotheses in ecology are point hypotheses already known to be false, so whether they are rejected or not provides little additional understanding. Ecological data rarely fit the controlled experimental setting for which significance tests were developed."
I can see clearly now: Reinterpreting statistical significance [link]	Dushoffdragons, Kain, and Bolker	2019	Methods in Ecology and Evolution	"Null hypothesis significance testing (NHST) remains popular despite decades of concern about misuse and misinterpretation. We believe that much misinterpretation of NHST is due to language: significance testing has little to do with other meanings of the word ‘significance’. We therefore suggest that researchers describe the conclusions of null‐hypothesis tests in terms of statistical ‘clarity’ rather than ‘significance’."
Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values [link]	Rinella and James	2010	Invasive Plant Science and Management	"Null hypothesis significance testing (NHST) forms the backbone of statistical inference in invasive plant science. Over 95% of research articles in Invasive Plant Science and Management report NHST results such as P-values or statistics closely related to P-values such as least significant differences. Unfortunately, NHST results are less informative than their ubiquity implies. P-values are hard to interpret and are regularly misinterpreted. Also, P-values do not provide estimates of the magnitudes and uncertainties of studied effects, and these effect size estimates are what invasive plant scientists care about most. In this paper, we reanalyze four datasets (two of our own and two of our colleagues; studies put forth as examples in this paper are used with permission of their authors) to illustrate limitations of NHST. The re-analyses are used to build a case for confidence intervals as preferable alternatives to P-values. Confidence intervals indicate effect sizes, and compared to P-values, confidence intervals provide more complete, intuitively appealing information on what data do/do not indicate."
Wise use of statistical tools in ecological field studies [link]	Halvorsen	2007	Folia Geobotanica	"The currently dominating hypothetico-deductive research paradigm for ecology has statistical hypothesis testing as a basic element. Classic statistical hypothesis testing does, however, present the ecologist with two fundamental dilemmas when field data are to be analyzed...I argue that a research strategy entirely based on rigorous statistical testing of hypotheses is insufficient for field ecological data and that inductive and deductive approaches are complementary in the process of building ecological knowledge. I recommend that great care is taken when statistical tests are applied to ecological field data. Use of less formal modelling approaches is recommended for cases when formal testing is not strictly needed. Sets of recommendations, “Guidelines for wise use of statistical tools”, are proposed both for testing and for modelling. Important elements of wise-use guidelines are parallel use of methods that preferably belong to different methodologies, selection of methods with few and less rigorous assumptions, conservative interpretation of results, and abandonment of definitive decisions based a predefined significance level."
Does the P Value Have a Future in Plant Pathology? (Letter to the editor) [link]	Madden, Shah, and Esker	2015	Phytopathology	"The P value (significance level) is possibly the mostly widely used, and also misused, quantity in data analysis. P has been heavily criticized on philosophical and theoretical grounds, especially from a Bayesian perspective....Plant pathologists may view this latest round of P value criticism with anything from indifference to alarm. If anything, we do hope the new round of attention will increase awareness within our own discipline of the very likely misuse of the calculated P value in plant pathological science. "
Explicit consideration of critical effect sizes and costs of errors can improve decision-making in plant science [link]	Mudge	2013	New Phytologist	"Plant scientists...are responsible for ensuring that data are analyzed and presented in a way that facilitates good decision-making. Good statistical practices can be an important tool for ensuring objective and transparent and data analysis. Despite frequent criticism over the last few decades, null hypothesis significance testing (NHST) remains widely used in plant science...Use of a consistent significance threshold regardless of sample size has contributed to frequent misinterpretations of P-values as being ‘highly significant’ or ‘marginally significant’, and/or as measures of how likely the alternate hypothesis is to be true..."
The continuing misuse of null hypothesis significance testing in biological anthropology [link]	Smith	2018	American Journal of Physical Anthropology	"There is over 60 years of discussion in the statistical literature concerning the misuse and limitations of null hypothesis significance tests (NHST). Based on the prevalence of NHST in biological anthropology research, it appears that the discipline generally is unaware of these concerns..."
Common scientific and statistical errors in obesity research [link]	George et. al.	2016	Obesity	"This review identifies 10 common errors and problems in the statistical analysis, design, interpretation, and reporting of obesity research and discuss how they can be avoided."
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations [link]	Greenland et. al.	2016	European Journal of Epidemiology	"Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature...We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting."
Toward evidence-based medical statistics. 1: The P value fallacy [link]	Goodman	1999	Annals of Internal Medicine	"There is little appreciation in the medical community that the [hypothesis test] methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy."
P Values: Use and Misuse in Medical Literature [link]	Cohen	2011	American Journal of Hypertension	"P values are widely used in the medical literature but many authors, reviewers, and readers are unfamiliar with a valid definition of a P value, let alone how to interpret one correctly...The article points out how to better interpret P values by avoiding common errors."
Understanding the Role of P Values and Hypothesis Tests in Clinical Research [link]	Mark, Lee, and Harrell	2016	JAMA Cardiology	"P values and hypothesis testing methods are frequently misused in clinical research. Much of this misuse appears to be owing to the widespread, mistaken belief that they provide simple, reliable, and objective triage tools for separating the true and important from the untrue or unimportant."
A Dirty Dozen: Twelve P-Value Misconceptions [link]	Goodman	2008	Seminars in Hematology	"The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value’s inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong."
The Value of a p-Valueless Paper [link]	Connor	2004	American Journal of Gastroenterolog	"As is common in current biomedical research, about 85% of original contributions in The American Journal of Gastroenterology in 2004 have reported p-values. However, none are reported in this issue's article by Abraham et al. who, instead, rely exclusively on effect size estimates and associated confidence intervals to summarize their findings. Authors using confidence intervals communicate much more information in a clear and efficient manner than those using p-values. This strategy also prevents readers from drawing erroneous conclusions caused by common misunderstandings about p-values. I outline how standard, two-sided confidence intervals can be used to measure whether two treatments differ or test whether they are clinically equivalent."
To P or Not to P : Backing Bayesian Statistics [link]	Buchinsky and Chadha	2017	Otolaryngology Head and Neck Surgery	"In biomedical research, it is imperative to differentiate chance variation from truth before we generalize what we see in a sample of subjects to the wider population. For decades, we have relied on null hypothesis significance testing, where we calculate P values for our data to decide whether to reject a null hypothesis. This methodology is subject to substantial misinterpretation and errant conclusions..."
Computation of measures of effect size for neuroscience data sets [link]	Hentschke and Stuttgen	2011	European Journal of Neuroscience	"Here we review the most common criticisms of significance testing and provide several examples from neuroscience where use of MES [measures of effect size] conveys insights not amenable through the use of P‐values alone. We introduce an open‐access matlab toolbox providing a wide range of MES to complement the frequently used types of hypothesis tests, such as t‐tests and analysis of variance."
Can nursing epistemology embrace p-values? [link]	Ou, Hall, and Thorne	2017	Nursing Philosophy	"The use of correlational probability values (p-values) as a means of evaluating evidence in nursing and health care has largely been accepted uncritically. There are reasons to be concerned about an uncritical adherence to the use of significance testing, which has been located in the natural science paradigm...Nursing has been minimally involved in the rich debate about the controversies of treating significance testing as evidentiary in the health and social sciences...We argue that nursing needs to critically reflect on the limitations associated with this tool of the evidence-based movement, given the complexities and contextual factors that are inherent to nursing epistemology. Such reflection will inform our thinking about what constitutes substantive knowledge for the nursing discipline."
P Value and the Theory of Hypothesis Testing An Explanation for New Researchers [link]	Jean Biau, Jolles, Porcher	2009	Clinical Orthopaedics and Related Research	"[T]he p value and the theory of hypothesis testing are different theories that often are misunderstood and confused, leading researchers to improper conclusions. Perhaps the most common misconception is to consider the p value as the probability that the null hypothesis is true rather than the probability of obtaining the difference observed, or one that is more extreme, considering the null is true. Another concern is the risk that an important proportion of statistically significant results are falsely significant. Researchers should have a minimum understanding of these two theories so that they are better able to plan, conduct, interpret, and report scientific experiments."
Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing [link]	Gagnier & Morgenstern	2017	The Journal of Bone and Joint Surgery	"The purpose of this article was to discuss these principles. We make several recommendations for moving forward: (1) Authors should avoid statements such as 'statistically significant' or 'statistically nonsignificant.' (2) Investigators should report the magnitude of effect of all outcomes together with the appropriate measure of precision or variation..."
Statistics in ophthalmology revisited: the (effect) size matters [link]	Grzybowski and Mianowany	2018	Acta Ophthalmologica	"[T]he null hypothesis significance testing (NHST) has been the kernel of statistical data analysis for a long period. Medicine is making use of it in a very wide range. The main problem in NHST arises out of a dichotomic perception of reality. Scientific reasoning was biased in the aftermath of the uncritical, almost dogmatic, trust in a level of statistical significance, namely a p‐value, defined as ‘the dictating paradigm of p‐value’...The p‐value refers directly to a formulated hypothesis only, that is the probability of observing a given outcome under the condition posited by a specific null hypothesis and given a specific model of the distribution of outcomes under the null hypothesis...To enhance the plausibility and true scientific value of research works, investigators should also take into consideration the effect size along with its confidence limits..."
The role of P-values in analysing trial results [link]	Freeman	1993	Statistics in Medicine	"Reasons for grave concern over the present situation [widepsread use of p-values] range from the unsatisfactory nature of p-values themselves, their very common misunderstanding by statisticians as well as by clinicians and their serious distorting influence on our perception of the very nature of clinical trials. Some of the ways [more sensible reporting can be introduced]...are discussed."
P in the right place: Revisiting the evidential value of P -values [link]	Lytsy	2018	Journal of Evidence-Based Medicine	"P‐values are often calculated when testing hypotheses in quantitative settings, and low P‐values are typically used as evidential measures to support research findings in published medical research. This article reviews old and new arguments questioning the evidential value of P‐values. Critiques of the P‐value include that it is confounded, fickle, and overestimates the evidence against the null. P‐values may turn out falsely low in studies due to random or systematic errors. Even correctly low P‐values do not logically provide support to any hypothesis. Recent studies show low replication rates of significant findings, questioning the dependability of published low P‐values. P‐values are poor indicators in support of scientific propositions. P‐values must be inferred by a thorough understanding of the study's question, design, and conduct. Null hypothesis significance testing will likely remain an important method in quantitative analysis but may be complemented with other statistical techniques that more straightforwardly address the size and precision of an effect or the plausibility that a hypothesis is true."
Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice [link]	Diong et. al	2018	PLoS One However two journals were studied in the article: The Journal of Physiology and British Journal of Pharmacology	"The Journal of Physiology and British Journal of Pharmacology jointly published an editorial series in 2011 to improve standards in statistical reporting and data analysis...We conducted a cross-sectional analysis of reporting practices in a random sample of research papers published in these journals before (n = 202) and after (n = 199) publication of the editorial advice...There was no evidence that reporting practices improved following publication of the editorial advice...Of papers that reported p-values between 0.05 and 0.1, 56-63% interpreted these as trends or statistically significant...Overall, poor statistical reporting, inadequate data presentation and spin were present before and after the editorial advice was published..."
Common Misconceptions about Data Analysis and Statistics [link]	Motulsky	2014	British Journal of Pharmacology	"Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, however, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: 1) P-hacking, which is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want; 2) overemphasis on P values rather than on the actual size of the observed effect; 3) overuse of statistical hypothesis testing, and being seduced by the word "significant"; and 4) over-reliance on standard errors, which are often misunderstood."
The Legend of the P Value [link]	Zeev	2005	Anesthesia & Analgesia	"Although there is a growing body of literature criticizing the use of mere statistical significance as a measure of clinical impact, much of this literature remains out of the purview of the discipline of anesthesiology. Currently, the magical boundary of P < 0.05 is a major factor in determining whether a manuscript will be accepted for publication or a research grant will be funded. Similarly, the Federal Drug Administration does not currently consider the magnitude of an advantage that a new drug shows over placebo. As long as the difference is statistically significant, a drug can be advertised in the United States as 'effective' whether clinical trials proved it to be 10% or 200% more effective than placebo. We submit that if a treatment is to be useful to our patients, it is not enough for treatment effects to be statistically significant; they also need to be large enough to be clinically meaningful."
The Controvery of Significance Testing: Misconceptions and Alternatives [link]	Glaser	1999	American Journal of Critical Care	"The significance testing approach has had defenders and opponents for decades...The primary concerns have been (1) the misuse of significance testing, (2) the misinterpretation of P values, and (3) the lack of accompanying statistics, such as effect sizes and confidence intervals...This article presents the current thinking...on significance testing."
Consequences of relying on statistical significance: Some illustrations [link]	Van Calster et. al.	2018	European Journal of Clinical Investigation	"Despite regular criticisms of null hypothesis significance testing (NHST), a focus on testing persists, sometimes in the belief to get published and sometimes encouraged by journal reviewers. This paper aims to demonstrate known key limitations of NHST using simple nontechnical illustrations...Researchers and journals should abandon statistical significance as a pivotal element in most scientific publications. Confidence intervals around effect sizes are more informative, but should not merely be reported to comply with journal requirements."
To “P” or not to “P”-That is not the question [link]	MacDermid	2018	Journal of Hand Threapy	"Reoccurring issues that arise in papers submitted to The Journal of Hand Therapy are around the use of P values. These issues affect the way results are reported and understood. Misconceptions about the P value often lead authors to focus on P values, which are relatively uninformative, instead of focusing on the size and importance of the effects they observed–which are very important. Authors are potentially misleading knowledge users like clinicians or policymakers, when they erroneously focus on P values rather than the nature and size of their findings, and how they relate to their hypothesis or the relevance of the effect size in practice."
Misinterpretations of the ‘p value’: a brief primer for academic sports medicine [link]	Stovitz, Verhagen, and Shrier	2016	British Journal of Sports Medicine	"When comparing treatment groups, the p value is a statistical measure that summarises the chance (‘p’ for probability) that one would obtain the observed result (or more extreme), if and only if, the treatment is ineffective (ie, under the assumption of the ‘null’ hypothesis). The p value does not tell us the probability that the null hypothesis is true. This editorial discusses how some common misinterpretations of the p value may impact sports medicine research."
"Evidence"-based medicine in eating disorders research: The problem of "confetti p values" [link]	Kraemer	2017	International Journal of Eating Disorders	"Eating disorders hold a unique place among mental health disorders, in that salient symptoms can be objectively observed and measured rather than determined only from patient interviews or subjective evaluations. Because of this measurement advantage alone, evidence-based medicine would be expected there to make the most rapid strides. However, conclusions in Eating Disorders research, as in all medical research literature, often continue to be misleading or ambiguous. One major and long-known source of such problems is the misuse and misinterpretation of "statistical significance", with "p values" strewn throughout research papers like so much confetti, a problem that has become systemic, that is, enforced, rather than corrected, by the peer-review system. This discussion attempts to clarify the issues, and to suggest how readers might deal with this issue in processing the research literature."
Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice [link]	Quatto, Ripamonti, and Marasini	2019	Journal of Biopharmaceutical Statistics	"The p-value is a classical proposal of statistical inference, dating back to the seminal contributions by Fisher, Neyman and E. Pearson. However, p-values have been frequently misunderstood and misused in practice, and medical research is not an exception."
Statistical hypothesis testing and common misinterpretations: Should we abandon p-value in forensic science applications? [link]	Taroni, Biedermann, and Bozza	2016	Forensic Science International	"Many people regard the concept of hypothesis testing as fundamental to inferential statistics...More recently, controversial discussion was initiated by an editorial decision of a scientific journal to refuse any paper submitted for publication containing null hypothesis testing procedures. Since the large majority of papers published in forensic journals propose the evaluation of statistical evidence based on the so called p-values, it is of interest to expose the discussion of this journal's decision within the forensic science community."
Problems In Common Interpretations Of Statistics In Scientific Articles, Expert Reports, And Testimony [link]	Greenland and Poole	2016	Jurimetrics	"Despite articles and books on proper interpretation of statistics, it is still common in expert reports as well as scientific and statistical literature to see basic misinterpretations and neglect of background assumptions that underlie all statistical inferences. This problem can be attributed to the complexities of correct definitions of concepts such as P-values, statistical significance, and confidence intervals. These complexities lead to oversimplifications and subsequent misinterpretations by authors and readers. Thus, the present article focuses on what these concepts are not, which allows a more nonmathematical approach. The goal is to provide reference points for courts and other lay readers to identify misinterpretations and misleading claims."
Some recurrent problems in interpreting statistical evidence in equal employment cases [link]	Gastwirth	2017	Law Probability and Risk	"Although the U.S. Supreme Court accepted statistical evidence in cases concerning discrimination against minorities in jury pools and equal employment in 1977, several misinterpretations of the results of statistical analyses still occur in legal decisions. Several of these problems will be described and statistical approaches that are more reliable are presented. For example, a number of opinions give an erroneous description of the p-value of a statistical test or fail to consider the power of the test. Others do not distinguish between an analysis of a simple aggregation of data stratified into homogeneous subgroups, and one that controls for subgroup membership. Courts have used measures of 'practical significance' that lack a sound statistical foundation. This has led to a split in the Circuits concerning the appropriateness of 'practical' versus 'statistical' significance for the evaluation of statistical evidence."
The Insignificance of Null Hypothesis Significance Testing [link]	Gill	1999	Political Research Quarterly	"The current method of hypothesis testing in the social sciences is under intense criticism, yet most political scientists are unaware of the important issues being raised...In this article I review the history of the null hypothesis significance testing paradigm in the social sciences and discuss major problems, some of which are logical inconsistencies while others are more interpretive in nature..."
An Analysis of the Use of Statistical Testing in Communication Research [link]	Katzer and Sodt	1973	Journal of Communication	"A study was conducted to determine the adequacy of statistical testing in communication research. Every article published in the 1971–72 issues of the Journal of Communication was studied. For those studies employing statistical testing, we computed the power of those tests and the observed effect size. We found the average a priori power to be 0.55, a figure which is probably much lower than communication researchers would desire. While the average observed effect size was high, we found little evidence of its being used in the interpretation of findings. This study also discovered a large amount of inconsistency in the reporting of statistical findings...we suggest some guidelines for presenting statistical data."
A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research[link]	Levine1 et al.	2008	Human Communication Research	"Null hypothesis significance testing (NHST) is the most widely accepted and frequently used approach to statistical inference in quantitative communication research. NHST, however, is highly controversial, and several serious problems with the approach have been identified. This paper reviews NHST and the controversy surrounding it. Commonly recognized problems include a sensitivity to sample size, the null is usually literally false, unacceptable Type II error rates, and misunderstanding and abuse. Problems associated with the conditional nature of NHST and the failure to distinguish statistical hypotheses from substantive hypotheses are emphasized. Recommended solutions and alternatives are addressed in a companion article."
Theoretical and empirical distributions of the p value [link]	Butler and Jones	2017	Metron	"The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research."
Is there life after P<0.05? Statistical significance and quantitative sociology [link]	Engman	2013	Quality and Quantity	"The overwhelming majority of quantitative work in sociology reports levels of statistical significance. Often, significance is reported with little or no discussion of what it actually entails philosophically, and this can be problematic when analyses are interpreted...The first section of this paper deals with this common misunderstanding...The third section is devoted to a discussion of the consequences of misinterpreting statistical significance for sociology. It is argued that reporting statistical significance provides sociology with very little value, and that the consequences of misinterpreting significance values outweighs the benefits of their use."
The need for nuance in the null hypothesis significance testing debate [link]	Häggström	2017	Educational and Psychological Measurement	"Null hypothesis significance testing (NHST) provides an important statistical toolbox, but there are a number of ways in which it is often abused and misinterpreted, with bad consequences for the reliability and progress of science. Parts of contemporary NHST debate, especially in the psychological sciences, is reviewed, and a suggestion is made that a new distinction between strongly, weakly and very weakly anti-NHST positions is likely to bring added clarity to the debate."
Effect Size Use in Studies of Learning Disabilities [link]	Ives	2003	Journal of Learning Disabilities	"The misinterpretation and overuse of significance testing in the social sciences has been widely criticized. This criticism is reviewed, along with several recommendations found in the literature, including the use of effect size measures to enhance the interpretation of significance testing. A review of typical effect size measures and their application is followed by an analysis of the extent to which effect size measures have been applied in three prominent journals on learning disabilities over a 10-year period. Specific recommendations are offered for using effect size measures to improve the quality of reporting on quantitative research in the field of learning disabilities."
Scientific rigour in psycho-oncology trials: Why and how to avoid common statistical errors [link]	Bell, Olivier, and King	2013	Psycho-Oncology	"It is well documented that statistical and methodological flaws are common in much of the health research literature, including psycho-oncology. These can have far-reaching effects, including the publishing of misleading results; the wasting of time, effort, and financial resources; exposure of patients to the potential harms of research and decreased confidence in science and researchers by the public. Several of the most common statistical errors and methodological pitfalls that occur in the field of psycho-oncology are discussed...These include proper approaches to power...and correct interpretation of p-values..."
Do We Understand Classic Statistics? [link]	Blasco	2017	Book chapter in Bayesian Data Analysis for Animal Scientists	"In this chapter, we review the classical statistical concepts and procedures, test of hypothesis, standard errors and confidence intervals...and we examine the most common misunderstandings about them..."

References

Klaus E. Meyer, Arjen van Witteloostuijn, Sjoerd Beugelsdijk, “What’s in a p? Reassessing Best Practices for Conducting and Reporting Hypothesis-Testing Research”, Research Methods in International Business, 2019 [link]