NHST Misuse in published journals

Article summary

This is a growing list of audits of null hypothesis significant testing misuse within published peer-reviewed articles.

Contact information

We’d love to hear from you! Please direct all inquires to james@theresear.ch


Balazs Aczel, Bence Palfi, Aba Szollosi, Marton Kovacs, Barnabas Szaszi, Peter Szecsi, Mark Zrubka, Quentin F. Gronau, Don van den Bergh, and Eric-Jan Wagenmakers
Advances in Methods and Practices in Psychological Science, 2018

In the traditional statistical framework, nonsignificant results leave researchers in a state of suspended disbelief. In this study, we examined, empirically, the treatment and evidential impact of nonsignificant results. Our specific goals were twofold: to explore how psychologists interpret and communicate nonsignificant results and to assess how much these results constitute evidence in favor of the null hypothesis. First, we examined all nonsignificant findings mentioned in the abstracts of the 2015 volumes of Psychonomic Bulletin & Review, Journal of Experimental Psychology: General, and Psychological Science (N = 137). In 72% of these cases, nonsignificant results were misinterpreted, in that the authors inferred that the effect was absent. Second, a Bayes factor reanalysis revealed that fewer than 5% of the nonsignificant findings provided strong evidence (i.e., BF01 > 10) in favor of the null hypothesis over the alternative hypothesis. We recommend that researchers expand their statistical tool kit in order to correctly interpret nonsignificant results and to be able to evaluate the evidence for and against the null hypothesis.

The prevalence of statistical reporting errors in psychology (1985–2013)

Michèle B. Nuijten, Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, and Jelte M. Wicherts
Behavior Research Methods, 2016

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

The interpretation of effect size in published articles

Rink Hoekstra
International Conference on Teaching Statistics, 2014

Significance testing has been criticized, among others, for encouraging researchers to focus on whether or not an effect exists, rather than on the size of an effect. Confidence intervals (CIs), on the other hand, are expected to encourage researchers to focus more on effect size, since CIs combine inference and effect size. Although the importance of focusing on effect size seems undisputed, little is known about how often effect sizes are actually interpreted in published articles. The present paper will present a study on this issue. Interpretations of effect size, if they are presented in the first place, are categorized as unstandardized (content-related) or standardized (not content-related). Moreover, the interpretations of effect size for articles that include a CI will be contrasted with articles in which significance testing is the only used inferential measure. Implications for the current research practice are discussed.

A peculiar prevalence of p values just below .05

E. J. Masicampo and Daniel R. Lalande
The Quarterly Journal of Experimental Psychology, 2012

In null hypothesis significance testing (NHST), p values are judged relative to an arbitrary threshold for significance (.05). The present work examined whether that standard influences the distribution of p values reported in the psychology literature. We examined a large subset of papers from three highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals. We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn
Psychological Science, 2011

In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation

Luis Carlos Silva-Ayçaguer, Patricio Suárez-Gil, and Ana Fernández-Somoano
BMC Medical Research Methodology, 2010

The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI). Our objective was to evaluate the extent and quality in the use of NHST and CI, both in English and Spanish language biomedical publications between 1995 and 2006, taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on the accuracy of the interpretation of statistical significance and the validity of conclusions.

PROBABILITY AS CERTAINTY: DICHOTOMOUS THINKING AND THE MISUSE OF P VALUES

Rink Hoekstra, Sue Finch, Henk A. L. Kiers, and Addie Johnson
Psychonomic Bulletin & Review, 2006

Significance testing is widely used and often criticized. The Task Force on Statistical Inference of the American Psychological Association (TFSI, APA; Wilkinson & TFSI, 1999) addressed the use of significance testing and made recommendations that were incorporated in the fifth edition of the APA Publication Manual (APA, 2001). They emphasized the interpretation of significance testing and the importance of reporting confidence intervals and effect sizes. We examined whether 286 Psychonomic Bulletin & Review articles submitted before and after the publication of the TFSI recommendations by APA complied with these recommendations. Interpretation errors when using significance testing were still made frequently, and the new prescriptions were not yet followed on a large scale. Changing the practice of reporting statistics seems doomed to be a slow process.

REPORTING OF STATISTICAL INFERENCE IN THE JOURNAL OF APPLIED PSYCHOLOGY : LITTLE EVIDENCE OF REFORM

Sue Finch, Geoff Cumming, and Neil Thomason
Educational and Psychological Measurement, 2001

Reformers have long argued that misuse of Null Hypothesis Significance Testing (NHST) is widespread and damaging. The authors analyzed 150 articles from the Journal of Applied Psychology (JAP) covering 1940 to 1999. They examined statistical reporting practices related to misconceptions about NHST, American Psychological Association (APA) guidelines, and reform recommendations. The analysis reveals (a) inconsistency in reporting alpha and p values, (b) the use of ambiguous language in describing NHST, (c) frequent acceptance of null hypotheses without consideration of power, (d) that power estimates are rarely reported, and (e) that confidence intervals were virtually never used. APA guidelines have been followed only selectively. Research methodology reported in JAP has increased greatly in sophistication over 60 yrs, but inference practices have shown remarkable stability. There is little sign that decades of cogent critiques by reformers had by 1999 led to changes in statistical reporting practices in JAP.

Statistical significance testing as it relates to practice: Use within Professional Psychology: Research and Practice

Tammi Vacha-Haase and Carin Ness
Professional Psychology: Research and Practice, 1999

Reviewed the use of statistical significance testing in the 265 quantitative research articles published in Professional Psychology: Research and Practice from 1990 to 1997. 204 (77%) of these articles used statistical significance testing. Less than 20% of the authors correctly used the term statistical significance; many described their results, rather, as simply "significant." 81.9% of authors did follow APA style by including the degrees, alpha levels, and the values of their test statistics when reporting results. However, the majority of authors made no mention of the effect size although the current APA publication manual (APA, 1994) clearly "encourages" authors to include effect size. The implications of these results for both authors and readers are discussed, with clear suggestions for change proferred. (PsycINFO Database Record (c) 2016 APA, all rights reserved)