About: Replication crisis is a research topic. Over the lifetime, 276 publications have been published within this topic receiving 14105 citations. The topic is also known as: replicability crisis & reproducibility crisis.
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
TL;DR: This work argues for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives, in the hope that this will facilitate action toward improving the transparency, reproducible and efficiency of scientific research.
Abstract: Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.
TL;DR: Although the very public problems experienced by psychology over this 2-year period are embarrassing to those of us working in the field, some have found comfort in the fact that, over the same period, similar concerns have been arising across the scientific landscape.
Abstract: Is there currently a crisis of confidence in psychological science reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field? It would certainly appear that there is. These doubts emerged and grew as a series of unhappy events unfolded in 2011: the Diederik Stapel fraud case (see Stroebe, Postmes, & Spears, 2012, this issue), the publication in a major social psychology journal of an article purporting to show evidence of extrasensory perception (Bem, 2011) followed by widespread public mockery (see Galak, LeBoeuf, Nelson, & Simmons, in press; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), reports by Wicherts and colleagues that psychologists are often unwilling or unable to share their published data for reanalysis (Wicherts, Bakker, & Molenaar, 2011; see also Wicherts, Borsboom, Kats, & Molenaar, 2006), and the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011). For those psychologists who expected that the embarrassments of 2011 would soon recede into memory, 2012 offered instead a quick plunge from bad to worse, with new indications of outright fraud in the field of social cognition (Simonsohn, 2012), an article in Psychological Science showing that many psychologists admit to engaging in at least some of the QRPs examined by Simmons and colleagues (John, Loewenstein, & Prelec, 2012), troubling new meta-analytic evidence suggesting that the QRPs described by Simmons and colleagues may even be leaving telltale signs visible in the distribution of p values in the psychological literature (Masicampo & Lalande, in press; Simonsohn, 2012), and an acrimonious dust-up in science magazines and blogs centered around the problems some investigators were having in replicating well-known results from the field of social cognition (Bower, 2012; Yong, 2012). Although the very public problems experienced by psychology over this 2-year period are embarrassing to those of us working in the field, some have found comfort in the fact that, over the same period, similar concerns have been arising across the scientific landscape (triggered by revelations that will be described shortly). Some of the suspected causes of unreplicability, such as publication bias (the tendency to publish only positive findings) have been discussed for years; in fact, the phrase file-drawer problem was first coined by a distinguished psychologist several decades ago (Rosenthal, 1979). However, many have speculated that these problems have been exacerbated in recent years as academia reaps the harvest of a hypercompetitive academic climate and an incentive scheme that provides rich rewards for overselling one’s work and few rewards at all for caution and circumspection (see Giner-Sorolla, 2012, this issue). Equally disturbing, investigators seem to be replicating each others’ work even less often than they did in the past, again presumably reflecting an incentive scheme gone askew (a point discussed in several articles in this issue, e.g., Makel, Plucker, & Hegarty, 2012). The frequency with which errors appear in the psychological literature is not presently known, but a number of facts suggest it might be disturbingly high. Ioannidis (2005) has shown through simple mathematical modeling that any scientific field that ignores replication can easily come to the miserable state wherein (as the title of his most famous article puts it) “most published research findings are false” (see also Ioannidis, 2012, this issue, and Pashler & Harris, 2012, this issue). Meanwhile, reports emerging from cancer research have made such grim scenarios seem more plausible: In 2012, several large pharmaceutical companies revealed that their efforts to replicate exciting preclinical findings from published academic studies in cancer biology were only rarely verifying the original results (Begley & Ellis, 2012; see also Osherovich, 2011; Prinz, Schlange, & Asadullah, 2011).
TL;DR: In this article, the authors develop strategies for improving scientific practices and knowledge accumulation that account for ordinary human motivations and biases, and demonstrate that the persistence of false findings can be mitigated with strategies that make the fundamental but abstract accuracy motive competitive with the more tangible and concrete incentive.
Abstract: An academic scientist’s professional success depends on publishing. Publishing norms emphasize novel, positive results. As such, disciplinary incentives encourage design, analysis, and reporting decisions that elicit positive results and ignore negative results. Prior reports demonstrate how these incentives inflate the rate of false effects in published science. When incentives favor novelty over replication, false results persist in the literature unchallenged, reducing efficiency in knowledge accumulation. Previous suggestions to address this problem are unlikely to be effective. For example, a journal of negative results publishes otherwise unpublishable reports. This enshrines the low status of the journal and its content. The persistence of false findings can be meliorated with strategies that make the fundamental but abstract accuracy motive—getting it right—competitive with the more tangible and concrete incentive—getting it published. This article develops strategies for improving scientific practices and knowledge accumulation that account for ordinary human motivations and biases.
TL;DR: This article suggests that so-called failures to replicate may not be failures at all, but rather are the result of low statistical power in single replication studies, and of failure to appreciate the need for multiple replications in order to have enough power to identify true effects.
Abstract: Psychology has recently been viewed as facing a replication crisis because efforts to replicate past study findings frequently do not show the same result. Often, the first study showed a statistically significant result but the replication does not. Questions then arise about whether the first study results were false positives, and whether the replication study correctly indicates that there is truly no effect after all. This article suggests these so-called failures to replicate may not be failures at all, but rather are the result of low statistical power in single replication studies, and the result of failure to appreciate the need for multiple replications in order to have enough power to identify true effects. We provide examples of these power problems and suggest some solutions using Bayesian statistics and meta-analysis. Although the need for multiple replication studies may frustrate those who would prefer quick answers to psychology's alleged crisis, the large sample sizes typically needed to provide firm evidence will almost always require concerted efforts from multiple investigators. As a result, it remains to be seen how many of the recently claimed failures to replicate will be supported or instead may turn out to be artifacts of inadequate sample sizes and single study replications.