Are psychological findings reliable? Do published studies mostly represent true effects, or are non-effects that look significant due to random sampling error overrepresented due to publication bias, the tendency to publish only significant effects? An increasing number of projects are being undertaken to answer these questions by meticulously replicating previously published research in an effort to see whether the same results are obtained. One of these projects is the Registered Replication Report recently unveiled at the journal Perspectives in Psychological Science. A registered replication report, or RRR, allows psychologists to specify their intentions to replicate a published finding, as well as plan out their analysis plan in advance so we can be confident that what they report is a confirmatory (designed to test an established hypothesis) rather than exploratory (designed to test multiple possible results to generate hypotheses) analysis.

Recently the first RRR was published at Perspectives. In this report, 31 labs collaborated to replicate an experiment by Schooler and Engstler-Schooler (1990) on verbal overshadowing – the finding that describing something verbally (in this case, a suspect in a criminal case), impairs subsequent visual identification of that same thing. The extent of this multi-lab effort is unusual, and likely reflects an initial enthusiasm for the idea of the RRR, but it does provide a mountain of evidence on verbal overshadowing. Due to an error in the initial protocol, only the second of the two studies in the replication report is a direct replication of one of the studies in the original paper. That still leaves data from 22 labs for this experiment, which found that there was a reliable drop in accuracy of 16% when a subject was verbally described. This is substantial, if somewhat smaller than the initial study, which found an effect size of 25%. Given that the original published effect sizes falls outside the confidence interval generated by the replication effort, I thought this data might make a good case study in the inflation of effect sizes that might result from publication bias.

When experiments don’t run enough participants to have a good chance of finding their effect (which is common in psychology), studies that find significant effects will tend to be those where random variation around the true effect errs on the side of making it larger, and thus easier to detect. Therefore, if publication bias leads to only significant studies getting published, the average effect size of published studies will be substantially larger than the true effect. Since a little less than half (9 of 22) of the studies in this replication effort are significant individually, it’s possible to directly compare the true effect size (at least as measured by the full set of 22 studies) with the effect size that would result from publication bias.

The above is a graph of the meta-analysis of the verbal overshadowing effect in Study 2. The meat of the graph is generated from the same code that generated Figure 3 in the paper (found here). In addition, I’ve added a final horizontal line, on which I’ve placed a marker in blue for the meta-analytic effect of only those studies that found a significant effect individually. At the bottom right, you can see how the confidence interval for this new meta-analytic effect compares to the overall effect.

As expected, publication bias yields a larger effect size estimate. How much larger? 23% compared to 16%, or a 144% increase in the size of the estimate. This difference speaks to the so-called ‘decline effect’, where the first published finding of any given effect tends to be the largest, and follow-up studies average smaller effects. We can see that the original study by Schooler and Engstler-Schooler, (represented at the top of the figure), yielded an effect size that falls outside the confidence interval of the larger meta-analysis. However, that initial effect size falls almost perfectly in the middle of the confidence interval generated by a population of only significant studies. Since counter-intuitive findings like verbal overshadowing will tend only to be published when they are significant, we can assume initial publications of such findings are drawn from a sample of only significant studies, which would completely account for the ‘decline effect’ in this particular case.

Of course, this doesn’t indicate that there was anything suspect about the Schooler and Engstler-Schooler (1990) study. Looking at the graph above, several of the replication efforts found effects that almost perfectly mirrored the initial study. If we imagine these 22 replication efforts to be 22 original studies, each of the 9 authors of a significant effect could publish their effect with completely honest reporting, while the other 13 conclude a failure to find significant results and go back to the drawing board. Thus, each published paper would be an over-estimate of the effect it reports on, while being perfectly valid, statistically, in it’s own right.

Given all of this, it seems reasonable to weight effect size estimates downward to some degree when reading a paper that attempts to verify a counter-intuitive hypothesis for the first time. The size of the publication bias effect will depend on the power of published studies to find a real effect, so you can adjust your estimates less when the published study has a substantial number of participants for the size of the effect they’re looking to test. If you want to get a better idea about power estimates, programs like G-Power can be useful. One bright note on this front is that, with the sudden ease of collecting large numbers of participants through sites like Mturk, many psychology publications going forward can be more than adequately powered, reducing the effect of publication bias substantially.

Replication efforts like this one go a long way to help us not only understand how reliable specific effects are, but, as more replications stack up, how reliable the published body of studies in psychology is as a whole. The relative ease with which I was able to further analyse this published data also speaks to the value of making data and analysis code publicly available of forums like the Open Science Framework. In that spirit, the code used to generate my extended plot can be found here.