Reproducing Experiments

I have a long commute to work everyday, and on those drives, I often listen to podcasts. One by the NPR Planet Money team struck a nerve recently. It was called “The Experiment Experiment” and was about reproducing results in the field of psychology. As an aside, physicist David Kestenbaum is one of the reporters on Planet Money that relayed this story.

In the episode, Brian Nosek, a psychologist at the University of Virginia is able to (amazingly!) persuade 270 others working in his field to try to reproduce 100 experiments that were published in the three top-tier psychology journals. The result?

64 of the 100 experiments were not reproduced.

The most interesting part of this podcast is the analysis as to why this occurred. Their hypothesis is that two main reasons are to blame:

  1. The “file-drawer effect”
  2. Psychologists tricking themselves due to misaligned incentives

File-Drawer Effect

Let’s use, as they did in the episode, the instructive analogy of coin-flipping to understand these two ideas. It goes something like this. If there are 100 researchers, doing the same experiment of flipping 10 coins, most of them will obtain somewhere between 4-6 heads. These researchers are likely to put their results in their drawers and move on to their next experiment. However, there will be 1 or 2 researchers that get the astounding result of 9 out of 10 or 1 out of 10 heads. These researchers will think to themselves, “Whoa! What is the probability of me obtaining that result?!” They will calculate it, see that the probability is about 1% and publish the result thinking that it is statistically significant.

Just as an illustration, here the distribution one would naively expect of 100 researchers doing the coin-flipping experiment:


Tricking Themselves

The other component that they claim contributes to these striking reproducibility numbers is that researchers have an incentive to obtain positive results. This is because positive results get researchers publications, which result in promotions for tenure-track faculty and jobs for graduate students and postdocs. Therefore, due to the incentive structure, researchers have a natural bias towards positive results. This does not mean to imply that these researchers are committing scientific misconduct; they are just unaware of their biases.

Let us take the coin-flipping example again and start from the above graphic to see how this might work. Approximately 12 of 100 researchers flipping coins will obtain 7 heads out of 10 coin flips. This would not be a remarkably significant result, but then suppose all 12 of them think, “Let me just check to see if this result is true,” and they flip another 4 coins. Now, 3-4 of those 12 researchers will obtain 3 or 4 heads when flipping the coin 4 times, reinforcing their previous result! They will then think, “Well, this result must be true! I better publish this!”

One can see how these two effects could combine to lead to the staggering number of results that are not reproducible. Because the incentive structure in our field is similar, one fears that such things may be going on in physics departments as well. I would like to hope not, but if psychologists are susceptible to psychological pressure, who isn’t?

9 responses to “Reproducing Experiments

  1. One of the biggest problems in psychology, it seems to me, is that “realness of the effect” is almost always equated with “p-value”. And p-value is a highly questionable statistic.


  2. Note though that in your example the erroneous ways lie in the lack of a proper statistical analysis. This is an issue in psychological and medical studies where one works with groups, and thus by definition, works with statistical distributions. Moreover, the research subjects are not inactive; they are (self-) conscious and therefore may be biased as well (apart from a possible bias of the researchers).

    In physics I think this is somewhat less of an issue, though the number of times an experiment has to reproduce should (indeed) be determined using statistical approaches. One would expect though that the samples produced should be identical in nature within random (as opposed to biased) statistical variations which can simply be taken into account in the analysis of how often an experiment needs to be repeated.


  3. Pingback: John Oliver on Science | This Condensed Life

  4. Neel Mansukhani

    I listened to that same planet money podcast yesterday and have been reflecting on another point they made – the studies which were not reproduced do not necessarily show that the initial study was false, but can reveal nuances of the initial findings which were not initially known or reported. Studies which are not reproduced, when examined more closely, may not have been conducted under identical circumstances which may effect the final outcome. An experiment which does not reproduce previous findings has value in that it can contribute to the bigger picture of the underlying process being studied when the methods are examined for differences compared to the initial study. This is done at a micro level daily when perfecting methods and strengthening results in the lab. Brian Nosek is simply doing this at a macro level in repeating published experiments.

    Liked by 1 person

  5. Pingback: Goodhart’s Law and Citation Metrics | This Condensed Life

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s