Category Archives: Paper Writing

Data Representation and Trust

Though popular media often portrays science as purely objective, there are many subjective sides to it as well. One of these is that there is a certain amount of trust we have in our peers that they are telling the truth.

For instance, in most experimental papers, one can only present an illustrative portion of all the data taken because of the sheer volume of data usually acquired. What is presented is supposed to be to a representative sample. However, as readers, we are never sure this is actually the case. We trust that our experimental colleagues have presented the data in a way that is honest, illustrative of all the data taken, and is reproducible under similar conditions. It is increasingly becoming a trend to publish the remaining data in the supplemental section — but the utter amount of data taken can easily overwhelm this section as well.

When writing a paper, an experimentalist also has to make certain choices about how to represent the data. Increasingly, the amount of data at the experimentalist’s disposal means that they often choose to show the data using some sort of color scheme in a contour or color density plot. Just take a flip through Nature Physics, for example, to see how popular this style of data representation has become. Almost every cover of Nature Physics is supplied by this kind of data.

However, there are some dangers that come with color schemes if the colors are not chosen appropriately. There is a great post at medvis.org talking about the ills of using, e.g. the rainbow color scheme, and how misleading it can be in certain circumstances. Make sure to also take a look at the articles cited therein to get a flavor of what these schemes can do. In particular, there is a paper called “Rainbow Map (Still) Considered Harmful”, which has several noteworthy comparisons of different color schemes including ones that are and are not perceptually linear. Take a look at the plots below and compare the different color schemes chosen to represent the same data set (taken from the “Rainbow Map (Still) Considered Harmful” paper):

rainbow

The rainbow scheme appears to show more drastic gradients in comparison to the other color schemes. My point, though, is that by choosing certain color schemes, an experimentalist can artificially enhance an effect or obscure one he/she does not want the reader to notice.

In fact, the experimentalist makes many choices when publishing a paper — the size of an image, the bounds of the axes, the scale of the axes (e.g. linear vs. log), the outliers omitted, etc.– all of which can have profound effects on the message of the paper. This is why there is an underlying issue of trust that lurks in within the community. We trust that experimentalists choose to exhibit data in an attempt to be as honest as they can be. Of course, there are always subconscious biases lurking when these choices are made. But my hope is that experimentalists are mindful and introspective when representing data, doubting themselves to a healthy extent before publishing results.

To be a part of the scientific community means that, among other things, you are accepted for your honesty and that your work is (hopefully) trustworthy. A breach of this implicit contract is seen as a grave offence and is why cases of misconduct are taken so seriously.

Goodhart’s Law and Citation Metrics

According to Wikipedia, Goodhart’s law colloquially states that:

“When a measure becomes a target, it ceases to be a good measure.”

It was originally formulated as an economics principle, but has been found to be applicable in a much wider variety of circumstances. Let’s take a look at a few examples to understand what this principle means.

Police departments are often graded using crime statistics. In the US in particular, a combined index of eight categories constitute a “crime index”. In 2014, it was reported in Chicago magazine that the huge crime reduction seen in Chicago was merely due to reclassification of certain crimes. Here is the summary plot they showed:

ChicagoCrime

Image reprinted from Chicago magazine

In effect, some felonies were labeled misdemeanors, etc. The manipulation of the “crime index” corrupted the way the police did their jobs.

Another famous example of Goodhart’s law is Google’s search algorithm, known as PageRank. Crudely, PageRank works in the following way as described by Wikipedia:

“PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.”

Knowing how PageRank works has obviously led to its manipulation. People seeking to have greater visibility and wanting to be ranked higher on Google searches have used several schemes to raise their rating. One of the most popular schemes is to post links of one’s own website in the comments section of high-ranked websites in order to inflate one’s own ranking. You can read a little more about this and other schemes here (pdf!).

With the increased use of citation metrics among the academic community, it should come as no surprise that it also can become corrupted. Increasingly, there are many authors per paper, as groups of authors can all take equal credit for papers when using the h-index as a scale. Many scientists also spend time emailing their colleagues to urge them to cite one of their papers (I only know of this happening anecdotally).

Since the academic example hits home for most of the readers of this blog, let me try to formulate a list of the beneficial and detrimental consequences of bean-counting:

Advantages:

  1. One learns how to write a technical paper early in one’s career.
  2. It can motivate some people to be more efficient with their time.
  3. It provides some sort of metric by which to measure scientific competence (though it can be argued that any currently existing index is wholly inadequate, and will always be inadequate in light of Goodhart’s law!).
  4. Please feel free to share any ideas in the comments section, because I honestly cannot think of any more!

Disadvantages:

  1. It makes researchers focuses on short-term problems instead of long-term moon-shot kinds of problems.
  2. The community loses good scientists because they are deemed as not being productive enough. A handful of the best students I came across in graduate school left physics because they didn’t want to “play the game”.
  3. It rewards those who may be more career-oriented and focus on short-term science, leading to an overpopulation of these kinds of people in the scientific community.
  4. It may lead scientists to cut corners and even go as far as to falsify data. I have addressed some of these concerns before in the context of psychology departments.
  5. It provides an incentive to flood the literature with papers that are of low quality. It is no secret that the number of publications has ballooned in the last couple decades. Though it is hard to quantify quality, I cannot imagine that scientists have just been able to publish more without sacrificing quality in some way.
  6. It takes the focus of scientists’ jobs away from science, and makes scientists concerned with an almost meaningless number.
  7. It leads authors to overstate the importance of their results in effort to publish in higher profile journals.
  8. It does not value potential. Researchers who would have excelled in their latter years, but not their former, are under-valued. Late-bloomers therefore go under-appreciated.

Just by examining my own behavior in reference to the above lists, I can say that my actions have been altered by the existence of citation and publication metrics. Especially towards the end of graduate school, I started pursuing shorter-term problems so that they would result in publications. Obviously, I am not the only one that suffers from this syndrome. The best one can do in this scenario is to work on longer-term problems on the side, while producing a steady stream of papers on shorter-term projects.

In light of the two-slit experiment, it seems ironic that physicists are altering their behavior due to the fact that they are being measured.

Reproducing Experiments

I have a long commute to work everyday, and on those drives, I often listen to podcasts. One by the NPR Planet Money team struck a nerve recently. It was called “The Experiment Experiment” and was about reproducing results in the field of psychology. As an aside, physicist David Kestenbaum is one of the reporters on Planet Money that relayed this story.

In the episode, Brian Nosek, a psychologist at the University of Virginia is able to (amazingly!) persuade 270 others working in his field to try to reproduce 100 experiments that were published in the three top-tier psychology journals. The result?

64 of the 100 experiments were not reproduced.

The most interesting part of this podcast is the analysis as to why this occurred. Their hypothesis is that two main reasons are to blame:

  1. The “file-drawer effect”
  2. Psychologists tricking themselves due to misaligned incentives

File-Drawer Effect

Let’s use, as they did in the episode, the instructive analogy of coin-flipping to understand these two ideas. It goes something like this. If there are 100 researchers, doing the same experiment of flipping 10 coins, most of them will obtain somewhere between 4-6 heads. These researchers are likely to put their results in their drawers and move on to their next experiment. However, there will be 1 or 2 researchers that get the astounding result of 9 out of 10 or 1 out of 10 heads. These researchers will think to themselves, “Whoa! What is the probability of me obtaining that result?!” They will calculate it, see that the probability is about 1% and publish the result thinking that it is statistically significant.

Just as an illustration, here the distribution one would naively expect of 100 researchers doing the coin-flipping experiment:

100ReseachersCoinFlipping

Tricking Themselves

The other component that they claim contributes to these striking reproducibility numbers is that researchers have an incentive to obtain positive results. This is because positive results get researchers publications, which result in promotions for tenure-track faculty and jobs for graduate students and postdocs. Therefore, due to the incentive structure, researchers have a natural bias towards positive results. This does not mean to imply that these researchers are committing scientific misconduct; they are just unaware of their biases.

Let us take the coin-flipping example again and start from the above graphic to see how this might work. Approximately 12 of 100 researchers flipping coins will obtain 7 heads out of 10 coin flips. This would not be a remarkably significant result, but then suppose all 12 of them think, “Let me just check to see if this result is true,” and they flip another 4 coins. Now, 3-4 of those 12 researchers will obtain 3 or 4 heads when flipping the coin 4 times, reinforcing their previous result! They will then think, “Well, this result must be true! I better publish this!”

One can see how these two effects could combine to lead to the staggering number of results that are not reproducible. Because the incentive structure in our field is similar, one fears that such things may be going on in physics departments as well. I would like to hope not, but if psychologists are susceptible to psychological pressure, who isn’t?