Tag Archives: Paper Writing

Goodhart’s Law and Citation Metrics

According to Wikipedia, Goodhart’s law colloquially states that:

“When a measure becomes a target, it ceases to be a good measure.”

It was originally formulated as an economics principle, but has been found to be applicable in a much wider variety of circumstances. Let’s take a look at a few examples to understand what this principle means.

Police departments are often graded using crime statistics. In the US in particular, a combined index of eight categories constitute a “crime index”. In 2014, it was reported in Chicago magazine that the huge crime reduction seen in Chicago was merely due to reclassification of certain crimes. Here is the summary plot they showed:

ChicagoCrime

Image reprinted from Chicago magazine

In effect, some felonies were labeled misdemeanors, etc. The manipulation of the “crime index” corrupted the way the police did their jobs.

Another famous example of Goodhart’s law is Google’s search algorithm, known as PageRank. Crudely, PageRank works in the following way as described by Wikipedia:

“PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.”

Knowing how PageRank works has obviously led to its manipulation. People seeking to have greater visibility and wanting to be ranked higher on Google searches have used several schemes to raise their rating. One of the most popular schemes is to post links of one’s own website in the comments section of high-ranked websites in order to inflate one’s own ranking. You can read a little more about this and other schemes here (pdf!).

With the increased use of citation metrics among the academic community, it should come as no surprise that it also can become corrupted. Increasingly, there are many authors per paper, as groups of authors can all take equal credit for papers when using the h-index as a scale. Many scientists also spend time emailing their colleagues to urge them to cite one of their papers (I only know of this happening anecdotally).

Since the academic example hits home for most of the readers of this blog, let me try to formulate a list of the beneficial and detrimental consequences of bean-counting:

Advantages:

  1. One learns how to write a technical paper early in one’s career.
  2. It can motivate some people to be more efficient with their time.
  3. It provides some sort of metric by which to measure scientific competence (though it can be argued that any currently existing index is wholly inadequate, and will always be inadequate in light of Goodhart’s law!).
  4. Please feel free to share any ideas in the comments section, because I honestly cannot think of any more!

Disadvantages:

  1. It makes researchers focuses on short-term problems instead of long-term moon-shot kinds of problems.
  2. The community loses good scientists because they are deemed as not being productive enough. A handful of the best students I came across in graduate school left physics because they didn’t want to “play the game”.
  3. It rewards those who may be more career-oriented and focus on short-term science, leading to an overpopulation of these kinds of people in the scientific community.
  4. It may lead scientists to cut corners and even go as far as to falsify data. I have addressed some of these concerns before in the context of psychology departments.
  5. It provides an incentive to flood the literature with papers that are of low quality. It is no secret that the number of publications has ballooned in the last couple decades. Though it is hard to quantify quality, I cannot imagine that scientists have just been able to publish more without sacrificing quality in some way.
  6. It takes the focus of scientists’ jobs away from science, and makes scientists concerned with an almost meaningless number.
  7. It leads authors to overstate the importance of their results in effort to publish in higher profile journals.
  8. It does not value potential. Researchers who would have excelled in their latter years, but not their former, are under-valued. Late-bloomers therefore go under-appreciated.

Just by examining my own behavior in reference to the above lists, I can say that my actions have been altered by the existence of citation and publication metrics. Especially towards the end of graduate school, I started pursuing shorter-term problems so that they would result in publications. Obviously, I am not the only one that suffers from this syndrome. The best one can do in this scenario is to work on longer-term problems on the side, while producing a steady stream of papers on shorter-term projects.

In light of the two-slit experiment, it seems ironic that physicists are altering their behavior due to the fact that they are being measured.

Reproducing Experiments

I have a long commute to work everyday, and on those drives, I often listen to podcasts. One by the NPR Planet Money team struck a nerve recently. It was called “The Experiment Experiment” and was about reproducing results in the field of psychology. As an aside, physicist David Kestenbaum is one of the reporters on Planet Money that relayed this story.

In the episode, Brian Nosek, a psychologist at the University of Virginia is able to (amazingly!) persuade 270 others working in his field to try to reproduce 100 experiments that were published in the three top-tier psychology journals. The result?

64 of the 100 experiments were not reproduced.

The most interesting part of this podcast is the analysis as to why this occurred. Their hypothesis is that two main reasons are to blame:

  1. The “file-drawer effect”
  2. Psychologists tricking themselves due to misaligned incentives

File-Drawer Effect

Let’s use, as they did in the episode, the instructive analogy of coin-flipping to understand these two ideas. It goes something like this. If there are 100 researchers, doing the same experiment of flipping 10 coins, most of them will obtain somewhere between 4-6 heads. These researchers are likely to put their results in their drawers and move on to their next experiment. However, there will be 1 or 2 researchers that get the astounding result of 9 out of 10 or 1 out of 10 heads. These researchers will think to themselves, “Whoa! What is the probability of me obtaining that result?!” They will calculate it, see that the probability is about 1% and publish the result thinking that it is statistically significant.

Just as an illustration, here the distribution one would naively expect of 100 researchers doing the coin-flipping experiment:

100ReseachersCoinFlipping

Tricking Themselves

The other component that they claim contributes to these striking reproducibility numbers is that researchers have an incentive to obtain positive results. This is because positive results get researchers publications, which result in promotions for tenure-track faculty and jobs for graduate students and postdocs. Therefore, due to the incentive structure, researchers have a natural bias towards positive results. This does not mean to imply that these researchers are committing scientific misconduct; they are just unaware of their biases.

Let us take the coin-flipping example again and start from the above graphic to see how this might work. Approximately 12 of 100 researchers flipping coins will obtain 7 heads out of 10 coin flips. This would not be a remarkably significant result, but then suppose all 12 of them think, “Let me just check to see if this result is true,” and they flip another 4 coins. Now, 3-4 of those 12 researchers will obtain 3 or 4 heads when flipping the coin 4 times, reinforcing their previous result! They will then think, “Well, this result must be true! I better publish this!”

One can see how these two effects could combine to lead to the staggering number of results that are not reproducible. Because the incentive structure in our field is similar, one fears that such things may be going on in physics departments as well. I would like to hope not, but if psychologists are susceptible to psychological pressure, who isn’t?

Lost in Translation

Like any good joke, there is a kernel (or a little more) of truth in the following one.

A dictionary of useful research phrases: what physics researchers say and what they mean by it

It has long been known: I didn’t look up the original reference.
A definite trend is evident: These data are practically meaningless.
Of great theoretical and practical importance: Interesting to me.
While it has not been possible to provide definite answers to these questions: An unsuccessful experiment, but I still hope to get it published.
Three of the samples were chosen for detailed study: The results of the others didn’t make any sense.
Typical results are shown: The best results are shown.
These results will be shown in a subsequent report: I might get around to this sometime if I’m pushed.
The most reliable results are those obtained by Jones: He was my graduate assistant.
It is believed that: I think.
It is generally believed that: A couple of other people think so too.
It is clear that much additional work will be required before a complete understanding of the phenomenon occurs: I don’t understand it.
Correct within an order of magnitude: Wrong.
It is hoped that this study will stimulate further investigation in this: This is a lousy paper, but so are all the others on this miserable topic.
A careful analysis of obtainable data: Three pages of notes were obliterated when I knocked over a glass of beer.

Source: http://www.luigigobbi.com/jokesaboutphysicists.htm

The Value of a Null Result

In our field, it is unpopular to publish a result where one finds an absence of a particular phenomenon. However, these results can be extremely valuable, as one can see what other authors have tried.

In the study of charge density wave (CDW) systems, which has been undergoing a renaissance of late, there is one particular null result I find quite fascinating. This result (sorry, paywall) was published by F. DiSalvo and R. Fleming in Solid State Communications, demonstrating the inability, even at high electric fields, for a charge density wave to depin and slide in two prototypical quasi-2D transition metal dichalcogenides, 1T-Tantalum Disulphide and 2H-Tantalum Diselenide.

In fact, I am unaware of any report of a sliding CDW in quasi-2D transition metal dichalcogenides. This has pretty vast implications for these materials, as it is difficult to probe the electronic subsystem alone due to the inability to divorce it from the ionic subsystem.

Any comments pointing me in the direction of observations of sliding CDWs in transition metal dichalcogenides are encouraged.