Data Representation and Trust

Though popular media often portrays science as purely objective, there are many subjective sides to it as well. One of these is that there is a certain amount of trust we have in our peers that they are telling the truth.

For instance, in most experimental papers, one can only present an illustrative portion of all the data taken because of the sheer volume of data usually acquired. What is presented is supposed to be to a representative sample. However, as readers, we are never sure this is actually the case. We trust that our experimental colleagues have presented the data in a way that is honest, illustrative of all the data taken, and is reproducible under similar conditions. It is increasingly becoming a trend to publish the remaining data in the supplemental section — but the utter amount of data taken can easily overwhelm this section as well.

When writing a paper, an experimentalist also has to make certain choices about how to represent the data. Increasingly, the amount of data at the experimentalist’s disposal means that they often choose to show the data using some sort of color scheme in a contour or color density plot. Just take a flip through Nature Physics, for example, to see how popular this style of data representation has become. Almost every cover of Nature Physics is supplied by this kind of data.

However, there are some dangers that come with color schemes if the colors are not chosen appropriately. There is a great post at talking about the ills of using, e.g. the rainbow color scheme, and how misleading it can be in certain circumstances. Make sure to also take a look at the articles cited therein to get a flavor of what these schemes can do. In particular, there is a paper called “Rainbow Map (Still) Considered Harmful”, which has several noteworthy comparisons of different color schemes including ones that are and are not perceptually linear. Take a look at the plots below and compare the different color schemes chosen to represent the same data set (taken from the “Rainbow Map (Still) Considered Harmful” paper):


The rainbow scheme appears to show more drastic gradients in comparison to the other color schemes. My point, though, is that by choosing certain color schemes, an experimentalist can artificially enhance an effect or obscure one he/she does not want the reader to notice.

In fact, the experimentalist makes many choices when publishing a paper — the size of an image, the bounds of the axes, the scale of the axes (e.g. linear vs. log), the outliers omitted, etc.– all of which can have profound effects on the message of the paper. This is why there is an underlying issue of trust that lurks in within the community. We trust that experimentalists choose to exhibit data in an attempt to be as honest as they can be. Of course, there are always subconscious biases lurking when these choices are made. But my hope is that experimentalists are mindful and introspective when representing data, doubting themselves to a healthy extent before publishing results.

To be a part of the scientific community means that, among other things, you are accepted for your honesty and that your work is (hopefully) trustworthy. A breach of this implicit contract is seen as a grave offence and is why cases of misconduct are taken so seriously.


2 responses to “Data Representation and Trust

  1. Why Most Published Research Findings Are False
    John P. A. Ioannidis


  2. There’s also the fact that rainbow-colored plots are unreadable to the colorblind, and don’t parse remotely well when printed in B&W.

    Thank goodness Matlab finally moved away from Jet as default, and matplotlib went to Viridis.


