Seven Deadly Data Sins

02/3/2014 Andy Lattal, Ph.D. Behavior Analysis

We often are reminded that we live in the “information age.” Ours is the period of human history when our decisions are guided by data. Data typically are collected by observation; scientific data through a particular set of observational rules called the “scientific method.” Like anything we humans have invented, data (note: there is always confusion about whether the word is singular or plural; I use the plural form when referring to the collective and the singular, datum, when referring to a specific instance) are used in different ways. They are used mostly to good ends, but they also can be twisted and distorted, whether intentionally or not, to ends that are not so hot. There are in particular at least seven ways in which data are used that are inappropriate and may lead us to conclusions that are confusing and even false.

What follows are seven such sins. There undoubtedly are more, but eight or nine would make for a less catchy title. The ones I list give us plenty to work with and think about. I also note that my ideas about data sins come from my work as a scientist; however, the observations are general ones about data. So, I would say that these observations apply equally to someone like me studying pigeons in a Skinner box as to a manager collecting data on employees’ safe behavior practices or a teacher collecting data on how well her pupils are learning or otherwise behaving in the classroom.

Except for the first, which is to me the worst data sin possible, the others are in no particular order of “badness.”

1. Confabulating (making up) data. This is the big enchilada, the cardinal sin. Doesn’t matter why it is done. It is unforgivable. If data are too good to be true, they probably are fabricated. Several years ago, I went to a talk where a guy was showing data demonstrating astonishingly beautiful effects of drugs in combination with a behavioral intervention in reducing problem behavior of institutionalized children. We (a group of university professors and graduate students in psychology) all came out of the talk simply stunned - the data were so clean and just plain beautiful (scientists find beauty in unusual things, sometimes). A few months later the ugly truth came out when one of the presenter’s collaborators produced evidence that at one of the institutions where the data were collected, there simply were not the number of patients with the reported diagnosis that the presenter claimed to have used in the study. Once the snowball of revelation got rolling, there quickly developed an avalanche of evidence that the presenter had made up a good bit of what he was reporting in a host of high-quality scientific journals. Good scientists are those who value research integrity above everything else they do as scientists, but there are sadly too many similar cases to the one I just described. Does Rosie Ruiz ring a bell?

2. Tweaking data or “cookin’ the books,” as they say. This is for all intents and purposes the same thing as fabricating data; it is a question of both what is done and how it is done. When I think of tweaking data, it isn’t that the data are just made up, but they are manipulated or changed in questionable to illegitimate ways. So, for example, much of the data are legitimate, it’s just that the data interpreter has fudged some of the numbers to make things look a particular way. This can take many forms. Here are two examples. (1) Anyone familiar with advanced statistics knows that by increasing a sample size, one can push an insignificant analysis over the top and make the result “statistically significant” when in fact nothing has changed except we have added a few more subjects. (2) Pushing one little data point around on a graph to a place where it shouldn’t be is as big a sin as if one had made up the whole graph without ever having stepped into the lab. Classic book cookin' in business helped to bring down great companies and ruin individuals and their families: Arthur Anderson, Enron, the Madoffs. Data collectors don’t cheat, and if they do they should be drummed out of the corps, epaulets torn, buttons shorn, and worse.

3. Over-reliance on statistical significance. If the outcome of a test of the effect of some intervention on a change in behavior is likely to be due to chance factors is less than 5 or less than 1 time in a hundred (depending on the degree of confidence sought) , then the effect is said to be statistically significant. Many “facts” in psychology are based on such evidence, and it actually works pretty well. But, as my colleague Ben Williams has pointed out, in psychology and medicine, sometimes a treatment that has not reached the arbitrary level of statistical significance can have profound positive effects for some individuals within the group. Drugs that can help thousands of people sometimes are dismissed as ineffective because drug trials based on averaging together individuals who are affected by the drug to differing degrees fail to show a statistical effect significantly different from chance. Statistical significance is just a guide, an arbitrary cutoff point that can be useful for examining the effect of something on the “average (in a statistical sense) person”, but it often can be important to look beyond these statistical tests that lump widely differing individuals together into a group to describe an average effect of some treatment or intervention. Misunderstanding that individual performers are caught in an over-reliance on the statistical norm can lead to painful and individually incorrect decisions about the performance of a management or assembly team taken as a whole. Our unchallenged reliance on statistics to address individual potential serves to limit and box in all of us.

4. Over-interpreting correlations. Almost everyone who has had an elementary statistics course and many who haven’t know that a correlation between two events doesn’t imply that one causes the other. Thus, a person’s height and weight tend to vary together, but no one would say that height causes weight. One of my pet peeves in the popular media’s presentation of scientific findings, particularly those related to health, is that too often they take correlational data (for example, a positive relation between some lifestyle event, like lack of exercise, and some kind of cancer) and conclude or imply that changing one’s lifestyle would lessen the cancer risk. The latter may be true, but there may be a third factor (e.g., a stressful, demanding job) that precludes regular exercise and changes the body such that it is more susceptible to cancer. In the case of business practice, a misattribution of causality in interpreting a correlation between bad sales figures with sales practices might lead to the firing of the sales team and their replacement by a second sales team. When sales figures still do not improve, further analysis might reveal the cause to be problems with the product itself that make it undesirable to the public.

5. Taking data as gospel truth. Scientists are trained to be skeptical. We look for the hidden flaw in an experiment and as a result are sometimes seen by the public as being hyper-critical or even cynical. I, for one, am guilty as charged. To not look at every claim based on data with microscopic precision to ensure that the claims made are supported by the observations would go against all of any scientist’s training, as well as being a dereliction of duty to a public that relies on our interpretations. Anyone dealing with so-called facts has an obligation to the people affected by their decisions to view data with a keen eye to make sure that decisions are based on the best available facts. It is easy, too, to focus on only one aspect of complicated data that change over time. With the U. S. automobile industry in the last quarter of the 20th century, leaders seemed to focus more on data from past successes rather than the global changes occurring around them.

6. Overgeneralizing conclusions. Just because a particular treatment or an intervention shows an effect doesn’t mean that it is appropriate for use in every instance. Scientists are always qualifying their results (and it seems that many in the popular media overgeneralize their conclusions about the scientists’ data to make such and such a point). Because an effect is shown, it doesn’t mean that whatever has shown the effect should be used as a treatment or means of problem reduction or elimination in every case. For example, removing social reinforcement as a means of controlling attention-maintained problem behavior may work well for a well-socialized child, but could be quite detrimental for a child who already is socially withdrawn or socially ostracized.

7. Distorting data. I am a strong advocate of sometimes “going with one’s gut” in doing stuff. For me, it isn’t intuition that guides this practice, but a particular behavioral history, about which I will write more in a future commentary. The data sin comes in trying to use data that clearly point to one conclusion to support another, contrary one. If a person wants to do something that the data don’t support, they should own up to their trust in their intuition over the seeming facts. Trying to hide a judgment behind data that don’t support it is unjustified. Instead of claiming false authority by which to act, acknowledge that your judgment is based on simply your personal experience. That could be and often is enough. Be alert, however, that what you know through experience is shaped by a narrow base (our individual lives). This base requires us all to be healthy skeptics (see Number 5 above) about the generalized wisdom you and I think we have. To appreciate the danger of going with your gut over the data, you only need remember the immortal words of the not so immortal General John Sedgwick, standing exposed to Confederate sniper fire at the Battle of Spotsylvania Court House on May 9, 1864: “They couldn’t hit an elephant at this distance.” Famous last words, literally.

Seven Deadly Data Sins

Posted by Andy Lattal, Ph.D.

Contact Info