Science is awesome, but it ain’t perfect. If you’ve been paying attention to the so-called “crises of reproducibility” in the behavioral, biomedical, and social sciences, you know that false positives and overblown effect sizes appear to be rampant in the published literature.

This is a problem for building solid theories of how the world works. In The Descent of Man, Charles Darwin observed that false facts are much more insidious than false theories. New theories can dominate previous theories if their explanations better fit the facts, and scientists, being human, love proving each other wrong. But if our facts are wrong, theory building is stymied and misdirected, our efforts wasted. If scientific results are wrong, we should all be concerned.

How does science produce false facts? Here’s a non-exhaustive list:

  • Studies are underpowered, leading to false positives and ambiguous results [1].
  • Negative results aren’t published, lowering information content in published results [2,3].
  • Misunderstanding of statistical techniques (e.g., misunderstanding of the meaning of p-values [4], incautious multiple hypothesis testing [5,6]) is pervasive, leading to false positives and ambiguous results.
  • Surprising, easily understood results are easiest to publish, putting less emphasis on reliable, time-consuming research that is perceived as “boring.”

These problems are well understood, and, in general, have been understood for decades. For example, warnings about misuse of p-values and low statistical power date to at least the 1960s [7,8]. We know these practices hinder scientific knowledge and lead to ambiguous, overestimated, and flat-out false results. Why, then, do they persist? At least three explanations present themselves.

  1. Incompetence: Scientists just don’t understand how to use good methods. Some of this may be going on, but it can’t be the full story. Scientists are, in general, pretty smart people. Moreover, a field tends to be guided by certain normative standards, at least in theory.
  2. Malicious fraud: Scientists are deliberately obtaining positive results, with a disregard for the truth, for personal gain. There is undoubtedly some of this going on as well (see, for example, the Schön fraud in physics, the Stapel fraud in social psychology, and this fascinating case of peer review fraud in clinical pharmacology).  However, I choose to believe that most scientists are motivated to really learn about the world.
  3. Cultural evolution: Incentives for publication and novelty select for normative practices that work against truth and precision. This is the argument I am presenting here, which Richard McElreath and I fleshed out in a recently submitted paper.  

The Natural Selection of Bad Science

The argument is an evolutionary one [9], and works essentially like this: Research methods can spread either directly, through the production of graduate students who go on to start their own labs, or indirectly, through adoption by researchers in other labs looking to copy those who are prestigious and/or successful. Methods that are associated with greater success in academic careers will, all else equal, tend to spread.

Selection needs some way to operationalize success – or “fitness,” the ability to produce “progeny” with similar traits. This is where the devilishness of many incentives currently operating in scientific institutions (such as universities and funding agencies) comes into play. Publications, and particularly high impact publications, are the currency used to evaluate decisions related to hiring, promotions, and funding, along with related metrics such as the h-index [1]. This sort of quantitative evaluation is troublesome, particularly when large, positive effects are overwhelmingly favored for acceptance in many journals. Any methods that boost false positives and overestimate effect sizes will therefore become associated with success, and spread.  McElreath and I have dubbed this process the natural selection of bad science.

The argument can extend not only to norms of questionable research practices, but also norms of misunderstandings (such as with -values), if such misunderstandings lead to success. Misunderstandings that do not lead to success will rarely be selected for.

An important point is that the natural selection of bad science requires no conscious strategizing, cheating, or loafing on the part of individual researchers. There will always be researchers committed to rigorous methods and scientific integrity. However, as long as institutional incentives reward positive, novel results at the expense of rigor, the rate of bad science, on average, will increase.

A Case Study

Statistical power refers to the ability of a research design to correctly identify an effect. In the early 1960s, Jacob Cohen noticed that psychological studies were dreadfully underpowered, and warned that power needed to dramatically increase in order for the field to produce clear, reproducible results [8]. In the late 1980s, two meta-analyses indicated that, despite Cohen’s warnings, power had not increased [10,11]. We recently updated this meta-analysis [1], and showed that in the last 60 years, there has been no discernible increase in statistical power in the social and behavioral sciences. It remains quite low: the average power to detect a small effect is only 0.24.

This result is consistent with our argument: that incentives for novel, positive results work against individual desires to improve research methods. This is not to say that all studies are underpowered, but it does indicate that the most influential methods, in terms of which methods are adopted by new scientists, may be those that are.

A Computational Model

Although the case study is suggestive, it was important to us to demonstrate the logic of the argument more forcefully. So we built a computational model in which a population of labs studied hypotheses, only some of which were true, and attempted to publish their results. We assumed the following:

  • Each lab has a characteristic methodological power – its ability to correctly identify true hypotheses. Note: this is distinct from statistical power, in that it is a Gestalt property of the entire research process, not only of a particular analysis. If we make the overly simplistic but convenient assumption that all hypotheses are either true or false and all results are either positive or negative, then power is defined as the probability of obtaining a positive result given that one’s hypothesis is true.
  • Increasing power also increases false positives, unless effort is exerted. This represents the idea that one can increase the likelihood of finding a positive result in a cost-free way by using “shortcuts” that allow weaker evidence to count as positive, but increasing the likelihood of finding a true result by doing more rigorous research—such as by collecting more data, preregistering analyses, and rooting hypotheses in formal theory —is costly.
  • Increasing effort lengthens the time between results.
  • Novel positive results are easier to publish than negative results.
  • Labs that publish more are more likely to have their methods “reproduced” in new labs.

We then allow the population to evolve. Over time, effort decreased to its minimum value, and the rate of false discoveries skyrocketed.

Replication Isn’t Enough

But wait, what about replication? In general, replication is not a sufficient measure to prevent rampant false discovery. For one thing, many hypotheses are wrong, and so many replications may be necessary to ascertain their veracity [3] (here’s an interactive game we made to illustrate this point). But let’s put that aside for now. Replication surely helps to identify faulty results. Might incentives to replicate, and punishment for producing non-reproducible results, curb the natural selection of bad science?

Our model indicates that they won’t.

We gave labs the opportunity to replicate previously published studies, and let all such efforts be publishable (and be worth half as much “fitness” as the publication of a novel result). For the lab that published the original study, a successful replication boosted its value, but a failed replication was extremely punitive. In other words, we created a situation that was very favorable to replication. We found that even when the rate of replication was extremely high – as high as 50% of all studies conducted – the decline of effort and the rise of false discoveries were slowed, but not stopped. The reason is that even though labs with low effort were more likely to have a failed replication, and hence less likely to “reproduce” their methods, not all studies by low-effort labs were false, and among those that were, not all of them were caught with a failed replication. Even when the average fitness of high-effort labs was higher than that of low-effort labs, the fittest labs were always those exerting low effort.  

Moving Forward

Science is hard. It’s messy and time-consuming and doesn’t always (or even often) yield major revelations about the secrets of the universe. That’s OK. We do it because it’s absolutely amazing to discover new truths about the world, and also because the knowledge we gain is occasionally quite useful. Being a professional scientist is a nice job if you can get it, and the competition is stiff. Unfortunately, that means that not everyone who wants to be a scientist can get a job doing so, and not every scientist can get funding to carry out the project of their dreams. Some will succeed, and others will fail.

Mechanisms to assess research quality are essential. Problems occur when those mechanisms are related to simple quantitative metrics, because those are usually subject to exploitation. This is true whether we’re talking about the number of publications, journal impact factors, or other “alt metrics.” When a measure becomes a target, it ceases to be a good measure

This idea is often understood in the sense that savvy operators will respond to incentives directly, by changing their behaviors to increase their performance on the relevant measures. This surely happens. But a cultural evolutionary perspective reveals that quantitative incentives are problematic even if individuals are motivated to disregard those incentives. If the system rewards those who maximize these metrics, whether they do so intentionally or not, the practices of those individuals will spread.

This means that it’s not enough to simply look at bad practices and say “Well, I don’t do that, so I’m fine.” We need to look at the institutional incentives – the factors that influence hiring, promotion, and funding decisions – and make sure they are rewarding the kinds of practices we want to spread.

Exactly what those practices are is open to debate. But they will involve rewarding quality research over flashy results. I think recent trends toward open science and reproducibility are good signs that there is widespread motivation to solve this problem, and progress is being made. I also suspect it will take time to fully effect the kind of changes we need.  Such changes need to come from early career scientists, who are in a position to set new standards for the generation and testing of hypotheses.

In his 1974 commencement address at Cal Tech, Richard Feynman characterized the problem quite clearly, illustrating that it was persistent then as it is today:

It is very dangerous… to teach students only how to get certain results, rather than how to do an experiment with scientific integrity. … I have just one wish for you—the good luck to be somewhere where you are free to maintain the kind of integrity I have described, and where you do not feel forced by a need to maintain your position in the organization, or financial support, or so on, to lose your integrity. May you have that freedom. 

May we all someday have that freedom. 


Paul E. Smaldino is Assistant Professor of Cognitive and Information Sciences at the University of California, Merced. Website: http://www.smaldino.com/wp

References:

[1] Smaldino PE, McElreath R (2016) The natural selection of bad science. arXiv:1605.09511

[2] Franco A, Malhotra, N, Simonovits G (2014) Publication bias in the social sciences: Unlocking the drawer. Science 345: 1502–1505.

[3] McElreath R, Smaldino PE (2015) Replication, communication, and the population dynamics of scientific discovery. PLOS ONE 10(8): e0136088.

[4] Wasserstein RL, Lazar NA (2016)  ASA’s statement on p-values: Context, process, and purposeAmerican Statistician 70(2): 129–133.

[5] Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22: 1359–1366.

[6] Gelman A, Loken E (2014) The statistical crisis in science. American Scientist 102(6): 460–465.

[7] Meehl PE (1967)  Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science 34: 103–115.

[8] Cohen J (1962) The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology 65(3): 145–153.

[9] That is, it is based on ideas from well-supported theories of cultural evolution. For an introduction, see books by Robert Boyd & Peter Richerson, Alex Mesoudi, and Joe Henrich.

[10] Sedlmeier P, Gigerenzer G (1989) Do studies of statistical power have an effect on the power of studies? Psychological Bulletin 105(2): 309–316.

[11] Rossi JS (1990) Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology 58(5): 646–656.