The End of Magic and the Beginning of Wisdom

Or why a Nobel Laureate hates randomized control trials and likes financial diaries.

Last updated on Apr 17, 2019 9 min read

As a university economics student, I was obsessed with two topics: development and econometrics. The former, because the field enabled me to explore questions about the patterns of wealth and poverty that I saw in Chile and Mexico growing up and struggled to find answers to. The latter, because the fancy statistical models that were taught in econometrics courses seemed powerful: when combined with good data (a big if!) it felt as if those models could reveal the secrets of development and the knowledge to make the world better.

In the years after I graduated university, development economics underwent a methodological “revolution” that started with a loss in confidence in the ability of econometric models trained on large-scale observational data to tease out truth. The leaders of this revolution were inspired by the “long and wretched” history of medicine (Philip Tetlock succinctly describes this history in a chapter called “Illusions of Knowledge” in his book Superforecasting), where useless and often harmful interventions (like bloodletting) persisted over centuries and physicians debated their theories and practices “like blind men arguing over the colors of the rainbow” – as the historian Ira Rutkow observed. Medical decision-making was based mostly on intuition, informed at best by flawed observations and untroubled by doubt. Physicians were quick to make up their minds about interventions and too slow to change them. Not until the mid-20th century did the idea of randomized experiments, careful measurement and statistical power take hold in medicine.

While it may be harsh to compare the state of development economics in the early 21st century with the state of medicine in the 1st century, it was on the grounds of providing a path out of ignorance that RCTs have been adopted widely in the past 10 years as a favored way to establish evidence about “what works” in development. But can RCTs do for economics what they did for medicine? Are they any better than other tools in helping us learn about how societies achieve prosperity?

Angus Deaton – the world’s most recent Nobel Laureate in the economic sciences certainly does not think so. If poverty reduction is to development what curing disease is to medicine, Deaton is unequivocal in stating: “no long term solutions are coming from RCTs. We are certainly not going to abolish world poverty this way”. Timothy Ogden’s recent interview with the Nobel Laureate (which is a part of Ogden’s upcoming book: Experimental Conversations) reveals both Deaton’s qualified sense of doubt about the ability of RCTs to deliver useful insight and a sharp dislike of the way they are treated preferentially over other forms of evidence. Where others spend a lot of effort selling RCTs on the basis of their advantages, Deaton shows us the cracks in this seemingly perfect research tool. While Deaton’s critiques have been covered extensively elsewhere, for the benefit of those not familiar with the arguments, let me recap two of his key points:

The ability to apply lessons from RCTs obtained in one context to another is extremely limited. There are two reasons for this. The first is a consequence of the way RCTs are done in practice. Researchers start with a “policy” population of interest (e.g. a country, the poor, school-age children, teachers, villages) but select an experimental population that is not necessarily representative of the policy population (perhaps due to convenience or politics). There is no reason results derived from the experimental population will apply to the policy population. The second reason is more subtle, but perhaps more profound. Causal relationships measured from RCTs in the social sciences can be contextual rather than universal in nature. In a fascinating debate with Abhijit Banerjee, Deaton illustrates this with a simple metaphor: just because a TV may cause a house fire in one instance (due perhaps to faulty wiring), this does not mean that TV sets cause all house fires. These types of causal factors are contingent on context, that is, their role as causal agents requires the presence of other enabling factors. Deaton laments that too little thought goes into figuring out whether and how results from one experiment can apply elsewhere and without a better understanding of mechanisms we cannot develop a roadmap that will help us “transport” findings from one setting to another. As things stand, there is no direct line from RCT findings to policy.
Results from RCTs lack reliability. In many cases RCTs are done with relatively small sample sizes. This is problematic because smaller sample sizes increase the noise in estimates of average causal effects. For example, say you are measuring the impact of microcredit and you’ve randomly picked some entrepreneurs from a city to receive a loan and others to act as a control. Also assume that in that city there is a small group of entrepreneurs who will benefit immensely from the loan, another small group who will be harmed by the debt it creates and a large majority for whom the loan will have no impact at all. The chances that estimates of the effect of microcredit you get from this experiment will be spurious through the influence of members of the two small groups increases as the sample size falls. The trouble is that many RCTs are done with small sample sizes and the fact that there is randomization does not make the results more reliable.

The troubling aspect of both of these limitations is the possibility that even if multiple RCTs generate evidence on the same question, results across those studies may have no interpretation, and we may get stuck calling random noise a puzzle that needs to be solved. Deaton likens this scenario to trying to interpret an Avant-garde movie that has no plot.

Deaton’s observations make me think about whether the development community’s RCT fever is linked with deeper mental models of how we think and decide. Psychologists divide our minds into System 2 – the deliberate and effortful thinking we deploy when solving math problems for instance and System 1 – the automatic perceptual and cognitive operations we deploy when, for example, we run instinctually from the proverbial “lion in the grass”. Perhaps the general lack of criticism that RCTs receive (and Deaton complains about this “halo” effect in the interview), has to do with the internalization of a rule of thumb that “RCT = gold standard evidence” and “anything else = flawed evidence”. Deaton reminds us to engage System 2 thinking to interrogate and question the System 1 impulse that concludes that RCTs always generate higher quality evidence:

“… people use this gold standard argument, and say we’re only going to look at estimates that are randomized control trials, or at least prioritize them. We often find a randomized control trial with only a handful of observations in each arm and with enormous standard errors. But that’s preferred to a potentially biased study that uses 100 million observations. That just makes no sense. Each study has to be considered on its own. RCTs are fine, but they are just one of the techniques in the armory that one would use to try to discover things. Gold standard thinking is magical thinking.”

So if not from RCTs alone, where can we begin our search for knowledge about development? To my delight – having spent the last year supporting FSD Kenya’s work on financial diaries – Deaton goes on to say:

“Things like the financial diaries and extended case studies are enormously important. Most of the great ideas in the social sciences over the last 100 years came out of case studies like that. Because people are open to hearing things that they didn’t necessarily plan, for one thing… I very much like the financial diaries work and I’ve learned a lot from them, and to me they are more useful than a series of randomized trials on the topic because they have lots of broadly useful information. I can make my own allowance for how selected they are and I’m not blinkered by the craziness that if it’s not a randomized control trial I shouldn’t pay any attention to it. Which I’ve heard more times than I can count.”

To be honest, I’m not sure which great ideas in social science he is referring to, but I certainly would like to know. I do agree with Deaton that financial diaries studies are enormously useful but I’m not sure their usefulness can be compared to RCTs. The reason is that financial diaries are designed to answer a completely different class of questions than RCTs. While RCTs are designed to test specific hypotheses (for example, A is more effective than B at producing an outcome X), I view financial diaries and other case study research as hypothesis generators for topics we know relatively little about. The financial diaries methodology is like a microscope – it has given us a way to record how households manage money in incredible detail, it generates insights about the context and complexity which give rise to those patterns and these observations are helping us generate new– or challenge existing – theories of how households use financial instruments and allocate resources in settings of scarcity. As Richard Feynman alludes to when discussing the nature of scientific discovery, ultimately the value of financial diaries may be in their ability to cast doubt on existing ideas and generate new experiments, ideas or services that help the poor in their own pursuit of prosperity:

“The rate of the development of science is not the rate at which you make observations alone, but, much more important, the rate at which you create new things to test”

Having first fallen in love with the power of complex econometric models, then seduced by the simplicity of RCTs and finally awed by the richness of financial diaries – I agree with Deaton that it is important to avoid getting tempted into thinking that any single approach to learning about development will give us all the answers, especially if research proceeds without considering the role of mechanisms and context in producing outcomes. But I also don’t think RCTs are useless, and as someone who conducts and funds research, I’m glad randomization and control groups are part of the toolkit. One last thought that the interview left me with is that doubt and skepticism is where hard, careful thinking that propels knowledge forward begins:

“That’s the beginning of wisdom. It’s very hard to do science. If it was easy or there was a magic machine out there we’d all be a lot wiser. It’s just very very hard.” – Angus Deaton

RCTs Financial Diaries

Paul Gubbins

I help governments, non-profits and companies get the most out of their data in order to advance public policies, programs and innovations that support people's well-being.