In Friday’s Financial Times, economist and author Tim Harford offers a much-needed, unusually considered response to big data hype, using the reported failure of Google Flu Trends to predict a recent flu season as a prismatic example that highlights key reasons big data merits more caution and less exuberance.
Harford’s critique is generally insightful and spot-on – his concerns about the perils of multiple comparisons are especially important (my 2008 take in theWashington Post here), and will become even more relevant in healthcare as more data become available to more people (obviously a good trend). Without statistically appropriate analysis, however, there’s a real concern that a spate of false-positive associations will emerge and distract us, much like in the early days of genomics.
The one area of Harford’s analysis that gave me pause involves the age-old debate between theory and empiricism (nicely summarized in Jim Manzi’s Uncontrolled, by the way). A strong version of the canonical big data thesis is that when you have enough information, you can make unbiased predictions that don’t require an underlying understanding of the process or context – the data are sufficient to speak for themselves. This is the so-called “end of theory.”
Not so fast, Harford responds. The failure of Google Flu Trends, in his view, emphasizes the perils of unmoored empiricism.
“A theory-free analysis of mere correlations is inevitably fragile,” Harford writes. “If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.”
He’s right, of course – but I suspect that outside of a few areas such as physics, our understanding of causation is far more fragile than we appreciate (a point emphasized at length in Manzi’s book). We overestimate our understanding of causation, and our ability to generalize.
I’d argue this is especially true in medicine, where despite our aspirations to approach health and disease from first principles, our actual understanding is far more limited, and based far more on rationalized empiricism than is often appreciated – there’s much more scientism than science. The primacy of empiricism in medicine also emerges from Morton Meyers’ Happy Accidents, and is a central theme of Nassim Taleb’s Antifragile. (I also touch upon implications for drug discovery here.)
Fundamentally, my concern is that more often than we appreciate, and especially in healthcare, our faith on theory is misplaced – we turn to various theories as crutches, explanatory models, memory devices, in the case of med students and harried residents.
Ideally, theories can be evaluated scientifically and replaced by better ones – and this happens, over time. But I suspect that many of our existing theories are at least as fragile as the visibly imperfect, data-driven associations Harford cites.
The difference is that we recognize (or should recognize) empirical predictions for what they are, limitations and all. Yet, I suspect we are more likely to let our guard down in instances where predictions are theory-driven, where we instinctively believe we really understand what is going on. In doing so, we are likely to discount data that don’t fit, and unconsciously constrain our thinking according to theory’s dictates.
For most phenomena in medicine and health, we really don’t have a clear understanding of cause and effect. “Que sais-je,” Montaigne inscribed above the door of his study five hundred years ago. “What do I know?” Empirical, big-data-driven analyses, at least, have the humility to acknowledge this.