To distill a clear message from growing piles of unruly genomics data, researchers often turn to meta-analysis — a tried-and-true statistical procedure for combining data from multiple studies. But the studies that a meta-analysis might mine for answers can diverge endlessly. Some enroll only men, others only children. Some are done in one country, others across a region like Europe. Some focus on milder forms of a disease, others on more advanced cases. Even if statistical methods can compensate for these kinds of variations, studies rarely use the same protocols and instruments to collect the data, or the same software to analyze it. Researchers performing meta-analyses go to untold lengths trying to clean up the hodgepodge of data to control for these confounding factors.

Purvesh Khatri, a computational immunologist at Stanford University, thinks they’re going about it all wrong. His approach to genomic discovery calls for scouring public repositories for data collected at different hospitals on different populations with different methods — the messier the data, the better. “We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”

To read more, click here.