Beth Mole at ArsTechnica: “With big data comes big noise. Google learned this lesson the hard way with its now kaput Google Flu Trends. The online tracker, which used Internet search data to predict real-life flu outbreaks, emerged amid fanfare in 2008. Then it met a quiet death this August after repeatedly coughing up bad estimates.
But big Internet data isn’t out of the disease tracking scene yet.
With hubris firmly in check, a team of Harvard researchers have come up with a way to tame the unruly data, combine it with other data sets, and continually calibrate it to track flu outbreaks with less error. Their new model, published Monday in the Proceedings of the National Academy of Sciences, out-performs Google Flu Trends and other models with at least double the accuracy. If the model holds up in coming flu seasons, it could reinstate some optimism in using big data to monitor disease and herald a wave of more accurate second-generation models.
Big data has a lot of potential, Samuel Kou, a statistics professor at Harvard University and coauthor on the new study, told Ars. It’s just a question of using the right analytics, he said.
Kou and his colleagues built on Google’s flu tracking model for their new version, called ARGO (AutoRegression with GOogle search data). Google Flu Trends basically relied on trends in Internet search terms, such as headache and chills, to estimate the number of flu cases. Those search terms were correlated with flu outbreak data collected by the Centers for Disease Control and Prevention. The CDC’s data relies on clinical reports from around the country. But compiling and analyzing that data can be slow, leading to a lag time of one to three weeks. The Google data, on the other hand, offered near real-time tracking for health experts to manage and prepare for outbreaks.
At first Google’s tracker appeared to be pretty good, matching CDC data’s late-breaking data somewhat closely. But, two notable stumbles led to its ultimate downfall: an underestimate of the 2009 H1N1 swine flu outbreak and an alarming overestimate (almost double real numbers) of the 2012-2013 flu season’s cases…..For ARGO, he and colleagues took the trend data and then designed a model that could self-correct for changes in how people search. The model has a two-year sliding window in which it re-calibrates current search term trends with the CDC’s historical flu data (the gold standard for flu data). They also made sure to exclude winter search terms, such as March Madness and the Oscars, so they didn’t get accidentally correlated with seasonal flu trends. Last, they incorporated data on the historical seasonality of flu.
The result was a model that significantly out-competed the Google Flu Trends estimates for the period between March 29, 2009 to July 11, 2015. ARGO also beat out other models, including one based on current and historical CDC data….(More)”
See also Proceedings of the National Academy of Sciences, 2015. DOI: 10.1073/pnas.1515373112