Google’s flu fail shows the problem with big data


Adam Kucharski in The Conversation: “When people talk about ‘big data’, there is an oft-quoted example: a proposed public health tool called Google Flu Trends. It has become something of a pin-up for the big data movement, but it might not be as effective as many claim.
The idea behind big data is that large amount of information can help us do things which smaller volumes cannot. Google first outlined the Flu Trends approach in a 2008 paper in the journal Nature. Rather than relying on disease surveillance used by the US Centers for Disease Control and Prevention (CDC) – such as visits to doctors and lab tests – the authors suggested it would be possible to predict epidemics through Google searches. When suffering from flu, many Americans will search for information related to their condition….
Between 2003 and 2008, flu epidemics in the US had been strongly seasonal, appearing each winter. However, in 2009, the first cases (as reported by the CDC) started in Easter. Flu Trends had already made its predictions when the CDC data was published, but it turned out that the Google model didn’t match reality. It had substantially underestimated the size of the initial outbreak.
The problem was that Flu Trends could only measure what people search for; it didn’t analyse why they were searching for those words. By removing human input, and letting the raw data do the work, the model had to make its predictions using only search queries from the previous handful of years. Although those 45 terms matched the regular seasonal outbreaks from 2003–8, they didn’t reflect the pandemic that appeared in 2009.
Six months after the pandemic started, Google – who now had the benefit of hindsight – updated their model so that it matched the 2009 CDC data. Despite these changes, the updated version of Flu Trends ran into difficulties again last winter, when it overestimated the size of the influenza epidemic in New York State. The incidents in 2009 and 2012 raised the question of how good Flu Trends is at predicting future epidemics, as opposed to merely finding patterns in past data.
In a new analysis, published in the journal PLOS Computational Biology, US researchers report that there are “substantial errors in Google Flu Trends estimates of influenza timing and intensity”. This is based on comparison of Google Flu Trends predictions and the actual epidemic data at the national, regional and local level between 2003 and 2013
Even when search behaviour was correlated with influenza cases, the model sometimes misestimated important public health metrics such as peak outbreak size and cumulative cases. The predictions were particularly wide of the mark in 2009 and 2012:

Original and updated Google Flu Trends (GFT) model compared with CDC influenza-like illness (ILI) data. PLOS Computational Biology 9:10
Click to enlarge

Although they criticised certain aspects of the Flu Trends model, the researchers think that monitoring internet search queries might yet prove valuable, especially if it were linked with other surveillance and prediction methods.
Other researchers have also suggested that other sources of digital data – from Twitter feeds to mobile phone GPS – have the potential to be useful tools for studying epidemics. As well as helping to analysing outbreaks, such methods could allow researchers to analyse human movement and the spread of public health information (or misinformation).
Although much attention has been given to web-based tools, there is another type of big data that is already having a huge impact on disease research. Genome sequencing is enabling researchers to piece together how diseases transmit and where they might come from. Sequence data can even reveal the existence of a new disease variant: earlier this week, researchers announced a new type of dengue fever virus….”

Global Collective Intelligence in Technological Societies


Paper by Juan Carlos Piedra Calderón and Javier Rainer in the International Journal of Artificial Intelligence and Interactive Multimedia: “The big influence of Information and Communication Technologies (ICT), especially in area of construction of Technological Societies has generated big
social changes. That is visible in the way of relating to people in different environments. These changes have the possibility to expand the frontiers of knowledge through sharing and cooperation. That has meaning the inherently creation of a new form of Collaborative Knowledge. The potential of this Collaborative Knowledge has been given through ICT in combination with Artificial Intelligence processes, from where is obtained a Collective Knowledge. When this kind of knowledge is shared, it gives the place to the Global Collective Intelligence”.

Information Now: Open Access and the Public Good


Podcast from SMARTech (Georgia Tech): “Every year, the international academic and research community dedicates a week in October to discuss, debate, and learn more about Open Access. Open Access in the academic sense refers to the free, immediate, and online access to the results of scholarly research, primarily academic, peer-reviewed journal articles. In the United States, the movement in support of Open Access has, in the last decade, been growing dramatically. Because of this growing interest in Open Access, a group of academic librarians from the Georgia Tech library, Wendy Hagenmaier (Digital Collections Archivist), Fred Rascoe (Scholarly Communication Librarian), and Lizzy Rolando (Research Data Librarian), got together to talk to folks in the thick of it, to try and unravel some of the different concerns and benefits of Open Access. But we didn’t just want to talk about Open Access for journal articles – we wanted to examine more broadly what it means to be “open”, what is open information, and what relationship open information has to the public good. In this podcast, we talk with different people who have seen and experienced open information and open access in practice. In the first act, Dan Cohen from the DPLA speaks about efforts to expand public access to archival and library collections. In the second, we’ll hear an argument from Christine George about why things sometimes need to be closed, if we want them to be open in the future. Third, Kari Watkins speaks about specific example of when a government agency decided, against legitimate concerns, to make transit data open, and why it worked for them. Fourth, Peter Suber from Harvard University will give us the background on the Open Access movement, some myths that have been dispelled, and why it is important for academic researchers to take the leap to make their research openly accessible. And finally, we’ll hear from Michael Chang, a researcher who did take that leap and helped start an Open Access journal, and why he sees openness in research as his obligation.”

See also Personal Guide to Open Access

Are We Puppets in a Wired World?


Sue Halpern in The New York Review of Books: “Also not obvious was how the Web would evolve, though its open architecture virtually assured that it would. The original Web, the Web of static homepages, documents laden with “hot links,” and electronic storefronts, segued into Web 2.0, which, by providing the means for people without technical knowledge to easily share information, recast the Internet as a global social forum with sites like Facebook, Twitter, FourSquare, and Instagram.
Once that happened, people began to make aspects of their private lives public, letting others know, for example, when they were shopping at H+M and dining at Olive Garden, letting others know what they thought of the selection at that particular branch of H+M and the waitstaff at that Olive Garden, then modeling their new jeans for all to see and sharing pictures of their antipasti and lobster ravioli—to say nothing of sharing pictures of their girlfriends, babies, and drunken classmates, or chronicling life as a high-paid escort, or worrying about skin lesions or seeking a cure for insomnia or rating professors, and on and on.
The social Web celebrated, rewarded, routinized, and normalized this kind of living out loud, all the while anesthetizing many of its participants. Although they likely knew that these disclosures were funding the new information economy, they didn’t especially care…
The assumption that decisions made by machines that have assessed reams of real-world information are more accurate than those made by people, with their foibles and prejudices, may be correct generally and wrong in the particular; and for those unfortunate souls who might never commit another crime even if the algorithm says they will, there is little recourse. In any case, computers are not “neutral”; algorithms reflect the biases of their creators, which is to say that prediction cedes an awful lot of power to the algorithm creators, who are human after all. Some of the time, too, proprietary algorithms, like the ones used by Google and Twitter and Facebook, are intentionally biased to produce results that benefit the company, not the user, and some of the time algorithms can be gamed. (There is an entire industry devoted to “optimizing” Google searches, for example.)
But the real bias inherent in algorithms is that they are, by nature, reductive. They are intended to sift through complicated, seemingly discrete information and make some sort of sense of it, which is the definition of reductive.”
Books reviewed:

The "crowd computing" revolution


Michael Copeland in the Atlantic: “Software might be eating the world, but Rob Miller, a professor of computer science at MIT, foresees a “crowd computing” revolution that makes workers and machines colleagues rather than competitors….
Miller studies human-computer interaction, specifically a field called crowd computing. A play on the more common term “cloud computing,” crowd computing is software that employs a group of people to do small tasks and solve a problem better than an algorithm or a single expert. Examples of crowd computing include Wikipedia, Amazon’s Mechanical Turk (where workers outsource projects that computers can’t do to an online community) a Facebook’s photo tagging feature.
But just as humans are better than computers at some things, Miller concedes that algorithms have surpassed human capability in several fields. Take a look at libraries, which now have advanced digital databases, eliminating the need for most human reference librarians. There’s also flight search, where algorithms are much better than people at finding the cheapest fare.
That said, more complicated tasks even in those fields can get tricky for a computer.
“For complex flight search, people are still better,” Miller says. A site called Flightfox lets travelers input a complex trip while a group of experts help find the cheapest or most convenient combination of flights. “There are travel agents and frequent flyers in that crowd, people with expertise at working angles of the airfare system that are not covered by the flight searches and may never be covered because they involve so many complex intersecting rules that are very hard to code.”
Social and cultural understanding is another area in which humans will always exceed computers, Miller says. People are constantly inventing new slang, watching the latest viral videos and movies, or partaking in some other cultural phenomena together. That’s something that an algorithm won’t ever be able to catch up to. “There’s always going to be a frontier of human understanding that leads the machines,” he says.
A post-employee economy where every task is automated by a computer is something Miller does not see happening, nor does he want it to happen. Instead, he considers the relationship between human and machine symbiotic. Both machines and humans benefit in crowd computing, “the machine wants to acquire data so it can train and get better. The crowd is improved in many ways, like through pay or education,” Miller says. And finally, the end users “get the benefit of a more accurate and fast answer.”
Miller’s User Interface Design Group at MIT has made several programs illustrating how this symbiosis between user, crowd and machine works. Most recently, the MIT group created Cobi, a tool that taps into an academic community to plan a large-scale conference. The software allows members to identify papers they want presented and what authors are experts in specific fields. A scheduling tool combines the community’s input with an algorithm that finds the best times to meet.
Programs more practical for everyday users include Adrenaline, a camera driven by a crowd, and Soylent, a word processing tool that allows people to do interactive document shortening and proofreading. The Adrenaline camera took a video and then had a crowd on call to very quickly identify the best still in that video, whether it was the best group portrait, mid-air jump, or angle of somebody’s face. Soylent also used users on Mechanical Turk to proofread and shorten text in Microsoft Word. In the process, Miller and his students found that the crowd found errors that neither a single expert proofreader nor the program—with spell and grammar check turned on—could find.
“It shows this is the essential thing that human beings bring that algorithms do not,” Miller said.
That said, you can’t just use any crowd for any task. “It does depend on having appropriate expertise in the crowd. If [the text] had been about computational biology, they might not have caught [the error]. The crowd does have to have skills.” Going forward, Miller thinks that software will increasingly use the power of the crowd. “In the next 10 or 20 years it will be more likely we already have a crowd,” he says. “There will already be these communities and they will have needs, some of which will be satisfied by software and some which will require human help and human attention. I think a lot of these algorithms and system techniques that are being developed by all these startups, who are experimenting with it in their own spaces, are going to be things that we’ll just naturally pick up and use as tools.”

A Data Revolution for Poverty Eradication


Report from devint.org: “The High Level Panel on the Post–2015 Development Agenda called for a data revolution for sustainable development, with a new international initiative to improve the quality of statistics and information available to citizens. It recommended actively taking advantage of new technology, crowd sourcing, and improved connectivity to empower people with information on the progress towards the targets. Development Initiatives believes there a number of steps that should be put in place in order to deliver the ambition set out by the Panel.
The data revolution should be seen as a basis on which greater openness and a wider transparency revolution can be built. The openness movement – one of the most exciting and promising developments of the last decade – is starting to transform the citizen-state compact. Rich and developing country governments are adapting the way they do business, recognising that greater transparency and participation leads to more effective, efficient, and equitable management of scarce public resources. Increased openness of data has potential to democratise access to information, empowering individuals with the knowledge they need to tackle the problems that they face. To realise this bold ambition, the revolution will need to reach beyond the niche data and statistical communities, sell the importance of the revolution to a wide range of actors (governments, donors, CSOs and the media) and leverage the potential of open data to deliver more usable information”

7 Tactics for 21st-Century Cities


Abhi Nemani, co-director of Code for America: “Be it the burden placed on them by shrinking federal support, or the opportunity presented by modern technology, 21st-century cities are finding new ways to do things. For four years, Code for America has worked with dozens of cities, each finding creative ways to solve neighborhood problems, build local capacity and steward a national network. These aren’t one-offs. Cities are championing fundamental, institutional reforms to commit to an ongoing innovation agenda.
Here are a few of the ways how:

  1. …Create an office of new urban mechanics or appoint a chief innovation officer…
  2. …Appoint a chief data officer or create an office of performance management/enhancement…
  3. …Adopt the Gov.UK Design Principles, and require plain, human language on every interface….
  4. …Share open source technology with a sister city or change procurement rules to make it easier to redeploy civic tech….
  5. …Work with the local civic tech community and engage citizens for their feedback on city policy through events, tech and existing forums…
  6. …Create an open data policy and adopt open data specifications…
  7. …Attract tech talent into city leadership, and create training opportunities citywide to level up the tech literacy for city staff…”

From open data to open democracy


Article by : “Such debates further underscore the complexities of open data and where it might lead. While open data may be viewed by some inside and outside government as a technically-focused and largely incremental project based upon information formatting and accessibility (with the degree of openness subject to a myriad of security and confidentiality provisions), such an approach greatly limits its potential. Indeed, the growing ubiquity of mobile and smart devices, the advent of open source operating systems and social media platforms, and the growing commitment by governments themselves to expansive public engagement objectives, all suggest a widening scope.
Yet, what will incentivize the typical citizen to access open data and to partake in collective efforts to create public value? It is here where our digital culture may well fall short, emphasizing individualized service and convenience at the expense of civic responsibility and community-mindedness. For one American academic, this “citizenship deficit” erodes democratic legitimacy and renders our politics more polarized and less discursive. For other observers in Europe, notions of the digital divide are giving rise to new “data divides.”
The politics and practicalities of data privacy often bring further confusion. While privacy advocates call for greater protection and a culture of data activism among Internet users themselves, the networked ethos of online communities and commercialization fuels speed and sharing, often with little understanding of the ramifications of doing so. Differences between consumerism and citizenship are subtle yet profoundly important, while increasingly blurred and overlooked.
A key conundrum provincially and federally, within the Westminster confines of parliamentary democracy, is that open data is being hatched mainly from within the executive branch, whereas the legislative branch watches and withers. In devising genuine democratic openness, politicians and their parties must do more than post expenses online: they must become partners and advocates for renewal. A lesson of open source technology, however, is that systemic change demands an informed and engaged civil society, disgruntled with the status quo but also determined to act anew.
Most often, such actions are highly localized, even in a virtual world, giving rise to the purpose and meaning of smarter and more intelligent communities. And in Canada it bears noting that we see communities both large and small embracing open data and other forms of online experimentation such as participatory budgeting. It is often within small but connected communities where a virtuous cycle of online and in-person identities and actions can deepen and impact decision-making most directly.
How, then, do we reconcile traditional notions of top-down political federalism and national leadership with this bottom-up approach to community engagement and democratic renewal? Shifting from open data to open democracy is likely to be an uneven, diverse, and at times messy affair. Better this way than attempting to ordain top-down change in a centralized and standardized manner.”

Our Privacy Problem is a Democracy Problem in Disguise


Evgeny Morozov in MIT Technology Review: “Intellectually, at least, it’s clear what needs to be done: we must confront the question not only in the economic and legal dimensions but also in a political one, linking the future of privacy with the future of democracy in a way that refuses to reduce privacy either to markets or to laws. What does this philosophical insight mean in practice?

First, we must politicize the debate about privacy and information sharing. Articulating the existence—and the profound political consequences—of the invisible barbed wire would be a good start. We must scrutinize data-intensive problem solving and expose its occasionally antidemocratic character. At times we should accept more risk, imperfection, improvisation, and inefficiency in the name of keeping the democratic spirit alive.
Second, we must learn how to sabotage the system—perhaps by refusing to self-track at all. If refusing to record our calorie intake or our whereabouts is the only way to get policy makers to address the structural causes of problems like obesity or climate change—and not just tinker with their symptoms through nudging—information boycotts might be justifiable. Refusing to make money off your own data might be as political an act as refusing to drive a car or eat meat. Privacy can then reëmerge as a political instrument for keeping the spirit of democracy alive: we want private spaces because we still believe in our ability to reflect on what ails the world and find a way to fix it, and we’d rather not surrender this capacity to algorithms and feedback loops.
Third, we need more provocative digital services. It’s not enough for a website to prompt us to decide who should see our data. Instead it should reawaken our own imaginations. Designed right, sites would not nudge citizens to either guard or share their private information but would reveal the hidden political dimensions to various acts of information sharing. We don’t want an electronic butler—we want an electronic provocateur. Instead of yet another app that could tell us how much money we can save by monitoring our exercise routine, we need an app that can tell us how many people are likely to lose health insurance if the insurance industry has as much data as the NSA, most of it contributed by consumers like us. Eventually we might discern such dimensions on our own, without any technological prompts.
Finally, we have to abandon fixed preconceptions about how our digital services work and interconnect. Otherwise, we’ll fall victim to the same logic that has constrained the imagination of so many well-­meaning privacy advocates who think that defending the “right to privacy”—not fighting to preserve democracy—is what should drive public policy. While many Internet activists would surely argue otherwise, what happens to the Internet is of only secondary importance. Just as with privacy, it’s the fate of democracy itself that should be our primary goal.

Open Data and Open Government: Rethinking Telecommunications Policy and Regulation


New paper by Ewan Sutherland: “While attention has been given to the uses of big data by network operators and to the provision of open data by governments, there has been no systematic attempt to re-examine the regulatory systems for telecommunications. The power of public authorities to access the big data held by operators could transform regulation by simplifying proof of bias or discrimination, making operators more susceptible to behavioural remedies, while it could also be used to deliver much finer granularity of decision making. By opening up data held by government and its agencies to enterprises, think tanks and research groups it should be possible to transform market regulation.