Big Data


Special Report on Big Data by Volta – A newsletter on Science, Technology and Society in Europe:  “Locating crime spots, or the next outbreak of a contagious disease, Big Data promises benefits for society as well as business. But more means messier. Do policy-makers know how to use this scale of data-driven decision-making in an effective way for their citizens and ensure their privacy?90% of the world’s data have been created in the last two years. Every minute, more than 100 million new emails are created, 72 hours of new video are uploaded to YouTube and Google processes more than 2 million searches. Nowadays, almost everyone walks around with a small computer in their pocket, uses the internet on a daily basis and shares photos and information with their friends, family and networks. The digital exhaust we leave behind every day contributes to an enormous amount of data produced, and at the same time leaves electronic traces that contain a great deal of personal information….
Until recently, traditional technology and analysis techniques have not been able to handle this quantity and type of data. But recent technological developments have enabled us to collect, store and process data in new ways. There seems to be no limitations, either to the volume of data or technology for storing and analyzing them. Big Data can map a driver’s sitting position to identify a car thief, it can use Google searches to predict outbreaks of the H1N1 flu virus, it can data-mine Twitter to predict the price of rice or use mobile phone top-ups to describe unemployment in Asia.
The word ‘data’ means ‘given’ in Latin. It commonly refers to a description of something that can be recorded and analyzed. While there is no clear definition of the concept of ‘Big Data’, it usually refers to the processing of huge amounts and new types of data that have not been possible with traditional tools.

‘The new development is not necessarily that there are so much more data. It’s rather that data is available to us in a new way.’

The notion of Big Data is kind of misleading, argues Robindra Prabhu, a project manager at the Norwegian Board of Technology. “The new development is not necessarily that there are so much more data. It’s rather that data is available to us in a new way. The digitalization of society gives us access to both ‘traditional’, structured data – like the content of a database or register – and unstructured data, for example the content in a text, pictures and videos. Information designed to be read by humans is now also readable by machines. And this development makes a whole new world of  data gathering and analysis available. Big Data is exciting not just because of the amount and variety of data out there, but that we can process data about so much more than before.”

Google’s flu fail shows the problem with big data


Adam Kucharski in The Conversation: “When people talk about ‘big data’, there is an oft-quoted example: a proposed public health tool called Google Flu Trends. It has become something of a pin-up for the big data movement, but it might not be as effective as many claim.
The idea behind big data is that large amount of information can help us do things which smaller volumes cannot. Google first outlined the Flu Trends approach in a 2008 paper in the journal Nature. Rather than relying on disease surveillance used by the US Centers for Disease Control and Prevention (CDC) – such as visits to doctors and lab tests – the authors suggested it would be possible to predict epidemics through Google searches. When suffering from flu, many Americans will search for information related to their condition….
Between 2003 and 2008, flu epidemics in the US had been strongly seasonal, appearing each winter. However, in 2009, the first cases (as reported by the CDC) started in Easter. Flu Trends had already made its predictions when the CDC data was published, but it turned out that the Google model didn’t match reality. It had substantially underestimated the size of the initial outbreak.
The problem was that Flu Trends could only measure what people search for; it didn’t analyse why they were searching for those words. By removing human input, and letting the raw data do the work, the model had to make its predictions using only search queries from the previous handful of years. Although those 45 terms matched the regular seasonal outbreaks from 2003–8, they didn’t reflect the pandemic that appeared in 2009.
Six months after the pandemic started, Google – who now had the benefit of hindsight – updated their model so that it matched the 2009 CDC data. Despite these changes, the updated version of Flu Trends ran into difficulties again last winter, when it overestimated the size of the influenza epidemic in New York State. The incidents in 2009 and 2012 raised the question of how good Flu Trends is at predicting future epidemics, as opposed to merely finding patterns in past data.
In a new analysis, published in the journal PLOS Computational Biology, US researchers report that there are “substantial errors in Google Flu Trends estimates of influenza timing and intensity”. This is based on comparison of Google Flu Trends predictions and the actual epidemic data at the national, regional and local level between 2003 and 2013
Even when search behaviour was correlated with influenza cases, the model sometimes misestimated important public health metrics such as peak outbreak size and cumulative cases. The predictions were particularly wide of the mark in 2009 and 2012:

Original and updated Google Flu Trends (GFT) model compared with CDC influenza-like illness (ILI) data. PLOS Computational Biology 9:10
Click to enlarge

Although they criticised certain aspects of the Flu Trends model, the researchers think that monitoring internet search queries might yet prove valuable, especially if it were linked with other surveillance and prediction methods.
Other researchers have also suggested that other sources of digital data – from Twitter feeds to mobile phone GPS – have the potential to be useful tools for studying epidemics. As well as helping to analysing outbreaks, such methods could allow researchers to analyse human movement and the spread of public health information (or misinformation).
Although much attention has been given to web-based tools, there is another type of big data that is already having a huge impact on disease research. Genome sequencing is enabling researchers to piece together how diseases transmit and where they might come from. Sequence data can even reveal the existence of a new disease variant: earlier this week, researchers announced a new type of dengue fever virus….”

Are We Puppets in a Wired World?


Sue Halpern in The New York Review of Books: “Also not obvious was how the Web would evolve, though its open architecture virtually assured that it would. The original Web, the Web of static homepages, documents laden with “hot links,” and electronic storefronts, segued into Web 2.0, which, by providing the means for people without technical knowledge to easily share information, recast the Internet as a global social forum with sites like Facebook, Twitter, FourSquare, and Instagram.
Once that happened, people began to make aspects of their private lives public, letting others know, for example, when they were shopping at H+M and dining at Olive Garden, letting others know what they thought of the selection at that particular branch of H+M and the waitstaff at that Olive Garden, then modeling their new jeans for all to see and sharing pictures of their antipasti and lobster ravioli—to say nothing of sharing pictures of their girlfriends, babies, and drunken classmates, or chronicling life as a high-paid escort, or worrying about skin lesions or seeking a cure for insomnia or rating professors, and on and on.
The social Web celebrated, rewarded, routinized, and normalized this kind of living out loud, all the while anesthetizing many of its participants. Although they likely knew that these disclosures were funding the new information economy, they didn’t especially care…
The assumption that decisions made by machines that have assessed reams of real-world information are more accurate than those made by people, with their foibles and prejudices, may be correct generally and wrong in the particular; and for those unfortunate souls who might never commit another crime even if the algorithm says they will, there is little recourse. In any case, computers are not “neutral”; algorithms reflect the biases of their creators, which is to say that prediction cedes an awful lot of power to the algorithm creators, who are human after all. Some of the time, too, proprietary algorithms, like the ones used by Google and Twitter and Facebook, are intentionally biased to produce results that benefit the company, not the user, and some of the time algorithms can be gamed. (There is an entire industry devoted to “optimizing” Google searches, for example.)
But the real bias inherent in algorithms is that they are, by nature, reductive. They are intended to sift through complicated, seemingly discrete information and make some sort of sense of it, which is the definition of reductive.”
Books reviewed:

The "crowd computing" revolution


Michael Copeland in the Atlantic: “Software might be eating the world, but Rob Miller, a professor of computer science at MIT, foresees a “crowd computing” revolution that makes workers and machines colleagues rather than competitors….
Miller studies human-computer interaction, specifically a field called crowd computing. A play on the more common term “cloud computing,” crowd computing is software that employs a group of people to do small tasks and solve a problem better than an algorithm or a single expert. Examples of crowd computing include Wikipedia, Amazon’s Mechanical Turk (where workers outsource projects that computers can’t do to an online community) a Facebook’s photo tagging feature.
But just as humans are better than computers at some things, Miller concedes that algorithms have surpassed human capability in several fields. Take a look at libraries, which now have advanced digital databases, eliminating the need for most human reference librarians. There’s also flight search, where algorithms are much better than people at finding the cheapest fare.
That said, more complicated tasks even in those fields can get tricky for a computer.
“For complex flight search, people are still better,” Miller says. A site called Flightfox lets travelers input a complex trip while a group of experts help find the cheapest or most convenient combination of flights. “There are travel agents and frequent flyers in that crowd, people with expertise at working angles of the airfare system that are not covered by the flight searches and may never be covered because they involve so many complex intersecting rules that are very hard to code.”
Social and cultural understanding is another area in which humans will always exceed computers, Miller says. People are constantly inventing new slang, watching the latest viral videos and movies, or partaking in some other cultural phenomena together. That’s something that an algorithm won’t ever be able to catch up to. “There’s always going to be a frontier of human understanding that leads the machines,” he says.
A post-employee economy where every task is automated by a computer is something Miller does not see happening, nor does he want it to happen. Instead, he considers the relationship between human and machine symbiotic. Both machines and humans benefit in crowd computing, “the machine wants to acquire data so it can train and get better. The crowd is improved in many ways, like through pay or education,” Miller says. And finally, the end users “get the benefit of a more accurate and fast answer.”
Miller’s User Interface Design Group at MIT has made several programs illustrating how this symbiosis between user, crowd and machine works. Most recently, the MIT group created Cobi, a tool that taps into an academic community to plan a large-scale conference. The software allows members to identify papers they want presented and what authors are experts in specific fields. A scheduling tool combines the community’s input with an algorithm that finds the best times to meet.
Programs more practical for everyday users include Adrenaline, a camera driven by a crowd, and Soylent, a word processing tool that allows people to do interactive document shortening and proofreading. The Adrenaline camera took a video and then had a crowd on call to very quickly identify the best still in that video, whether it was the best group portrait, mid-air jump, or angle of somebody’s face. Soylent also used users on Mechanical Turk to proofread and shorten text in Microsoft Word. In the process, Miller and his students found that the crowd found errors that neither a single expert proofreader nor the program—with spell and grammar check turned on—could find.
“It shows this is the essential thing that human beings bring that algorithms do not,” Miller said.
That said, you can’t just use any crowd for any task. “It does depend on having appropriate expertise in the crowd. If [the text] had been about computational biology, they might not have caught [the error]. The crowd does have to have skills.” Going forward, Miller thinks that software will increasingly use the power of the crowd. “In the next 10 or 20 years it will be more likely we already have a crowd,” he says. “There will already be these communities and they will have needs, some of which will be satisfied by software and some which will require human help and human attention. I think a lot of these algorithms and system techniques that are being developed by all these startups, who are experimenting with it in their own spaces, are going to be things that we’ll just naturally pick up and use as tools.”

From open data to open democracy


Article by : “Such debates further underscore the complexities of open data and where it might lead. While open data may be viewed by some inside and outside government as a technically-focused and largely incremental project based upon information formatting and accessibility (with the degree of openness subject to a myriad of security and confidentiality provisions), such an approach greatly limits its potential. Indeed, the growing ubiquity of mobile and smart devices, the advent of open source operating systems and social media platforms, and the growing commitment by governments themselves to expansive public engagement objectives, all suggest a widening scope.
Yet, what will incentivize the typical citizen to access open data and to partake in collective efforts to create public value? It is here where our digital culture may well fall short, emphasizing individualized service and convenience at the expense of civic responsibility and community-mindedness. For one American academic, this “citizenship deficit” erodes democratic legitimacy and renders our politics more polarized and less discursive. For other observers in Europe, notions of the digital divide are giving rise to new “data divides.”
The politics and practicalities of data privacy often bring further confusion. While privacy advocates call for greater protection and a culture of data activism among Internet users themselves, the networked ethos of online communities and commercialization fuels speed and sharing, often with little understanding of the ramifications of doing so. Differences between consumerism and citizenship are subtle yet profoundly important, while increasingly blurred and overlooked.
A key conundrum provincially and federally, within the Westminster confines of parliamentary democracy, is that open data is being hatched mainly from within the executive branch, whereas the legislative branch watches and withers. In devising genuine democratic openness, politicians and their parties must do more than post expenses online: they must become partners and advocates for renewal. A lesson of open source technology, however, is that systemic change demands an informed and engaged civil society, disgruntled with the status quo but also determined to act anew.
Most often, such actions are highly localized, even in a virtual world, giving rise to the purpose and meaning of smarter and more intelligent communities. And in Canada it bears noting that we see communities both large and small embracing open data and other forms of online experimentation such as participatory budgeting. It is often within small but connected communities where a virtuous cycle of online and in-person identities and actions can deepen and impact decision-making most directly.
How, then, do we reconcile traditional notions of top-down political federalism and national leadership with this bottom-up approach to community engagement and democratic renewal? Shifting from open data to open democracy is likely to be an uneven, diverse, and at times messy affair. Better this way than attempting to ordain top-down change in a centralized and standardized manner.”

The New Eye of Government: Citizen Sentiment Analysis in Social Media


New paper by R. Arunachalam and S. Sarkar: “Governments across the world facing unique challenges today than ever before. In recent time, Arab Spring
phenomenon is an example of how Governments can be impacted if they ignore citizen sentiment. It is a growing trend that Governments are trying to move closer to the citizen-centric model, where the priorities and services would be driven according to citizen needs rather than Government capability. Such trends are
forcing the Governments in rethinking and reshaping their policies in citizen interactions. New disruptive technologies like cloud, mobile etc. are opening new opportunities to the Governments to enable innovations in such interactions.
The advent of Social Media is a recent addition to such disruptive socio-technical enablers. Governments are fast realizing that it can be a great vehicle to get closer to the citizens. It can provide deep insight in what citizens want. Thus, in the current gloomy climate of world economy today, Governments can reorganize and reprioritize the allocation limited funds, thereby creating maximum impact on citizens’ life. Building such insight is a non-trivial task because of the huge
volume of information that social media can generate. However, Sentiment Analysis or Opinion Mining can be a useful vehicle in this journey.
In this work, we presented a model and case study to analyze citizen sentiment from social media in helping the Governments to take decisions.”

Bright Spots of open government to be recognised at global summit


Press Release of the UK Cabinet Office: “The 7 shortlisted initiatives vying for the Bright Spots award show how governments in Open Government Partnership countries are working with citizens to sharpen governance, harness new technologies to increase public participation and improve government responsiveness.
At the Open Government Partnership summit in London on 31 October 2013 and 1 November 2013, participants will be able to vote for one of the shortlisted projects. The winning project – the Bright Spot – will be announced in the summit’s final plenary session….
The shortlisted entries for the Bright Spots prize – which will be awarded at the London summit – are:

  • Chile – ChileAtiende

The aim of ChileAtiende has been to simplify government to citizens by providing a one-stop shop for accessing public services. Today, ChileAtiende has more than 190 offices across the whole country, a national call centre and a digital platform, through which citizens can access multiple services and benefits without having to navigate multiple government offices.

  • Estonia – People’s Assembly

The People’s Assembly is a deliberative democracy tool, designed to encourage input from citizens on the government’s legislative agenda. This web-based platform allows ordinary citizens to propose policy solutions to problems including fighting corruption. Within 3 weeks, 1,800 registered users posted nearly 6,000 ideas and comments. Parliament has since set a timetable for the most popular proposals to be introduced in the formal proceedings.

  • Georgia – improvements to the Freedom of Information Act

Civil society organisations in Georgia have successfully used the government’s participation in OGP to advocate improvements to the country’s Freedom of Information legislation. Government agencies are now obliged to proactively publish information in a way that is accessible to anyone, and to establish an electronic request system for information.

  • Indonesia – complaints portal

LAPOR! (meaning “to report” in Indonesian) is a social media channel where Indonesian citizens can submit complaints and enquiries about development programmes and public services. Comments are transferred directly to relevant ministries or government agencies, which can respond via the website. LAPOR! now has more than 225,350 registered users and receives an average of 1,435 inputs per day.

  • Montenegro – Be Responsible app

“Be Responsible” is a mobile app that allows citizens to report local problems – from illegal waste dumps, misuse of official vehicles and irregular parking, to failure to comply with tax regulations and issues over access to healthcare and education.

  • Philippines – citizen audits

The Citizen Participatory Audit (CPA) project is exploring ways in which citizens can be directly engaged in the audit process for government projects and contribute to ensuring greater efficiency and effectiveness in the use of public resources. 4 pilot audits are in progress, covering public works, welfare, environment and education projects.

  • Romania – transparency in public sector recruitment

The PublicJob.ro website was set up to counter corruption and lack of transparency in civil service recruitment. PublicJob.ro takes recruitment data from public organisations and e-mails it to more than 20,000 subscribers in a weekly newsletter. As a result, it has become more difficult to manipulate the recruitment process.”

Building a Smarter City


PSFK: “As cities around the world grow in size, one of the major challenges will be how to make city services and infrastructure more adaptive and responsive in order to keep existing systems running efficiently, while expanding to accommodate greater need. In our Future Of Cities report, PSFK Labs investigated the key trends and pressing issues that will play a role in shaping the evolution of urban environments over the next decade.

A major theme identified in the report is Sensible Cities, which is bringing intelligence to the city and its citizens through the free flow of information and data, helping improve both immediate and long term decision making. This theme consists of six key trends: Citizen Sensor Networks, Hyperlocal Reporting, Just-In-Time Alerts, Proximity Services, Data Transparency, and Intelligent Transport.

The Citizen Sensor Networks trend described in the Future Of Cities report highlights how sensor-laden personal electronics are enabling everyday people to passive collect environmental data and other information about their communities. When fed back into centralized, public databases for analysis, this accessible pool of knowledge enables any  interested party to make more informed choices about their surroundings. These feedback systems require little infrastructure, and transform people into sensor nodes with little effort on their part. An example of this type of network in action is Street Bump, which  is a crowdsourcing project that helps residents improve their neighborhood streets by collecting data around real-time road conditions while they drive. Using the mobile application’s motion- detecting accelerometer, Street Bump is able to sense when a bump is hit, while the phone’s GPS records and transmits the location.

retio-app

The next trend of Hyperlocal Reporting describes how crowdsourced information platforms are changing the top-down nature of how news is gathered and disseminated by placing reporting tools in the hands of citizens, allowing any individual to instantly broadcast about what is important to them. Often using mobile phone technology, these information monitoring systems not only provide real-time, location specific data, but also boost civic engagement by establishing direct channels of communication between an individual and their community. A good example of this is Retio, which is a mobile application that allows Mexican citizens to report on organized crime and corruption using social media. Each issue is plotted on a map, allowing users and authorities to get an overall idea of what has been reported or narrow results down to specific incidents.

.openspending

Data Transparency is a trend that examines how city administrators, institutions, and companies are publicly sharing data generated within their systems to add new levels of openness and accountability. Availability of this information not only strengthens civic engagement, but also establishes a collaborative agenda at all levels of government that empowers citizens through greater access and agency. For example, OpenSpending is a mobile and web-based application that allows citizens in participating cities to examine where their taxes are being spent through interactive visualizations. Citizens can review their personal share of public works, examine local impacts of public spending, rate and vote on proposed plans for spending and monitor the progress of projects that are or are not underway…”

Connecting Grassroots and Government for Disaster Response


New Report by John Crowley for the Wilson Center: “Leaders in disaster response are finding it necessary to adapt to a new reality. Although community actions have always been the core of the recovery process, collective action from the grassroots has changed response operations in ways that few would have predicted. Using new tools that interconnect over expanding mobile networks, citizens can exchange information via maps and social media, then mobilize thousands of people to collect, analyze, and act on that information. Sometimes, community-sourced intelligence may be fresher and more accurate than the information given to the responders who provide aid…
Also see the companion report from our September 2012 workshop, written by Ryan Burns and Lea Shanley, as well as a series of videos from the workshop and podcasts with workshop participants.”

Special issue of FirstMonday: "Making data — Big data and beyond"


Introduction by Rasmus Helles and Klaus Bruhn Jensen: “Data are widely understood as minimal units of information about the world, waiting to be found and collected by scholars and other analysts. With the recent prominence of ‘big data’ (Mayer–Schönberger and Cukier, 2013), the assumption that data are simply available and plentiful has become more pronounced in research as well as public debate. Challenging and reflecting on this assumption, the present special issue considers how data are made. The contributors take big data and other characteristic features of the digital media environment as an opportunity to revisit classic issues concerning data — big and small, fast and slow, experimental and naturalistic, quantitative and qualitative, found and made.
Data are made in a process involving multiple social agents — communicators, service providers, communication researchers, commercial stakeholders, government authorities, international regulators, and more. Data are made for a variety of scholarly and applied purposes, oriented by knowledge interests (Habermas, 1971). And data are processed and employed in a whole range of everyday and institutional contexts with political, economic, and cultural implications. Unfortunately, the process of generating the materials that come to function as data often remains opaque and certainly under–documented in the published research.
The following eight articles seek to open up some of the black boxes from which data can be seen to emerge. While diverse in their theoretical and topical focus, the articles generally approach the making of data as a process that is extended in time and across spatial and institutional settings. In the common culinary metaphor, data are repeatedly processed, rather than raw. Another shared point of attention is meta–data — the type of data that bear witness to when, where, and how other data such as Web searches, e–mail messages, and phone conversations are exchanged, and which have taken on new, strategic importance in digital media. Last but not least, several of the articles underline the extent to which the making of data as well as meta–data is conditioned — facilitated and constrained — by technological and institutional structures that are inherent in the very domain of analysis. Researchers increasingly depend on the practices and procedures of commercial entities such as Google and Facebook for their research materials, as illustrated by the pivotal role of application programming interfaces (API). Research on the Internet and other digital media also requires specialized tools of data management and analysis, calling, once again, for interdisciplinary competences and dialogues about ‘what the data show.’”
See Table of Contents