Big Data


Special Report on Big Data by Volta – A newsletter on Science, Technology and Society in Europe:  “Locating crime spots, or the next outbreak of a contagious disease, Big Data promises benefits for society as well as business. But more means messier. Do policy-makers know how to use this scale of data-driven decision-making in an effective way for their citizens and ensure their privacy?90% of the world’s data have been created in the last two years. Every minute, more than 100 million new emails are created, 72 hours of new video are uploaded to YouTube and Google processes more than 2 million searches. Nowadays, almost everyone walks around with a small computer in their pocket, uses the internet on a daily basis and shares photos and information with their friends, family and networks. The digital exhaust we leave behind every day contributes to an enormous amount of data produced, and at the same time leaves electronic traces that contain a great deal of personal information….
Until recently, traditional technology and analysis techniques have not been able to handle this quantity and type of data. But recent technological developments have enabled us to collect, store and process data in new ways. There seems to be no limitations, either to the volume of data or technology for storing and analyzing them. Big Data can map a driver’s sitting position to identify a car thief, it can use Google searches to predict outbreaks of the H1N1 flu virus, it can data-mine Twitter to predict the price of rice or use mobile phone top-ups to describe unemployment in Asia.
The word ‘data’ means ‘given’ in Latin. It commonly refers to a description of something that can be recorded and analyzed. While there is no clear definition of the concept of ‘Big Data’, it usually refers to the processing of huge amounts and new types of data that have not been possible with traditional tools.

‘The new development is not necessarily that there are so much more data. It’s rather that data is available to us in a new way.’

The notion of Big Data is kind of misleading, argues Robindra Prabhu, a project manager at the Norwegian Board of Technology. “The new development is not necessarily that there are so much more data. It’s rather that data is available to us in a new way. The digitalization of society gives us access to both ‘traditional’, structured data – like the content of a database or register – and unstructured data, for example the content in a text, pictures and videos. Information designed to be read by humans is now also readable by machines. And this development makes a whole new world of  data gathering and analysis available. Big Data is exciting not just because of the amount and variety of data out there, but that we can process data about so much more than before.”

Open data: Unlocking innovation and performance with liquid information


New report by McKinsey Global Institute:“Open data—machine-readable information, particularly government data, that’s made available to others—has generated a great deal of excitement around the world for its potential to empower citizens, change how government works, and improve the delivery of public services. It may also generate significant economic value, according to a new McKinsey report.1 Our research suggests that seven sectors alone could generate more than $3 trillion a year in additional value as a result of open data, which is already giving rise to hundreds of entrepreneurial businesses and helping established companies to segment markets, define new products and services, and improve the efficiency and effectiveness of operations.

Although the open-data phenomenon is in its early days, we see a clear potential to unlock significant economic value by applying advanced analytics to both open and proprietary knowledge. Open data can become an instrument for breaking down information gaps across industries, allowing companies to share benchmarks and spread best practices that raise productivity. Blended with proprietary data sets, it can propel innovation and help organizations replace traditional and intuitive decision-making approaches with data-driven ones. Open-data analytics can also help uncover consumer preferences, allowing companies to improve new products and to uncover anomalies and needless variations. That can lead to leaner, more reliable processes.
However, investments in technology and expertise are required to use the data effectively. And there is much work to be done by governments, companies, and consumers to craft policies that protect privacy and intellectual property, as well as establish standards to speed the flow of data that is not only open but also “liquid.” After all, consumers have serious privacy concerns, and companies are reluctant to share proprietary information—even when anonymity is assured—for fear of losing competitive advantage…
See also Executive Summary and Full Report”

Smart Citizens


FutureEverything: “This publication aims to shift the debate on the future of cities towards the central place of citizens, and of decentralised, open urban infrastructures. It provides a global perspective on how cities can create the policies, structures and tools to engender a more innovative and participatory society. The publication contains a series of 23 short essays representing some of the key voices developing an emerging discourse around Smart Citizens.  Contributors include:

  • Dan Hill, Smart Citizens pioneer and CEO of communications research centre and transdisciplinary studio Fabrica on why Smart Citizens Make Smart Cities.
  • Anthony Townsend, urban planner, forecaster and author of Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia on the tensions between place-making and city-making on the role of mobile technologies in changing the way that people interact with their surroundings.
  • Paul Maltby, Director of the Government Innovation Group and of the Open Data and Transparency in the UK Cabinet Office on how government can support a smarter society.
  • Aditya Dev Sood, Founder and CEO of the Center for Knowledge Societies, presents polarised hypothetical futures for India in 2025 that argues for the use of technology to bridge gaps in social inequality.
  • Adam Greenfield, New York City-based writer and urbanist, on Recuperating the Smart City.

Editors: Drew Hemment, Anthony Townsend
Download Here.

Open Data Index provides first major assessment of state of open government data


Press Release from the Open Knowledge Foundation: “In the week of a major international summit on government transparency in London, the Open Knowledge Foundation has published its 2013 Open Data Index, showing that governments are still not providing enough information in an accessible form to their citizens and businesses.
The UK and US top the 2013 Index, which is a result of community-based surveys in 70 countries. They are followed by Denmark, Norway and the Netherlands. Of the countries assessed, Cyprus, St Kitts & Nevis, the British Virgin Islands, Kenya and Burkina Faso ranked lowest. There are many countries where the governments are less open but that were not assessed because of lack of openness or a sufficiently engaged civil society. This includes 30 countries who are members of the Open Government Partnership.
The Index ranks countries based on the availability and accessibility of information in ten key areas, including government spending, election results, transport timetables, and pollution levels, and reveals that whilst some good progress is being made, much remains to be done.
Rufus Pollock, Founder and CEO of the Open Knowledge Foundation said:

Opening up government data drives democracy, accountability and innovation. It enables citizens to know and exercise their rights, and it brings benefits across society: from transport, to education and health. There has been a welcome increase in support for open data from governments in the last few years, but this Index reveals that too much valuable information is still unavailable.

The UK and US are leaders on open government data but even they have room for improvement: the US for example does not provide a single consolidated and open register of corporations, while the UK Electoral Commission lets down the UK’s good overall performance by not allowing open reuse of UK election data.
There is a very disappointing degree of openness of company registers across the board: only 5 out of the 20 leading countries have even basic information available via a truly open licence, and only 10 allow any form of bulk download. This information is critical for range of reasons – including tackling tax evasion and other forms of financial crime and corruption.
Less than half of the key datasets in the top 20 countries are available to re-use as open data, showing that even the leading countries do not fully understand the importance of citizens and businesses being able to legally and technically use, reuse and redistribute data. This enables them to build and share commercial and non-commercial services.
To see the full results: https://index.okfn.org. For graphs of the data: https://index.okfn.org/visualisations.”

Making government simpler is complicated


Mike Konczal in The Washington Post: “Here’s something a politician would never say: “I’m in favor of complex regulations.” But what would the opposite mean? What would it mean to have “simple” regulations?

There are two definitions of “simple” that have come to dominate liberal conversations about government. One is the idea that we should make use of “nudges” in regulation. The other is the idea that we should avoid “kludges.” As it turns out, however, these two definitions conflict with each other —and the battle between them will dominate conversations about the state in the years ahead.

The case for “nudges”

The first definition of a “simple” regulation is one emphasized in Cass Sunstein’s recent book titled Simpler: The Future of Government (also see here). A simple policy is one that simply “nudges” people into one choice or another using a variety of default rules, disclosure requirements, and other market structures. Think, for instance, of rules that require fast-food restaurants to post calories on their menus, or a mortgage that has certain terms clearly marked in disclosures.

These sorts of regulations are deemed “choice preserving.” Consumers are still allowed to buy unhealthy fast-food meals or sign up for mortgages they can’t reasonably afford. The regulations are just there to inform people about their choices. These rules are designed to keep the market “free,” where all possibilities are ultimately possible, although there are rules to encourage certain outcomes.
In his book, however, Sunstein adds that there’s another very different way to understand the term “simple.” What most people mean when they think of simple regulations is a rule that is “simple to follow.” Usually a rule is simple to follow because it outright excludes certain possibilities and thus ensures others. Which means, by definition, it limits certain choices.

The case against “kludges”
This second definition of simple plays a key role in political scientist Steve Teles’ excellent recent essay, “Kludgeocracy in America.” For Teles, a “kludge” is a “clumsy but temporarily effective” fix for a policy problem. (The term comes from computer science.) These kludges tend to pile up over time, making government cumbersome and inefficient overall.
Teles focuses on several ways that kludges are introduced into policy, with a particularly sharp focus on overlapping jurisdictions and the related mess of federal and state overlap in programs. But, without specifically invoking it, he also suggests that a reliance on “nudge” regulations can lead to more kludges.
After all, non-kludge policy proposal is one that will be simple to follow and will clearly cause a certain outcome, with an obvious causality chain. This is in contrast to a web of “nudges” and incentives designed to try and guide certain outcomes.

Why “nudges” aren’t always simpler
The distinction between the two is clear if we take a specific example core to both definitions: retirement security.
For Teles, “one of the often overlooked benefits of the Social Security program… is that recipients automatically have taxes taken out of their paychecks, and, then without much effort on their part, checks begin to appear upon retirement. It’s simple and direct. By contrast, 401(k) retirement accounts… require enormous investments of time, effort, and stress to manage responsibly.”

Yet 401(k)s are the ultimately fantasy laboratory for nudge enthusiasts. A whole cottage industry has grown up around figuring out ways to default people into certain contributions, on designing the architecture of choices of investments, and trying to effortlessly and painlessly guide people into certain savings.
Each approach emphasizes different things. If you want to focus your energy on making people better consumers and market participations, expanding our government’s resources and energy into 401(k)s is a good choice. If you want to focus on providing retirement security directly, expanding Social Security is a better choice.
The first is “simple” in that it doesn’t exclude any possibility but encourages market choices. The second is “simple” in that it is easy to follow, and the result is simple as well: a certain amount of security in old age is provided directly. This second approach understands the government as playing a role in stopping certain outcomes, and providing for the opposite of those outcomes, directly….

Why it’s hard to create “simple” regulations
Like all supposed binaries this is really a continuum. Taxes, for instance, sit somewhere in the middle of the two definitions of “simple.” They tend to preserve the market as it is but raise (or lower) the price of certain goods, influencing choices.
And reforms and regulations are often most effective when there’s a combination of these two types of “simple” rules.
Consider an important new paper, “Regulating Consumer Financial Products: Evidence from Credit Cards,” by Sumit Agarwal, Souphala Chomsisengphet, Neale Mahoney and Johannes Stroebel. The authors analyze the CARD Act of 2009, which regulated credit cards. They found that the nudge-type disclosure rules “increased the number of account holders making the 36-month payment value by 0.5 percentage points.” However, more direct regulations on fees had an even bigger effect, saving U.S. consumers $20.8 billion per year with no notable reduction in credit access…..
The balance between these two approaches of making regulations simple will be front and center as liberals debate the future of government, whether they’re trying to pull back on the “submerged state” or consider the implications for privacy. The debate over the best way for government to be simple is still far from over.”

Google’s flu fail shows the problem with big data


Adam Kucharski in The Conversation: “When people talk about ‘big data’, there is an oft-quoted example: a proposed public health tool called Google Flu Trends. It has become something of a pin-up for the big data movement, but it might not be as effective as many claim.
The idea behind big data is that large amount of information can help us do things which smaller volumes cannot. Google first outlined the Flu Trends approach in a 2008 paper in the journal Nature. Rather than relying on disease surveillance used by the US Centers for Disease Control and Prevention (CDC) – such as visits to doctors and lab tests – the authors suggested it would be possible to predict epidemics through Google searches. When suffering from flu, many Americans will search for information related to their condition….
Between 2003 and 2008, flu epidemics in the US had been strongly seasonal, appearing each winter. However, in 2009, the first cases (as reported by the CDC) started in Easter. Flu Trends had already made its predictions when the CDC data was published, but it turned out that the Google model didn’t match reality. It had substantially underestimated the size of the initial outbreak.
The problem was that Flu Trends could only measure what people search for; it didn’t analyse why they were searching for those words. By removing human input, and letting the raw data do the work, the model had to make its predictions using only search queries from the previous handful of years. Although those 45 terms matched the regular seasonal outbreaks from 2003–8, they didn’t reflect the pandemic that appeared in 2009.
Six months after the pandemic started, Google – who now had the benefit of hindsight – updated their model so that it matched the 2009 CDC data. Despite these changes, the updated version of Flu Trends ran into difficulties again last winter, when it overestimated the size of the influenza epidemic in New York State. The incidents in 2009 and 2012 raised the question of how good Flu Trends is at predicting future epidemics, as opposed to merely finding patterns in past data.
In a new analysis, published in the journal PLOS Computational Biology, US researchers report that there are “substantial errors in Google Flu Trends estimates of influenza timing and intensity”. This is based on comparison of Google Flu Trends predictions and the actual epidemic data at the national, regional and local level between 2003 and 2013
Even when search behaviour was correlated with influenza cases, the model sometimes misestimated important public health metrics such as peak outbreak size and cumulative cases. The predictions were particularly wide of the mark in 2009 and 2012:

Original and updated Google Flu Trends (GFT) model compared with CDC influenza-like illness (ILI) data. PLOS Computational Biology 9:10
Click to enlarge

Although they criticised certain aspects of the Flu Trends model, the researchers think that monitoring internet search queries might yet prove valuable, especially if it were linked with other surveillance and prediction methods.
Other researchers have also suggested that other sources of digital data – from Twitter feeds to mobile phone GPS – have the potential to be useful tools for studying epidemics. As well as helping to analysing outbreaks, such methods could allow researchers to analyse human movement and the spread of public health information (or misinformation).
Although much attention has been given to web-based tools, there is another type of big data that is already having a huge impact on disease research. Genome sequencing is enabling researchers to piece together how diseases transmit and where they might come from. Sequence data can even reveal the existence of a new disease variant: earlier this week, researchers announced a new type of dengue fever virus….”

Global Collective Intelligence in Technological Societies


Paper by Juan Carlos Piedra Calderón and Javier Rainer in the International Journal of Artificial Intelligence and Interactive Multimedia: “The big influence of Information and Communication Technologies (ICT), especially in area of construction of Technological Societies has generated big
social changes. That is visible in the way of relating to people in different environments. These changes have the possibility to expand the frontiers of knowledge through sharing and cooperation. That has meaning the inherently creation of a new form of Collaborative Knowledge. The potential of this Collaborative Knowledge has been given through ICT in combination with Artificial Intelligence processes, from where is obtained a Collective Knowledge. When this kind of knowledge is shared, it gives the place to the Global Collective Intelligence”.

Information Now: Open Access and the Public Good


Podcast from SMARTech (Georgia Tech): “Every year, the international academic and research community dedicates a week in October to discuss, debate, and learn more about Open Access. Open Access in the academic sense refers to the free, immediate, and online access to the results of scholarly research, primarily academic, peer-reviewed journal articles. In the United States, the movement in support of Open Access has, in the last decade, been growing dramatically. Because of this growing interest in Open Access, a group of academic librarians from the Georgia Tech library, Wendy Hagenmaier (Digital Collections Archivist), Fred Rascoe (Scholarly Communication Librarian), and Lizzy Rolando (Research Data Librarian), got together to talk to folks in the thick of it, to try and unravel some of the different concerns and benefits of Open Access. But we didn’t just want to talk about Open Access for journal articles – we wanted to examine more broadly what it means to be “open”, what is open information, and what relationship open information has to the public good. In this podcast, we talk with different people who have seen and experienced open information and open access in practice. In the first act, Dan Cohen from the DPLA speaks about efforts to expand public access to archival and library collections. In the second, we’ll hear an argument from Christine George about why things sometimes need to be closed, if we want them to be open in the future. Third, Kari Watkins speaks about specific example of when a government agency decided, against legitimate concerns, to make transit data open, and why it worked for them. Fourth, Peter Suber from Harvard University will give us the background on the Open Access movement, some myths that have been dispelled, and why it is important for academic researchers to take the leap to make their research openly accessible. And finally, we’ll hear from Michael Chang, a researcher who did take that leap and helped start an Open Access journal, and why he sees openness in research as his obligation.”

See also Personal Guide to Open Access

Are We Puppets in a Wired World?


Sue Halpern in The New York Review of Books: “Also not obvious was how the Web would evolve, though its open architecture virtually assured that it would. The original Web, the Web of static homepages, documents laden with “hot links,” and electronic storefronts, segued into Web 2.0, which, by providing the means for people without technical knowledge to easily share information, recast the Internet as a global social forum with sites like Facebook, Twitter, FourSquare, and Instagram.
Once that happened, people began to make aspects of their private lives public, letting others know, for example, when they were shopping at H+M and dining at Olive Garden, letting others know what they thought of the selection at that particular branch of H+M and the waitstaff at that Olive Garden, then modeling their new jeans for all to see and sharing pictures of their antipasti and lobster ravioli—to say nothing of sharing pictures of their girlfriends, babies, and drunken classmates, or chronicling life as a high-paid escort, or worrying about skin lesions or seeking a cure for insomnia or rating professors, and on and on.
The social Web celebrated, rewarded, routinized, and normalized this kind of living out loud, all the while anesthetizing many of its participants. Although they likely knew that these disclosures were funding the new information economy, they didn’t especially care…
The assumption that decisions made by machines that have assessed reams of real-world information are more accurate than those made by people, with their foibles and prejudices, may be correct generally and wrong in the particular; and for those unfortunate souls who might never commit another crime even if the algorithm says they will, there is little recourse. In any case, computers are not “neutral”; algorithms reflect the biases of their creators, which is to say that prediction cedes an awful lot of power to the algorithm creators, who are human after all. Some of the time, too, proprietary algorithms, like the ones used by Google and Twitter and Facebook, are intentionally biased to produce results that benefit the company, not the user, and some of the time algorithms can be gamed. (There is an entire industry devoted to “optimizing” Google searches, for example.)
But the real bias inherent in algorithms is that they are, by nature, reductive. They are intended to sift through complicated, seemingly discrete information and make some sort of sense of it, which is the definition of reductive.”
Books reviewed:

The "crowd computing" revolution


Michael Copeland in the Atlantic: “Software might be eating the world, but Rob Miller, a professor of computer science at MIT, foresees a “crowd computing” revolution that makes workers and machines colleagues rather than competitors….
Miller studies human-computer interaction, specifically a field called crowd computing. A play on the more common term “cloud computing,” crowd computing is software that employs a group of people to do small tasks and solve a problem better than an algorithm or a single expert. Examples of crowd computing include Wikipedia, Amazon’s Mechanical Turk (where workers outsource projects that computers can’t do to an online community) a Facebook’s photo tagging feature.
But just as humans are better than computers at some things, Miller concedes that algorithms have surpassed human capability in several fields. Take a look at libraries, which now have advanced digital databases, eliminating the need for most human reference librarians. There’s also flight search, where algorithms are much better than people at finding the cheapest fare.
That said, more complicated tasks even in those fields can get tricky for a computer.
“For complex flight search, people are still better,” Miller says. A site called Flightfox lets travelers input a complex trip while a group of experts help find the cheapest or most convenient combination of flights. “There are travel agents and frequent flyers in that crowd, people with expertise at working angles of the airfare system that are not covered by the flight searches and may never be covered because they involve so many complex intersecting rules that are very hard to code.”
Social and cultural understanding is another area in which humans will always exceed computers, Miller says. People are constantly inventing new slang, watching the latest viral videos and movies, or partaking in some other cultural phenomena together. That’s something that an algorithm won’t ever be able to catch up to. “There’s always going to be a frontier of human understanding that leads the machines,” he says.
A post-employee economy where every task is automated by a computer is something Miller does not see happening, nor does he want it to happen. Instead, he considers the relationship between human and machine symbiotic. Both machines and humans benefit in crowd computing, “the machine wants to acquire data so it can train and get better. The crowd is improved in many ways, like through pay or education,” Miller says. And finally, the end users “get the benefit of a more accurate and fast answer.”
Miller’s User Interface Design Group at MIT has made several programs illustrating how this symbiosis between user, crowd and machine works. Most recently, the MIT group created Cobi, a tool that taps into an academic community to plan a large-scale conference. The software allows members to identify papers they want presented and what authors are experts in specific fields. A scheduling tool combines the community’s input with an algorithm that finds the best times to meet.
Programs more practical for everyday users include Adrenaline, a camera driven by a crowd, and Soylent, a word processing tool that allows people to do interactive document shortening and proofreading. The Adrenaline camera took a video and then had a crowd on call to very quickly identify the best still in that video, whether it was the best group portrait, mid-air jump, or angle of somebody’s face. Soylent also used users on Mechanical Turk to proofread and shorten text in Microsoft Word. In the process, Miller and his students found that the crowd found errors that neither a single expert proofreader nor the program—with spell and grammar check turned on—could find.
“It shows this is the essential thing that human beings bring that algorithms do not,” Miller said.
That said, you can’t just use any crowd for any task. “It does depend on having appropriate expertise in the crowd. If [the text] had been about computational biology, they might not have caught [the error]. The crowd does have to have skills.” Going forward, Miller thinks that software will increasingly use the power of the crowd. “In the next 10 or 20 years it will be more likely we already have a crowd,” he says. “There will already be these communities and they will have needs, some of which will be satisfied by software and some which will require human help and human attention. I think a lot of these algorithms and system techniques that are being developed by all these startups, who are experimenting with it in their own spaces, are going to be things that we’ll just naturally pick up and use as tools.”