Tim Harford in the Financial Times: “Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”. Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”…
But big data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.
“We have a new resource here,” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”
To use big data to produce such answers will require large strides in statistical methods.
“It’s the wild west right now,” says Patrick Wolfe of UCL. “People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that’s cool. But we’re flying a little bit blind at the moment.”
Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.
Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.
“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.”
Potholes and Big Data: Crowdsourcing Our Way to Better Government
Phil Simon in Wired: “Big Data is transforming many industries and functions within organizations with relatively limited budgets.
Consider Thomas M. Menino, up until recently Boston’s longest-serving mayor. At some point in the past few years, Menino realized that it was no longer 1950. Perhaps he was hobnobbing with some techies from MIT at dinner one night. Whatever his motivation, he decided that there just had to be a better, more cost-effective way to maintain and fix the city’s roads. Maybe smartphones could help the city take a more proactive approach to road maintenance.
To that end, in July 2012, the Mayor’s Office of New Urban Mechanics launched a new project called Street Bump, an app that allows drivers to automatically report the road hazards to the city as soon as they hear that unfortunate “thud,” with their smartphones doing all the work.
The app’s developers say their work has already sparked interest from other cities in the U.S., Europe, Africa and elsewhere that are imagining other ways to harness the technology.
Before they even start their trip, drivers using Street Bump fire up the app, then set their smartphones either on the dashboard or in a cup holder. The app takes care of the rest, using the phone’s accelerometer — a motion detector — to sense when a bump is hit. GPS records the location, and the phone transmits it to an AWS remote server.
But that’s not the end of the story. It turned out that the first version of the app reported far too many false positives (i.e., phantom potholes). This finding no doubt gave ammunition to the many naysayers who believe that technology will never be able to do what people can and that things are just fine as they are, thank you. Street Bump 1.0 “collected lots of data but couldn’t differentiate between potholes and other bumps.” After all, your smartphone or cell phone isn’t inert; it moves in the car naturally because the car is moving. And what about the scores of people whose phones “move” because they check their messages at a stoplight?
To their credit, Menino and his motley crew weren’t entirely discouraged by this initial setback. In their gut, they knew that they were on to something. The idea and potential of the Street Bump app were worth pursuing and refining, even if the first version was a bit lacking. Plus, they have plenty of examples from which to learn. It’s not like the iPad, iPod, and iPhone haven’t evolved considerably over time.
Enter InnoCentive, a Massachusetts-based firm specializing in open innovation and crowdsourcing. The City of Boston contracted InnoCentive to improve Street Bump and reduce the amount of tail chasing. The company accepted the challenge and essentially turned it into a contest, a process sometimes called gamification. InnoCentive offered a network of 400,000 experts a share of $25,000 in prize money donated by Liberty Mutual.
Almost immediately, the ideas to improve Street Bump poured in from unexpected places. This crowd had wisdom. Ultimately, the best suggestions came from:
- A group of hackers in Somerville, Massachusetts, that promotes community education and research
- The head of the mathematics department at Grand Valley State University in Allendale, MI.
- An anonymous software engineer
…Crowdsourcing roadside maintenance isn’t just cool. Increasingly, projects like Street Bump are resulting in substantial savings — and better government.”
Crowdsourced transit app shows what time the bus will really come
Springwise: “The problem with most transport apps is that they rely on fixed data from transport company schedules and don’t truly reflect exactly what’s going on with the city’s trains and buses at any given moment. Operating like a Waze for public transport, Israel’s Ototo app crowdsources real-time information from passengers to give users the best suggestions for their commute.
The app relies on a community of ‘Riders’, who allow anonymous location data to be sent from their smartphone whenever they’re using public transport. By collating this data together, Ototo offers more realistic information about bus and train routes. While a bus may be due in five minutes, a Rider currently on that bus might be located more than five minutes away, indicating that the bus isn’t on time. Ototo can then suggest a quicker route for users. According to Fast Company, the service currently has a 12,000-strong global Riders community that powers its travel recommendations. On top of this, the app is designed in an easy-to-use infographic format that quickly and efficiently tells users where they need to be going and how long it will take. The app is free to download from the App Store, and the video below offers a demonstration:
Ototo faces competition from similar services such as New York City’s Moovit, which also details how crowded buses are.”
The data gold rush
Neelie KROES (European Commission): “Nearly 200 years ago, the industrial revolution saw new networks take over. Not just a new form of transport, the railways connected industries, connected people, energised the economy, transformed society.
Now we stand facing a new industrial revolution: a digital one.
With cloud computing its new engine, big data its new fuel. Transporting the amazing innovations of the internet, and the internet of things. Running on broadband rails: fast, reliable, pervasive.
My dream is that Europe takes its full part. With European industry able to supply, European citizens and businesses able to benefit, European governments able and willing to support. But we must get all those components right.
What does it mean to say we’re in the big data era?
First, it means more data than ever at our disposal. Take all the information of humanity from the dawn of civilisation until 2003 – nowadays that is produced in just two days. We are also acting to have more and more of it become available as open data, for science, for experimentation, for new products and services.
Second, we have ever more ways – not just to collect that data – but to manage it, manipulate it, use it. That is the magic to find value amid the mass of data. The right infrastructure, the right networks, the right computing capacity and, last but not least, the right analysis methods and algorithms help us break through the mountains of rock to find the gold within.
Third, this is not just some niche product for tech-lovers. The impact and difference to people’s lives are huge: in so many fields.
Transforming healthcare, using data to develop new drugs, and save lives. Greener cities with fewer traffic jams, and smarter use of public money.
A business boost: like retailers who communicate smarter with customers, for more personalisation, more productivity, a better bottom line.
No wonder big data is growing 40% a year. No wonder data jobs grow fast. No wonder skills and profiles that didn’t exist a few years ago are now hot property: and we need them all, from data cleaner to data manager to data scientist.
This can make a difference to people’s lives. Wherever you sit in the data ecosystem – never forget that. Never forget that real impact and real potential.
Politicians are starting to get this. The EU’s Presidents and Prime Ministers have recognised the boost to productivity, innovation and better services from big data and cloud computing.
But those technologies need the right environment. We can’t go on struggling with poor quality broadband. With each country trying on its own. With infrastructure and research that are individual and ineffective, separate and subscale. With different laws and practices shackling and shattering the single market. We can’t go on like that.
Nor can we continue in an atmosphere of insecurity and mistrust.
Recent revelations show what is possible online. They show implications for privacy, security, and rights.
You can react in two ways. One is to throw up your hands and surrender. To give up and put big data in the box marked “too difficult”. To turn away from this opportunity, and turn your back on problems that need to be solved, from cancer to climate change. Or – even worse – to simply accept that Europe won’t figure on this mapbut will be reduced to importing the results and products of others.
Alternatively: you can decide that we are going to master big data – and master all its dependencies, requirements and implications, including cloud and other infrastructures, Internet of things technologies as well as privacy and security. And do it on our own terms.
And by the way – privacy and security safeguards do not just have to be about protecting and limiting. Data generates value, and unlocks the door to new opportunities: you don’t need to “protect” people from their own assets. What you need is to empower people, give them control, give them a fair share of that value. Give them rights over their data – and responsibilities too, and the digital tools to exercise them. And ensure that the networks and systems they use are affordable, flexible, resilient, trustworthy, secure.
One thing is clear: the answer to greater security is not just to build walls. Many millennia ago, the Greek people realised that. They realised that you can build walls as high and as strong as you like – it won’t make a difference, not without the right awareness, the right risk management, the right security, at every link in the chain. If only the Trojans had realised that too! The same is true in the digital age: keep our data locked up in Europe, engage in an impossible dream of isolation, and we lose an opportunity; without gaining any security.
But master all these areas, and we would truly have mastered big data. Then we would have showed technology can take account of democratic values; and that a dynamic democracy can cope with technology. Then we would have a boost to benefit every European.
So let’s turn this asset into gold. With the infrastructure to capture and process. Cloud capability that is efficient, affordable, on-demand. Let’s tackle the obstacles, from standards and certification, trust and security, to ownership and copyright. With the right skills, so our workforce can seize this opportunity. With new partnerships, getting all the right players together. And investing in research and innovation. Over the next two years, we are putting 90 million euros on the table for big data and 125 million for the cloud.
I want to respond to this economic imperative. And I want to respond to the call of the European Council – looking at all the aspects relevant to tomorrow’s digital economy.
You can help us build this future. All of you. Helping to bring about the digital data-driven economy of the future. Expanding and depening the ecosystem around data. New players, new intermediaries, new solutions, new jobs, new growth….”
The Parable of Google Flu: Traps in Big Data Analysis
David Lazer: “…big data last winter had its “Dewey beats Truman” moment, when the poster child of big data (at least for behavioral data), Google Flu Trends (GFT), went way off the rails in “nowcasting” the flu–overshooting the peak last winter by 130% (and indeed, it has been systematically overshooting by wide margins for 3 years). Tomorrow we (Ryan Kennedy, Alessandro Vespignani, and Gary King) have a paper out in Science dissecting why GFT went off the rails, how that could have been prevented, and the broader lessons to be learned regarding big data.
[We are The Parable of Google Flu (WP-Final).pdf we submitted before acceptance. We have also posted an SSRN paper evaluating GFT for 2013-14, since it was reworked in the Fall.]Key lessons that I’d highlight:
1) Big data are typically not scientifically calibrated. This goes back to my post last month regarding measurement. This does not make them useless from a scientific point of view, but you do need to build into the analysis that the “measures” of behavior are being affected by unseen things. In this case, the likely culprit was the Google search algorithm, which was modified in various ways that we believe likely to have increased flu related searches.
2) Big data + analytic code used in scientific venues with scientific claims need to be more transparent. This is a tricky issue, because there are both legitimate proprietary interests involved and privacy concerns, but much more can be done in this regard than has been done in the 3 GFT papers. [One of my aspirations over the next year is to work together with big data companies, researchers, and privacy advocates to figure out how this can be done.]
3) It’s about the questions, not the size of the data. In this particular case, one could have done a better job stating the likely flu prevalence today by ignoring GFT altogether and just project 3 week old CDC data to today (better still would have been to combine the two). That is, a synthesis would have been more effective than a pure “big data” approach. I think this is likely the general pattern.
4) More generally, I’d note that there is much more that the academy needs to do. First, the academy needs to build the foundation for collaborations around big data (e.g., secure infrastructures, legal understandings around data sharing, etc). Second, there needs to be MUCH more work done to build bridges between the computer scientists who work on big data and social scientists who think about deriving insights about human behavior from data more generally. We have moved perhaps 5% of the way that we need to in this regard.”
How government can engage with citizens online – expert views
The Guardian: In our livechat on 28 February the experts discussed how to connect up government and citizens online. Digital public services are not just for ‘techno wizzy people’, so government should make them easier for everyone… Read the livechat in full
Michael Sanders, head of research for the behavioural insights team – @mike_t_sanders
It’s important that government is a part of people’s lives: when people interact with government it shouldn’t be a weird and alienating experience, but one that feels part of their everyday lives.
Online services are still too often difficult to use: most people who use the HMRC website will do so infrequently, and will forget its many nuances between visits. This is getting better but there’s a long way to go.
Digital by default keeps things simple: one of our main findings from our research on improving public services is that we should do all we can to “make it easy”.
There is always a risk of exclusion: we should avoid “digital by default” becoming “digital only”.
Ben Matthews, head of communications at Futuregov – @benrmatthews
We prefer digital by design to digital by default: sometimes people can use technology badly, under the guise of ‘digital by default’. We should take a more thoughtful approach to technology, using it as a means to an end – to help us be open, accountable and human.
Leadership is important: you can get enthusiasm from the frontline or younger workers who are comfortable with digital tools, but until they’re empowered by the top of the organisation to use them actively and effectively, we’ll see little progress.
Jargon scares people off: ‘big data’ or ‘open data’, for example….”
Predicting Individual Behavior with Social Networks
Article by Sharad Goel and Daniel Goldstein (Microsoft Research): “With the availability of social network data, it has become possible to relate the behavior of individuals to that of their acquaintances on a large scale. Although the similarity of connected individuals is well established, it is unclear whether behavioral predictions based on social data are more accurate than those arising from current marketing practices. We employ a communications network of over 100 million people to forecast highly diverse behaviors, from patronizing an off-line department store to responding to advertising to joining a recreational league. Across all domains, we find that social data are informative in identifying individuals who are most likely to undertake various actions, and moreover, such data improve on both demographic and behavioral models. There are, however, limits to the utility of social data. In particular, when rich transactional data were available, social data did little to improve prediction.”
Overcoming 'Tragedies of the Commons' with a Self-Regulating, Participatory Market Society
Paper by Dirk Helbing; “Our society is fundamentally changing. These days, almost nothing works without a computer chip. Processing power doubles every 18 months and will exceed the capabilities of human brains in about ten years from now. Some time ago, IBM’s Big Blue computer already beat the best chess player. Meanwhile, computers perform about 70 percent of all financial transactions, and IBM’s Watson advises customers better than human telephone hotlines. Will computers and robots soon replace skilled labor? In many European countries, unemployment is reaching historical heights. The forthcoming economic and social impact of future information and communication technologies (ICT) will be huge – probably more significant than that caused by the steam engine, or by nano- or biotechnology.
The storage capacity for data is growing even faster than computational capacity. Within just a year we will soon generate more data than in the entire history of humankind. The “Internet of Things” will network trillions of sensors. Unimaginable amounts of data will be collected. Big Data is already being praised as the “oil of the 21st century”. What opportunities and risks does this create for our society, economy, and environment?”
Open Government -Opportunities and Challenges for Public Governance
Big Data, Big New Businesses
Nigel Shaboldt and Michael Chui: “Many people have long believed that if government and the private sector agreed to share their data more freely, and allow it to be processed using the right analytics, previously unimaginable solutions to countless social, economic, and commercial problems would emerge. They may have no idea how right they are.
Even the most vocal proponents of open data appear to have underestimated how many profitable ideas and businesses stand to be created. More than 40 governments worldwide have committed to opening up their electronic data – including weather records, crime statistics, transport information, and much more – to businesses, consumers, and the general public. The McKinsey Global Institute estimates that the annual value of open data in education, transportation, consumer products, electricity, oil and gas, health care, and consumer finance could reach $3 trillion.
These benefits come in the form of new and better goods and services, as well as efficiency savings for businesses, consumers, and citizens. The range is vast. For example, drawing on data from various government agencies, the Climate Corporation (recently bought for $1 billion) has taken 30 years of weather data, 60 years of data on crop yields, and 14 terabytes of information on soil types to create customized insurance products.
Similarly, real-time traffic and transit information can be accessed on smartphone apps to inform users when the next bus is coming or how to avoid traffic congestion. And, by analyzing online comments about their products, manufacturers can identify which features consumers are most willing to pay for, and develop their business and investment strategies accordingly.
Opportunities are everywhere. A raft of open-data start-ups are now being incubated at the London-based Open Data Institute (ODI), which focuses on improving our understanding of corporate ownership, health-care delivery, energy, finance, transport, and many other areas of public interest.
Consumers are the main beneficiaries, especially in the household-goods market. It is estimated that consumers making better-informed buying decisions across sectors could capture an estimated $1.1 trillion in value annually. Third-party data aggregators are already allowing customers to compare prices across online and brick-and-mortar shops. Many also permit customers to compare quality ratings, safety data (drawn, for example, from official injury reports), information about the provenance of food, and producers’ environmental and labor practices.
Consider the book industry. Bookstores once regarded their inventory as a trade secret. Customers, competitors, and even suppliers seldom knew what stock bookstores held. Nowadays, by contrast, bookstores not only report what stock they carry but also when customers’ orders will arrive. If they did not, they would be excluded from the product-aggregation sites that have come to determine so many buying decisions.
The health-care sector is a prime target for achieving new efficiencies. By sharing the treatment data of a large patient population, for example, care providers can better identify practices that could save $180 billion annually.
The Open Data Institute-backed start-up Mastodon C uses open data on doctors’ prescriptions to differentiate among expensive patent medicines and cheaper “off-patent” varieties; when applied to just one class of drug, that could save around $400 million in one year for the British National Health Service. Meanwhile, open data on acquired infections in British hospitals has led to the publication of hospital-performance tables, a major factor in the 85% drop in reported infections.
There are also opportunities to prevent lifestyle-related diseases and improve treatment by enabling patients to compare their own data with aggregated data on similar patients. This has been shown to motivate patients to improve their diet, exercise more often, and take their medicines regularly. Similarly, letting people compare their energy use with that of their peers could prompt them to save hundreds of billions of dollars in electricity costs each year, to say nothing of reducing carbon emissions.
Such benchmarking is even more valuable for businesses seeking to improve their operational efficiency. The oil and gas industry, for example, could save $450 billion annually by sharing anonymized and aggregated data on the management of upstream and downstream facilities.
Finally, the move toward open data serves a variety of socially desirable ends, ranging from the reuse of publicly funded research to support work on poverty, inclusion, or discrimination, to the disclosure by corporations such as Nike of their supply-chain data and environmental impact.
There are, of course, challenges arising from the proliferation and systematic use of open data. Companies fear for their intellectual property; ordinary citizens worry about how their private information might be used and abused. Last year, Telefónica, the world’s fifth-largest mobile-network provider, tried to allay such fears by launching a digital confidence program to reassure customers that innovations in transparency would be implemented responsibly and without compromising users’ personal information.
The sensitive handling of these issues will be essential if we are to reap the potential $3 trillion in value that usage of open data could deliver each year. Consumers, policymakers, and companies must work together, not just to agree on common standards of analysis, but also to set the ground rules for the protection of privacy and property.”