Believe the hype: Big data can have a big social impact


Annika Small at the Guardian: “Given all the hype around so called big data at the moment, it would be easy to dismiss it as nothing more than the latest technology buzzword. This would be a mistake, given that the application and interpretation of huge – often publicly available – data sets is already supporting new models of creativity, innovation and engagement.
To date, stories of big data’s progress and successes have tended to come from government and the private sector, but we’ve heard little about its relevance to social organisations. Yet big data can fuel big social change.
It’s already playing a vital role in the charitable sector. Some social organisations are using existing open government data to better target their services, to improve advocacy and fundraising, and to support knowledge sharing and collaboration between different charities and agencies. Crowdsourcing of open data also offers a new way for not-for-profits to gather intelligence, and there is a wide range of freely available online tools to help them analyse the information.
However, realising the potential of big and open data presents a number of technical and organisational challenges for social organisations. Many don’t have the required skills, awareness and investment to turn big data to their advantage. They also tend to lack the access to examples that might help demystify the technicalities and focus on achievable results.
Overcoming these challenges can be surprisingly simple: Keyfund, for example, gained insight into what made for a successful application to their scheme through using a free, online tool to create word clouds out of all the text in their application forms. Many social organisations could use this same technique to better understand the large volume of unstructured text that they accumulate – in doing so, they would be “doing big data” (albeit in a small way). At the other end of the scale, Global Giving has developed its own sophisticated set of analytical tools to better understand the 57,000+ “stories” gathered from its network.
Innovation often happens when different disciplines collide and it’s becoming apparent that most value – certainly most social value – is likely to be created at the intersection of government, private and social sector data. That could be the combination of data from different sectors, or better “data collaboration” within sectors.
The Housing Association Charitable Trust (HACT) has produced two original tools that demonstrate this. Its Community Insight tool combines data from different sectors, allowing housing providers easily to match information about their stock to a large store of well-maintained open government figures. Meanwhile, its Housing Big Data programme is building a huge dataset by combining stats from 16 different housing providers across the UK. While Community Insight allows each organisation to gain better individual understanding of their communities (measuring well-being and deprivation levels, tracking changes over time, identifying hotspots of acute need), Housing Big Data is making progress towards a much richer network of understanding, providing a foundation for the sector to collaboratively identify challenges and quantify the impact of their interventions.
Alongside this specific initiative from HACT, it’s also exciting to see programmes such as 360giving, which forge connections between a range of private and social enterprises, and lays foundations for UK social investors to be a significant source of information over the next decade. Certainly, The Big Lottery Fund’s publication of open data late last year is a milestone which also highlights how far we have to travel as a sector before we are truly “data-rich”.
At Nominet Trust, we have produced the Social Tech Guide to demonstrate the scale and diversity of social value being generated internationally – much of which is achieved through harnessing the power of big data. From Knewton creating personally tailored learning programmes, to Cellslider using the power of the crowd to advance cancer research, there is no shortage of inspiration. The UN’s Global Pulse programme is another great example, with its focus on how we can combine private and public sources to pin down the size and shape of a social challenge, and calibrate our collective response.
These examples of data-driven social change demonstrate the huge opportunities for social enterprises to harness technology to generate insights, to drive more effective action and to fuel social change. If we are to realise this potential, we need to continue to stretch ourselves as social enterprises and social investors.”

Can Big Data Stop Wars Before They Happen?


Foreign Policy: “It has been almost two decades exactly since conflict prevention shot to the top of the peace-building agenda, as large-scale killings shifted from interstate wars to intrastate and intergroup conflicts. What could we have done to anticipate and prevent the 100 days of genocidal killing in Rwanda that began in April 1994 or the massacre of thousands of Bosnian Muslims at Srebrenica just over a year later? The international community recognized that conflict prevention could no longer be limited to diplomatic and military initiatives, but that it also requires earlier intervention to address the causes of violence between nonstate actors, including tribal, religious, economic, and resource-based tensions.
For years, even as it was pursued as doggedly as personnel and funding allowed, early intervention remained elusive, a kind of Holy Grail for peace-builders. This might finally be changing. The rise of data on social dynamics and what people think and feel — obtained through social media, SMS questionnaires, increasingly comprehensive satellite information, news-scraping apps, and more — has given the peace-building field hope of harnessing a new vision of the world. But to cash in on that hope, we first need to figure out how to understand all the numbers and charts and figures now available to us. Only then can we expect to predict and prevent events like the recent massacres in South Sudan or the ongoing violence in the Central African Republic.
A growing number of initiatives have tried to make it across the bridge between data and understanding. They’ve ranged from small nonprofit shops of a few people to massive government-funded institutions, and they’ve been moving forward in fits and starts. Few of these initiatives have been successful in documenting incidents of violence actually averted or stopped. Sometimes that’s simply because violence or absence of it isn’t verifiable. The growing literature on big data and conflict prevention today is replete with caveats about “overpromising and underdelivering” and the persistent gap between early warning and early action. In the case of the Conflict Early Warning and Response Mechanism (CEWARN) system in central Africa — one of the earlier and most prominent attempts at early intervention — it is widely accepted that the project largely failed to use the data it retrieved for effective conflict management. It relied heavily on technology to produce large databases, while lacking the personnel to effectively analyze them or take meaningful early action.
To be sure, disappointments are to be expected when breaking new ground. But they don’t have to continue forever. This pioneering work demands not just data and technology expertise. Also critical is cross-discipline collaboration between the data experts and the conflict experts, who know intimately the social, political, and geographic terrain of different locations. What was once a clash of cultures over the value and meaning of metrics when it comes to complex human dynamics needs to morph into collaboration. This is still pretty rare, but if the past decade’s innovations are any prologue, we are hopefully headed in the right direction.
* * *
Over the last three years, the U.S. Defense Department, the United Nations, and the CIA have all launched programs to parse the masses of public data now available, scraping and analyzing details from social media, blogs, market data, and myriad other sources to achieve variations of the same goal: anticipating when and where conflict might arise. The Defense Department’s Information Volume and Velocity program is designed to use “pattern recognition to detect trends in a sea of unstructured data” that would point to growing instability. The U.N.’s Global Pulse initiative’s stated goal is to track “human well-being and emerging vulnerabilities in real-time, in order to better protect populations from shocks.” The Open Source Indicators program at the CIA’s Intelligence Advanced Research Projects Activity aims to anticipate “political crises, disease outbreaks, economic instability, resource shortages, and natural disasters.” Each looks to the growing stream of public data to detect significant population-level changes.
Large institutions with deep pockets have always been at the forefront of efforts in the international security field to design systems for improving data-driven decision-making. They’ve followed the lead of large private-sector organizations where data and analytics rose to the top of the corporate agenda. (In that sector, the data revolution is promising “to transform the way many companies do business, delivering performance improvements not seen since the redesign of core processes in the 1990s,” as David Court, a director at consulting firm McKinsey, has put it.)
What really defines the recent data revolution in peace-building, however, is that it is transcending size and resource limitations. It is finding its way to small organizations operating at local levels and using knowledge and subject experts to parse information from the ground. It is transforming the way peace-builders do business, delivering data-led programs and evidence-based decision-making not seen since the field’s inception in the latter half of the 20th century.
One of the most famous recent examples is the 2013 Kenyan presidential election.
In March 2013, the world was watching and waiting to see whether the vote would produce more of the violence that had left at least 1,300 people dead and 600,000 homeless during and after 2010 elections. In the intervening years, a web of NGOs worked to set up early-warning and early-response mechanisms to defuse tribal rivalries, party passions, and rumor-mongering. Many of the projects were technology-based initiatives trying to leverage data sources in new ways — including a collaborative effort spearheaded and facilitated by a Kenyan nonprofit called Ushahidi (“witness” in Swahili) that designs open-source data collection and mapping software. The Umati (meaning “crowd”) project used an Ushahidi program to monitor media reports, tweets, and blog posts to detect rising tensions, frustration, calls to violence, and hate speech — and then sorted and categorized it all on one central platform. The information fed into election-monitoring maps built by the Ushahidi team, while mobile-phone provider Safaricom donated 50 million text messages to a local peace-building organization, Sisi ni Amani (“We are Peace”), so that it could act on the information by sending texts — which had been used to incite and fuel violence during the 2007 elections — aimed at preventing violence and quelling rumors.
The first challenges came around 10 a.m. on the opening day of voting. “Rowdy youth overpowered police at a polling station in Dandora Phase 4,” one of the informal settlements in Nairobi that had been a site of violence in 2007, wrote Neelam Verjee, programs manager at Sisi ni Amani. The young men were blocking others from voting, and “the situation was tense.”
Sisi ni Amani sent a text blast to its subscribers: “When we maintain peace, we will have joy & be happy to spend time with friends & family but violence spoils all these good things. Tudumishe amani [“Maintain the peace”] Phase 4.” Meanwhile, security officers, who had been called separately, arrived at the scene and took control of the polling station. Voting resumed with little violence. According to interviews collected by Sisi ni Amani after the vote, the message “was sent at the right time” and “helped to calm down the situation.”
In many ways, Kenya’s experience is the story of peace-building today: Data is changing the way professionals in the field think about anticipating events, planning interventions, and assessing what worked and what didn’t. But it also underscores the possibility that we might be edging closer to a time when peace-builders at every level and in all sectors — international, state, and local, governmental and not — will have mechanisms both to know about brewing violence and to save lives by acting on that knowledge.
Three important trends underlie the optimism. The first is the sheer amount of data that we’re generating. In 2012, humans plugged into digital devices managed to generate more data in a single year than over the course of world history — and that rate more than doubles every year. As of 2012, 2.4 billion people — 34 percent of the world’s population — had a direct Internet connection. The growth is most stunning in regions like the Middle East and Africa where conflict abounds; access has grown 2,634 percent and 3,607 percent, respectively, in the last decade.
The growth of mobile-phone subscriptions, which allow their owners to be part of new data sources without a direct Internet connection, is also staggering. In 2013, there were almost as many cell-phone subscriptions in the world as there were people. In Africa, there were 63 subscriptions per 100 people, and there were 105 per 100 people in the Arab states.
The second trend has to do with our expanded capacity to collect and crunch data. Not only do we have more computing power enabling us to produce enormous new data sets — such as the Global Database of Events, Language, and Tone (GDELT) project, which tracks almost 300 million conflict-relevant events reported in the media between 1979 and today — but we are also developing more-sophisticated methodological approaches to using these data as raw material for conflict prediction. New machine-learning methodologies, which use algorithms to make predictions (like a spam filter, but much, much more advanced), can provide “substantial improvements in accuracy and performance” in anticipating violent outbreaks, according to Chris Perry, a data scientist at the International Peace Institute.
This brings us to the third trend: the nature of the data itself. When it comes to conflict prevention and peace-building, progress is not simply a question of “more” data, but also different data. For the first time, digital media — user-generated content and online social networks in particular — tell us not just what is going on, but also what people think about the things that are going on. Excitement in the peace-building field centers on the possibility that we can tap into data sets to understand, and preempt, the human sentiment that underlies violent conflict.
Realizing the full potential of these three trends means figuring out how to distinguish between the information, which abounds, and the insights, which are actionable. It is a distinction that is especially hard to make because it requires cross-discipline expertise that combines the wherewithal of data scientists with that of social scientists and the knowledge of technologists with the insights of conflict experts.

To the Cloud: Big Data in a Turbulent World


Book by Vincent Mosco: “In the wake of revelations about National Security Agency activities—many of which occur “in the cloud”—this book offers both enlightenment and a critical view. Cloud computing and big data are arguably the most significant forces in information technology today. In clear prose, To the Cloud explores where the cloud originated, what it means, and how important it is for business, government, and citizens. It describes the intense competition among cloud companies like Amazon and Google, the spread of the cloud to government agencies like the controversial NSA, and the astounding growth of entire cloud cities in China. From advertising to trade shows, the cloud and big data are furiously marketed to the world, even as dark clouds loom over environmental, privacy, and employment issues that arise from the cloud. Is the cloud the long-promised information utility that will solve many of the world’s economic and social problems? Or is it just marketing hype? To the Cloud provides the first thorough analysis of the potential and the problems of a technology that may very well disrupt the world.”

The false promise of the digital humanities


Adam Kirsch in the New Republic: “The humanities are in crisis again, or still. But there is one big exception: digital humanities, which is a growth industry. In 2009, the nascent field was the talk of the Modern Language Association (MLA) convention: “among all the contending subfields,” a reporter wrote about that year’s gathering, “the digital humanities seem like the first ‘next big thing’ in a long time.” Even earlier, the National Endowment for the Humanities created its Office of Digital Humanities to help fund projects. And digital humanities continues to go from strength to strength, thanks in part to the Mellon Foundation, which has seeded programs at a number of universities with large grantsmost recently, $1 million to the University of Rochester to create a graduate fellowship.

Despite all this enthusiasm, the question of what the digital humanities is has yet to be given a satisfactory answer. Indeed, no one asks it more often than the digital humanists themselves. The recent proliferation of books on the subjectfrom sourcebooks and anthologies to critical manifestosis a sign of a field suffering an identity crisis, trying to determine what, if anything, unites the disparate activities carried on under its banner. “Nowadays,” writes Stephen Ramsay in Defining Digital Humanities, “the term can mean anything from media studies to electronic art, from data mining to edutech, from scholarly editing to anarchic blogging, while inviting code junkies, digital artists, standards wonks, transhumanists, game theorists, free culture advocates, archivists, librarians, and edupunks under its capacious canvas.”

Within this range of approaches, we can distinguish a minimalist and a maximalist understanding of digital humanities. On the one hand, it can be simply the application of computer technology to traditional scholarly functions, such as the editing of texts. An exemplary project of this kind is the Rossetti Archive created by Jerome McGann, an online repository of texts and images related to the career of Dante Gabriel Rossetti: this is essentially an open-ended, universally accessible scholarly edition. To others, however, digital humanities represents a paradigm shift in the way we think about culture itself, spurring a change not just in the medium of humanistic work but also in its very substance. At their most starry-eyed, some digital humanistssuch as the authors of the jargon-laden manifesto and handbook Digital_Humanitieswant to suggest that the addition of the high-powered adjective to the long-suffering noun signals nothing less than an epoch in human history: “We live in one of those rare moments of opportunity for the humanities, not unlike other great eras of cultural-historical transformation such as the shift from the scroll to the codex, the invention of movable type, the encounter with the New World, and the Industrial Revolution.”

The language here is the language of scholarship, but the spirit is the spirit of salesmanshipthe very same kind of hyperbolic, hard-sell approach we are so accustomed to hearing about the Internet, or  about Apple’s latest utterly revolutionary product. Fundamental to this kind of persuasion is the undertone of menace, the threat of historical illegitimacy and obsolescence. Here is the future, we are made to understand: we can either get on board or stand athwart it and get run over. The same kind of revolutionary rhetoric appears again and again in the new books on the digital humanities, from writers with very different degrees of scholarly commitment and intellectual sophistication.

In Uncharted, Erez Aiden and Jean-Baptiste Michel, the creators of the Google Ngram Vieweran online tool that allows you to map the frequency of words in all the printed matter digitized by Googletalk up the “big data revolution”: “Its consequences will transform how we look at ourselves…. Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower.” These breathless prophecies are just hype. But at the other end of the spectrum, even McGann, one of the pioneers of what used to be called “humanities computing,” uses the high language of inevitability: “Here is surely a truth now universally acknowledged: that the whole of our cultural inheritance has to be recurated and reedited in digital forms and institutional structures.”

If ever there were a chance to see the ideological construction of reality at work, digital humanities is it. Right before our eyes, options are foreclosed and demands enforced; a future is constructed as though it were being discovered. By now we are used to this process, since over the last twenty years the proliferation of new technologies has totally discredited the idea of opting out of “the future.”…

Findings of the Big Data and Privacy Working Group Review


John Podesta at the White House Blog: “Over the past several days, severe storms have battered Arkansas, Oklahoma, Mississippi and other states. Dozens of people have been killed and entire neighborhoods turned to rubble and debris as tornadoes have touched down across the region. Natural disasters like these present a host of challenges for first responders. How many people are affected, injured, or dead? Where can they find food, shelter, and medical attention? What critical infrastructure might have been damaged?
Drawing on open government data sources, including Census demographics and NOAA weather data, along with their own demographic databases, Esri, a geospatial technology company, has created a real-time map showing where the twisters have been spotted and how the storm systems are moving. They have also used these data to show how many people live in the affected area, and summarize potential impacts from the storms. It’s a powerful tool for emergency services and communities. And it’s driven by big data technology.
In January, President Obama asked me to lead a wide-ranging review of “big data” and privacy—to explore how these technologies are changing our economy, our government, and our society, and to consider their implications for our personal privacy. Together with Secretary of Commerce Penny Pritzker, Secretary of Energy Ernest Moniz, the President’s Science Advisor John Holdren, the President’s Economic Advisor Jeff Zients, and other senior officials, our review sought to understand what is genuinely new and different about big data and to consider how best to encourage the potential of these technologies while minimizing risks to privacy and core American values.
Over the course of 90 days, we met with academic researchers and privacy advocates, with regulators and the technology industry, with advertisers and civil rights groups. The President’s Council of Advisors for Science and Technology conducted a parallel study of the technological trends underpinning big data. The White House Office of Science and Technology Policy jointly organized three university conferences at MIT, NYU, and U.C. Berkeley. We issued a formal Request for Information seeking public comment, and hosted a survey to generate even more public input.
Today, we presented our findings to the President. We knew better than to try to answer every question about big data in three months. But we are able to draw important conclusions and make concrete recommendations for Administration attention and policy development in a few key areas.
There are a few technological trends that bear drawing out. The declining cost of collection, storage, and processing of data, combined with new sources of data like sensors, cameras, and geospatial technologies, mean that we live in a world of near-ubiquitous data collection. All this data is being crunched at a speed that is increasingly approaching real-time, meaning that big data algorithms could soon have immediate effects on decisions being made about our lives.
The big data revolution presents incredible opportunities in virtually every sector of the economy and every corner of society.
Big data is saving lives. Infections are dangerous—even deadly—for many babies born prematurely. By collecting and analyzing millions of data points from a NICU, one study was able to identify factors, like slight increases in body temperature and heart rate, that serve as early warning signs an infection may be taking root—subtle changes that even the most experienced doctors wouldn’t have noticed on their own.
Big data is making the economy work better. Jet engines and delivery trucks now come outfitted with sensors that continuously monitor hundreds of data points and send automatic alerts when maintenance is needed. Utility companies are starting to use big data to predict periods of peak electric demand, adjusting the grid to be more efficient and potentially averting brown-outs.
Big data is making government work better and saving taxpayer dollars. The Centers for Medicare and Medicaid Services have begun using predictive analytics—a big data technique—to flag likely instances of reimbursement fraud before claims are paid. The Fraud Prevention System helps identify the highest-risk health care providers for waste, fraud, and abuse in real time and has already stopped, prevented, or identified $115 million in fraudulent payments.
But big data raises serious questions, too, about how we protect our privacy and other values in a world where data collection is increasingly ubiquitous and where analysis is conducted at speeds approaching real time. In particular, our review raised the question of whether the “notice and consent” framework, in which a user grants permission for a service to collect and use information about them, still allows us to meaningfully control our privacy as data about us is increasingly used and reused in ways that could not have been anticipated when it was collected.
Big data raises other concerns, as well. One significant finding of our review was the potential for big data analytics to lead to discriminatory outcomes and to circumvent longstanding civil rights protections in housing, employment, credit, and the consumer marketplace.
No matter how quickly technology advances, it remains within our power to ensure that we both encourage innovation and protect our values through law, policy, and the practices we encourage in the public and private sector. To that end, we make six actionable policy recommendations in our report to the President:
Advance the Consumer Privacy Bill of Rights. Consumers deserve clear, understandable, reasonable standards for how their personal information is used in the big data era. We recommend the Department of Commerce take appropriate consultative steps to seek stakeholder and public comment on what changes, if any, are needed to the Consumer Privacy Bill of Rights, first proposed by the President in 2012, and to prepare draft legislative text for consideration by stakeholders and submission by the President to Congress.
Pass National Data Breach Legislation. Big data technologies make it possible to store significantly more data, and further derive intimate insights into a person’s character, habits, preferences, and activities. That makes the potential impacts of data breaches at businesses or other organizations even more serious. A patchwork of state laws currently governs requirements for reporting data breaches. Congress should pass legislation that provides for a single national data breach standard, along the lines of the Administration’s 2011 Cybersecurity legislative proposal.
Extend Privacy Protections to non-U.S. Persons. Privacy is a worldwide value that should be reflected in how the federal government handles personally identifiable information about non-U.S. citizens. The Office of Management and Budget should work with departments and agencies to apply the Privacy Act of 1974 to non-U.S. persons where practicable, or to establish alternative privacy policies that apply appropriate and meaningful protections to personal information regardless of a person’s nationality.
Ensure Data Collected on Students in School is used for Educational Purposes. Big data and other technological innovations, including new online course platforms that provide students real time feedback, promise to transform education by personalizing learning. At the same time, the federal government must ensure educational data linked to individual students gathered in school is used for educational purposes, and protect students against their data being shared or used inappropriately.
Expand Technical Expertise to Stop Discrimination. The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining.” The federal government’s lead civil rights and consumer protection agencies should expand their technical expertise to be able to identify practices and outcomes facilitated by big data analytics that have a discriminatory impact on protected classes, and develop a plan for investigating and resolving violations of law.
Amend the Electronic Communications Privacy Act. The laws that govern protections afforded to our communications were written before email, the internet, and cloud computing came into wide use. Congress should amend ECPA to ensure the standard of protection for online, digital content is consistent with that afforded in the physical world—including by removing archaic distinctions between email left unread or over a certain age.
We also identify several broader areas ripe for further study, debate, and public engagement that, collectively, we hope will spark a national conversation about how to harness big data for the public good. We conclude that we must find a way to preserve our privacy values in both the domestic and international marketplace. We urgently need to build capacity in the federal government to identify and prevent new modes of discrimination that could be enabled by big data. We must ensure that law enforcement agencies using big data technologies do so responsibly, and that our fundamental privacy rights remain protected. Finally, we recognize that data is a valuable public resource, and call for continuing the Administration’s efforts to open more government data sources and make investments in research and technology.
While big data presents new challenges, it also presents immense opportunities to improve lives, the United States is perhaps better suited to lead this conversation than any other nation on earth. Our innovative spirit, technological know-how, and deep commitment to values of privacy, fairness, non-discrimination, and self-determination will help us harness the benefits of the big data revolution and encourage the free flow of information while working with our international partners to protect personal privacy. This review is but one piece of that effort, and we hope it spurs a conversation about big data across the country and around the world.
Read the Big Data Report.
See the fact sheet from today’s announcement.

Saving Big Data from Big Mouths


Cesar A. Hidalgo in Scientific American: “It has become fashionable to bad-mouth big data. In recent weeks the New York Times, Financial Times, Wired and other outlets have all run pieces bashing this new technological movement. To be fair, many of the critiques have a point: There has been a lot of hype about big data and it is important not to inflate our expectations about what it can do.
But little of this hype has come from the actual people working with large data sets. Instead, it has come from people who see “big data” as a buzzword and a marketing opportunity—consultants, event organizers and opportunistic academics looking for their 15 minutes of fame.
Most of the recent criticism, however, has been weak and misguided. Naysayers have been attacking straw men, focusing on worst practices, post hoc failures and secondary sources. The common theme has been to a great extent obvious: “Correlation does not imply causation,” and “data has biases.”
Critics of big data have been making three important mistakes:
First, they have misunderstood big data, framing it narrowly as a failed revolution in social science hypothesis testing. In doing so they ignore areas where big data has made substantial progress, such as data-rich Web sites, information visualization and machine learning. If there is one group of big-data practitioners that the critics should worship, they are the big-data engineers building the social media sites where their platitudes spread. Engineering a site rich in data, like Facebook, YouTube, Vimeo or Twitter, is extremely challenging. These sites are possible because of advances made quietly over the past five years, including improvements in database technologies and Web development frameworks.
Big data has also contributed to machine learning and computer vision. Thanks to big data, Facebook algorithms can now match faces almost as accurately as humans do.
And detractors have overlooked big data’s role in the proliferation of computational design, data journalism and new forms of artistic expression. Computational artists, journalists and designers—the kinds of people who congregate at meetings like Eyeo—are using huge sets of data to give us online experiences that are unlike anything we experienced in paper. If we step away from hypothesis testing, we find that big data has made big contributions.
The second mistake critics often make is to confuse the limitations of prototypes with fatal flaws. This is something I have experienced often. For example, in Place Pulse—a project I created with my team the M.I.T. Media Lab—we used Google Street View images and crowdsourced visual surveys to map people’s perception of a city’s safety and wealth. The original method was rife with limitations that we dutifully acknowledged in our paper. Google Street View images are taken at arbitrary times of the day and showed cities from the perspective of a car. City boundaries were also arbitrary. To overcome these limitations, however, we needed a first data set. Producing that first limited version of Place Pulse was a necessary part of the process of making a working prototype.
A year has passed since we published Place Pulse’s first data set. Now, thanks to our focus on “making,” we have computer vision and machine-learning algorithms that we can use to correct for some of these easy-to-spot distortions. Making is allowing us to correct for time of the day and dynamically define urban boundaries. Also, we are collecting new data to extend the method to new geographical boundaries.
Those who fail to understand that the process of making is iterative are in danger of  being too quick to condemn promising technologies.  In 1920 the New York Times published a prediction that a rocket would never be able to leave  atmosphere. Similarly erroneous predictions were made about the car or, more recently, about iPhone’s market share. In 1969 the Times had to publish a retraction of their 1920 claim. What similar retractions will need to be published in the year 2069?
Finally, the doubters have relied too heavily on secondary sources. For instance, they made a piñata out of the 2008 Wired piece by Chris Anderson framing big data as “the end of theory.” Others have criticized projects for claims that their creators never made. A couple of weeks ago, for example, Gary Marcus and Ernest Davis published a piece on big data in the Times. There they wrote about another of one of my group’s projects, Pantheon, which is an effort to collect, visualize and analyze data on historical cultural production. Marcus and Davis wrote that Pantheon “suggests a misleading degree of scientific precision.” As an author of the project, I have been unable to find where I made such a claim. Pantheon’s method section clearly states that: “Pantheon will always be—by construction—an incomplete resource.” That same section contains a long list of limitations and caveats as well as the statement that “we interpret this data set narrowly, as the view of global cultural production that emerges from the multilingual expression of historical figures in Wikipedia as of May 2013.”
Bickering is easy, but it is not of much help. So I invite the critics of big data to lead by example. Stop writing op–eds and start developing tools that improve on the state of the art. They are much appreciated. What we need are projects that are worth imitating and that we can build on, not obvious advice such as “correlation does not imply causation.” After all, true progress is not something that is written, but made.”

Looking for the Needle in a Stack of Needles: Tracking Shadow Economic Activities in the Age of Big Data


Manju Bansal in MIT Technology Review: “The undocumented guys hanging out in the home-improvement-store parking lot looking for day labor, the neighborhood kids running a lemonade stand, and Al Qaeda terrorists plotting to do harm all have one thing in common: They operate in the underground economy, a shadowy zone where businesses, both legitimate and less so, transact in the currency of opportunity, away from traditional institutions and their watchful eyes.
One might think that this alternative economy is limited to markets that are low on the Transparency International rankings (such as sub-Saharan Africa and South Asia, for instance). However, a recent University of Wisconsin report estimates the value of the underground economy in the United States at about $2 trillion, about 15% of the total U.S. GDP. And a 2013 study coauthored by Friedrich Schneider, a noted authority on global shadow economies, estimated the European Union’s underground economy at more than 18% of GDP, or a whopping 2.1 trillion euros. More than two-thirds of the underground activity came from the most developed countries, including Germany, France, Italy, Spain, and the United Kingdom.
Underground economic activity is a multifaceted phenomenon, with implications across the board for national security, tax collections, public-sector services, and more. It includes the activity of any business that relies primarily on old-fashioned cash for most transactions — ranging from legitimate businesses (including lemonade stands) to drug cartels and organized crime.
Though it’s often soiled, heavy to lug around, and easy to lose to theft, cash is still king simply because it is so easy to hide from the authorities. With the help of the right bank or financial institution, “dirty” money can easily be laundered and come out looking fresh and clean, or at least legitimate. Case in point is the global bank HSBC, which agreed to pay U.S. regulators $1.9 billion in fines to settle charges of money laundering on behalf of Mexican drug cartels. According to a U.S. Senate subcommittee report, that process involved transferring $7 billion in cash from the bank’s branches in Mexico to those in the United States. Just for reference, each $100 bill weighs one gram, so to transfer $7 billion, HSBC had to physically transport 70 metric tons of cash across the U.S.-Mexican border.
The Financial Action Task Force, an intergovernmental body established in 1989, has estimated the total amount of money laundered worldwide to be around 2% to 5% of global GDP. Many of these transactions seem, at first glance, to be perfectly legitimate. Therein lies the conundrum for a banker or a government official: How do you identify, track, control, and, one hopes, prosecute money launderers, when they are hiding in plain sight and their business is couched in networked layers of perfectly defensible legitimacy?
Enter big-data tools, such as those provided by SynerScope, a Holland-based startup that is a member of the SAP Startup Focus program. This company’s solutions help unravel the complex networks hidden behind the layers of transactions and interactions.
Networks, good or bad, are near omnipresent in almost any form of organized human activity and particularly in banking and insurance. SynerScope takes data from both structured and unstructured data fields and transforms these into interactive computer visuals that display graphic patterns that humans can use to quickly make sense of information. Spotting of deviations in complex networked processes can easily be put to use in fraud detection for insurance, banking, e-commerce, and forensic accounting.
SynerScope’s approach to big-data business intelligence is centered on data-intense compute and visualization that extend the human “sense-making” capacity in much the same way that a telescope or microscope extends human vision.
To understand how SynerScope helps authorities track and halt money laundering, it’s important to understand how the networked laundering process works. It typically involves three stages.
1. In the initial, or placement, stage, launderers introduce their illegal profits into the financial system. This might be done by breaking up large amounts of cash into less-conspicuous smaller sums that are then deposited directly into a bank account, or by purchasing a series of monetary instruments (checks, money orders) that are then collected and deposited into accounts at other locations.
2. After the funds have entered the financial system, the launderer commences the second stage, called layering, which uses a series of conversions or transfers to distance the funds from their sources. The funds might be channeled through the purchase and sales of investment instruments, or the launderer might simply wire the funds through a series of accounts at various banks worldwide. 
Such use of widely scattered accounts for laundering is especially prevalent in those jurisdictions that do not cooperate in anti-money-laundering investigations. Sometimes the launderer disguises the transfers as payments for goods or services.
3. Having successfully processed the criminal profits through the first two phases, the launderer then proceeds to the third stage, integration, in which the funds re-enter the legitimate economy. The launderer might invest the funds in real estate, luxury assets, or business ventures.
Current detection tools compare individual transactions against preset profiles and rules. Sophisticated criminals quickly learn how to make their illicit transactions look normal for such systems. As a result, rules and profiles need constant and costly updating.
But SynerScope’s flexible visual analysis uses a network angle to detect money laundering. It shows the structure of the entire network with data coming in from millions of transactions, a structure that launderers cannot control. With just a few mouse clicks, SynerScope’s relation and sequence views reveal structural interrelationships and interdependencies. When those patterns are mapped on a time scale, it becomes virtually impossible to hide abnormal flows.

SynerScope’s relation and sequence views reveal structural and temporal transaction patterns which make it virtually impossible to hide abnormal money flows.”

Using data to treat the sickest and most expensive patients


Dan Gorenstein for Marketplace (radio):  “Driving to a big data conference a few weeks back, Dr. Jeffrey Brenner brought his compact SUV to a full stop – in the middle of a short highway entrance ramp in downtown Philadelphia…

Here’s what you need to know about Dr. Jeffrey BrennerHe really likes to figure out how things work. And he’s willing to go to extremes to do it – so far that he’s risking his health policy celebrity status.
Perhaps it’s not the smartest move from a guy who just last fall was named a MacArthur Genius, but this month, Brenner began to test his theory for treating some of the sickest and most expensive patients.
“We can actually take the sickest and most complicated patients, go to their bedside, go to their home, go with them to their appointments and help them for about 90 days and dramatically improve outcomes and reduce cost,” he says.
That’s the theory anyway. Like many ideas when it comes to treating the sickest patients, there’s little data to back up that it works.
Brenner’s willing to risk his reputation precisely because he’s not positive his approach for treating folks who cycle in and out of the healthcare system — “super-utilizers” — actually works.
“It’s really easy for me at this point having gotten a MacArthur award to simply declare what we do works and to drive this work forward without rigorously testing it,” Brenner said. “We are not going to do that,” he said. “We don’t think that’s the right thing to do. So we are going to do a randomized controlled trial on our work and prove whether it works and how well it works.”
Helping lower costs and improve care for the super-utilizers is one of the most pressing policy questions in healthcare today. And given its importance, there is a striking lack of data in the field.
People like to call randomized controlled trials (RCTs) the gold standard of scientific testing because two groups are randomly assigned – one gets the treatment, while the other doesn’t – and researchers closely monitor differences.
But a 2012 British Medical Journal article found over the last 25 years, a total of six RCTs have focused on care delivery for super-utilizers.

Randomized Clinical Trials (RCTs)

…Every major health insurance company – Medicare and Medicaid, too – has spent billions on programs for super-utilizers. The absence of rigorous evidence raises the question: Is all this effort built on health policy quicksand?
Not being 100 percent sure can be dangerous, says Duke behavioral scientist Peter Ubel, particularly in healthcare.
Ubel said back in the 1980s and 90s doctors prescribed certain drugs for irregular heartbeats. The medication, he said, made those weird rhythms go away, leaving beautiful-looking EKGs.
“But no one had tested whether people receiving these drugs actually lived longer, and many people thought, ‘Why would you do that? We can look at their cardiogram and see that they’re getting better,’” Ubel said. “Finally when somebody put that evidence to the test of a randomized trial, it turned out that these drugs killed people.”
WellPoint’s Nussbaum said he hoped Brenner’s project would inspire others to follow his lead and insert data into the discussion.
“I believe more people should be bold in challenging the status quo of our delivery system,” Nussbaum said. “The Jeff Brenners of the world should be embraced. We should be advocating for them to take on these studies.”
So why aren’t more healthcare luminaries putting their brilliance to the test? There are a couple of reasons.
Harvard economist Kate Baicker said until now there have been few personal incentives pushing people.
“If you’re focused on branding and spreading your brand, you have no incentive to say, ‘How good is my brand after all?’” she said.
And Venrock healthcare venture capitalist Bob Kocher said no one would fault Brenner if he put his brand before science, an age-old practice in this business.
“Healthcare has benefitted from the fact that you don’t understand it. It’s a bit of an art, and it hasn’t been a science,” he said. “You made money in healthcare by putting a banner outside your building saying you are a top something without having to justify whether you really are top at whatever you do.”
Duke’s Ubel said it’s too easy – and frankly, wrong – to say the main reason doctors avoid these rigorous studies is because they’re afraid to lose money and status. He said doctors aren’t immune from the very human trap of being sure their own ideas are right.
He says psychologists call it confirmation bias.
“Everything you see is filtered through your hopes, your expectations and your pre-existing beliefs,” Ubel said. “And that’s why I might look at a grilled cheese sandwich and see a grilled cheese sandwich and you might see an image of Jesus,” he says.
Even with all these hurdles, MIT economist Amy Finkelstein – who is running the RCT with Brenner – sees change coming.
“Providers have a lot more incentive now than they use to,” she said. “They have much more skin in the game.”
Finkelstein said hospital readmission penalties and new ways to pay doctors are bringing market incentives that have long been missing.
Brenner said he accepts that the truth of what he’s doing in Camden may be messier than the myth.

Digital Humanitarians


New book by Patrick Meier on how big data is changing humanitarian response: “The overflow of information generated during disasters can be as paralyzing to humanitarian response as the lack of information. This flash flood of information when amplified by social media and satellite imagery is increasingly referred to as Big Data—or Big Crisis Data. Making sense of Big Crisis Data during disasters is proving an impossible challenge for traditional humanitarian organizations, which explains why they’re increasingly turning to Digital Humanitarians.
Who exactly are these Digital Humanitarians? They’re you, me, all of us. Digital Humanitarians are volunteers and professionals from the world over and from all walks of life. What do they share in common? The desire to make a difference, and they do that by rapidly mobilizing online in collaboration with international humanitarian organizations. They make sense of vast volumes of social media and satellite imagery in virtually real-time to support relief efforts worldwide. How? They craft and leverage ingenious crowdsourcing solutions with trail-blazing insights from artificial intelligence.
In sum, this book charts the sudden and spectacular rise of Digital Humanitarians by sharing their remarkable, real-life stories, highlighting how their humanity coupled with innovative solutions to Big Data is changing humanitarian response forever. Digital Humanitarians will make you think differently about what it means to be humanitarian and will invite you to join the journey online.
Clicker here to be notified when the book becomes available. For speaking requests, please email Speaking@iRevolution.net.”

Passage Of The DATA Act Is A Major Advance In Government Transparency


OpEd by Hudson Hollister in Forbes: “Even as the debate over official secrecy grows on Capitol Hill, basic information about our government’s spending remains hidden in plain sight.
Information that is technically public — federal finance, awards, and expenditures — is effectively locked within a disconnected disclosure system that relies on outdated paper-based technology. Budgets, grants, contracts, and disbursements are reported manually and separately, using forms and spreadsheets. Researchers seeking insights into federal spending must invest time and resources crafting data sets out of these documents. Without common data standards across all government spending, analyses of cross-agency spending trends require endless conversions of apples to oranges.
For a nation whose tech industry leads the world, there is no reason to allow this antiquated system to persist.
That’s why we’re excited to welcome Thursday’s unanimous Senate approval of the Digital Accountability and Transparency Act — known as the DATA Act.
The DATA Act will mandate government-wide standards for federal spending data. It will also require agencies to publish this information online, fully searchable and open to everyone.
Watchdogs and transparency advocates from across the political spectrum have endorsed the DATA Act because all Americans will benefit from clear, accessible information about how their tax dollars are being spent.
It is darkly appropriate that the only organized opposition to this bill took place behind closed doors. In January, Senate sponsors Mark Warner (D-VA) and Rob Portman (R-OH) rejected amendments offered privately by the White House Office of Management and Budget. These nonpublic proposals would have gutted the DATA Act’s key data standards requirement. But Warner and Portman went public with their opposition, and Republicans and Democrats agreed to keep a strong standards mandate.
We now await swift action by the House of Representatives to pass this bill and put it on the President’s desk.
The tech industry is already delivering the technology and expertise that will use federal spending data, once it is open and standardized, to solve problems.
If the DATA Act is fully enforced, citizens will be able to track government spending on a particular contractor or from a particular program, payment by payment. Agencies will be able to deploy sophisticated Big Data analytics to illuminate, and eliminate, waste and fraud. And states and universities will be able to automate their complex federal grant reporting tasks, freeing up more tax dollars for their intended use. Our industry can perform these tasks — as soon as we get the data.
Chairman Earl Devaney’s Recovery Accountability and Transparency Board proved this is possible. Starting in 2009, the Recovery Board applied data standards to track stimulus spending. Our members’ software used that data to help inspectors general prevent and recover over $100 million in spending on suspicious grantees and contractors. The DATA Act applies that approach across the whole of government spending.
Congress is now poised to pass this landmark legislative mandate to transform spending from disconnected documents into open data. Next , the executive branch must implement that mandate.
So our Coalition’s work continues. We will press the Treasury Department and the White House to adopt robust, durable, and nonproprietary data standards for federal spending.
And we won’t stop with spending transparency. The American people deserve access to open data across all areas of government activity — financial regulatory reporting, legislative actions, judicial filings, and much more….”