Humanitarians in the sky

Patrick Meier in the Guardian: “Unmanned aerial vehicles (UAVs) capture images faster, cheaper, and at a far higher resolution than satellite imagery. And as John DeRiggi speculates in “Drones for Development?” these attributes will likely lead to a host of applications in development work. In the humanitarian field that future is already upon us — so we need to take a rights-based approach to advance the discussion, improve coordination of UAV flights, and to promote regulation that will ensure safety while supporting innovation.
It was the unprecedentedly widespread use of civilian UAVs following typhoon Haiyan in the Philippines that opened my eyes to UAV use in post-disaster settings. I was in Manila to support the United Nations’ digital humanitarian efforts and came across new UAV projects every other day.
One team was flying rotary-wing UAVs to search for survivors among vast fields of debris that were otherwise inaccessible. Another flew fixed-wing UAVs around Tacloban to assess damage and produce high-quality digital maps. Months later, UAVs are still being used to support recovery and preparedness efforts. One group is working with local mayors to identify which communities are being overlooked in the reconstruction.
Humanitarian UAVs are hardly new. As far back as 2007, the World Food Program teamed up with the University of Torino to build humanitarian UAVs. But today UAVs are much cheaper, safer, and easier to fly. This means more people own personal UAVs. The distinguishing feature between these small UAVs and traditional remote control airplanes or helicopters is that they are intelligent. Most can be programmed to fly and land autonomously at designated locations. Newer UAVs also have on-board, flight-stabilization features that automatically adapt to changing winds, automated collision avoidance systems, and standard fail-safe mechanisms.
While I was surprised by the surge in UAV projects in the Philippines, I was troubled that none of these teams were aware of each other and that most were apparently not sharing their imagery with local communities. What happens when even more UAV teams show up following future disasters? Will they be accompanied by droves of drone journalists and “disaster tourists” equipped with personal UAVs? Will we see thousands of aerial disaster pictures and videos uploaded to social media rather than in the hands of local communities? What are the privacy implications? And what about empowering local communities to deploy their own UAVs?
There were many questions but few answers. So I launched the humanitarian UAV network (UAViators) to bridge the worlds of humanitarian professionals and UAV experts to address these questions. Our first priority was to draft a code of conduct for the use of UAVs in humanitarian settings to hold ourselves accountable while educating new UAV pilots before serious mistakes are made…”

Lessons in Mass Collaboration

Elizabeth Walker, Ryan Siegel, Todd Khozein, Nick Skytland, Ali Llewellyn, Thea Aldrich, and Michael Brennan in the Stanford Social Innovation Review: “significant advances in technology in the last two decades have opened possibilities to engage the masses in ways impossible to imagine centuries ago. Beyond coordination, today’s technological capability permits organizations to leverage and focus public interest, talent, and energy through mass collaborative engagement to better understand and solve today’s challenges. And given the rising public awareness of a variety of social, economic, and environmental problems, organizations have seized the opportunity to leverage and lead mass collaborations in the form of hackathons.
Hackathons emerged in the mid-2000s as a popular approach to leverage the expertise of large numbers of individuals to address social issues, often through the creation of online technological solutions. Having led hundreds of mass collaboration initiatives for organizations around the world in diverse cultural contexts, we at SecondMuse offer the following lessons as a starting point for others interested in engaging the masses, as well as challenges others’ may face.

What Mass Collaboration Looks Like

An early example of a mass collaborative endeavor was Random Hacks of Kindness (RHoK), which formed in 2009. RHoK was initially developed in collaboration with Google, Microsoft, Yahoo!, NASA, the World Bank, and later, HP as a volunteer mobilization effort; it aimed to build technology that would enable communities to respond better to crises such as natural disasters. In 2012, nearly 1,000 participants attended 30 events around the world to address 176 well-defined problems.
In 2013, NASA and SecondMuse led the International Space Apps Challenge, which engaged six US federal agencies, 400 partner institutions, and 9,000 global citizens through a variety of local and global team configurations; it aimed to address 58 different challenges to improve life on Earth and in space. In Athens, Greece, for example, in direct response to the challenge of creating a space-deployable greenhouse, a team developed a modular spinach greenhouse designed to survive the harsh Martian climate. Two months later, 11,000 citizens across 95 events participated in the National Day of Civic Hacking in 83 different US cities, ultimately contributing about 150,000 person-hours and addressing 31 federal and several state and local challenges over a single weekend. One result was Keep Austin Fed from Austin, Texas, which leveraged local data to coordinate food donations for those in need.
Strong interest on the part of institutions and an enthusiastic international community has paved the way for follow-up events in 2014.

Benefits of Mass Collaboration

The benefits of this approach to problem-solving are many, including:

  • Incentivizing the use of government data. As institutions push to make data available to the public, mass collaboration can increase the usefulness of that data by creating products from it, as well as inform and streamline future data collection processes.
  • Increasing transparency. Engaging citizens in the process of addressing public concerns educates them about the work that institutions do and advances efforts to meet public expectations of transparency.
  • Increasing outcome ownership. When people engage in a collaborative process of problem solving, they naturally have a greater stake in the outcome. Put simply, the more people who participate in the process, the greater the sense of community ownership. Also, when spearheading new policies or initiatives, the support of a knowledgeable community can be important to long-term success.
  • Increasing awareness. Engaging the populace in addressing challenges of public concern increases awareness of issues and helps develop an active citizenry. As a result, improved public perception and license to operate bolster governmental and non-governmental efforts to address challenges.
  • Saving money. By providing data and structures to the public, and allowing them to build and iterate on plans and prototypes, mass collaboration gives agencies a chance to harness the power of open innovation with minimal time and funds.
  • Harnessing cognitive surplus. The advent of online tools allowing for distributed collaboration enables citizens to use their free time incrementally toward collective endeavors that benefit local communities and the nation.

Challenges of Mass Collaboration

Although the benefits can be significant, agencies planning to lead mass collaborations should be aware of several challenges:

  • Investing time and effort. A mass collaboration is most effective when it is not a one-time event. The up-front investment in building a collaboration of supporting partner organizations, creating a robust framework for action, developing the necessary tools and defining the challenges, and investing in implementation and scaling of the most promising results all require substantial time to secure long-term commitment and strong relationships.
  • Forging an institution-community relationship. Throughout the course of most engagements, the power dynamic between the organization providing the frameworks and challenges and the groupings of individuals responding to the call to action can shift dramatically as the community incorporates the endeavor into their collective identity. Everyone involved should embrace this as they lay the foundation for self-sustaining mass collaboration communities. Once participants develop a firmly entrenched collective identity and sense of ownership, the convening organization can fully tap into its collective genius, as they can work together based on trust and shared vision. Without community ownership, organizers need to allot more time, energy, and resources to keep their initiative moving forward, and to battle against volunteer fatigue, diminished productivity, and substandard output.
  • Focusing follow-up. Turning a massive infusion of creative ideas, concepts, and prototypes into concrete solutions requires a process of focused follow-up. Identifying and nurturing the most promising seeds to fruition requires time, discrete skills, insight, and—depending on the solutions you scale—support from a variety of external organizations.
  • Understanding ROI. Any resource-intensive endeavor where only a few of numerous resulting products ever see the light of day demands deep consideration of what constitutes a reasonable return on investment. For mass collaborations, this means having an initial understanding of the potential tangible and intangible outcomes, and making a frank assessment of whether those outcomes meet the needs of the collaborators.

Technological developments in the last century have enabled relationships between individuals and institutions to blossom into a rich and complex tapestry…”

HHS releases new data and tools to increase transparency on hospital utilization and other trends

Pressrelease: “With more than 2,000 entrepreneurs, investors, data scientists, researchers, policy experts, government employees and more in attendance, the Department of Health and Human Services (HHS) is releasing new data and launching new initiatives at the annual Health Datapalooza conference in Washington, D.C.
Today, the Centers for Medicare & Medicaid Services (CMS) is releasing its first annual update to the Medicare hospital charge data, or information comparing the average amount a hospital bills for services that may be provided in connection with a similar inpatient stay or outpatient visit. CMS is also releasing a suite of other data products and tools aimed to increase transparency about Medicare payments. The data trove on CMS’s website now includes inpatient and outpatient hospital charge data for 2012, and new interactive dashboards for the CMS Chronic Conditions Data Warehouse and geographic variation data. Also today, the Food and Drug Administration (FDA) will launch a new open data initiative. And before the end of the conference, the Office of the National Coordinator for Health Information Technology (ONC) will announce the winners of two data challenges.
“The release of these data sets furthers the administration’s efforts to increase transparency and support data-driven decision making which is essential for health care transformation,” said HHS Secretary Kathleen Sebelius.
“These public data resources provide a better understanding of Medicare utilization, the burden of chronic conditions among beneficiaries and the implications for our health care system and how this varies by where beneficiaries are located,” said Bryan Sivak, HHS chief technology officer. “This information can be used to improve care coordination and health outcomes for Medicare beneficiaries nationwide, and we are looking forward to seeing what the community will do with these releases. Additionally, the openFDA initiative being launched today will for the first time enable a new generation of consumer facing and research applications to embed relevant and timely data in machine-readable, API-based formats.”
2012 Inpatient and Outpatient Hospital Charge Data
The data posted today on the CMS website provide the first annual update of the hospital inpatient and outpatient data released by the agency last spring. The data include information comparing the average charges for services that may be provided in connection with the 100 most common Medicare inpatient stays at over 3,000 hospitals in all 50 states and Washington, D.C. Hospitals determine what they will charge for items and services provided to patients and these “charges” are the amount the hospital generally bills for those items or services.
With two years of data now available, researchers can begin to look at trends in hospital charges. For example, average charges for medical back problems increased nine percent from $23,000 to $25,000, but the total number of discharges decreased by nearly 7,000 from 2011 to 2012.
In April, ONC launched a challenge – the Code-a-Palooza challenge – calling on developers to create tools that will help patients use the Medicare data to make health care choices. Fifty-six innovators submitted proposals and 10 finalists are presenting their applications during Datapalooza. The winning products will be announced before the end of the conference.
Chronic Conditions Warehouse and Dashboard
CMS recently released new and updated information on chronic conditions among Medicare fee-for-service beneficiaries, including:

  • Geographic data summarized to national, state, county, and hospital referral regions levels for the years 2008-2012;
  • Data for examining disparities among specific Medicare populations, such as beneficiaries with disabilities, dual-eligible beneficiaries, and race/ethnic groups;
  • Data on prevalence, utilization of select Medicare services, and Medicare spending;
  • Interactive dashboards that provide customizable information about Medicare beneficiaries with chronic conditions at state, county, and hospital referral regions levels for 2012; and
  • Chartbooks and maps.

These public data resources support the HHS Initiative on Multiple Chronic Conditions by providing researchers and policymakers a better understanding of the burden of chronic conditions among beneficiaries and the implications for our health care system.
Geographic Variation Dashboard
The Geographic Variation Dashboards present Medicare fee-for-service per-capita spending at the state and county levels in interactive formats. CMS calculated the spending figures in these dashboards using standardized dollars that remove the effects of the geographic adjustments that Medicare makes for many of its payment rates. The dashboards include total standardized per capita spending, as well as standardized per capita spending by type of service. Users can select the indicator and year they want to display. Users can also compare data for a given state or county to the national average. All of the information presented in the dashboards is also available for download from the Geographic Variation Public Use File.
Research Cohort Estimate Tool
CMS also released a new tool that will help researchers and other stakeholders estimate the number of Medicare beneficiaries with certain demographic profiles or health conditions. This tool can assist a variety of stakeholders interested in specific figures on Medicare enrollment. Researchers can also use this tool to estimate the size of their proposed research cohort and the cost of requesting CMS data to support their study.
Digital Privacy Notice Challenge
ONC, with the HHS Office of Civil Rights, will be awarding the winner of the Digital Privacy Notice Challenge during the conference. The winning products will help consumers get notices of privacy practices from their health care providers or health plans directly in their personal health records or from their providers’ patient portals.
The FDA’s new initiative, openFDA, is designed to facilitate easier access to large, important public health datasets collected by the agency. OpenFDA will make FDA’s publicly available data accessible in a structured, computer readable format that will make it possible for technology specialists, such as mobile application creators, web developers, data visualization artists and researchers to quickly search, query, or pull massive amounts of information on an as needed basis. The initiative is the result of extensive research to identify FDA’s publicly available datasets that are often in demand, but traditionally difficult to use. Based on this research, openFDA is beginning with a pilot program involving millions of reports of drug adverse events and medication errors submitted to the FDA from 2004 to 2013. The pilot will later be expanded to include the FDA’s databases on product recalls and product labeling.
For more information about CMS data products, please visit
For more information about today’s FDA announcement visit: or

Data Mining Reddit Posts Reveals How to Ask For a Favor–And Get it

Emerging Technology From the arXiv: “There’s a secret to asking strangers for something and getting it. Now data scientists say they’ve discovered it by studying successful requests on the web

One of the more extraordinary phenomena on the internet is the rise of altruism and of websites designed to enable it. The Random Acts of Pizza section of the Reddit website is a good example.

People leave messages asking for pizza which others fulfil if they find the story compelling. As the site says: “because… who doesn’t like helping out a stranger? The purpose is to have fun, eat pizza and help each other out. Together, we aim to restore faith in humanity, one slice at a time.”

A request might go something like this: “It’s been a long time since my mother and I have had proper food. I’ve been struggling to find any kind of work so I can supplement my mom’s social security… A real pizza would certainly lift our spirits”. Anybody can then fulfil the order which is then marked on the site with a badge saying “got pizza’d”, often with notes of thanks.

That raises an interesting question. What kinds of requests are most successful in getting a response? Today, we get an answer thanks to the work of Tim Althoff at Stanford University and a couple of pals who lift the veil on the previously murky question of how to ask for a favour—and receive it.

They analysed how various features might be responsible for the success of a post, such as the politeness of the post; its sentiment, whether positive or negative for example; its length. The team also looked at the similarity of the requester to the benefactor; and also the status of the requester.

Finally, they examined whether the post contained evidence of need in the form of a narrative that described why the requester needed free pizza.

Althoff and co used a standard machine learning algorithm to comb through all the possible correlations in 70 per cent of the data, which they used for training. Having found various correlations, they tested to see whether this had predictive power in the remaining 30 per cent of the data. In other words, can their algorithm predict whether a previously unseen request will be successful or not?

It turns out that their algorithm makes a successful prediction about 70 per cent of the time. That’s far from perfect but much better than random guessing which is right only half the time.

So what kinds of factors are important? Narrative is a key part of many of the posts, so Althoff and co spent some time categorising the types of stories people use.

They divided the narratives into five types, those that mention: money; a job; being a student; family; and a final group that includes mentions of friends, being drunk, celebrating and so on, which Althoff and co call ‘craving’.

Of these, narratives about jobs, family and money increase the probability of success. Student narratives have no effect while craving narratives significantly reduce the chances of success. In other words, narratives that communicate a need are more successful than those that do not.

 “We find that clearly communicating need through the narrative is essential,” say Althoff and co. And evidence of reciprocation helps too.

(Given these narrative requirements, it is not surprising that longer requests tend to be more successful than short ones.)

So for example, the following request was successful because it clearly demonstrates both need and evidence of reciprocation.

“My gf and I have hit some hard times with her losing her job and then unemployment as well for being physically unable to perform her job due to various hand injuries as a server in a restaurant. She is currently petitioning to have unemployment reinstated due to medical reasons for being unable to perform her job, but until then things are really tight and ANYTHING would help us out right now.

I’ve been both a giver and receiver in RAOP before and would certainly return the favor again when I am able to reciprocate. It took everything we have to pay rent today and some food would go a long ways towards making our next couple of days go by much better with some food.”

By contrast, the ‘craving’ narrative below demonstrates neither and was not successful.

“My friend is coming in town for the weekend and my friends and i are so excited because we haven’t seen him since junior high. we are going to a high school football game then to the dollar theater after and it would be so nice if someone fed us before we embarked :)”

Althoff and co also say that the status of the requester is an important factor too. “We find that Reddit users with higher status overall (higher karma) or higher status within the subcommunity (previous posts) are significantly more likely to receive help,” they say.

But surprisingly, being polite does not help (except by offering thanks).

That’s interesting work. Until now, psychologists have never understood the factors that make requests successful, largely because it has always been difficult to separate the influence of the request from what is being requested.

The key here is that everybody making requests in this study wants the same thing—pizza. In one swoop, this makes the data significantly easier to tease apart.

An important line of future work will be in using his work to understand altruistic behaviour in other communities too…

Ref: : How to Ask for a Favor: A Case Study on the Success of Altruistic Requests”

Three projects meet the European Job Challenge and receive the Social Innovation Prize

EU Press Release: “Social innovation can be a tool to create new or better jobs, while giving an answer to pressing challenges faced by Europe. Today, Michel Barnier, European Commissioner, has awarded three European Social Innovation prizes to ground-breaking ideas to create new types of work and address social needs. The winning projects aim to help disadvantaged women by employing them to create affordable and limited fashion collections, create jobs in the sector of urban farming, and convert abandoned social housing into learning spaces and entrepreneurship labs.

After the success of the first edition in 2013, the European Commission launched a second round of the Social Innovation Competition in memory of Diogo Vasconcelos1. Its main goal was to invite Europeans to propose new solutions to answer The Job Challenge. The Commission received 1,254 ideas out of which three were awarded with a prize of €30,000 each.

Commissioner Michel Barnier said: “We believe that the winning projects can take advantage of unmet social needs and create sustainable jobs. I want these projects to be scaled up and replicated and inspire more social innovations in Europe. We need to tap into this potential to bring innovative solutions to the needs of our citizens and create new types of work.”

More informationon the Competition page

More jobs for Europe – three outstanding ideas

The following new and exceptional ideas are the winners of the second edition of the European Social Innovation Competition:

  • ‘From waste to wow! QUID project’ (Italy): fashion business demands perfection, and slightly damaged textile cannot be used for top brands. The project intends to recycle this first quality waste into limited collections and thereby provide jobs to disadvantaged women. This is about creating highly marketable products and social value through recycling.

  • ‘Urban Farm Lease’ (Belgium): urban agriculture could provide 6,000 direct jobs in Brussels, and an additional 1,500 jobs considering indirect employment (distribution, waste management, training or events). The project aims at providing training, connection and consultancy so that unemployed people take advantage of the large surfaces available for agriculture in the city (e.g. 908 hectares of land or 394 hectares of suitable flat roofs).

  • ‘Voidstarter’ (Ireland): all major cities in Europe have “voids”, units of social housing which are empty because city councils have insufficient budgets to make them into viable homes. At the same time these cities also experience pressure with social housing provision and homelessness. Voidstarter will provide unemployed people with learning opportunities alongside skilled tradespersons in the refurbishing of the voids.”

The rise of open data driven businesses in emerging markets

Alla Morrison at the Worldbank blog:

Key findings —

  • Many new data companies have emerged around the world in the last few years. Of these companies, the majority use some form of government data.
  • There are a large number of data companies in sectors with high social impact and tremendous development opportunities.
  • An actionable pipeline of data-driven companies exists in Latin America and in Asia. The most desired type of financing is equity, followed by quasi-equity in the amounts ranging from $100,000 to $5 million, with averages of between $2 and $3 million depending on the region. The total estimated need for financing may exceed $400 million.

“The economic value of open data is no longer a hypothesis
How can one make money with open data which is akin to air – free and open to everyone? Should the World Bank Group be in the catalyzer role for a sector that is just emerging?  And if so, what set of interventions would be the most effective? Can promoting open data-driven businesses contribute to the World Bank Group’s twin goals of fighting poverty and boosting shared prosperity?
These questions have been top of the mind since the World Bank Open Finances team convened a group of open data entrepreneurs from across Latin America to share their business models, success stories and challenges at the Open Data Business Models workshop in Uruguay in June 2013. We were in Uruguay to find out whether open data could lead to the creation of sustainable new businesses and jobs. To do so, we tested a couple of hypotheses: open data has economic value, beyond the benefits of increased transparency and accountability; and open data companies with sustainable business models already exist in emerging economies.
Encouraged by our findings in Uruguay we set out to further explore the economic development potential of open data, with a focus on:

  • Contribution of open data to countries’ GDP;
  • Innovative solutions to tackle social problems in key sectors like agriculture, health, education, transportation, climate change, financial services, especially those benefiting low income populations;
  • Economic benefits of governments’ buy-in into the commercial value of open data and resulting release of new datasets, which in turn would lead to increased transparency in public resource management (reductions in misallocations, a more level playing field in procurement) and better service delivery; and
  • Creation of data-related private sector jobs, especially suited for the tech savvy young generation.

We proposed a joint IFC/World Bank approach (From open data to development impact – the crucial role of private sector) that envisages providing financing to data-driven companies through a dedicated investment fund, as well as loans and grants to governments to create a favorable enabling environment. The concept was received enthusiastically for the most part by a wide group of peers at the Bank, the IFC, as well as NGOs, foundations, DFIs and private sector investors.
Thanks also in part to a McKinsey report last fall stating that open data could help unlock more than $3 trillion in value every year, the potential value of open data is now better understood. The acquisition of Climate Corporation (whose business model holds enormous potential for agriculture and food security, if governments open up the right data) for close to a billion dollars last November and the findings of the Open Data 500 project led by GovLab of the NYU further substantiated the hypothesis. These days no one asks whether open data has economic value; the focus has shifted to finding ways for companies, both startups and large corporations, and governments to unlock it. The first question though is – is it still too early to plan a significant intervention to spur open data driven economic growth in emerging markets?”

Continued Progress and Plans for Open Government Data

Steve VanRoekel, and Todd Park at the White House:  “One year ago today, President Obama signed an executive order that made open and machine-readable data the new default for government information. This historic step is helping to make government-held data more accessible to the public and to entrepreneurs while appropriately safeguarding sensitive information and rigorously protecting privacy.
Freely available data from the U.S. government is an important national resource, serving as fuel for entrepreneurship, innovation, scientific discovery, and economic growth. Making information about government operations more readily available and useful is also core to the promise of a more efficient and transparent government. This initiative is a key component of the President’s Management Agenda and our efforts to ensure the government is acting as an engine to expand economic growth and opportunity for all Americans. The Administration is committed to driving further progress in this area, including by designating Open Data as one of our key Cross-Agency Priority Goals.
Over the past few years, the Administration has launched a number of Open Data Initiatives aimed at scaling up open data efforts across the Health, Energy, Climate, Education, Finance, Public Safety, and Global Development sectors. The White House has also launched Project Open Data, designed to share best practices, examples, and software code to assist federal agencies with opening data. These efforts have helped unlock troves of valuable data—that taxpayers have already paid for—and are making these resources more open and accessible to innovators and the public.
Other countries are also opening up their data. In June 2013, President Obama and other G7 leaders endorsed the Open Data Charter, in which the United States committed to publish a roadmap for our nation’s approach to releasing and improving government data for the public.
Building upon the Administration’s Open Data progress, and in fulfillment of the Open Data Charter, today we are excited to release the U.S. Open Data Action Plan. The plan includes a number of exciting enhancements and new data releases planned in 2014 and 2015, including:

  • Small Business Data: The Small Business Administration’s (SBA) database of small business suppliers will be enhanced so that software developers can create tools to help manufacturers more easily find qualified U.S. suppliers, ultimately reducing the transaction costs to source products and manufacture domestically.
  • Smithsonian American Art Museum Collection: The Smithsonian American Art Museum’s entire digitized collection will be opened to software developers to make educational apps and tools. Today, even museum curators do not have easily accessible information about their art collections. This information will soon be available to everyone.
  • FDA Adverse Drug Event Data: Each year, healthcare professionals and consumers submit millions of individual reports on drug safety to the Food and Drug Administration (FDA). These anonymous reports are a critical tool to support drug safety surveillance. Today, this data is only available through limited quarterly reports. But the Administration will soon be making these reports available in their entirety so that software developers can build tools to help pull potentially dangerous drugs off shelves faster than ever before.

We look forward to implementing the U.S. Open Data Action Plan, and to continuing to work with our partner countries in the G7 to take the open data movement global”.

Findings of the Big Data and Privacy Working Group Review

John Podesta at the White House Blog: “Over the past several days, severe storms have battered Arkansas, Oklahoma, Mississippi and other states. Dozens of people have been killed and entire neighborhoods turned to rubble and debris as tornadoes have touched down across the region. Natural disasters like these present a host of challenges for first responders. How many people are affected, injured, or dead? Where can they find food, shelter, and medical attention? What critical infrastructure might have been damaged?
Drawing on open government data sources, including Census demographics and NOAA weather data, along with their own demographic databases, Esri, a geospatial technology company, has created a real-time map showing where the twisters have been spotted and how the storm systems are moving. They have also used these data to show how many people live in the affected area, and summarize potential impacts from the storms. It’s a powerful tool for emergency services and communities. And it’s driven by big data technology.
In January, President Obama asked me to lead a wide-ranging review of “big data” and privacy—to explore how these technologies are changing our economy, our government, and our society, and to consider their implications for our personal privacy. Together with Secretary of Commerce Penny Pritzker, Secretary of Energy Ernest Moniz, the President’s Science Advisor John Holdren, the President’s Economic Advisor Jeff Zients, and other senior officials, our review sought to understand what is genuinely new and different about big data and to consider how best to encourage the potential of these technologies while minimizing risks to privacy and core American values.
Over the course of 90 days, we met with academic researchers and privacy advocates, with regulators and the technology industry, with advertisers and civil rights groups. The President’s Council of Advisors for Science and Technology conducted a parallel study of the technological trends underpinning big data. The White House Office of Science and Technology Policy jointly organized three university conferences at MIT, NYU, and U.C. Berkeley. We issued a formal Request for Information seeking public comment, and hosted a survey to generate even more public input.
Today, we presented our findings to the President. We knew better than to try to answer every question about big data in three months. But we are able to draw important conclusions and make concrete recommendations for Administration attention and policy development in a few key areas.
There are a few technological trends that bear drawing out. The declining cost of collection, storage, and processing of data, combined with new sources of data like sensors, cameras, and geospatial technologies, mean that we live in a world of near-ubiquitous data collection. All this data is being crunched at a speed that is increasingly approaching real-time, meaning that big data algorithms could soon have immediate effects on decisions being made about our lives.
The big data revolution presents incredible opportunities in virtually every sector of the economy and every corner of society.
Big data is saving lives. Infections are dangerous—even deadly—for many babies born prematurely. By collecting and analyzing millions of data points from a NICU, one study was able to identify factors, like slight increases in body temperature and heart rate, that serve as early warning signs an infection may be taking root—subtle changes that even the most experienced doctors wouldn’t have noticed on their own.
Big data is making the economy work better. Jet engines and delivery trucks now come outfitted with sensors that continuously monitor hundreds of data points and send automatic alerts when maintenance is needed. Utility companies are starting to use big data to predict periods of peak electric demand, adjusting the grid to be more efficient and potentially averting brown-outs.
Big data is making government work better and saving taxpayer dollars. The Centers for Medicare and Medicaid Services have begun using predictive analytics—a big data technique—to flag likely instances of reimbursement fraud before claims are paid. The Fraud Prevention System helps identify the highest-risk health care providers for waste, fraud, and abuse in real time and has already stopped, prevented, or identified $115 million in fraudulent payments.
But big data raises serious questions, too, about how we protect our privacy and other values in a world where data collection is increasingly ubiquitous and where analysis is conducted at speeds approaching real time. In particular, our review raised the question of whether the “notice and consent” framework, in which a user grants permission for a service to collect and use information about them, still allows us to meaningfully control our privacy as data about us is increasingly used and reused in ways that could not have been anticipated when it was collected.
Big data raises other concerns, as well. One significant finding of our review was the potential for big data analytics to lead to discriminatory outcomes and to circumvent longstanding civil rights protections in housing, employment, credit, and the consumer marketplace.
No matter how quickly technology advances, it remains within our power to ensure that we both encourage innovation and protect our values through law, policy, and the practices we encourage in the public and private sector. To that end, we make six actionable policy recommendations in our report to the President:
Advance the Consumer Privacy Bill of Rights. Consumers deserve clear, understandable, reasonable standards for how their personal information is used in the big data era. We recommend the Department of Commerce take appropriate consultative steps to seek stakeholder and public comment on what changes, if any, are needed to the Consumer Privacy Bill of Rights, first proposed by the President in 2012, and to prepare draft legislative text for consideration by stakeholders and submission by the President to Congress.
Pass National Data Breach Legislation. Big data technologies make it possible to store significantly more data, and further derive intimate insights into a person’s character, habits, preferences, and activities. That makes the potential impacts of data breaches at businesses or other organizations even more serious. A patchwork of state laws currently governs requirements for reporting data breaches. Congress should pass legislation that provides for a single national data breach standard, along the lines of the Administration’s 2011 Cybersecurity legislative proposal.
Extend Privacy Protections to non-U.S. Persons. Privacy is a worldwide value that should be reflected in how the federal government handles personally identifiable information about non-U.S. citizens. The Office of Management and Budget should work with departments and agencies to apply the Privacy Act of 1974 to non-U.S. persons where practicable, or to establish alternative privacy policies that apply appropriate and meaningful protections to personal information regardless of a person’s nationality.
Ensure Data Collected on Students in School is used for Educational Purposes. Big data and other technological innovations, including new online course platforms that provide students real time feedback, promise to transform education by personalizing learning. At the same time, the federal government must ensure educational data linked to individual students gathered in school is used for educational purposes, and protect students against their data being shared or used inappropriately.
Expand Technical Expertise to Stop Discrimination. The detailed personal profiles held about many consumers, combined with automated, algorithm-driven decision-making, could lead—intentionally or inadvertently—to discriminatory outcomes, or what some are already calling “digital redlining.” The federal government’s lead civil rights and consumer protection agencies should expand their technical expertise to be able to identify practices and outcomes facilitated by big data analytics that have a discriminatory impact on protected classes, and develop a plan for investigating and resolving violations of law.
Amend the Electronic Communications Privacy Act. The laws that govern protections afforded to our communications were written before email, the internet, and cloud computing came into wide use. Congress should amend ECPA to ensure the standard of protection for online, digital content is consistent with that afforded in the physical world—including by removing archaic distinctions between email left unread or over a certain age.
We also identify several broader areas ripe for further study, debate, and public engagement that, collectively, we hope will spark a national conversation about how to harness big data for the public good. We conclude that we must find a way to preserve our privacy values in both the domestic and international marketplace. We urgently need to build capacity in the federal government to identify and prevent new modes of discrimination that could be enabled by big data. We must ensure that law enforcement agencies using big data technologies do so responsibly, and that our fundamental privacy rights remain protected. Finally, we recognize that data is a valuable public resource, and call for continuing the Administration’s efforts to open more government data sources and make investments in research and technology.
While big data presents new challenges, it also presents immense opportunities to improve lives, the United States is perhaps better suited to lead this conversation than any other nation on earth. Our innovative spirit, technological know-how, and deep commitment to values of privacy, fairness, non-discrimination, and self-determination will help us harness the benefits of the big data revolution and encourage the free flow of information while working with our international partners to protect personal privacy. This review is but one piece of that effort, and we hope it spurs a conversation about big data across the country and around the world.
Read the Big Data Report.
See the fact sheet from today’s announcement.

The Data Mining Techniques That Reveal Our Planet's Cultural Links and Boundaries

Emerging Technology From the arXiv: “The habits and behaviors that define a culture are complex and fascinating. But measuring them is a difficult task. What’s more, understanding the way cultures change from one part of the world to another is a task laden with challenges.
The gold standard in this area of science is known as the World Values Survey, a global network of social scientists studying values and their impact on social and political life. Between 1981 and 2008, this survey conducted over 250,000 interviews in 87 societies. That’s a significant amount of data and the work has continued since then. This work is hugely valuable but it is also challenging, time-consuming and expensive.
Today, Thiago Silva at the Universidade Federal de Minas Gerais in Brazil and a few buddies reveal another way to collect data that could revolutionize the study of global culture. These guys study cultural differences around the world using data generated by check-ins on the location-based social network, Foursquare.
That allows these researchers to gather huge amounts of data, cheaply and easily in a short period of time. “Our one-week dataset has a population of users of the same order of magnitude of the number of interviews performed in [the World Values Survey] in almost three decades,” they say.
Food and drink are fundamental aspects of society and so the behaviors and habits associated with them are important indicators. The basic question that Silva and co attempt to answer is: what are your eating and drinking habits? And how do these differ from a typical individual in another part of the world such as Japan, Malaysia, or Brazil?
Foursquare is ideally set up to explore this question. Users “check in” by indicating when they have reached a particular location that might be related to eating and drinking but also to other activities such as entertainment, sport and so on.
Silva and co are only interested in the food and drink preferences of individuals and, in particular, on the way these preferences change according to time of day and geographical location.
So their basic approach is to compare a large number individual preferences from different parts of the world and see how closely they match or how they differ.
Because Foursquare does not share its data, Silva and co downloaded almost five million tweets containing Foursquare check-ins, URLs pointing to the Foursquare website containing information about each venue. They discarded check-ins that were unrelated to food or drink.
That left them with some 280,000 check-ins related to drink from 160,000 individuals; over 400,000 check-ins related to fast food from 230,000 people; and some 400,000 check-ins relating to ordinary restaurant food or what Silva and co call slow food.
They then divide each of these classes into subcategories. For example, the drink class has 21 subcategories such as brewery, karaoke bar, pub, and so on. The slow food class has 53 subcategories such as Chinese restaurant, Steakhouse, Greek restaurant, and so on.
Each check-in gives the time and geographical location which allows the team to compare behaviors from all over the world. They compare, for example, eating and drinking times in different countries both during the week and at the weekend. They compare the choices of restaurants, fast food habits and drinking habits by continent and country. The even compare eating and drinking habits in New York, London, and Tokyo.
The results are a fascinating insight into humanity’s differing habits. Many places have similar behaviors, Malaysia and Singapore or Argentina and Chile, for example, which is just as expected given the similarities between these places.
But other resemblances are more unexpected. A comparison of drinking habits show greater similarity between Brazil and France, separated by the Atlantic Ocean, than they do between France and England, separated only by the English Channel…
They point out only two major differences. The first is that no Islamic cluster appears in the Foursquare data. Countries such as Turkey are similar to Russia, while Indonesia seems related to Malaysia and Singapore.
The second is that the U.S. and Mexico make up their own individual cluster in the Foursquare data whereas the World Values Survey has them in the “English-speaking” and “Latin American” clusters accordingly.
That’s exciting data mining work that has the potential to revolutionize the way sociologists and anthropologists study human culture around the world. Expect to hear more about it
Ref: You Are What You Eat (and Drink): Identifying Cultural Boundaries By Analyzing Food & Drink Habits In Foursquare”.

Eight (No, Nine!) Problems With Big Data

Gary Marcus and Ernest Davis in the New York Times: “BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.

Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.

Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

Both efforts get many things right — Jesus, Lincoln and Shakespeare were surely important people — but both also make some egregious errors. “Who’s Bigger?” claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldn’t be fooled by the appearance of exactitude.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as “dumbed-down escapist fare” that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate “dumbed-down escapist fare” into German and then back into English: out comes the incoherent “scaled-flight fare.” That is a long way from what Mr. Lowe intended — and from big data’s aspirations for translation.

Wait, we almost forgot one last problem: the hype….