OpenUp Corporate Data while Protecting Privacy


Article by Stefaan G. Verhulst and David Sangokoya, (The GovLab) for the OpenUp? Blog: “Consider a few numbers: By the end of 2014, the number of mobile phone subscriptions worldwide is expected to reach 7 billion, nearly equal to the world’s population. More than 1.82 billion people communicate on some form of social network, and almost 14 billion sensor-laden everyday objects (trucks, health monitors, GPS devices, refrigerators, etc.) are now connected and communicating over the Internet, creating a steady stream of real-time, machine-generated data.
Much of the data generated by these devices is today controlled by corporations. These companies are in effect “owners” of terabytes of data and metadata. Companies use this data to aggregate, analyze, and track individual preferences, provide more targeted consumer experiences, and add value to the corporate bottom line.
At the same time, even as we witness a rapid “datafication” of the global economy, access to data is emerging as an increasingly critical issue, essential to addressing many of our most important social, economic, and political challenges. While the rise of the Open Data movement has opened up over a million datasets around the world, much of this openness is limited to government (and, to a lesser extent, scientific) data. Access to corporate data remains extremely limited. This is a lost opportunity. If corporate data—in the form of Web clicks, tweets, online purchases, sensor data, call data records, etc.—were made available in a de-identified and aggregated manner, researchers, public interest organizations, and third parties would gain greater insights on patterns and trends that could help inform better policies and lead to greater public good (including combatting Ebola).
Corporate data sharing holds tremendous promise. But its potential—and limitations—are also poorly understood. In what follows, we share early findings of our efforts to map this emerging open data frontier, along with a set of reflections on how to safeguard privacy and other citizen and consumer rights while sharing. Understanding the practice of shared corporate data—and assessing the associated risks—is an essential step in increasing access to socially valuable data held by businesses today. This is a challenge certainly worth exploring during the forthcoming OpenUp conference!
Understanding and classifying current corporate data sharing practices
Corporate data sharing remains very much a fledgling field. There has been little rigorous analysis of different ways or impacts of sharing. Nonetheless, our initial mapping of the landscape suggests there have been six main categories of activity—i.e., ways of sharing—to date:…
Assessing risks of corporate data sharing
Although the shared corporate data offers several benefits for researchers, public interest organizations, and other companies, there do exist risks, especially regarding personally identifiable information (PII). When aggregated, PII can serve to help understand trends and broad demographic patterns. But if PII is inadequately scrubbed and aggregated data is linked to specific individuals, this can lead to identity theft, discrimination, profiling, and other violations of individual freedom. It can also lead to significant legal ramifications for corporate data providers….”

How Wikipedia Data Is Revolutionizing Flu Forecasting


They say their model has the potential to transform flu forecasting from a black art to a modern science as well-founded as weather forecasting.
Flu takes between 3,000 and 49,000 lives each year in the U.S. so an accurate forecast can have a significant impact on the way society prepares for the epidemic. The current method of monitoring flu outbreaks is somewhat antiquated. It relies on a voluntary system in which public health officials report the percentage of patients they see each week with influenza-like illnesses. This is defined as the percentage of people with a temperature higher than 100 degrees, a cough and no other explanation other than flu.
These numbers give a sense of the incidence of flu at any instant but the accuracy is clearly limited. They do not, for example, account for people with flu who do not seek treatment or people with flu-like symptoms who seek treatment but do not have flu.
There is another significant problem. The network that reports this data is relatively slow. It takes about two weeks for the numbers to filter through the system so the data is always weeks old.
That’s why the CDC is interested in finding new ways to monitor the spread of flu in real time. Google, in particular, has used the number of searches for flu and flu-like symptoms to forecast flu in various parts of the world. That approach has had considerable success but also some puzzling failures. One problem, however, is that Google does not make its data freely available and this lack of transparency is a potential source of trouble for this kind of research.
So Hickmann and co have turned to Wikipedia. Their idea is that the variation in numbers of people accessing articles about flu is an indicator of the spread of the disease. And since Wikipedia makes this data freely available to any interested party, it is an entirely transparent source that is likely to be available for the foreseeable future….
Ref: arxiv.org/abs/1410.7716 : Forecasting the 2013–2014 Influenza Season using Wikipedia”

The New Thing in Google Flu Trends Is Traditional Data


in the New York Times: “Google is giving its Flu Trends service an overhaul — “a brand new engine,” as it announced in a blog post on Friday.

The new thing is actually traditional data from the Centers for Disease Control and Prevention that is being integrated into the Google flu-tracking model. The goal is greater accuracy after the Google service had been criticized for consistently over-estimating flu outbreaks in recent years.

The main critique came in an analysis done by four quantitative social scientists, published earlier this year in an article in Science magazine, “The Parable of Google Flu: Traps in Big Data Analysis.” The researchers found that the most accurate flu predictor was a data mash-up that combined Google Flu Trends, which monitored flu-related search terms, with the official C.D.C. reports from doctors on influenza-like illness.

The Google Flu Trends team is heeding that advice. In the blog post, written by Christian Stefansen, a Google senior software engineer, wrote, “We’re launching a new Flu Trends model in the United States that — like many of the best performing methods in the literature — takes official CDC flu data into account as the flu season progresses.”

Google’s flu-tracking service has had its ups and downs. Its triumph came in 2009, when it gave an advance signal of the severity of the H1N1 outbreak, two weeks or so ahead of official statistics. In a 2009 article in Nature explaining how Google Flu Trends worked, the company’s researchers did, as the Friday post notes, say that the Google service was not intended to replace official flu surveillance methods and that it was susceptible to “false alerts” — anything that might prompt a surge in flu-related search queries.

Yet those caveats came a couple of pages into the Nature article. And Google Flu Trends became a symbol of the superiority of the new, big data approach — computer algorithms mining data trails for collective intelligence in real time. To enthusiasts, it seemed so superior to the antiquated method of collecting health data that involved doctors talking to patients, inspecting them and filing reports.

But Google’s flu service greatly overestimated the number of cases in the United States in the 2012-13 flu season — a well-known miss — and, according to the research published this year, has persistently overstated flu cases over the years. In the Science article, the social scientists called it “big data hubris.”

Crowd-Sourcing Corruption: What Petrified Forests, Street Music, Bath Towels and the Taxman Can Tell Us About the Prospects for Its Future


Paper by Dieter Zinnbauer: This article seeks to map out the prospects of crowd-sourcing technologies in the area of corruption-reporting. A flurry of initiative and concomitant media hype in this area has led to exuberant hopes that the end of impunity is not such a distant possibility any more – at least not for the most blatant, ubiquitous and visible forms of administrative corruption, such as bribes and extortion payments that on average almost a quarter of citizens reported to face year in, year out in their daily lives in so many countries around the world (Transparency International 2013).
Only with hindsight will we be able to tell, if these hopes were justified. However, a closer look at an interdisciplinary body of literature on corruption and social mobilisation can help shed some interesting light on these questions and offer a fresh perspective on the potential of social media based crowd-sourcing for better governance and less corruption. So far the potential of crowd-sourcing is mainly approached from a technology-centred perspective. Where challenges are identified, pondered, and worked upon they are primarily technical and managerial in nature, ranging from issues of privacy protection and fighting off hacker attacks to challenges of data management, information validation or fundraising.
In contrast, short shrift is being paid to insights from a substantive, multi-disciplinary and growing body of literature on how corruption works, how it can be fought and more generally how observed logics of collective action and social mobilisation interact with technological affordances and condition the success of these efforts.
This imbalanced debate is not really surprising as it seems to follow the trajectory of the hype-and-bust cycle that we have seen in the public debate for a variety of other technology applications. From electronic health cards to smart government, to intelligent transport systems, all these and many other highly ambitious initiatives start with technology-centric visions of transformational impact. However, over time – with some hard lessons learnt and large sums spent – they all arrive at a more pragmatic and nuanced view on how social and economic forces shape the implementation of such technologies and require a more shrewd design approach, in order to make it more likely that potential actually translates into impact….”

When Experts Are a Waste of Money


Vivek Wadhwa at the Wall Street Journal: “Corporations have always relied on industry analysts, management consultants and in-house gurus for advice on strategy and competitiveness. Since these experts understand the products, markets and industry trends, they also get paid the big bucks.
But what experts do is analyze historical trends, extrapolate forward on a linear basis and protect the status quo — their field of expertise. And technologies are not progressing linearly anymore; they are advancing exponentially. Technology is advancing so rapidly that listening to people who just have domain knowledge and vested interests will put a company on the fastest path to failure. Experts are no longer the right people to turn to; they are a waste of money.
Just as the processing power of our computers doubles every 18 months, with prices falling and devices becoming smaller, fields such as medicine, robotics, artificial intelligence and synthetic biology are seeing accelerated change. Competition now comes from the places you least expect it to. The health-care industry, for example, is about to be disrupted by advances in sensors and artificial intelligence; lodging and transportation, by mobile apps; communications, by Wi-Fi and the Internet; and manufacturing, by robotics and 3-D printing.
To see the competition coming and develop strategies for survival, companies now need armies of people, not experts. The best knowledge comes from employees, customers and outside observers who aren’t constrained by their expertise or personal agendas. It is they who can best identify the new opportunities. The collective insight of large numbers of individuals is superior because of the diversity of ideas and breadth of knowledge that they bring. Companies need to learn from people with different skills and backgrounds — not from those confined to a department.
When used properly, crowdsourcing can be the most effective, least expensive way of solving problems.
Crowdsourcing can be as simple as asking employees to submit ideas via email or via online discussion boards, or it can assemble cross-disciplinary groups to exchange ideas and brainstorm. Internet platforms such as Zoho Connect, IdeaScale and GroupTie can facilitate group ideation by providing the ability to pose questions to a large number of people and having them discuss responses with each other.
Many of the ideas proposed by the crowd as well as the discussions will seem outlandish — especially if anonymity is allowed on discussion forums. And companies will surely hear things they won’t like. But this is exactly the input and out-of-the-box thinking that they need in order to survive and thrive in this era of exponential technologies….
Another way of harnessing the power of the crowd is to hold incentive competitions. These can solve problems, foster innovation and even create industries — just as the first XPRIZE did. Sponsored by the Ansari family, it offered a prize of $10 million to any team that could build a spacecraft capable of carrying three people to 100 kilometers above the earth’s surface, twice within two weeks. It was won by Burt Rutan in 2004, who launched a spacecraft called SpaceShipOne. Twenty-six teams, from seven countries, spent more than $100 million in competing. Since then, more than $1.5 billion has been invested in private space flight by companies such as Virgin Galactic, Armadillo Aerospace and Blue Origin, according to the XPRIZE Foundation….
Competitions needn’t be so grand. InnoCentive and HeroX, a spinoff from the XPRIZE Foundation, for example, allow prizes as small as a few thousand dollars for solving problems. A company or an individual can specify a problem and offer prizes for whoever comes up with the best idea to solve it. InnoCentive has already run thousands of public and inter-company competitions. The solutions they have crowdsourced have ranged from the development of biomarkers for Amyotrophic lateral sclerosis disease to dual-purpose solar lights for African villages….”

Ebola and big data: Call for help


The Economist: “WITH at least 4,500 people dead, public-health authorities in west Africa and worldwide are struggling to contain Ebola. Borders have been closed, air passengers screened, schools suspended. But a promising tool for epidemiologists lies unused: mobile-phone data.
When people make mobile-phone calls, the network generates a call data record (CDR) containing such information as the phone numbers of the caller and receiver, the time of the call and the tower that handled it—which gives a rough indication of the device’s location. This information provides researchers with an insight into mobility patterns. Indeed phone companies use these data to decide where to build base stations and thus improve their networks, and city planners use them to identify places to extend public transport.
But perhaps the most exciting use of CDRs is in the field of epidemiology. Until recently the standard way to model the spread of a disease relied on extrapolating trends from census data and surveys. CDRs, by contrast, are empirical, immediate and updated in real time. You do not have to guess where people will flee to or move. Researchers have used them to map malaria outbreaks in Kenya and Namibia and to monitor the public response to government health warnings during Mexico’s swine-flu epidemic in 2009. Models of population movements during a cholera outbreak in Haiti following the earthquake in 2010 used CDRs and provided the best estimates of where aid was most needed.
Doing the same with Ebola would be hard: in west Africa most people do not own a phone. But CDRs are nevertheless better than simulations based on stale, unreliable statistics. If researchers could track population flows from an area where an outbreak had occurred, they could see where it would be likeliest to break out next—and therefore where they should deploy their limited resources. Yet despite months of talks, and the efforts of the mobile-network operators’ trade association and several smaller UN agencies, telecoms firms have not let researchers use the data (see article).
One excuse is privacy, which is certainly a legitimate worry, particularly in countries fresh from civil war, or where tribal tensions exist. But the phone data can be anonymised and aggregated in a way that alleviates these concerns. A bigger problem is institutional inertia. Big data is a new field. The people who grasp the benefits of examining mobile-phone usage tend to be young, and lack the clout to free them for research use.”

Ebola’s Information Paradox


 Steven Johnson at The New York Times:” …The story of the Broad Street outbreak is perhaps the most famous case study in public health and epidemiology, in large part because it led to the revolutionary insight that cholera was a waterborne disease, not airborne as most believed at the time. But there is another element of the Broad Street outbreak that warrants attention today, as popular anxiety about Ebola surges across the airwaves and subways and living rooms of the United States: not the spread of the disease itself, but the spread of information about the disease.

It was a full seven days after Baby Lewis became ill, and four days after the Soho residents began dying in mass numbers, before the outbreak warranted the slightest mention in the London papers, a few short lines indicating that seven people had died in the neighborhood. (The report understated the growing death toll by an order of magnitude.) It took two entire weeks before the press began treating the outbreak as a major news event for the city.

Within Soho, the information channels were equally unreliable. Rumors spread throughout the neighborhood that the entire city had succumbed at the same casualty rate, and that London was facing a catastrophe on the scale of the Great Fire of 1666. But this proved to be nothing more than rumor. Because the Soho crisis had originated with a single-point source — the poisoned well — its range was limited compared with its intensity. If you lived near the Broad Street well, you were in grave danger. If you didn’t, you were likely to be unaffected.

Compare this pattern of information flow to the way news spreads now. On Thursday, Craig Spencer, a New York doctor, was given a diagnosis of Ebola after presenting a high fever, and the entire world learned of the test result within hours of the patient himself learning it. News spread with similar velocity several weeks ago with the Dallas Ebola victim, Thomas Duncan. In a sense, it took news of the cholera outbreak a week to travel the 20 blocks from Soho to Fleet Street in 1854; today, the news travels at nearly the speed of light, as data traverses fiber-optic cables. Thanks to that technology, the news channels have been on permanent Ebola watch for weeks now, despite the fact that, as the joke went on Twitter, more Americans have been married to Kim Kardashian than have died in the United States from Ebola.

As societies and technologies evolve, the velocities vary with which disease and information can spread. The tremendous population density of London in the 19th century enabled the cholera bacterium to spread through a neighborhood with terrifying speed, while the information about that terror moved more slowly. This was good news for the mental well-being of England’s wider population, which was spared the anxiety of following the death count as if it were a stock ticker. But it was terrible from a public health standpoint; the epidemic had largely faded before the official institutions of public health even realized the magnitude of the outbreak….

Information travels faster than viruses do now. This is why we are afraid. But this is also why we are safe.”

From Information to Smart Society


New book edited by Mola, Lapo, Pennarola, Ferdinando,  and Za, Stefano: “This book presents a collection of research papers focusing on issues emerging from the interaction of information technologies and organizational systems. In particular, the individual contributions examine digital platforms and artifacts currently adopted in both the business world and society at large (people, communities, firms, governments, etc.). The topics covered include: virtual organizations, virtual communities, smart societies, smart cities, ecological sustainability, e-healthcare, e-government, and interactive policy-making (IPM)…”

The government wants to study ‘social pollution’ on Twitter


in the Washington Post: “If you take to Twitter to express your views on a hot-button issue, does the government have an interest in deciding whether you are spreading “misinformation’’? If you tweet your support for a candidate in the November elections, should taxpayer money be used to monitor your speech and evaluate your “partisanship’’?

My guess is that most Americans would answer those questions with a resounding no. But the federal government seems to disagree. The National Science Foundation , a federal agency whose mission is to “promote the progress of science; to advance the national health, prosperity and welfare; and to secure the national defense,” is funding a project to collect and analyze your Twitter data.
The project is being developed by researchers at Indiana University, and its purported aim is to detect what they deem “social pollution” and to study what they call “social epidemics,” including how memes — ideas that spread throughout pop culture — propagate. What types of social pollution are they targeting? “Political smears,” so-called “astroturfing” and other forms of “misinformation.”
Named “Truthy,” after a term coined by TV host Stephen Colbert, the project claims to use a “sophisticated combination of text and data mining, social network analysis, and complex network models” to distinguish between memes that arise in an “organic manner” and those that are manipulated into being.

But there’s much more to the story. Focusing in particular on political speech, Truthy keeps track of which Twitter accounts are using hashtags such as #teaparty and #dems. It estimates users’ “partisanship.” It invites feedback on whether specific Twitter users, such as the Drudge Report, are “truthy” or “spamming.” And it evaluates whether accounts are expressing “positive” or “negative” sentiments toward other users or memes…”

Chicago uses big data to save itself from urban ills


Aviva Rutkin in the New Scientist: “THIS year in Chicago, some kids will get lead poisoning from the paint or pipes in their homes. Some restaurants will cook food in unsanitary conditions and, here and there, a street corner will be suddenly overrun with rats. These kinds of dangers are hard to avoid in a city of more than 2.5 million people. The problem is, no one knows for certain where or when they will pop up.

The Chicago city government is hoping to change that by knitting powerful predictive models into its everyday city inspections. Its latest project, currently in pilot tests, analyses factors such as home inspection records and census data, and uses the results to guess which buildings are likely to cause lead poisoning in children – a problem that affects around 500,000 children in the US each year. The idea is to identify trouble spots before kids are exposed to dangerous lead levels.

“We are able to prevent problems instead of just respond to them,” says Jay Bhatt, chief innovation officer at the Chicago Department of Public Health. “These models are just the beginning of the use of predictive analytics in public health and we are excited to be at the forefront of these efforts.”

Chicago’s projects are based on the thinking that cities already have what they need to raise their municipal IQ: piles and piles of data. In 2012, city officials built WindyGrid, a platform that collected data like historical facts about buildings and up-to-date streams such as bus locations, tweets and 911 calls. The project was designed as a proof of concept and was never released publicly but it led to another, called Plenario, that allowed the public to access the data via an online portal.

The experience of building those tools has led to more practical applications. For example, one tool matches calls to the city’s municipal hotline complaining about rats with conditions that draw rats to a particular area, such as excessive moisture from a leaking pipe, or with an increase in complaints about garbage. This allows officials to proactively deploy sanitation crews to potential hotspots. It seems to be working: last year, resident requests for rodent control dropped by 15 per cent.

Some predictions are trickier to get right. Charlie Catlett, director of the Urban Center for Computation and Data in Chicago, is investigating an old axiom among city cops: that violent crime tends to spike when there’s a sudden jump in temperature. But he’s finding it difficult to test its validity in the absence of a plausible theory for why it might be the case. “For a lot of things about cities, we don’t have that underlying theory that tells us why cities work the way they do,” says Catlett.

Still, predictive modelling is maturing, as other cities succeed in using it to tackle urban ills….Such efforts can be a boon for cities, making them more productive, efficient and safe, says Rob Kitchin of Maynooth University in Ireland, who helped launched a real-time data site for Dublin last month called the Dublin Dashboard. But he cautions that there’s a limit to how far these systems can aid us. Knowing that a particular street corner is likely to be overrun with rats tomorrow doesn’t address what caused the infestation in the first place. “You might be able to create a sticking plaster or be able to manage it more efficiently, but you’re not going to be able to solve the deep structural problems….”