Better Data for Better Policy: Accessing New Data Sources for Statistics Through Data Collaboratives


Medium Blog by Stefaan Verhulst: “We live in an increasingly quantified world, one where data is driving key business decisions. Data is claimed to be the new competitive advantage. Yet, paradoxically, even as our reliance on data increases and the call for agile, data-driven policy making becomes more pronounced, many Statistical Offices are confronted with shrinking budgets and an increased demand to adjust their practices to a data age. If Statistical Offices fail to find new ways to deliver “evidence of tomorrow”, by leveraging new data sources, this could mean that public policy may be formed without access to the full range of available and relevant intelligence — as most business leaders have. At worst, a thinning evidence base and lack of rigorous data foundation could lead to errors and more “fake news,” with possibly harmful public policy implications.

While my talk was focused on the key ways data can inform and ultimately transform the full policy cycle (see full presentation here), a key premise I examined was the need to access, utilize and find insight in the vast reams of data and data expertise that exist in private hands through the creation of new kinds of public and private partnerships or “data collaboratives” to establish more agile and data-driven policy making.

Screen Shot 2017-10-20 at 5.18.23 AM

Applied to statistics, such approaches have already shown promise in a number of settings and countries. Eurostat itself has, for instance, experimented together with Statistics Belgium, with leveraging call detail records provided by Proximus to document population density. Statistics Netherlands (CBS) recently launched a Center for Big Data Statistics (CBDS)in partnership with companies like Dell-EMC and Microsoft. Other National Statistics Offices (NSOs) are considering using scanner data for monitoring consumer prices (Austria); leveraging smart meter data (Canada); or using telecom data for complementing transportation statistics (Belgium). We are now living undeniably in an era of data. Much of this data is held by private corporations. The key task is thus to find a way of utilizing this data for the greater public good.

Value Proposition — and Challenges

There are several reasons to believe that public policy making and official statistics could indeed benefit from access to privately collected and held data. Among the value propositions:

  • Using private data can increase the scope and breadth and thus insights offered by available evidence for policymakers;
  • Using private data can increase the quality and credibility of existing data sets (for instance, by complementing or validating them);
  • Private data can increase the timeliness and thus relevance of often-outdated information held by statistical agencies (social media streams, for example, can provide real-time insights into public behavior); and
  • Private data can lower costs and increase other efficiencies (for example, through more sophisticated analytical methods) for statistical organizations….(More)”.

Federal Crowdsourcing and Citizen Science Catalog


About: “The catalog contains information about federal citizen science and crowdsourcing projects. In citizen science, the public participates voluntarily in the scientific process, addressing real-world problems in ways that may include formulating research questions, conducting scientific experiments, collecting and analyzing data, interpreting results, making new discoveries, developing technologies and applications, and solving complex problems. In crowdsourcing,organizations submit an open call for voluntary assistance from a group of individuals for online, distributed problem solving.

Projects in the catalog must meet the following criteria:

  • The project addresses societal needs or accelerates science, technology, and innovation consistent with a Federal agency’s mission.
  • Project outcomes include active management of data and data quality.
  • Participants serve as contributors, collaborators or co-creators in the project.
  • The project solicits engagement from individuals outside of a discipline’s or program’s traditional participants in the scientific enterprise.
  • Beyond practical limitations, the project does not seek to limit the number of participants or partners involved.
  • The project is opt-in; participants have full control over the extent that they participate.
  • The US Government enables or enhances the project via funding or providing an in-kind contribution. The US Government’s in-kind contribution to the project may be active or passive, formal or informal….(More)”.

Linux Foundation Debuts Community Data License Agreement


Press Release: “The Linux Foundation, the nonprofit advancing professional open source management for mass collaboration, today announced the Community Data License Agreement(CDLA) family of open data agreements. In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data.

Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.

The growth of big data analytics, machine learning and artificial intelligence (AI) technologies has allowed people to extract unprecedented levels of insight from data. Now the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses and other organizations open up and share data, with the goal of creating communities that curate and share data openly.

For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.

Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.

And if government agencies share aggregated data on building permits, school enrollment figures, sewer and water usage, their citizens benefit from the ability of commercial entities to anticipate their future needs and respond with infrastructure and facilities that arrive in anticipation of citizens’ demands.

“An open data license is essential for the frictionless sharing of the data that powers both critical technologies and societal benefits,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure.”…(More)”.

The role of policy entrepreneurs in open government data policy innovation diffusion: An analysis of Australian Federal and State Governments


Paper by Akemi TakeokaChatfield and Christopher G.Reddick: “Open government data (OGD) policy differs substantially from the existing Freedom of Information policies. Consequently OGD can be viewed as a policy innovation. Drawing on both innovation diffusion theory and its application to public policy innovation research, we examine Australia’s OGD policy diffusion patterns at both the federal and state government levels based on the policy adoption timing and CKAN portal “Organization” and “Category” statistics. We found that state governments that had adopted OGD policies earlier had active policy entrepreneurs (or lead departments/agencies) responsible for the policy innovation diffusion across the different government departments. We also found that their efficacy ranking was relatively high in terms of OGD portal openness when openness is measured by the greater number of datasets proactively and systematically published through their OGD portals. These findings have important implications for the role played by OGD policy entrepreneurs in openly sharing the government-owned datasets with the public….(More)”.

Our laws don’t do enough to protect our health data


 at the Conversation: “A particularly sensitive type of big data is medical big data. Medical big data can consist of electronic health records, insurance claims, information entered by patients into websites such as PatientsLikeMeand more. Health information can even be gleaned from web searches, Facebook and your recent purchases.

Such data can be used for beneficial purposes by medical researchers, public health authorities, and healthcare administrators. For example, they can use it to study medical treatments, combat epidemics and reduce costs. But others who can obtain medical big data may have more selfish agendas.

I am a professor of law and bioethics who has researched big data extensively. Last year, I published a book entitled Electronic Health Records and Medical Big Data: Law and Policy.

I have become increasingly concerned about how medical big data might be used and who could use it. Our laws currently don’t do enough to prevent harm associated with big data.

What your data says about you

Personal health information could be of interest to many, including employers, financial institutions, marketers and educational institutions. Such entities may wish to exploit it for decision-making purposes.

For example, employers presumably prefer healthy employees who are productive, take few sick days and have low medical costs. However, there are laws that prohibit employers from discriminating against workers because of their health conditions. These laws are the Americans with Disabilities Act (ADA) and the Genetic Information Nondiscrimination Act. So, employers are not permitted to reject qualified applicants simply because they have diabetes, depression or a genetic abnormality.

However, the same is not true for most predictive information regarding possible future ailments. Nothing prevents employers from rejecting or firing healthy workers out of the concern that they will later develop an impairment or disability, unless that concern is based on genetic information.

What non-genetic data can provide evidence regarding future health problems? Smoking status, eating preferences, exercise habits, weight and exposure to toxins are all informative. Scientists believe that biomarkers in your blood and other health details can predict cognitive decline, depression and diabetes.

Even bicycle purchases, credit scores and voting in midterm elections can be indicators of your health status.

Gathering data

How might employers obtain predictive data? An easy source is social media, where many individuals publicly post very private information. Through social media, your employer might learn that you smoke, hate to exercise or have high cholesterol.

Another potential source is wellness programs. These programs seek to improve workers’ health through incentives to exercise, stop smoking, manage diabetes, obtain health screenings and so on. While many wellness programs are run by third party vendors that promise confidentiality, that is not always the case.

In addition, employers may be able to purchase information from data brokers that collect, compile and sell personal information. Data brokers mine sources such as social media, personal websites, U.S. Census records, state hospital records, retailers’ purchasing records, real property records, insurance claims and more. Two well-known data brokers are Spokeo and Acxiom.

Some of the data employers can obtain identify individuals by name. But even information that does not provide obvious identifying details can be valuable. Wellness program vendors, for example, might provide employers with summary data about their workforce but strip away particulars such as names and birthdates. Nevertheless, de-identified information can sometimes be re-identified by experts. Data miners can match information to data that is publicly available….(More)”.

Laboratories for news? Experimenting with journalism hackathons


Jan Lauren Boyles in Journalism: “Journalism hackathons are computationally based events in which participants create news product prototypes. In the ideal case, the gatherings are rooted in local community, enabling a wide set of institutional stakeholders (legacy journalists, hacker journalists, civic hackers, and the general public) to gather in conversation around key civic issues. This study explores how and to what extent journalism hackathons operate as a community-based laboratory for translating open data from practitioners to the public. Surfaced from in-depth interviews with event organizers encompassing nine countries, the findings illustrate that journalism hackathons are most successful when collaboration integrates civic organizations and community leaders….(More)”.

How “Big Data” Went Bust


The problem with “big data” is not that data is bad. It’s not even that big data is bad: Applied carefully, massive data sets can reveal important trends that would otherwise go undetected. It’s the fetishization of data, and its uncritical use, that tends to lead to disaster, as Julia Rose West recently wrote for Slate. And that’s what “big data,” as a catchphrase, came to represent.

By its nature, big data is hard to interpret. When you’re collecting billions of data points—clicks or cursor positions on a website; turns of a turnstile in a large public space; hourly wind speed observations from around the world; tweets—the provenance of any given data point is obscured. This in turn means that seemingly high-level trends might turn out to be artifacts of problems in the data or methodology at the most granular level possible. But perhaps the bigger problem is that the data you have are usually only a proxy for what you really want to know. Big data doesn’t solve that problem—it magnifies it….

Aside from swearing off data and reverting to anecdote and intuition, there are at least two viable ways to deal with the problems that arise from the imperfect relationship between a data set and the real-world outcome you’re trying to measure or predict.

One is, in short: moar data. This has long been Facebook’s approach. When it became apparent that users’ “likes” were a flawed proxy for what they actually wanted to see more of in their feeds, the company responded by adding more and more proxies to its model. It began measuring other things, like the amount of time they spent looking at a post in their feed, the amount of time they spent reading a story they had clicked on, and whether they hit “like” before or after they had read the piece. When Facebook’s engineers had gone as far as they could in weighting and optimizing those metrics, they found that users were still unsatisfied in important ways. So the company added yet more metrics to the sauce: It started running huge user-survey panels, added new reaction emojis by which users could convey more nuanced sentiments, and started using A.I. to detect clickbait-y language in posts by pages and publishers. The company knows none of these proxies are perfect. But by constantly adding more of them to the mix, it can theoretically edge ever closer to an algorithm that delivers to users the posts that they most want to see.

One downside of the moar data approach is that it’s hard and expensive. Another is that the more variables are added to your model, the more complex, opaque, and unintelligible its methodology becomes. This is part of the problem Pasquale articulated in The Black Box Society. Even the most sophisticated algorithm, drawing on the best data sets, can go awry—and when it does, diagnosing the problem can be nigh-impossible. There are also the perils of “overfitting” and false confidence: The more sophisticated your model becomes, the more perfectly it seems to match up with all your past observations, and the more faith you place in it, the greater the danger that it will eventually fail you in a dramatic way. (Think mortgage crisis, election prediction models, and Zynga.)

Another possible response to the problems that arise from biases in big data sets is what some have taken to calling “small data.” Small data refers to data sets that are simple enough to be analyzed and interpreted directly by humans, without recourse to supercomputers or Hadoop jobs. Like “slow food,” the term arose as a conscious reaction to the prevalence of its opposite….(More)”

 

Open Space: The Global Effort for Open Access to Environmental Satellite Data


Book by Mariel Borowitz: “Key to understanding and addressing climate change is continuous and precise monitoring of environmental conditions. Satellites play an important role in collecting climate data, offering comprehensive global coverage that can’t be matched by in situ observation. And yet, as Mariel Borowitz shows in this book, much satellite data is not freely available but restricted; this remains true despite the data-sharing advocacy of international organizations and a global open data movement. Borowitz examines policies governing the sharing of environmental satellite data, offering a model of data-sharing policy development and applying it in case studies from the United States, Europe, and Japan—countries responsible for nearly half of the unclassified government Earth observation satellites.

Borowitz develops a model that centers on the government agency as the primary actor while taking into account the roles of such outside actors as other government officials and non-governmental actors, as well as the economic, security, and normative attributes of the data itself. The case studies include the U.S. National Aeronautics and Space Administration (NASA) and the U.S. National Oceanographic and Atmospheric Association (NOAA), and the United States Geological Survey (USGS); the European Space Agency (ESA) and the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT); and the Japanese Aerospace Exploration Agency (JAXA) and the Japanese Meteorological Agency (JMA). Finally, she considers the policy implications of her findings for the future and provides recommendations on how to increase global sharing of satellite data….(More)”.

The Unexamined Algorithm Is Not Worth Using


Ruben Mancha & Haslina Ali at Stanford Social Innovation Review: “In 1983, at the height of the Cold War, just one man stood between an algorithm and the outbreak of nuclear war. Stanislav Petrov, a colonel of the Soviet Air Defence Forces, was on duty in a secret command center when early-warning alarms went off indicating the launch of intercontinental ballistic missiles from an American base. The systems reported that the alarm was of the highest possible reliability. Petrov’s role was to advise his superiors on the veracity of the alarm that, in turn, would affect their decision to launch a retaliatory nuclear attack. Instead of trusting the algorithm, Petrov went with his gut and reported that the alarm was a malfunction. He turned out to be right.

This historical nugget represents an extreme example of the effect that algorithms have on our lives. The detection algorithm, it turns out, mistook the sun’s reflection for a missile launch. It is a sobering thought that a poorly designed or malfunctioning algorithm could have changed the course of history and resulted in millions of deaths….

We offer five recommendations to guide the ethical development and evaluation of algorithms used in your organization:

  1. Consider ethical outcomes first, speed and efficiency second. Organizations seeking speed and efficiency through algorithmic automation should remember that customer value comes through higher strategic speed, not higher operational speed. When implementing algorithms, organizations should never forget their ultimate goal is creating customer value, and fast yet potentially unethical algorithms defile that objective.
  2. Make ethical guiding principles salient to your organization. Your organization should reflect on the ethical principles guiding it and convey them clearly to employees, business partners, and customers. A corporate social responsibility framework is a good starting point for any organization ready to articulate its ethical principles.
  3. Employ programmers well versed in ethics. The computer engineers responsible for designing and programming algorithms should understand the ethical implications of the products of their work. While some ethical decisions may seem intuitive (such as do not use an algorithm to steal data from a user’s computer), most are not. The study of ethics and the practice of ethical inquiry should be part of every coding project.
  4. Interrogate your algorithms against your organization’s ethical standards. Through careful evaluation of the your algorithms’ behavior and outcomes, your organization can identify those circumstances, real or simulated, in which they do not meet the ethical standards.
  5. Engage your stakeholders. Transparently share with your customers, employees, and business partners details about the processes and outcomes of your algorithms. Stakeholders can help you identify and address ethical gaps….(More).

Data for Development


New Report by the OECD: “The 2017 volume of the  Development Co-operation Report focuses on Data for Development. “Big Data” and “the Internet of Things” are more than buzzwords: the data revolution is transforming the way that economies and societies are functioning across the planet. The Sustainable Development Goals along with the data revolution are opportunities that should not be missed: more and better data can help boost inclusive growth, fight inequalities and combat climate change. These data are also essential to measure and monitor progress against the Sustainable Development Goals.

The value of data in enabling development is uncontested. Yet, there continue to be worrying gaps in basic data about people and the planet and weak capacity in developing countries to produce the data that policy makers need to deliver reforms and policies that achieve real, visible and long-lasting development results. At the same time, investing in building statistical capacity – which represented about 0.30% of ODA in 2015 – is not a priority for most providers of development assistance.

There is a need for stronger political leadership, greater investment and more collective action to bridge the data divide for development. With the unfolding data revolution, developing countries and donors have a unique chance to act now to boost data production and use for the benefit of citizens. This report sets out priority actions and good practices that will help policy makers and providers of development assistance to bridge the global data divide, notably by strengthening statistical systems in developing countries to produce better data for better policies and better lives….(More)”