Linux Foundation Debuts Community Data License Agreement


Press Release: “The Linux Foundation, the nonprofit advancing professional open source management for mass collaboration, today announced the Community Data License Agreement(CDLA) family of open data agreements. In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data.

Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.

The growth of big data analytics, machine learning and artificial intelligence (AI) technologies has allowed people to extract unprecedented levels of insight from data. Now the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses and other organizations open up and share data, with the goal of creating communities that curate and share data openly.

For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.

Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.

And if government agencies share aggregated data on building permits, school enrollment figures, sewer and water usage, their citizens benefit from the ability of commercial entities to anticipate their future needs and respond with infrastructure and facilities that arrive in anticipation of citizens’ demands.

“An open data license is essential for the frictionless sharing of the data that powers both critical technologies and societal benefits,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure.”…(More)”.

The role of policy entrepreneurs in open government data policy innovation diffusion: An analysis of Australian Federal and State Governments


Paper by Akemi TakeokaChatfield and Christopher G.Reddick: “Open government data (OGD) policy differs substantially from the existing Freedom of Information policies. Consequently OGD can be viewed as a policy innovation. Drawing on both innovation diffusion theory and its application to public policy innovation research, we examine Australia’s OGD policy diffusion patterns at both the federal and state government levels based on the policy adoption timing and CKAN portal “Organization” and “Category” statistics. We found that state governments that had adopted OGD policies earlier had active policy entrepreneurs (or lead departments/agencies) responsible for the policy innovation diffusion across the different government departments. We also found that their efficacy ranking was relatively high in terms of OGD portal openness when openness is measured by the greater number of datasets proactively and systematically published through their OGD portals. These findings have important implications for the role played by OGD policy entrepreneurs in openly sharing the government-owned datasets with the public….(More)”.

Enabling Blockchain Innovation in the U.S. Federal Government


Primer by the American Council for Technology – Industry Advisory Council: “… intended to be a foundational tool in the understanding of blockchain and its use cases within the United States federal government. To that end, it should help allay the concerns that some may have about this new technology by providing an introduction to blockchain and its related technologies, and how blockchain can be safely and securely applied to the right government use cases. Blockchain has the potential to help government to reduce fraud, errors and the cost of paper-intensive processes, while enabling collaboration across multiple divisions and agencies to provide more efficient and effective services to citizens. Moreover, the adoption of blockchain may also allow governmental agencies to provide new value-added services to businesses and others which can generate new sources of revenue for these agencies….(More)”.

Our laws don’t do enough to protect our health data


 at the Conversation: “A particularly sensitive type of big data is medical big data. Medical big data can consist of electronic health records, insurance claims, information entered by patients into websites such as PatientsLikeMeand more. Health information can even be gleaned from web searches, Facebook and your recent purchases.

Such data can be used for beneficial purposes by medical researchers, public health authorities, and healthcare administrators. For example, they can use it to study medical treatments, combat epidemics and reduce costs. But others who can obtain medical big data may have more selfish agendas.

I am a professor of law and bioethics who has researched big data extensively. Last year, I published a book entitled Electronic Health Records and Medical Big Data: Law and Policy.

I have become increasingly concerned about how medical big data might be used and who could use it. Our laws currently don’t do enough to prevent harm associated with big data.

What your data says about you

Personal health information could be of interest to many, including employers, financial institutions, marketers and educational institutions. Such entities may wish to exploit it for decision-making purposes.

For example, employers presumably prefer healthy employees who are productive, take few sick days and have low medical costs. However, there are laws that prohibit employers from discriminating against workers because of their health conditions. These laws are the Americans with Disabilities Act (ADA) and the Genetic Information Nondiscrimination Act. So, employers are not permitted to reject qualified applicants simply because they have diabetes, depression or a genetic abnormality.

However, the same is not true for most predictive information regarding possible future ailments. Nothing prevents employers from rejecting or firing healthy workers out of the concern that they will later develop an impairment or disability, unless that concern is based on genetic information.

What non-genetic data can provide evidence regarding future health problems? Smoking status, eating preferences, exercise habits, weight and exposure to toxins are all informative. Scientists believe that biomarkers in your blood and other health details can predict cognitive decline, depression and diabetes.

Even bicycle purchases, credit scores and voting in midterm elections can be indicators of your health status.

Gathering data

How might employers obtain predictive data? An easy source is social media, where many individuals publicly post very private information. Through social media, your employer might learn that you smoke, hate to exercise or have high cholesterol.

Another potential source is wellness programs. These programs seek to improve workers’ health through incentives to exercise, stop smoking, manage diabetes, obtain health screenings and so on. While many wellness programs are run by third party vendors that promise confidentiality, that is not always the case.

In addition, employers may be able to purchase information from data brokers that collect, compile and sell personal information. Data brokers mine sources such as social media, personal websites, U.S. Census records, state hospital records, retailers’ purchasing records, real property records, insurance claims and more. Two well-known data brokers are Spokeo and Acxiom.

Some of the data employers can obtain identify individuals by name. But even information that does not provide obvious identifying details can be valuable. Wellness program vendors, for example, might provide employers with summary data about their workforce but strip away particulars such as names and birthdates. Nevertheless, de-identified information can sometimes be re-identified by experts. Data miners can match information to data that is publicly available….(More)”.

Reboot for the AI revolution


Yuval Noah Harari in Nature: “The ongoing artificial-intelligence revolution will change almost every line of work, creating enormous social and economic opportunities — and challenges. Some believe that intelligent computers will push humans out of the job market and create a new ‘useless class’; others maintain that automation will generate a wide range of new human jobs and greater prosperity for all. Almost everybody agrees that we should take action to prevent the worst-case scenarios….

Governments might decide to deliberately slow down the pace of automation, to lessen the resulting shocks and allow time for readjustments. But it will probably be both impossible and undesirable to prevent automation and job loss completely. That would mean giving up the immense positive potential of AI and robotics. If self-driving vehicles drive more safely and cheaply than humans, it would be counterproductive to ban them just to protect the jobs of taxi and lorry drivers.

A more sensible strategy is to create new jobs. In particular, as routine jobs are automated, opportunities for new non-routine jobs will mushroom. For example, general physicians who focus on diagnosing known diseases and administering familiar treatments will probably be replaced by AI doctors. Precisely because of that, there will be more money to pay human experts to do groundbreaking medical research, develop new medications and pioneer innovative surgical techniques.

This calls for economic entrepreneurship and legal dexterity. Above all, it necessitates a revolution in education…Creating new jobs might prove easier than retraining people to fill them. A huge useless class might appear, owing to both an absolute lack of jobs and a lack of relevant education and mental flexibility….

With insights gleaned from early warning signs and test cases, scholars should strive to develop new socio-economic models. The old ones no longer hold. For example, twentieth-century socialism assumed that the working class was crucial to the economy, and socialist thinkers tried to teach the proletariat how to translate its immense economic power into political clout. In the twenty-first century, if the masses lose their economic value they might have to struggle against irrelevance rather than exploitation….The challenges posed in the twenty-first century by the merger of infotech and biotech are arguably bigger than those thrown up by steam engines, railways, electricity and fossil fuels. Given the immense destructive power of our modern civilization, we cannot afford more failed models, world wars and bloody revolutions. We have to do better this time….(More)”

Laboratories for news? Experimenting with journalism hackathons


Jan Lauren Boyles in Journalism: “Journalism hackathons are computationally based events in which participants create news product prototypes. In the ideal case, the gatherings are rooted in local community, enabling a wide set of institutional stakeholders (legacy journalists, hacker journalists, civic hackers, and the general public) to gather in conversation around key civic issues. This study explores how and to what extent journalism hackathons operate as a community-based laboratory for translating open data from practitioners to the public. Surfaced from in-depth interviews with event organizers encompassing nine countries, the findings illustrate that journalism hackathons are most successful when collaboration integrates civic organizations and community leaders….(More)”.

How “Big Data” Went Bust


The problem with “big data” is not that data is bad. It’s not even that big data is bad: Applied carefully, massive data sets can reveal important trends that would otherwise go undetected. It’s the fetishization of data, and its uncritical use, that tends to lead to disaster, as Julia Rose West recently wrote for Slate. And that’s what “big data,” as a catchphrase, came to represent.

By its nature, big data is hard to interpret. When you’re collecting billions of data points—clicks or cursor positions on a website; turns of a turnstile in a large public space; hourly wind speed observations from around the world; tweets—the provenance of any given data point is obscured. This in turn means that seemingly high-level trends might turn out to be artifacts of problems in the data or methodology at the most granular level possible. But perhaps the bigger problem is that the data you have are usually only a proxy for what you really want to know. Big data doesn’t solve that problem—it magnifies it….

Aside from swearing off data and reverting to anecdote and intuition, there are at least two viable ways to deal with the problems that arise from the imperfect relationship between a data set and the real-world outcome you’re trying to measure or predict.

One is, in short: moar data. This has long been Facebook’s approach. When it became apparent that users’ “likes” were a flawed proxy for what they actually wanted to see more of in their feeds, the company responded by adding more and more proxies to its model. It began measuring other things, like the amount of time they spent looking at a post in their feed, the amount of time they spent reading a story they had clicked on, and whether they hit “like” before or after they had read the piece. When Facebook’s engineers had gone as far as they could in weighting and optimizing those metrics, they found that users were still unsatisfied in important ways. So the company added yet more metrics to the sauce: It started running huge user-survey panels, added new reaction emojis by which users could convey more nuanced sentiments, and started using A.I. to detect clickbait-y language in posts by pages and publishers. The company knows none of these proxies are perfect. But by constantly adding more of them to the mix, it can theoretically edge ever closer to an algorithm that delivers to users the posts that they most want to see.

One downside of the moar data approach is that it’s hard and expensive. Another is that the more variables are added to your model, the more complex, opaque, and unintelligible its methodology becomes. This is part of the problem Pasquale articulated in The Black Box Society. Even the most sophisticated algorithm, drawing on the best data sets, can go awry—and when it does, diagnosing the problem can be nigh-impossible. There are also the perils of “overfitting” and false confidence: The more sophisticated your model becomes, the more perfectly it seems to match up with all your past observations, and the more faith you place in it, the greater the danger that it will eventually fail you in a dramatic way. (Think mortgage crisis, election prediction models, and Zynga.)

Another possible response to the problems that arise from biases in big data sets is what some have taken to calling “small data.” Small data refers to data sets that are simple enough to be analyzed and interpreted directly by humans, without recourse to supercomputers or Hadoop jobs. Like “slow food,” the term arose as a conscious reaction to the prevalence of its opposite….(More)”

 

Open Space: The Global Effort for Open Access to Environmental Satellite Data


Book by Mariel Borowitz: “Key to understanding and addressing climate change is continuous and precise monitoring of environmental conditions. Satellites play an important role in collecting climate data, offering comprehensive global coverage that can’t be matched by in situ observation. And yet, as Mariel Borowitz shows in this book, much satellite data is not freely available but restricted; this remains true despite the data-sharing advocacy of international organizations and a global open data movement. Borowitz examines policies governing the sharing of environmental satellite data, offering a model of data-sharing policy development and applying it in case studies from the United States, Europe, and Japan—countries responsible for nearly half of the unclassified government Earth observation satellites.

Borowitz develops a model that centers on the government agency as the primary actor while taking into account the roles of such outside actors as other government officials and non-governmental actors, as well as the economic, security, and normative attributes of the data itself. The case studies include the U.S. National Aeronautics and Space Administration (NASA) and the U.S. National Oceanographic and Atmospheric Association (NOAA), and the United States Geological Survey (USGS); the European Space Agency (ESA) and the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT); and the Japanese Aerospace Exploration Agency (JAXA) and the Japanese Meteorological Agency (JMA). Finally, she considers the policy implications of her findings for the future and provides recommendations on how to increase global sharing of satellite data….(More)”.

Our Gutenberg Moment: It’s Time To Grapple With The Internet’s Effect On Democracy


Alberto Ibargüen at HuffPost: “When clashes wracked Charlottesville, many Americans saw neo-nazi demonstrators as the obvious instigators. But others focused on counter-demonstrators, a view amplified by the president blaming “many sides.” The rift in perception underscored an uncomfortable but unavoidable truth about the flow of information today: Americans no longer have a shared foundation of facts upon which we can agree.

Politics has long been a messy, divisive business. I lived through the 1960s, a period of similar dissatisfaction, disillusionment, and disunity, brilliantly chronicled by Ken Burns’ new film “The Vietnam War” on PBS. But common, local knowledge —of history and current events — has always been the great equalizer in American society. Today, however, a decrease in shared knowledge has led to a collapse in trust. Over the past few years, we have watched our capacity to compromise wane as not only our politics, but also our most basic value systems, have become polarized.

The key difference between then and now is how news is delivered and consumed. At the beginning of our Republic, the reach of media was local and largely verifiable. That direct relationship between media outlets and their communities — local newspapers and, later, radio and TV stations — held until the second half of the 20th century. Network TV began to create a sense of national community but it fractioned with the sudden ability to offer targeted, membership-based models via cable.

But cable was nothing compared to Internet. Internet’s unique ability to personalize and to create virtual communities of interest accelerated the decline of newspapers and television business models and altered the flow of information in ways that we are still uncovering. “Media” now means digital and cable, cool mediums that require hot performance. Trust in all media, including traditional media, is at an all-time low, and we’re just now beginning to grapple with the threat to democracy posed by this erosion of trust.

Internet is potentially the greatest democratizing tool in history. It is also democracy’s greatest challenge. In offering access to information that can support any position and confirm any bias, social media has propelled the erosion of our common set of everyday facts….(More)”.

Open data, democracy and public service reform


Mark Thompson at Computer Weekly: “Discussion around reforming public services is as important as better information sharing rules if government is to make the most of public data…

Our public services face two paradoxes in relation to data sharing. First, on the demand side, “Zuckerberg’s law” – which claims that the amount of data we’re happy to share with companies increases exponentially year-on-year – flies in the face of our wariness as citizens to share with the state….

The upcoming General Data Protection Regulation (GDPR) – a beefed-up version of the existing Data Protection Act (DPA) – is likely to only exacerbate a fundamental problem, therefore: citizens don’t want the state to know much about them, and public servants don’t want to share. Each behaviour is paradoxical, and thus complex to address culturally.

Worse, we need to accelerate our public conversation considerably if we are to maintain pace with accelerating technological developments.

Existing complexity in the data space will shortly be exacerbated by new abilities to process unstructured data such as images and natural language – abilities which offer entirely new opportunities for commercial exploitation as well as surveillance…(More)”.