Stewardship in the “Age of Algorithms”


Clifford Lynch at First Monday: “This paper explores pragmatic approaches that might be employed to document the behavior of large, complex socio-technical systems (often today shorthanded as “algorithms”) that centrally involve some mixture of personalization, opaque rules, and machine learning components. Thinking rooted in traditional archival methodology — focusing on the preservation of physical and digital objects, and perhaps the accompanying preservation of their environments to permit subsequent interpretation or performance of the objects — has been a total failure for many reasons, and we must address this problem.

The approaches presented here are clearly imperfect, unproven, labor-intensive, and sensitive to the often hidden factors that the target systems use for decision-making (including personalization of results, where relevant); but they are a place to begin, and their limitations are at least outlined.

Numerous research questions must be explored before we can fully understand the strengths and limitations of what is proposed here. But it represents a way forward. This is essentially the first paper I am aware of which tries to effectively make progress on the stewardship challenges facing our society in the so-called “Age of Algorithms;” the paper concludes with some discussion of the failure to address these challenges to date, and the implications for the roles of archivists as opposed to other players in the broader enterprise of stewardship — that is, the capture of a record of the present and the transmission of this record, and the records bequeathed by the past, into the future. It may well be that we see the emergence of a new group of creators of documentation, perhaps predominantly social scientists and humanists, taking the front lines in dealing with the “Age of Algorithms,” with their materials then destined for our memory organizations to be cared for into the future…(More)”.

Solving Public Problems with Data


Dinorah Cantú-Pedraza and Sam DeJohn at The GovLab: “….To serve the goal of more data-driven and evidence-based governing,  The GovLab at NYU Tandon School of Engineering this week launched “Solving Public Problems with Data,” a new online course developed with support from the Laura and John Arnold Foundation.

This online lecture series helps those working for the public sector, or simply in the public interest, learn to use data to improve decision-making. Through real-world examples and case studies — captured in 10 video lectures from leading experts in the field — the new course outlines the fundamental principles of data science and explores ways practitioners can develop a data analytical mindset. Lectures in the series include:

  1. Introduction to evidence-based decision-making  (Quentin Palfrey, formerly of MIT)
  2. Data analytical thinking and methods, Part I (Julia Lane, NYU)
  3. Machine learning (Gideon Mann, Bloomberg LP)
  4. Discovering and collecting data (Carter Hewgley, Johns Hopkins University)
  5. Platforms and where to store data (Arnaud Sahuguet, Cornell Tech)
  6. Data analytical thinking and methods, Part II (Daniel Goroff, Alfred P. Sloan Foundation)
  7. Barriers to building a data practice (Beth Blauer, Johns Hopkins University and GovEx)
  8. Data collaboratives (Stefaan G. Verhulst, The GovLab)
  9. Strengthening a data analytic culture (Amen Ra Mashariki, ESRI)
  10. Data governance and sharing (Beth Simone Noveck, NYU Tandon/The GovLab)

The goal of the lecture series is to enable participants to define and leverage the value of data to achieve improved outcomes and equities, reduced cost and increased efficiency in how public policies and services are created. No prior experience with computer science or statistics is necessary or assumed. In fact, the course is designed precisely to serve public professionals seeking an introduction to data science….(More)”.

Nearly All of Wikipedia Is Written By Just 1 Percent of Its Editors


Daniel Oberhaus at Motherboard: “…Sixteen years later, the free encyclopedia and fifth most popular website in the world is well on its way to this goal. Today, Wikipedia is home to 43 million articles in 285 languages and all of these articles are written and edited by an autonomous group of international volunteers.

Although the non-profit Wikimedia Foundation diligently keeps track of how editors and users interact with the site, until recently it was unclear how content production on Wikipedia was distributed among editors. According to the results of a recent study that looked at the 250 million edits made on Wikipedia during its first ten years, only about 1 percent of Wikipedia’s editors have generated 77 percent of the site’s content.

“Wikipedia is both an organization and a social movement,” Sorin Matei, the director of the Purdue University Data Storytelling Network and lead author of the study, told me on the phone. “The assumption is that it’s a creation of the crowd, but this couldn’t be further from the truth. Wikipedia wouldn’t have been possible without a dedicated leadership.”

At the time of writing, there are roughly 132,000 registered editors who have been active on Wikipedia in the last month (there are also an unknown number of unregistered Wikipedians who contribute to the site). So statistically speaking, only about 1,300 people are creating over three-quarters of the 600 new articles posted to Wikipedia every day.

Of course, these “1 percenters” have changed over the last decade and a half. According to Matei, roughly 40 percent of the top 1 percent of editors bow out about every five weeks. In the early days, when there were only a few hundred thousand people collaborating on Wikipedia, Matei said the content production was significantly more equitable. But as the encyclopedia grew, and the number of collaborators grew with it, a cadre of die-hard editors emerged that have accounted for the bulk of Wikipedia’s growth ever since.

Matei and his colleague Brian Britt, an assistant professor of journalism at South Dakota State University, used a machine learning algorithm to crawl the quarter of a billion publicly available edit logs from Wikipedia’s first decade of existence. The results of this research, published September as a book, suggests that for all of Wikipedia’s pretension to being a site produced by a network of freely collaborating peers, “some peers are more equal than others,” according to Matei.

Matei and Britt argue that rather than being a decentralized, spontaneously evolving organization, Wikipedia is better described as an “adhocracy“—a stable hierarchical power structure which nevertheless allows for a high degree of individual mobility within that hierarchy….(More)”.

More Machine Learning About Congress’ Priorities


ProPublica: “We keep training machine learning models on Congress. Find out what this one learned about lawmakers’ top issues…

Speaker of the House Paul Ryan is a tax wonk ― and most observers of Congress know that. But knowing what interests the other 434 members of Congress is harder.

To make it easier to know what issues each lawmaker really focuses on, we’re launching a new feature in our Represent database called Policy Priorities. We had two goals in creating it: To help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own and their communities.

We created Policy Priorities using some sophisticated computer algorithms (more on this in a second) to calculate interest based on what each congressperson talks ― and brags ― about in their press releases.

Voting and drafting legislation aren’t the only things members of Congress do with their time, but they’re often the main way we analyze congressional data, in part because they’re easily measured. But the job of a member of Congress goes well past voting. They go to committee meetings, discuss policy on the floor and in caucuses, raise funds and ― important for our purposes ― communicate with their constituents and journalists back home. They use press releases to talk about what they’ve accomplished and to demonstrate their commitment to their political ideals.

We’ve been gathering these press releases for a few years, and have a body of some 86,000 that we used for a kind of analysis called machine learning….(More)”.

Understanding Corporate Data Sharing Decisions: Practices, Challenges, and Opportunities for Sharing Corporate Data with Researchers


Leslie Harris at the Future of Privacy Forum: “Data has become the currency of the modern economy. A recent study projects the global volume of data to grow from about 0.8 zettabytes (ZB) in 2009 to more than 35 ZB in 2020, most of it generated within the last two years and held by the corporate sector.

As the cost of data collection and storage becomes cheaper and computing power increases, so does the value of data to the corporate bottom line. Powerful data science techniques, including machine learning and deep learning, make it possible to search, extract and analyze enormous sets of data from many sources in order to uncover novel insights and engage in predictive analysis. Breakthrough computational techniques allow complex analysis of encrypted data, making it possible for researchers to protect individual privacy, while extracting valuable insights.

At the same time, these newfound data sources hold significant promise for advancing scholarship and shaping more impactful social policies, supporting evidence-based policymaking and more robust government statistics, and shaping more impactful social interventions. But because most of this data is held by the private sector, it is rarely available for these purposes, posing what many have argued is a serious impediment to scientific progress.

A variety of reasons have been posited for the reluctance of the corporate sector to share data for academic research. Some have suggested that the private sector doesn’t realize the value of their data for broader social and scientific advancement. Others suggest that companies have no “chief mission” or public obligation to share. But most observers describe the challenge as complex and multifaceted. Companies face a variety of commercial, legal, ethical, and reputational risks that serve as disincentives to sharing data for academic research, with privacy – particularly the risk of reidentification – an intractable concern. For companies, striking the right balance between the commercial and societal value of their data, the privacy interests of their customers, and the interests of academics presents a formidable dilemma.

To be sure, there is evidence that some companies are beginning to share for academic research. For example, a number of pharmaceutical companies are now sharing clinical trial data with researchers, and a number of individual companies have taken steps to make data available as well. What is more, companies are also increasingly providing open or shared data for other important “public good” activities, including international development, humanitarian assistance and better public decision-making. Some are contributing to data collaboratives that pool data from different sources to address societal concerns. Yet, it is still not clear whether and to what extent this “new era of data openness” will accelerate data sharing for academic research.

Today, the Future of Privacy Forum released a new study, Understanding Corporate Data Sharing Decisions: Practices, Challenges, and Opportunities for Sharing Corporate Data with ResearchersIn this report, we aim to contribute to the literature by seeking the “ground truth” from the corporate sector about the challenges they encounter when they consider making data available for academic research. We hope that the impressions and insights gained from this first look at the issue will help formulate further research questions, inform the dialogue between key stakeholders, and identify constructive next steps and areas for further action and investment….(More)”.

Bot.Me: A revolutionary partnership


PWC Consumer Intelligence Series: “The modern world has been shaped by the technological revolutions of the past, like the Industrial Revolution and the Information Revolution. The former redefined the way the world values both human and material resources; the latter redefined value in terms of resources while democratizing information. Today, as technology progresses even further, value is certain to shift again, with a focus on sentiments more intrinsic to the human experience: thinking, creativity, and problem-solving. AI, shorthand for artificial intelligence, defines technologies emerging today that can understand, learn, and then act based on that information. Forms of AI in use today include digital assistants, chatbots, and machine learning.

Today, AI works in three ways:

  • Assisted intelligence, widely available today, improves what people and organizations are already doing. A simple example, prevalent in cars today, is the GPS navigation program that offers directions to drivers and adjusts to road conditions.
  • Augmented intelligence, emerging today, enables people and organizations to do things they couldn’t otherwise do. For example, the combination of programs that organize cars in ride-sharing services enables businesses that could not otherwise exist.
  • Autonomous intelligence, being developed for the future, establishes machines that act on their own. An example of this will be self-driving vehicles, when they come into widespread use.

With a market projected to reach $70 billion by 2020, AI is poised to have a transformative effect on consumer, enterprise, and government markets around the world. While there are certainly obstacles to overcome, consumers believe that AI has the potential to assist in medical breakthroughs, democratize costly services, elevate poor customer service, and even free up an overburdened workforce. Some tech optimists believe AI could create a world where human abilities are amplified as machines help mankind process, analyze, and evaluate the abundance of data that creates today’s world, allowing humans to spend more time engaged in high-level thinking, creativity, and decision-making. Technological revolutions, like the Industrial Revolution and the Information Revolution, didn’t happen overnight. In fact, people in the midst of those revolutions often didn’t even realize they were happening, until history was recorded later.

That is where we find ourselves today, in the very beginning of what some are calling the “augmented age.” Just like humans in the past, it is up to mankind to find the best ways to leverage these machine revolutions to help the world evolve. As Isaac Asimov, the prolific science fiction writer with many works on AI mused, “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” As a future with AI approaches, it’s important to understand how people think of it today, how it will amplify the world tomorrow, and what guiding principles will be required to navigate this monumental change….(More)”.

Linux Foundation Debuts Community Data License Agreement


Press Release: “The Linux Foundation, the nonprofit advancing professional open source management for mass collaboration, today announced the Community Data License Agreement(CDLA) family of open data agreements. In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data.

Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.

The growth of big data analytics, machine learning and artificial intelligence (AI) technologies has allowed people to extract unprecedented levels of insight from data. Now the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses and other organizations open up and share data, with the goal of creating communities that curate and share data openly.

For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.

Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.

And if government agencies share aggregated data on building permits, school enrollment figures, sewer and water usage, their citizens benefit from the ability of commercial entities to anticipate their future needs and respond with infrastructure and facilities that arrive in anticipation of citizens’ demands.

“An open data license is essential for the frictionless sharing of the data that powers both critical technologies and societal benefits,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure.”…(More)”.

How We Can Stop Earthquakes From Killing People Before They Even Hit


Justin Worland in Time Magazine: “…Out of that realization came a plan to reshape disaster management using big data. Just a few months later, Wani worked with two fellow Stanford students to create a platform to predict the toll of natural disasters. The concept is simple but also revolutionary. The One Concern software pulls geological and structural data from a variety of public and private sources and uses machine learning to predict the impact of an earthquake down to individual city blocks and buildings. Real-time information input during an earthquake improves how the system responds. And earthquakes represent just the start for the company, which plans to launch a similar program for floods and eventually other natural disasters….

Previous software might identify a general area where responders could expect damage, but it would appear as a “big red blob” that wasn’t helpful when deciding exactly where to send resources, Dayton says. The technology also integrates information from many sources and makes it easy to parse in an emergency situation when every moment matters. The instant damage evaluations mean fast and actionable information, so first responders can prioritize search and rescue in areas most likely to be worst-hit, rather than responding to 911 calls in the order they are received.

One Concern is not the only company that sees an opportunity to use data to rethink disaster response. The mapping company Esri has built rapid-response software that shows expected damage from disasters like earthquakes, wildfires and hurricanes. And the U.S. government has invested in programs to use data to shape disaster response at agencies like the National Oceanic and Atmospheric Administration (NOAA)….(More)”.

Using big data to predict suicide risk among Canadian youth


SAS Insights “Suicide is the second leading cause of death among youth in Canada, according to Statistics Canada, accounting for one-fifth of deaths of people under the age of 25 in 2011. The Canadian Mental Health Association states that among 15 – 24 year olds the number is an even more frightening at 24 percent – the third highest in the industrialized world. Yet despite these disturbing statistics, the signals that an individual plans on self-injury or suicide are hard to isolate….

Team members …collected 2.3 million tweets and used text mining software to identify 1.1 million of them as likely to have been authored by 13 to 17 year olds in Canada by building a machine learning model to predict age, based on the open source PAN author profiling dataset. Their analysis made use of natural language processing, predictive modelling, text mining, and data visualization….

However, there were challenges. Ages are not revealed on Twitter, so the team had to figure out how to tease out the data for 13 – 17 year olds in Canada. “We had a text data set, and we created a model to identify if people were in that age group based on how they talked in their tweets,” Soehl said. “From there, we picked some specific buzzwords and created topics around them, and our software mined those tweets to collect the people.”

Another issue was the restrictions Twitter places on pulling data, though Soehl believes that once this analysis becomes an established solution, Twitter may work with researchers to expedite the process. “Now that we’ve shown it’s possible, there are a lot of places we can go with it,” said Soehl. “Once you know your path and figure out what’s going to be valuable, things come together quickly.”

The team looked at the percentage of people in the group who were talking about depression or suicide, and what they were talking about. Horne said that when SAS’ work went in front of a Canadian audience working in health care, they said that it definitely filled a gap in their data — and that was the validation he’d been looking for. The team also won $10,000 for creating the best answer to this question (the team donated the award money to two mental health charities: Mind Your Mind and Rise Asset Development)

What’s next?

That doesn’t mean the work is done, said Jos Polfliet. “We’re just scraping the surface of what can be done with the information.” Another way to use the results is to look at patterns and trends….(More)”

Artificial Intelligence and Public Policy


Paper by Adam D. ThiererAndrea Castillo and Raymond Russell: “There is growing interest in the market potential of artificial intelligence (AI) technologies and applications as well as in the potential risks that these technologies might pose. As a result, questions are being raised about the legal and regulatory governance of AI, machine learning, “autonomous” systems, and related robotic and data technologies. Fearing concerns about labor market effects, social inequality, and even physical harm, some have called for precautionary regulations that could have the effect of limiting AI development and deployment. In this paper, we recommend a different policy framework for AI technologies. At this nascent stage of AI technology development, we think a better case can be made for prudence, patience, and a continuing embrace of “permissionless innovation” as it pertains to modern digital technologies. Unless a compelling case can be made that a new invention will bring serious harm to society, innovation should be allowed to continue unabated, and problems, if they develop at all, can be addressed later…(More)”.