7 things we’ve learned about computer algorithms


Aaron Smith at Pew Research Center: “Algorithms are all around us, using massive stores of data and complex analytics to make decisions with often significant impacts on humans – from choosing the content people see on social media to judging whether a person is a good credit risk or job candidate. Pew Research Center released several reports in 2018 that explored the role and meaning of algorithms in people’s lives today. Here are some of the key themes that emerged from that research.

  1. Algorithmically generated content platforms play a prominent role in Americans’ information diets. Sizable shares of U.S. adults now get news on sites like Facebook or YouTube that use algorithms to curate the content they show to their users. A study by the Center found that 81% of YouTube users say they at least occasionally watch the videos suggested by the platform’s recommendation algorithm, and that these recommendations encourage users to watch progressively longer content as they click through the videos suggested by the site.
  2. The inner workings of even the most common algorithms can be confusing to users. Facebook is among the most popular social media platforms, but roughly half of Facebook users – including six-in-ten users ages 50 and older – say they do not understand how the site’s algorithmically generated news feed selects which posts to show them. And around three-quarters of Facebook users are not aware that the site automatically estimates their interests and preferences based on their online behaviors in order to deliver them targeted advertisements and other content.
  3. The public is wary of computer algorithms being used to make decisions with real-world consequences. The public expresses widespread concern about companies and other institutions using computer algorithms in situations with potential impacts on people’s lives. More than half (56%) of U.S. adults think it is unacceptable to use automated criminal risk scores when evaluating people who are up for parole. And 68% think it is unacceptable for companies to collect large quantities of data about individuals for the purposes of offering them deals or other financial incentives. When asked to elaborate about their worries, many feel that these programs violate people’s privacy, are unfair, or simply will not work as well as decisions made by humans….(More)”.

Achieving Digital Permanence


Raymond Blum with Betsy Beyer at ACM Queu: “Digital permanence has become a prevalent issue in society. This article focuses on the forces behind it and some of the techniques to achieve a desired state in which “what you read is what was written.” While techniques that can be imposed as layers above basic data stores—blockchains, for example—are valid approaches to achieving a system’s information assurance guarantees, this article won’t discuss them.

First, let’s define digital permanence and the more basic concept of data integrity.

Data integrity is the maintenance of the accuracy and consistency of stored information. Accuracy means that the data is stored as the set of values that were intended. Consistency means that these stored values remain the same over time—they do not unintentionally waver or morph as time passes.

Digital permanence refers to the techniques used to anticipate and then meet the expected lifetime of data stored in digital media. Digital permanence not only considers data integrity, but also targets guarantees of relevance and accessibility: the ability to recall stored data and to recall it with predicted latency and at a rate acceptable to the applications that require that information.

To illustrate the aspects of relevance and accessibility, consider two counterexamples: journals that were safely stored redundantly on Zip drives or punch cards may as well not exist if the hardware required to read the media into a current computing system isn’t available. Nor is it very useful to have receipts and ledgers stored on a tape medium that will take eight days to read in when you need the information for an audit on Thursday.

The Multiple Facets of Digital Permanence

Human memory is the most subjective record imaginable. Common adages and clichés such as “He said, she said,” “IIRC (If I remember correctly),” and “You might recall” recognize the truth of memories—that they are based only on fragments of the one-time subjective perception of any objective state of affairs. What’s more, research indicates that people alter their memories over time. Over the years, as the need to provide a common ground for actions based on past transactions arises, so does the need for an objective record of fact—an independent “true” past. These records must be both immutable to a reasonable degree and durable. Media such as clay tablets, parchment, photographic prints, and microfiche became popular because they satisfied the “write once, read many” requirement of society’s record keepers.

Information storage in the digital age has evolved to fit the scale of access (frequent) and volume (high) by moving to storage media that record and deliver information in an almost intangible state. Such media have distinct advantages: electrical impulses and the polarity of magnetized ferric compounds can be moved around at great speed and density. These media, unfortunately, also score higher in another measure: fragility. Paper and clay can survive large amounts of neglect and punishment, but a stray electromagnetic discharge or microscopic rupture can render a digital library inaccessible or unrecognizable.

It stands to reason that storing permanent records in some immutable and indestructible medium would be ideal—something that, once altered to encode information, could never be altered again, either by an overwrite or destruction. Experience shows that such ideals are rarely realized; with enough force and will, the hardest stone can be broken and the most permanent markings defaced.

In considering and ensuring digital permanence, you want to guard against two different failures: the destruction of the storage medium, and a loss of the integrity or “truthfulness” of the records….(More)”.

Decoding Algorithms


Malcalester University: “Ada Lovelace probably didn’t foresee the impact of the mathematical formula she published in 1843, now considered the first computer algorithm.

Nor could she have anticipated today’s widespread use of algorithms, in applications as different as the 2016 U.S. presidential campaign and Mac’s first-year seminar registration. “Over the last decade algorithms have become embedded in every aspect of our lives,” says Shilad Sen, professor in Macalester’s Math, Statistics, and Computer Science (MSCS) Department.

How do algorithms shape our society? Why is it important to be aware of them? And for readers who don’t know, what is an algorithm, anyway?…(More)”.

Leveraging and Sharing Data for Urban Flourishing


Testimony by Stefaan Verhulst before New York City Council Committee on Technology and the Commission on Public Information and Communication (COPIC): “We live in challenging times. From climate change to economic inequality, the difficulties confronting New York City, its citizens, and decision-makers are unprecedented in their variety, and also in their complexity and urgency. Our standard policy toolkit increasingly seems stale and ineffective. Existing governance institutions and mechanisms seem outdated and distrusted by large sections of the population.

To tackle today’s problems we need not only new solutions but also new methods for arriving at solutions. Data can play a central role in this task. Access to and the use of data in a trusted and responsible manner is central to meeting the challenges we face and enabling public innovation.

This hearing, called by the Technology Committee and the Commission on Public Information and Communication, is therefore timely and very important. It is my firm belief that rapid progress on developing an effective data sharing framework is among the most important steps our New York City leaders can take to tackle the myriad of 21st challenges....

I am joined today by some of my distinguished NYU colleagues, Prof. Julia Lane and Prof. Julia Stoyanovich, who have worked extensively on the technical and privacy challenges associated with data sharing. I will, therefore, avoid duplicating our testimonies and won’t focus on issues of privacy, trust and how to establish a responsible data sharing infrastructure, while these are central considerations for the type of data-driven approaches I will discuss. I am, of course, happy to elaborate on these topics during the question and answer session.

Instead, I want to focus on four core issues associated with data collaboration. I phrase these issues as answers to four questions. For each of these questions, I also provide a set of recommended actions that this Committee could consider undertaking or studying.

The four core questions are:

  • First, why should NYC care about data and data sharing?
  • Second, if you build a data-sharing framework, will they come?
  • Third, how can we best engage the private sector when it comes to sharing and using their data?
  • And fourth, is technology is the main (or best) answer?…(More)”.

Assessing the Legitimacy of “Open” and “Closed” Data Partnerships for Sustainable Development


Paper by Andreas Rasche, Mette Morsing and Erik Wetter in Business and Society: “This article examines the legitimacy attached to different types of multi-stakeholder data partnerships occurring in the context of sustainable development. We develop a framework to assess the democratic legitimacy of two types of data partnerships: open data partnerships (where data and insights are mainly freely available) and closed data partnerships (where data and insights are mainly shared within a network of organizations). Our framework specifies criteria for assessing the legitimacy of relevant partnerships with regard to their input legitimacy as well as their output legitimacy. We demonstrate which particular characteristics of open and closed partnerships can be expected to influence an analysis of their input and output legitimacy….(More)”.

Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence


Paper by Huimin Xia et al in at Nature Medicine: “Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework.

Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal….(More)”.

Show me the Data! A Systematic Mapping on Open Government Data Visualization


Paper by André Eberhardt and Milene Selbach Silveira: “During the last years many government organizations have adopted Open Government Data policies to make their data publicly available. Although governments are having success on publishing their data, the availability of the datasets is not enough to people to make use of it due to lack of technical expertise such as programming skills and knowledge on data management. In this scenario, Visualization Techniques can be applied to Open Government Data in order to help to solve this problem.

In this sense, we analyzed previously published papers related to Open Government Data Visualization in order to provide an overview about how visualization techniques are being applied to Open Government Data and which are the most common challenges when dealing with it. A systematic mapping study was conducted to survey the papers that were published in this area. The study found 775 papers and, after applying all inclusion and exclusion criteria, 32 papers were selected. Among other results, we found that datasets related to transportation are the main ones being used and Map is the most used visualization technique. Finally, we report that data quality is the main challenge being reported by studies that applied visualization techniques to Open Government Data…(More)”.

Urban Computing


Book by Yu Zheng:”…Urban computing brings powerful computational techniques to bear on such urban challenges as pollution, energy consumption, and traffic congestion. Using today’s large-scale computing infrastructure and data gathered from sensing technologies, urban computing combines computer science with urban planning, transportation, environmental science, sociology, and other areas of urban studies, tackling specific problems with concrete methodologies in a data-centric computing framework. This authoritative treatment of urban computing offers an overview of the field, fundamental techniques, advanced models, and novel applications.

Each chapter acts as a tutorial that introduces readers to an important aspect of urban computing, with references to relevant research. The book outlines key concepts, sources of data, and typical applications; describes four paradigms of urban sensing in sensor-centric and human-centric categories; introduces data management for spatial and spatio-temporal data, from basic indexing and retrieval algorithms to cloud computing platforms; and covers beginning and advanced topics in mining knowledge from urban big data, beginning with fundamental data mining algorithms and progressing to advanced machine learning techniques. Urban Computing provides students, researchers, and application developers with an essential handbook to an evolving interdisciplinary field….(More)”

This is how AI bias really happens—and why it’s so hard to fix


Karen Hao at MIT Technology Review: “Over the past few months, we’ve documented how the vast majority of AI’s applications today are based on the category of algorithms known as deep learning, and how deep-learning algorithms find patterns in data. We’ve also covered how these technologies affect people’s lives: how they can perpetuate injustice in hiring, retail, and security and may already be doing so in the criminal legal system.

But it’s not enough just to know that this bias exists. If we want to be able to fix it, we need to understand the mechanics of how it arises in the first place.

How AI bias happens

We often shorthand our explanation of AI bias by blaming it on biased training data. The reality is more nuanced: bias can creep in long before the data is collected as well as at many other stages of the deep-learning process. For the purposes of this discussion, we’ll focus on three key stages.Sign up for the The AlgorithmArtificial intelligence, demystified

By signing up you agree to receive email newsletters and notifications from MIT Technology Review. You can change your preferences at any time. View our Privacy Policy for more detail.

Framing the problem. The first thing computer scientists do when they create a deep-learning model is decide what they actually want it to achieve. A credit card company, for example, might want to predict a customer’s creditworthiness, but “creditworthiness” is a rather nebulous concept. In order to translate it into something that can be computed, the company must decide whether it wants to, say, maximize its profit margins or maximize the number of loans that get repaid. It could then define creditworthiness within the context of that goal. The problem is that “those decisions are made for various business reasons other than fairness or discrimination,” explains Solon Barocas, an assistant professor at Cornell University who specializes in fairness in machine learning. If the algorithm discovered that giving out subprime loans was an effective way to maximize profit, it would end up engaging in predatory behavior even if that wasn’t the company’s intention.

Collecting the data. There are two main ways that bias shows up in training data: either the data you collect is unrepresentative of reality, or it reflects existing prejudices. The first case might occur, for example, if a deep-learning algorithm is fed more photos of light-skinned faces than dark-skinned faces. The resulting face recognition system would inevitably be worse at recognizing darker-skinned faces. The second case is precisely what happened when Amazon discovered that its internal recruiting tool was dismissing female candidates. Because it was trained on historical hiring decisions, which favored men over women, it learned to do the same.

Preparing the data. Finally, it is possible to introduce bias during the data preparation stage, which involves selecting which attributes you want the algorithm to consider. (This is not to be confused with the problem-framing stage. You can use the same attributes to train a model for very different goals or use very different attributes to train a model for the same goal.) In the case of modeling creditworthiness, an “attribute” could be the customer’s age, income, or number of paid-off loans. In the case of Amazon’s recruiting tool, an “attribute” could be the candidate’s gender, education level, or years of experience. This is what people often call the “art” of deep learning: choosing which attributes to consider or ignore can significantly influence your model’s prediction accuracy. But while its impact on accuracy is easy to measure, its impact on the model’s bias is not.

Why AI bias is hard to fix

Given that context, some of the challenges of mitigating bias may already be apparent to you. Here we highlight four main ones….(More)”

Fact-Based Policy: How Do State and Local Governments Accomplish It?


Report and Proposal by Justine Hastings: “Fact-based policy is essential to making government more effective and more efficient, and many states could benefit from more extensive use of data and evidence when making policy. Private companies have taken advantage of declining computing costs and vast data resources to solve problems in a fact-based way, but state and local governments have not made as much progress….

Drawing on her experience in Rhode Island, Hastings proposes that states build secure, comprehensive, integrated databases, and that they transform those databases into data lakes that are optimized for developing insights. Policymakers can then use the insights from this work to sharpen policy goals, create policy solutions, and measure progress against those goals. Policymakers, computer scientists, engineers, and economists will work together to build the data lake and analyze the data to generate policy insights….(More)”.