Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers


Suju Rajan at Yahoo Labs: “Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers.

Research scientists at Yahoo Labs have long enjoyed working on large-scale machine learning problems inspired by consumer-facing products. This has enabled us to advance the thinking in areas such as search ranking, computational advertising, information retrieval, and core machine learning. A key aspect of interest to the external research community has been the application of new algorithms and methodologies to production traffic and to large-scale datasets gathered from real products.

Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use.

In addition to the interaction data, we are providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining….(More)”

Can crowdsourcing decipher the roots of armed conflict?


Stephanie Kanowitz at GCN: “Researchers at Pennsylvania State University and the University of Texas at Dallas are proving that there’s accuracy, not just safety, in numbers. The Correlates of War project, a long-standing effort that studies the history of warfare, is now experimenting with crowdsourcing as a way to more quickly and inexpensively create a global conflict database that could help explain when and why countries go to war.

The goal is to facilitate the collection, dissemination and use of reliable data in international relations, but a byproduct has emerged: the development of technology that uses machine learning and natural language processing to efficiently, cost-effectively and accurately create databases from news articles that detail militarized interstate disputes.

The project is in its fifth iteration, having released the fourth set of Militarized Dispute (MID) Data in 2014. To create those earlier versions, researchers paid subject-matter experts such as political scientists to read and hand code newswire articles about disputes, identifying features of possible militarized incidents. Now, however, they’re soliciting help from anyone and everyone — and finding the results are much the same as what the experts produced, except the results come in faster and with significantly less expense.

As news articles come across the wire, the researchers pull them and formulate questions about them that help evaluate the military events. Next, the articles and questions are loaded onto the Amazon Mechanical Turk, a marketplace for crowdsourcing. The project assigns articles to readers, who typically spend about 10 minutes reading an article and responding to the questions. The readers submit the answers to the project researchers, who review them. The project assigns the same article to multiple workers and uses computer algorithms to combine the data into one annotation.

A systematic comparison of the crowdsourced responses with those of trained subject-matter experts showed that the crowdsourced work was accurate for 68 percent of the news reports coded. More important, the aggregation of answers for each article showed that common answers from multiple readers strongly correlated with correct coding. This allowed researchers to easily flag the articles that required deeper expert involvement and process the majority of the news items in near-real time and at limited cost….(more)”

Predictive Analytics


Revised book by Eric Siegel: “Prediction is powered by the world’s most potent, flourishing unnatural resource: data. Accumulated in large part as the by-product of routine tasks, data is the unsalted, flavorless residue deposited en masse as organizations churn away. Surprise! This heap of refuse is a gold mine. Big data embodies an extraordinary wealth of experience from which to learn.

Predictive analytics unleashes the power of data. With this technology, the computer literally learns from data how to predict the future behavior of individuals. Perfect prediction is not possible, but putting odds on the future drives millions of decisions more effectively, determining whom to call, mail, investigate, incarcerate, set up on a date, or medicate.

In this lucid, captivating introduction — now in its Revised and Updated edition — former Columbia University professor and Predictive Analytics World founder Eric Siegel reveals the power and perils of prediction:

    • What type of mortgage risk Chase Bank predicted before the recession.
    • Predicting which people will drop out of school, cancel a subscription, or get divorced before they even know it themselves.
    • Why early retirement predicts a shorter life expectancy and vegetarians miss fewer flights.
    • Five reasons why organizations predict death — including one health insurance company.
    • How U.S. Bank and Obama for America calculated — and Hillary for America 2016 plans to calculate — the way to most strongly persuade each individual.
    • Why the NSA wants all your data: machine learning supercomputers to fight terrorism.
    • How IBM’s Watson computer used predictive modeling to answer questions and beat the human champs on TV’s Jeopardy!
    • How companies ascertain untold, private truths — how Target figures out you’re pregnant and Hewlett-Packard deduces you’re about to quit your job.
    • How judges and parole boards rely on crime-predicting computers to decide how long convicts remain in prison.
    • 183 examples from Airbnb, the BBC, Citibank, ConEd, Facebook, Ford, Google, the IRS, LinkedIn, Match.com, MTV, Netflix, PayPal, Pfizer, Spotify, Uber, UPS, Wikipedia, and more….(More)”

 

Daedalus Issue on “The Internet”


Press release: “Thirty years ago, the Internet was a network that primarily delivered email among academic and government employees. Today, it is rapidly evolving into a control system for our physical environment through the Internet of Things, as mobile and wearable technology more tightly integrate the Internet into our everyday lives.

How will the future Internet be shaped by the design choices that we are making today? Could the Internet evolve into a fundamentally different platform than the one to which we have grown accustomed? As an alternative to big data, what would it mean to make ubiquitously collected data safely available to individuals as small data? How could we attain both security and privacy in the face of trends that seem to offer neither? And what role do public institutions, such as libraries, have in an environment that becomes more privatized by the day?

These are some of the questions addressed in the Winter 2016 issue of Daedalus on “The Internet.”  As guest editors David D. Clark (Senior Research Scientist at the MIT Computer Science and Artificial Intelligence Laboratory) and Yochai Benkler (Berkman Professor of Entrepreneurial Legal Studies at Harvard Law School and Faculty Co-Director of the Berkman Center for Internet and Society at Harvard University) have observed, the Internet “has become increasingly privately owned, commercial, productive, creative, and dangerous.”

Some of the themes explored in the issue include:

  • The conflicts that emerge among governments, corporate stakeholders, and Internet users through choices that are made in the design of the Internet
  • The challenges—including those of privacy and security—that materialize in the evolution from fixed terminals to ubiquitous computing
  • The role of public institutions in shaping the Internet’s privately owned open spaces
  • The ownership and security of data used for automatic control of connected devices, and
  • Consumer demand for “free” services—developed and supported through the sale of user data to advertisers….

Essays in the Winter 2016 issue of Daedalus include:

  • The Contingent Internet by David D. Clark (MIT)
  • Degrees of Freedom, Dimensions of Power by Yochai Benkler (Harvard Law School)
  • Edge Networks and Devices for the Internet of Things by Peter T. Kirstein (University College London)
  • Reassembling Our Digital Selves by Deborah Estrin (Cornell Tech and Weill Cornell Medical College) and Ari Juels (Cornell Tech)
  • Choices: Privacy and Surveillance in a Once and Future Internet by Susan Landau (Worcester Polytechnic Institute)
  • As Pirates Become CEOs: The Closing of the Open Internet by Zeynep Tufekci (University of North Carolina at Chapel Hill)
  • Design Choices for Libraries in the Digital-Plus Era by John Palfrey (Phillips Academy)…(More)

See also: Introduction

Big Data Analysis: New Algorithms for a New Society


Book edited by Nathalie Japkowicz and Jerzy Stefanowski: “This edited volume is devoted to Big Data Analysis from a Machine Learning standpoint as presented by some of the most eminent researchers in this area.

It demonstrates that Big Data Analysis opens up new research problems which were either never considered before, or were only considered within a limited range. In addition to providing methodological discussions on the principles of mining Big Data and the difference between traditional statistical data analysis and newer computing frameworks, this book presents recently developed algorithms affecting such areas as business, financial forecasting, human mobility, the Internet of Things, information networks, bioinformatics, medical systems and life science. It explores, through a number of specific examples, how the study of Big Data Analysis has evolved and how it has started and will most likely continue to affect society. While the benefits brought upon by Big Data Analysis are underlined, the book also discusses some of the warnings that have been issued concerning the potential dangers of Big Data Analysis along with its pitfalls and challenges….(More)”

OpenAI won’t benefit humanity without data-sharing


 at the Guardian: “There is a common misconception about what drives the digital-intelligence revolution. People seem to have the idea that artificial intelligence researchers are directly programming an intelligence; telling it what to do and how to react. There is also the belief that when we interact with this intelligence we are processed by an “algorithm” – one that is subject to the whims of the designer and encodes his or her prejudices.

OpenAI, a new non-profit artificial intelligence company that was founded on Friday, wants to develop digital intelligence that will benefit humanity. By sharing its sentient algorithms with all, the venture, backed by a host of Silicon Valley billionaires, including Elon Musk and Peter Thiel, wants to avoid theexistential risks associated with the technology.

OpenAI’s launch announcement was timed to coincide with this year’s Neural Information Processing Systems conference: the main academic outlet for scientific advances in machine learning, which I chaired. Machine learning is the technology that underpins the new generation of AI breakthroughs.

One of OpenAI’s main ideas is to collaborate openly, publishing code and papers. This is admirable and the wider community is already excited by what the company could achieve.

OpenAI is not the first company to target digital intelligence, and certainly not the first to publish code and papers. Both Facebook and Google have already shared code. They were also present at the same conference. All three companies hosted parties with open bars, aiming to entice the latest and brightest minds.

However, the way machine learning works means that making algorithms available isn’t necessarily as useful as one might think. A machine- learning algorithm is subtly different from popular perception.

Just as in baking we don’t have control over how the cake will emerge from the oven, in machine learning we don’t control every decision that the computer will make. In machine learning the quality of the ingredients, the quality of the data provided, has a massive impact on the intelligence that is produced.

For intelligent decision-making the recipe needs to be carefully applied to the data: this is the process we refer to as learning. The result is the combination of our data and the recipe. We need both to make predictions.

By sharing their algorithms, Facebook and Google are merely sharing the recipe. Someone has to provide the eggs and flour and provide the baking facilities (which in Google and Facebook’s case are vast data-computation facilities, often located near hydroelectric power stations for cheaper electricity).

So even before they start, an open question for OpenAI is how will it ensure it has access to the data on the necessary scale to make progress?…(More)”

China’s Biggest Polluters Face Wrath of Data-Wielding Citizens


Bloomberg News: “Besides facing hefty fines, criminal punishments and the possibility of closing, the worst emitters in China risk additional public anger as new smartphone applications and lower-cost monitoring devices widen access to data on pollution sources.

The Blue Map app, developed by the Institute of Public & Environmental Affairs with support from the SEE Foundation and the Alibaba Foundation, provides pollution data from more than 3,000 large coal-power, steel, cement and petrochemical production plants. Origins Technology Ltd. in July began sale of the Laser Egg, a palm-sized air quality monitor used to track indoor and outdoor air quality by measuring fine particulate matter in the air.

“Letting people know the sources of regional pollution will help the push for control over emissions of every chimney,” said Ma Jun, the founder and director of the Beijing-based IPE.

The phone map and Laser Egg are the latest levers in prying control over information on air quality from the hands of the few to the many, and they’re beginning to weigh on how officials respond to the issue. Numerous smartphone applications, including those developed by SINA Corp. and Moji Fengyun (Beijing) Software Technology Development Co., now provide people in China with real-time access to air quality readings, essentially democratizing what was once an information pipeline available only to the government.

“China’s continuing struggle to control and reduce air pollution exemplifies the government’s fear that lifestyle issues will mutate into demands for political change,” said Mary Gallagher, an associate professor of political science at the University of Michigan.

Even the government is getting in on the act. The Ministry of Environmental Protection rolled out a smartphone application called “Nationwide Air Quality” with the help ofWuhan Juzheng Environmental Science & Technology Co. at the end of 2013.

“As citizens know more about air pollution, more pressure will be put on the government,” said Xu Qinxiang, a technology manager at Wuhan Juzheng. “This will urge the government to control pollutant sources and upgrade heavy industries.”

 Laser Egg

Sources of air quality data come from the China National Environment Monitoring Center, local environmental protection bureaus and non-Chinese sources such as the U.S. Embassy’s website in Beijing, Xu said.

Air quality is a controversial subject in China. Since 2012, the public has pushed the government to move more quickly than planned to begin releasing data measuring pollution levels — especially of PM2.5, the particulates most harmful to human health.

The reading was 267 micrograms per cubic meter at 10 a.m. Monday near Tiananmen Square, according to the Beijing Municipal Environmental Monitoring Center. The World Health Organization cautions against 24-hour exposure to concentrations higher than 25.

The availability of data appears to be filling a need, especially with the arrival of colder temperatures and the associated smog that blanketed Beijing and northern Chinarecently….

“With more disclosure of the data, everyone becomes more sensitive, hoping the government can do something,” Li Yajuan, a 27-year-old office secretary, said in an interview in Beijing’s Fuchengmen area. “It’s our own living environment after all.”

Efforts to make products linked to air data continue. IBM has been developing artificial intelligence to help fight Beijing’s toxic air pollution, and plans to work with other municipalities in China and India on similar projects to manage air quality….(More)”

Decoding the Future for National Security


George I. Seffers at Signal: “U.S. intelligence agencies are in the business of predicting the future, but no one has systematically evaluated the accuracy of those predictions—until now. The intelligence community’s cutting-edge research and development agency uses a handful of predictive analytics programs to measure and improve the ability to forecast major events, including political upheavals, disease outbreaks, insider threats and cyber attacks.

The Office for Anticipating Surprise at the Intelligence Advanced Research Projects Activity (IARPA) is a place where crystal balls come in the form of software, tournaments and throngs of people. The office sponsors eight programs designed to improve predictive analytics, which uses a variety of data to forecast events. The programs all focus on incidents outside of the United States, and the information is anonymized to protect privacy. The programs are in different stages, some having recently ended as others are preparing to award contracts.

But they all have one more thing in common: They use tournaments to advance the state of the predictive analytic arts. “We decided to run a series of forecasting tournaments in which people from around the world generate forecasts about, now, thousands of real-world events,” says Jason Matheny, IARPA’s new director. “All of our programs on predictive analytics do use this tournament style of funding and evaluating research.” The Open Source Indicators program used a crowdsourcing technique in which people across the globe offered their predictions on such events as political uprisings, disease outbreaks and elections.

The data analyzed included social media trends, Web search queries and even cancelled dinner reservations—an indication that people are sick. “The methods applied to this were all automated. They used machine learning to comb through billions of pieces of data to look for that signal, that leading indicator, that an event was about to happen,” Matheny explains. “And they made amazing progress. They were able to predict disease outbreaks weeks earlier than traditional reporting.” The recently completed Aggregative Contingent Estimation (ACE) program also used a crowdsourcing competition in which people predicted events, including whether weapons would be tested, treaties would be signed or armed conflict would break out along certain borders. Volunteers were asked to provide information about their own background and what sources they used. IARPA also tested participants’ cognitive reasoning abilities. Volunteers provided their forecasts every day, and IARPA personnel kept score. Interestingly, they discovered the “deep domain” experts were not the best at predicting events. Instead, people with a certain style of thinking came out the winners. “They read a lot, not just from one source, but from multiple sources that come from different viewpoints. They have different sources of data, and they revise their judgments when presented with new information. They don’t stick to their guns,” Matheny reveals. …

The ACE research also contributed to a recently released book, Superforecasting: The Art and Science of Prediction, according to the IARPA director. The book was co-authored, along with Dan Gardner, by Philip Tetlock, the Annenberg University professor of psychology and management at the University of Pennsylvania who also served as a principal investigator for the ACE program. Like ACE, the Crowdsourcing Evidence, Argumentation, Thinking and Evaluation program uses the forecasting tournament format, but it also requires participants to explain and defend their reasoning. The initiative aims to improve analytic thinking by combining structured reasoning techniques with crowdsourcing.

Meanwhile, the Foresight and Understanding from Scientific Exposition (FUSE) program forecasts science and technology breakthroughs….(More)”

Artificial Intelligence Aims to Make Wikipedia Friendlier and Better


Tom Simonite in MIT Technology Review: “Software trained to know the difference between an honest mistake and intentional vandalism is being rolled out in an effort to make editing Wikipedia less psychologically bruising. It was developed by the Wikimedia Foundation, the nonprofit organization that supports Wikipedia.

One motivation for the project is a significant decline in the number of people considered active contributors to the flagship English-language Wikipedia: it has fallen by 40 percent over the past eight years, to about 30,000. Research indicates that the problem is rooted in Wikipedians’ complex bureaucracy and their often hard-line responses to newcomers’ mistakes, enabled by semi-automated tools that make deleting new changes easy (see “The Decline of Wikipedia”).

Aaron Halfaker, a senior research scientist at Wikimedia Foundation who helped diagnose that problem, is now leading the project trying to fight it, which relies on algorithms with a sense for human fallibility. His ORES system, for “Objective Revision Evaluation Service,” can be trained to score the quality of new changes to Wikipedia and judge whether an edit was made in good faith or not….

ORES can allow editing tools to direct people to review the most damaging changes. The software can also help editors treat rookie or innocent mistakes more appropriately, says Halfaker. “I suspect the aggressive behavior of Wikipedians doing quality control is because they’re making judgments really fast and they’re not encouraged to have a human interaction with the person,” he says. “This enables a tool to say, ‘If you’re going to revert this, maybe you should be careful and send the person who made the edit a message.’”

..Earlier efforts to make Wikipedia more welcoming to newcomers have been stymied by the very community that’s supposed to benefit. Wikipedians rose up in 2013 when Wikimedia made a word-processor-style editing interface the default, forcing the foundation to make it opt-in instead. To this day, the default editor uses a complicated markup language called Wikitext…(More)”

Robots Will Make Leeds the First Self-Repairing City


Emiko Jozuka at Motherboard: “Researchers in Britain want to make the first “self-repairing” city by 2035. How will they do this? By creating autonomous repair robots that patrol the streets and drainage systems, making sure your car doesn’t dip into a pothole, and that you don’t experience any gas leaks.

“The idea is to create a city that behaves almost like a living organism,” said Raul Fuentes, a researcher at the School of Civil Engineering at Leeds University, who is working on the project. “The robots will act like white cells that are able to identify bacteria or viruses and attack them. It’s kind of like an immune system.”

The £4.2 million ($6.4 million) national infrastructure project is in collaboration with Leeds City Council and the UK Collaboration for Research in Infrastructures and Cities (UKCRIC). The aim is to create a fleet of robot repair workers who will live in Leeds city, spot problems, and sort them out before they become even bigger ones by 2035, said Fuentes. The project is set to launch officially in January 2016, he added.

For their five-year project—which has a vision that extends until 2050—the researchers will develop robot designs and technologies that focus on three main areas. The first is to create drones that can perch on high structures and repair things like street lamps; the second is to develop drones that can autonomously spot when a pothole is about to form and zone in and patch that up before it worsens; and the third is to develop robots that will live in utility pipes so they can inspect, repair, and report back to humans when they spot an issue.

“The robots will be living permanently in the city, and they’ll be able to identify issues before they become real problems,” explained Fuentes. The researchers are working on making the robots autonomous, and want them to be living in swarms or packs where they can communicate with one another on how best they could get the repair job done….(More)