A computational algorithm for fact-checking


Kurzweil News: “Computers can now do fact-checking for any body of knowledge, according to Indiana University network scientists, writing in an open-access paper published June 17 in PLoS ONE.

Using factual information from summary infoboxes from Wikipedia* as a source, they built a “knowledge graph” with 3 million concepts and 23 million links between them. A link between two concepts in the graph can be read as a simple factual statement, such as “Socrates is a person” or “Paris is the capital of France.”

In the first use of this method, IU scientists created a simple computational fact-checker that assigns “truth scores” to statements concerning history, geography and entertainment, as well as random statements drawn from the text of Wikipedia. In multiple experiments, the automated system consistently matched the assessment of human fact-checkers in terms of the humans’ certitude about the accuracy of these statements.

Dealing with misinformation and disinformation

In what the IU scientists describe as an “automatic game of trivia,” the team applied their algorithm to answer simple questions related to geography, history, and entertainment, including statements that matched states or nations with their capitals, presidents with their spouses, and Oscar-winning film directors with the movie for which they won the Best Picture awards. The majority of tests returned highly accurate truth scores.

Lastly, the scientists used the algorithm to fact-check excerpts from the main text of Wikipedia, which were previously labeled by human fact-checkers as true or false, and found a positive correlation between the truth scores produced by the algorithm and the answers provided by the fact-checkers.

Significantly, the IU team found their computational method could even assess the truthfulness of statements about information not directly contained in the infoboxes. For example, the fact that Steve Tesich — the Serbian-American screenwriter of the classic Hoosier film “Breaking Away” — graduated from IU, despite the information not being specifically addressed in the infobox about him.

Using multiple sources to improve accuracy and richness of data

“The measurement of the truthfulness of statements appears to rely strongly on indirect connections, or ‘paths,’ between concepts,” said Giovanni Luca Ciampaglia, a postdoctoral fellow at the Center for Complex Networks and Systems Research in the IU Bloomington School of Informatics and Computing, who led the study….

“These results are encouraging and exciting. We live in an age of information overload, including abundant misinformation, unsubstantiated rumors and conspiracy theories whose volume threatens to overwhelm journalists and the public. Our experiments point to methods to abstract the vital and complex human task of fact-checking into a network analysis problem, which is easy to solve computationally.”

Expanding the knowledge base

Although the experiments were conducted using Wikipedia, the IU team’s method does not assume any particular source of knowledge. The scientists aim to conduct additional experiments using knowledge graphs built from other sources of human knowledge, such as Freebase, the open-knowledge base built by Google, and note that multiple information sources could be used together to account for different belief systems….(More)”

Beating the news’ with EMBERS: Forecasting Civil Unrest using Open Source Indicators


Paper by Naren Ramakrishnan et al: “We describe the design, implementation, and evaluation of EMBERS, an automated, 24×7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the uptick and downtick of incidents during the June 2013 protests in Brazil. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings….(More)”

Big Data’s Impact on Public Transportation


InnovationEnterprise: “Getting around any big city can be a real pain. Traffic jams seem to be a constant complaint, and simply getting to work can turn into a chore, even on the best of days. With more people than ever before flocking to the world’s major metropolitan areas, the issues of crowding and inefficient transportation only stand to get much worse. Luckily, the traditional methods of managing public transportation could be on the verge of changing thanks to advances in big data. While big data use cases have been a part of the business world for years now, city planners and transportation experts are quickly realizing how valuable it can be when making improvements to city transportation. That hour long commute may no longer be something travelers will have to worry about in the future.

In much the same way that big data has transformed businesses around the world by offering greater insight in the behavior of their customers, it can also provide a deeper look at travellers. Like retail customers, commuters have certain patterns they like to keep to when on the road or riding the rails. Travellers also have their own motivations and desires, and getting to the heart of their actions is all part of what big data analytics is about. By analyzing these actions and the factors that go into them, transportation experts can gain a better understanding of why people choose certain routes or why they prefer one method of transportation over another. Based on these findings, planners can then figure out where to focus their efforts and respond to the needs of millions of commuters.

Gathering the accurate data needed to make knowledgeable decisions regarding city transportation can be a challenge in itself, especially considering how many people commute to work in a major city. New methods of data collection have made that effort easier and a lot less costly. One way that’s been implemented is through the gathering of call data records (CDR). From regular transactions made from mobile devices, information about location, time, and duration of an action (like a phone call) can give data scientists the necessary details on where people are traveling to, how long it takes them to get to their destination, and other useful statistics. The valuable part of this data is the sample size, which provides a much bigger picture of the transportation patterns of travellers.

That’s not the only way cities are using big data to improve public transportation though. Melbourne in Australia has long been considered one of the world’s best cities for public transit, and much of that is thanks to big data. With big data and ad hoc analysis, Melbourne’s acclaimed tram system can automatically reconfigure routes in response to sudden problems or challenges, such as a major city event or natural disaster. Data is also used in this system to fix problems before they turn serious.Sensors located in equipment like tram cars and tracks can detect when maintenance is needed on a specific part. Crews are quickly dispatched to repair what needs fixing, and the tram system continues to run smoothly. This is similar to the idea of the Internet of Things, wherein embedded sensors collect data that is then analyzed to identify problems and improve efficiency.

Sao Paulo, Brazil is another city that sees the value of using big data for its public transportation. The city’s efforts concentrate on improving the management of its bus fleet. With big data collected in real time, the city can get a more accurate picture of just how many people are riding the buses, which routes are on time, how drivers respond to changing conditions, and many other factors. Based off of this information, Sao Paulo can optimize its operations, providing added vehicles where demand is genuine whilst finding which routes are the most efficient. Without big data analytics, this process would have taken a very long time and would likely be hit-or-miss in terms of accuracy, but now, big data provides more certainty in a shorter amount of time….(More)”

The Climatologist’s Almanac


Clara Chaisson at onEarth: “Forget your weather app with its five- or even ten-day forecasts—a supercomputer at NASA has just provided us with high-resolution climate projections through the end of the century. The massive new 11-terabyte data set combines historical daily temperatures and precipitation measurements with climate simulations under two greenhouse gas emissions scenarios. The project spans from 1950 to 2100, but users can easily zero in on daily timescales for their own locales—which is precisely the point.

The projections can be found on Amazon for free for all to see and plan by. The space agency hopes that developing nations and poorer communities that may not have any spare supercomputers lying around will use the info to predict and prepare for climate change. …(More)”

Field experimenting in economics: Lessons learned for public policy


Robert Metcalfe at OUP Blog: “Do neighbourhoods matter to outcomes? Which classroom interventions improve educational attainment? How should we raise money to provide important and valued public goods? Do energy prices affect energy demand? How can we motivate people to become healthier, greener, and more cooperative? These are some of the most challenging questions policy-makers face. Academics have been trying to understand and uncover these important relationships for decades.

Many of the empirical tools available to economists to answer these questions do not allow causal relationships to be detected. Field experiments represent a relatively new methodological approach capable of measuring the causal links between variables. By overlaying carefully designed experimental treatments on real people performing tasks common to their daily lives, economists are able to answer interesting and policy-relevant questions that were previously intractable. Manipulation of market environments allows these economists to uncover the hidden motivations behind economic behaviour more generally. A central tenet of field experiments in the policy world is that governments should understand the actual behavioural responses of their citizens to changes in policies or interventions.

Field experiments represent a departure from laboratory experiments. Traditionally, laboratory experiments create experimental settings with tight control over the decision environment of undergraduate students. While these studies also allow researchers to make causal statements, policy-makers are often concerned subjects in these experiments may behave differently in settings where they know they are being observed or when they are permitted to sort out of the market.

For example, you might expect a college student to contribute more to charity when she is scrutinized in a professor’s lab than when she can avoid the ask altogether. Field experiments allow researchers to make these causal statements in a setting that is more generalizable to the behaviour policy-makers are directly interested in.

To date, policy-makers traditionally gather relevant information and data by using focus groups, qualitative evidence, or observational data without a way to identify causal mechanisms. It is quite easy to elicit people’s intentions about how they behave with respect to a new policy or intervention, but there is increasing evidence that people’s intentions are a poor guide to predicting their behaviour.

However, we are starting to see a small change in how governments seek to answer pertinent questions. For instance, the UK tax office (Her Majesty’s Revenue and Customs) now uses field experiments across some of its services to improve the efficacy of scarce taxpayers money. In the US, there are movements toward gathering more evidence from field experiments.

In the corporate world, experimenting is not new. Many of the current large online companies—such as Amazon, Facebook, Google, and Microsoft—are constantly using field experiments matched with big data to improve their products and deliver better services to their customers. More and more companies will use field experiments over time to help them better set prices, tailor advertising, provide a better customer journey to increase welfare, and employ more productive workers…(More).

See also Field Experiments in the Developed World: An Introduction (Oxford Review of Economic Policy)

Signal: Understanding What Matters in a World of Noise,


Book by Stephen Few: “In this age of so-called Big Data, organizations are scrambling to implement new software and hardware to increase the amount of data they collect and store. However, in doing so they are unwittingly making it harder to find the needles of useful information in the rapidly growing mounds of hay. If you don’t know how to differentiate signals from noise, adding more noise only makes things worse. When we rely on data for making decisions, how do we tell what qualifies as a signal and what is merely noise? In and of itself, data is neither. Assuming that data is accurate, it is merely a collection of facts. When a fact is true and useful, only then is it a signal. When it’s not, it’s noise. It’s that simple. In Signal, Stephen Few provides the straightforward, practical instruction in everyday signal detection that has been lacking until now. Using data visualization methods, he teaches how to apply statistics to gain a comprehensive understanding of one’s data and adapts the techniques of Statistical Process Control in new ways to detect not just changes in the metrics but also changes in the patterns that characterize data…(More)”

5 cool ways connected data is being used


 at Wareable: “The real news behind the rise of wearable tech isn’t so much the gadgetry as the gigantic amount of personal data that it harnesses.

Concerns have already been raised over what companies may choose to do with such valuable information, with one US life insurance company already using Fitbits to track customers’ exercise and offer them discounts when they hit their activity goals.

Despite a mildly worrying potential dystopia in which our own data could be used against us, there are plenty of positive ways in which companies are using vast amounts of connected data to make the world a better place…

Parkinson’s disease research

Apple Health ResearchKit was recently unveiled as a platform for collecting collaborative data for medical studies, but Apple isn’t the first company to rely on crowdsourced data for medical research.

The Michael J. Fox Foundation for Parkinson’s Research recently unveiled a partnership with Intel to improve research and treatment for the neurodegenerative brain disease. Wearables are being used to unobtrusively gather real-time data from sufferers, which is then analysed by medical experts….

Saving the rhino

Connected data and wearable tech isn’t just limited to humans. In South Africa, the Madikwe Conservation Project is using wearable-based data to protect endangered rhinos from callous poachers.

A combination of ultra-strong Kevlar ankle collars powered by an Intel Galileo chip, along with an RFID chip implanted in each rhino’s horn allows the animals to be monitored. Any break in proximity between the anklet and horn results in anti-poaching teams being deployed to catch the bad guys….

Making public transport smart

A company called Snips is collecting huge amounts of urban data in order to improve infrastructure. In partnership with French national rail operator SNCF, Snips produced an app called Tranquilien to utilise location data from commuters’ phones and smartwatches to track which parts of the rail network were busy at which times.

Combining big data with crowdsourcing, the information helps passengers to pick a train where they can find a seat during peak times, while the data can also be useful to local businesses when serving the needs of commuters who are passing through.

Improving the sports fan experience

We’ve already written about how wearable tech is changing the NFL, but the collection of personal data is also set to benefit the fans.

Levi’s Stadium – the new home of the San Francisco 49ers – opened in 2014 and is one of the most technically advanced sports venues in the world. As well as a strong Wi-Fi signal throughout the stadium, fans also benefit from a dedicated app. This not only offers instant replays and real-time game information, but it also helps them find a parking space, order food and drinks directly to their seat and even check the lines at the toilets. As fans use the app, all of the data is collated to enhance the fan experience in future….

Creating interactive art

Don’t be put off by the words ‘interactive installation’. On Broadway is a cool work of art that “represents life in the 21st Century city through a compilation of images and data collected along the 13 miles of Broadway that span Manhattan”….(More)”

Tracking Employment Shocks Using Mobile Phone Data


Paper by Jameson L. Toole et al.: “Can data from mobile phones be used to observe economic shocks and their consequences at multiple scales? Here we present novel methods to detect mass layoffs, identify individuals affected by them, and predict changes in aggregate unemployment rates using call detail records (CDRs) from mobile phones. Using the closure of a large manufacturing plant as a case study, we first describe a structural break model to correctly detect the date of a mass layoff and estimate its size. We then use a Bayesian classification model to identify affected individuals by observing changes in calling behavior following the plant’s closure. For these affected individuals, we observe significant declines in social behavior and mobility following job loss. Using the features identified at the micro level, we show that the same changes in these calling behaviors, aggregated at the regional level, can improve forecasts of macro unemployment rates. These methods and results highlight promise of new data resources to measure micro economic behavior and improve estimates of critical economic indicators….(More)”

Navigating the Health Data Ecosystem


New book on O’Reilly Media on “The “Six C’s”: Understanding the Health Data Terrain in the Era of Precision Medicine”: “Data-driven technologies are now being adopted, developed, funded, and deployed throughout the health care market at an unprecedented scale. But, as this O’Reilly report reveals, health care innovation contains more hurdles and requires more finesse than many tech startups expect. By paying attention to the lessons from the report’s findings, innovation teams can better anticipate what they’ll face, and plan accordingly.

Simply put, teams looking to apply collective intelligence and “big data” platforms to health and health care problems often don’t appreciate the messy details of using and making sense of data in the heavily regulated hospital IT environment. Download this report today and learn how it helps prepare startups in six areas:

  1. Complexity: An enormous domain with noisy data not designed for machine consumption
  2. Computing: Lack of standard, interoperable schema for documenting human health in a digital format
  3. Context: Lack of critical contextual metadata for interpreting health data
  4. Culture: Startup difficulties in hospital ecosystems: why innovation can be a two-edged sword
  5. Contracts: Navigating the IRB, HIPAA, and EULA frameworks
  6. Commerce: The problem of how digital health startups get paid

This report represents the initial findings of a study funded by a grant from the Robert Wood Johnson Foundation. Subsequent reports will explore the results of three deep-dive projects the team pursued during the study. (More)”

Big Data. Big Obstacles.


Dalton Conley et al. in the Chronicle of Higher Education: “After decades of fretting over declining response rates to traditional surveys (the mainstay of 20th-century social research), an exciting new era would appear to be dawning thanks to the rise of big data. Social contagion can be studied by scraping Twitter feeds; peer effects are tested on Facebook; long-term trends in inequality and mobility can be assessed by linking tax records across years and generations; social-psychology experiments can be run on Amazon’s Mechanical Turk service; and cultural change can be mapped by studying the rise and fall of specific Google search terms. In many ways there has been no better time to be a scholar in sociology, political science, economics, or related fields.

However, what should be an opportunity for social science is now threatened by a three-headed monster of privatization, amateurization, and Balkanization. A coordinated public effort is needed to overcome all of these obstacles.

While the availability of social-media data may obviate the problem of declining response rates, it introduces all sorts of problems with the level of access that researchers enjoy. Although some data can be culled from the web—Twitter feeds and Google searches—other data sit behind proprietary firewalls. And as individual users tune up their privacy settings, the typical university or independent researcher is increasingly locked out. Unlike federally funded studies, there is no mandate for Yahoo or Alibaba to make its data publicly available. The result, we fear, is a two-tiered system of research. Scientists working for or with big Internet companies will feast on humongous data sets—and even conduct experiments—and scholars who do not work in Silicon Valley (or Alley) will be left with proverbial scraps….

To address this triple threat of privatization, amateurization, and Balkanization, public social science needs to be bolstered for the 21st century. In the current political and economic climate, social scientists are not waiting for huge government investment like we saw during the Cold War. Instead, researchers have started to knit together disparate data sources by scraping, harmonizing, and geo­coding any and all information they can get their hands on.

Currently, many firms employ some well-trained social and behavioral scientists free to pursue their own research; likewise, some companies have programs by which scholars can apply to be in residence or work with their data extramurally. However, as Facebook states, its program is “by invitation only and requires an internal Facebook champion.” And while Google provides services like Ngram to the public, such limited efforts at data sharing are not enough for truly transparent and replicable science….(More)”