A computational algorithm for fact-checking


Kurzweil News: “Computers can now do fact-checking for any body of knowledge, according to Indiana University network scientists, writing in an open-access paper published June 17 in PLoS ONE.

Using factual information from summary infoboxes from Wikipedia* as a source, they built a “knowledge graph” with 3 million concepts and 23 million links between them. A link between two concepts in the graph can be read as a simple factual statement, such as “Socrates is a person” or “Paris is the capital of France.”

In the first use of this method, IU scientists created a simple computational fact-checker that assigns “truth scores” to statements concerning history, geography and entertainment, as well as random statements drawn from the text of Wikipedia. In multiple experiments, the automated system consistently matched the assessment of human fact-checkers in terms of the humans’ certitude about the accuracy of these statements.

Dealing with misinformation and disinformation

In what the IU scientists describe as an “automatic game of trivia,” the team applied their algorithm to answer simple questions related to geography, history, and entertainment, including statements that matched states or nations with their capitals, presidents with their spouses, and Oscar-winning film directors with the movie for which they won the Best Picture awards. The majority of tests returned highly accurate truth scores.

Lastly, the scientists used the algorithm to fact-check excerpts from the main text of Wikipedia, which were previously labeled by human fact-checkers as true or false, and found a positive correlation between the truth scores produced by the algorithm and the answers provided by the fact-checkers.

Significantly, the IU team found their computational method could even assess the truthfulness of statements about information not directly contained in the infoboxes. For example, the fact that Steve Tesich — the Serbian-American screenwriter of the classic Hoosier film “Breaking Away” — graduated from IU, despite the information not being specifically addressed in the infobox about him.

Using multiple sources to improve accuracy and richness of data

“The measurement of the truthfulness of statements appears to rely strongly on indirect connections, or ‘paths,’ between concepts,” said Giovanni Luca Ciampaglia, a postdoctoral fellow at the Center for Complex Networks and Systems Research in the IU Bloomington School of Informatics and Computing, who led the study….

“These results are encouraging and exciting. We live in an age of information overload, including abundant misinformation, unsubstantiated rumors and conspiracy theories whose volume threatens to overwhelm journalists and the public. Our experiments point to methods to abstract the vital and complex human task of fact-checking into a network analysis problem, which is easy to solve computationally.”

Expanding the knowledge base

Although the experiments were conducted using Wikipedia, the IU team’s method does not assume any particular source of knowledge. The scientists aim to conduct additional experiments using knowledge graphs built from other sources of human knowledge, such as Freebase, the open-knowledge base built by Google, and note that multiple information sources could be used together to account for different belief systems….(More)”

A Research Roadmap for Human Computation


Emerging Technology From the arXiv : “The wisdom of the crowd has become so powerful and so accessible via the Internet that it has become a resource in its own right. Various services now tap into this rich supply of human cognition, such as Wikipedia, Duolingo, and Amazon’s Mechanical Turk.

So important is this resource that scientists have given it a name; they call it human computation. And a rapidly emerging and increasingly important question is how best to exploit it.

Today, we get an answer of sorts thanks to a group of computer scientists, crowdsourcing pioneers, and visionaries who have created a roadmap for research into human computation. The team, led by Pietro Michelucci at the Human Computation Institute, point out that human computation systems have been hugely successful at tackling complex problems from identifying spiral galaxies to organizing disaster relief.

But their potential is even greater still, provided that human cognition can be efficiently harnessed on a global scale. Last year, they met to discuss these issues and have now published the results of their debate.

The begin by pointing out the extraordinary successes of human computation….then describe the kinds of projects they want to create. They call one idea Project Houston after the crowdsourced effort on the ground that helped bring back the Apollo 13 astronauts after an on-board explosion on the way to the moon.

Their idea is that similar help can be brought to bear from around the world when individuals on earth find themselves in trouble. By this they mean individuals who might be considering suicide or suffering from depression, for example.

The plan is to use state-of-the-art speech analysis and natural language understanding to detect stress and offer help. This would come in the form of composite personalities made up from individuals with varying levels of expertise in the crowd, supported by artificial intelligence techniques. “Project Houston could provide a consistently kind and patient personality even if the “crowd” changes completely over time,” they say.

Another idea is to build on the way that crowdsourcing helps people learn. One example of this is Duolingo, an app that offers free language lessons while simultaneously acting as a document translation service. “Why stop with language learning and translation?” they ask.

A similar approach could help people learn new skills as they work online, a process that should allow them to take on more complex roles. One example is in the field of radiology, where an important job is to recognize tumors on x-ray images. This is a task that machine vision algorithms do not yet perform reliably…..

Yet another idea would be to crowdsource information that helps the poorest families in America find social welfare programs. These programs are often difficult to navigate and represent a disproportionate hardship for the people who are most likely to benefit from them: those who are homeless, who have disabilities, who are on low income, and so on.

The idea is that the crowd should take on some of this burden freeing up this group for other tasks, like finding work, managing health problems and so on.

These are worthy goals but they raise some significant questions. Chief among these is the nature of the ethical, legal, and social implications of human computation. How can this work be designed to allow meaningful and dignified human participation? How can the outcomes be designed so that the most vulnerable people can benefit from it? And what is the optimal division of labor between machines and humans to produce a specific result?

Ref:  arxiv.org/abs/1505.07096 : A U.S. Research Roadmap for Human Computation”

Open data could save the NHS hundreds of millions, says top UK scientist


The Guardian: “The UK government must open up and highlight the power of more basic data sets to improve patient care in the NHS and save hundreds of millions of pounds a year, Nigel Shadbolt, chairman of the Open Data Institute (ODI) has urged.

The UK government topped the first league table for open data (paywall)produced by the ODI last year but Shadbolt warns that ministers’ open data responsibilities have not yet been satisfied.

Basic data on prescription administration is now published on a monthly basis but Shadbolt said medical practitioners must be educated about the power of this data to change prescribing habits across the country.

Other data sets, such as trusts’ opening times, consultant lists and details of services, that are promised to make the NHS more accessible are not currently available in a form that is machine-readable.

“These basic sets of information about the processes, the people and places in the health system are all fragmented and fractured and many of them are not available as registers that you can go to,” Shadbolt said.

“Whenever you talk about health data people think you must be talking about personal data and patient data and there are issues, obviously, of absolutely protecting privacy there. But there’s lots of data in the health service that is not about personal patient data at all that would be hugely useful to just have available as machine-readable data for apps to use.”

The UK government has led the way in recent years in encouraging transparency and accountability within the NHS by opening league tables. The publication of league tables on MRSA was followed by a 76-79% drop in infections.

Shadbolt said: “Those hospitals that were worst in their league table don’t like to be there and there was a very rapid diffusion of understanding of best practice across them that you can quantify. It’s many millions of pounds being saved.”

The artificial intelligence and open data expert said the next big area for open data improvement in the NHS is around prescriptions.

Shadbolt pointed to the publication of data about the prescription of statins,which has helped identify savings worth hundreds of millions of pounds: “There is little doubt that this pattern is likely to exist across the whole of the prescribing space.”…(More)”

How Data Mining could have prevented Tunisia’s Terror attack in Bardo Museum


Wassim Zoghlami at Medium: “…Data mining is the process of posing queries and extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. Latelely there has been a big interest on leveraging the use of data mining for counter-terrorism applications

Using the data on more than 50.000+ ISIS connected twitter accounts , I was able to establish an understanding of some factors determined how often ISIS attacks occur , what different types of terror strikes are used in which geopolitical situations, and many other criteria through graphs about the frequency of hashtags usages and the frequency of a particular group of the words used in the tweets.

A simple data mining project of some of the repetitive hashtags and sequences of words used typically by ISIS militants in their tweets yielded surprising results. The results show a rise of some keywords on the tweets that started from Marsh 15, three days before Bardo museum attacks.

Some of the common frequent keywords and hashtags that had a unusual peak since marsh 15 , three days before the attack :

#طواغيت تونس : Tyrants of Tunisia = a reference to the military

بشرى تونس : Good news for Tunisia.

قريبا تونس : Soon in Tunisia.

#إفريقية_للإعلام : The head of social media of Afriqiyah

#غزوة_تونس : The foray of Tunis…

Big Data and Data Mining should be used for national security intelligence

The Tunisian national security has to leverage big data to predict such attacks and to achieve objectives as the volume of digital data. Some of the challenges facing the Data mining techniques are that to carry out effective data mining and extract useful information for counterterrorism and national security, we need to gather all kinds of information about individuals. However, this information could be a threat to the individuals’ privacy and civil liberties…(More)”

How Crowdsourcing And Machine Learning Will Change The Way We Design Cities


Shaunacy Ferro at FastCompany: “In 2011, researchers at the MIT Media Lab debuted Place Pulse, a website that served as a kind of “hot or not” for cities. Given two Google Street View images culled from a select few cities including New York City and Boston, the site asked users to click on the one that seemed safer, more affluent, or more unique. The result was an empirical way to measure urban aesthetics.

Now, that data is being used to predict what parts of cities feel the safest. StreetScore, a collaboration between the MIT Media Lab’s Macro Connections and Camera Culture groups, uses an algorithm to create a super high-resolution map of urban perceptions. The algorithmically generated data could one day be used to research the connection between urban perception and crime, as well as informing urban design decisions.

The algorithm, created by Nikhil Naik, a Ph.D. student in the Camera Culture lab, breaks an image down into its composite features—such as building texture, colors, and shapes. Based on how Place Pulse volunteers rated similar features, the algorithm assigns the streetscape a perceived safety score between 1 and 10. These scores are visualized as geographic points on a map, designed by MIT rising sophomore Jade Philipoom. Each image available from Google Maps in the two cities are represented by a colored dot: red for the locations that the algorithm tags as unsafe, and dark green for those that appear safest. The site, now limited to New York and Boston, will be expanded to feature Chicago and Detroit later this month, and eventually, with data collected from a new version of Place Pulse, will feature dozens of cities around the world….(More)”

Modern Methods for Sentiment Analysis


Review by Michael Czerny: “Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.

Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.

Word2Vec and Doc2Vec

Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method….(More)

Citizen Science for Citizen Access to Law


Paper by Michael Curtotti, Wayne Weibel, Eric McCreath, Nicolas Ceynowa, Sara Frug, and Tom R Bruce: “This paper sits at the intersection of citizen access to law, legal informatics and plain language. The paper reports the results of a joint project of the Cornell University Legal Information Institute and the Australian National University which collected thousands of crowdsourced assessments of the readability of law through the Cornell LII site. The aim of the project is to enhance accuracy in the prediction of the readability of legal sentences. The study requested readers on legislative pages of the LII site to rate passages from the United States Code and the Code of Federal Regulations and other texts for readability and other characteristics. The research provides insight into who uses legal rules and how they do so. The study enables conclusions to be drawn as to the current readability of law and spread of readability among legal rules. The research is intended to enable the creation of a dataset of legal rules labelled by human judges as to readability. Such a dataset, in combination with machine learning, will assist in identifying factors in legal language which impede readability and access for citizens. As far as we are aware, this research is the largest ever study of readability and usability of legal language and the first research which has applied crowdsourcing to such an investigation. The research is an example of the possibilities open for enhancing access to law through engagement of end users in the online legal publishing environment for enhancement of legal accessibility and through collaboration between legal publishers and researchers….(More)”

The End of Asymmetric Information


Essay by Alex Tabarrok and Tyler Cowen: Might the age of asymmetric information – for better or worse – be over?  Market institutions are rapidly evolving to a situation where very often the buyer and the seller have roughly equal knowledge. Technological developments are giving everyone who wants it access to the very best information when it comes to product quality, worker performance, matches to friends and partners, and the nature of financial transactions, among many other areas.

These developments will have implications for how markets work, how much consumers benefit, and also economic policy and the law. As we will see, there may be some problematic sides to these new arrangements, specifically when it comes to privacy. Still, a large amount of economic regulation seems directed at a set of problems which, in large part, no longer exist…

Many “public choice” problems are really problems of asymmetric information. In William Niskanen’s (1974) model of bureaucracy, government workers usually benefit from larger bureaus, and they are able to expand their bureaus to inefficient size because they are the primary providers of information to politicians. Some bureaus, such as the NSA and the CIA, may still be able to use secrecy to benefit from information asymmetry. For instance they can claim to politicians that they need more resources to deter or prevent threats, and it is hard for the politicians to have well-informed responses on the other side of the argument. Timely, rich information about most other bureaucracies, however, is easily available to politicians and increasingly to the public as well. As information becomes more symmetric, Niskanen’s (1974) model becomes less applicable, and this may help check the growth of unneeded bureaucracy.

Cheap sensors are greatly extending how much information can be economically gathered and analyzed. It’s not uncommon for office workers to have every key stroke logged. When calling customer service, who has not been told “this call may be monitored for quality control purposes?” Service-call workers have their location tracked through cell phones. Even information that once was thought to be purely subjective can now be collected and analyzed, often with the aid of smart software or artificial intelligence. One firm, for example, uses badges equipped with microphones, accelerometers, and location sensors to measure tone of voice, posture, and body language, as well as who spoke to whom and for how long (Lohr 2014). The purpose is not only to monitor workers but to deduce when, where and why workers are the most productive. We are again seeing trade-offs which bring greater productivity, and limit asymmetric information, albeit at the expense of some privacy.

As information becomes more prevalent and symmetric, earlier solutions to asymmetric problems will become less necessary. When employers do not easily observe workers, for example, employers may pay workers unusually high wages, generating a rent. Workers will then work at high levels despite infrequent employer observation, to maintain their future rents (Shapiro and Stiglitz 1984). But those higher wages involved a cost, namely that fewer workers were hired, and the hires that were made often were directed to people who were already known to the firm. Better monitoring of workers will mean that employers will hire more people and furthermore they may be more willing to take chances on risky outsiders, rather than those applicants who come with impeccable pedigree. If the outsider does not work out and produce at an acceptable level, it is easy enough to figure this out and fire them later on….(More)”

Big Data for Social Good


Introduction to a Special Issue of the Journal “Big Data” by Catlett Charlie and Ghani Rayid: “…organizations focused on social good are realizing the potential as well but face several challenges as they seek to become more data-driven. The biggest challenge they face is a paucity of examples and case studies on how data can be used for social good. This special issue of Big Data is targeted at tackling that challenge and focuses on highlighting some exciting and impactful examples of work that uses data for social good. The special issue is just one example of the recent surge in such efforts by the data science community. …

This special issue solicited case studies and problem statements that would either highlight (1) the use of data to solve a social problem or (2) social challenges that need data-driven solutions. From roughly 20 submissions, we selected 5 articles that exemplify this type of work. These cover five broad application areas: international development, healthcare, democracy and government, human rights, and crime prevention.

“Understanding Democracy and Development Traps Using a Data-Driven Approach” (Ranganathan et al.) details a data-driven model between democracy, cultural values, and socioeconomic indicators to identify a model of two types of “traps” that hinder the development of democracy. They use historical data to detect causal factors and make predictions about the time expected for a given country to overcome these traps.

“Targeting Villages for Rural Development Using Satellite Image Analysis” (Varshney et al.) discusses two case studies that use data and machine learning techniques for international economic development—solar-powered microgrids in rural India and targeting financial aid to villages in sub-Saharan Africa. In the process, the authors stress the importance of understanding the characteristics and provenance of the data and the criticality of incorporating local “on the ground” expertise.

In “Human Rights Event Detection from Heterogeneous Social Media Graphs,” Chen and Neil describe efficient and scalable techniques to use social media in order to detect emerging patterns in human rights events. They test their approach on recent events in Mexico and show that they can accurately detect relevant human rights–related tweets prior to international news sources, and in some cases, prior to local news reports, which could potentially lead to more timely, targeted, and effective advocacy by relevant human rights groups.

“Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets” (Wang et al.) describes a case study with the Cambridge Police Department, using a subspace clustering method to analyze the department’s full housebreak database, which contains detailed information from thousands of crimes from over a decade. They find that the method allows human crime analysts to handle vast amounts of data and provides new insights into true patterns of crime committed in Cambridge…..(More)

Data scientists rejoice! There’s an online marketplace selling algorithms from academics


SiliconRepublic: “Algorithmia, an online marketplace that connects computer science researchers’ algorithms with developers who may have uses for them, has exited its private beta.

Algorithms are essential to our online experience. Google uses them to determine which search results are the most relevant. Facebook uses them to decide what should appear in your news feed. Netflix uses them to make movie recommendations.

Founded in 2013, Algorithmia could be described as an app store for algorithms, with over 800 of them available in its library. These algorithms provide the means of completing various tasks in the fields of machine learning, audio and visual processing, and computer vision.

Algorithmia found a way to monetise algorithms by creating a platform where academics can share their creations and charge a royalty fee per use, while developers and data scientists can request specific algorithms in return for a monetary reward. One such suggestion is for ‘punctuation prediction’, which would insert correct punctuation and capitalisation in speech-to-text translation.

While it’s not the first algorithm marketplace online, Algorithmia will accept and sell any type of algorithm and host them on its servers. What this means is that developers need only add a simple piece of code to their software in order to send a query to Algorithmia’s servers, so the algorithm itself doesn’t have to be integrated in its entirety….

Computer science researchers can spend years developing algorithms, only for them to be published in a scientific journal never to be read by software engineers.

Algorithmia intends to create a community space where academics and engineers can meet to discuss and refine these algorithms for practical use. A voting and commenting system on the site will allow users to engage and even share insights on how contributions can be improved.

To that end, Algorithmia’s ultimate goal is to advance the development of algorithms as well as their discovery and use….(More)”