OpenAI won’t benefit humanity without data-sharing


 at the Guardian: “There is a common misconception about what drives the digital-intelligence revolution. People seem to have the idea that artificial intelligence researchers are directly programming an intelligence; telling it what to do and how to react. There is also the belief that when we interact with this intelligence we are processed by an “algorithm” – one that is subject to the whims of the designer and encodes his or her prejudices.

OpenAI, a new non-profit artificial intelligence company that was founded on Friday, wants to develop digital intelligence that will benefit humanity. By sharing its sentient algorithms with all, the venture, backed by a host of Silicon Valley billionaires, including Elon Musk and Peter Thiel, wants to avoid theexistential risks associated with the technology.

OpenAI’s launch announcement was timed to coincide with this year’s Neural Information Processing Systems conference: the main academic outlet for scientific advances in machine learning, which I chaired. Machine learning is the technology that underpins the new generation of AI breakthroughs.

One of OpenAI’s main ideas is to collaborate openly, publishing code and papers. This is admirable and the wider community is already excited by what the company could achieve.

OpenAI is not the first company to target digital intelligence, and certainly not the first to publish code and papers. Both Facebook and Google have already shared code. They were also present at the same conference. All three companies hosted parties with open bars, aiming to entice the latest and brightest minds.

However, the way machine learning works means that making algorithms available isn’t necessarily as useful as one might think. A machine- learning algorithm is subtly different from popular perception.

Just as in baking we don’t have control over how the cake will emerge from the oven, in machine learning we don’t control every decision that the computer will make. In machine learning the quality of the ingredients, the quality of the data provided, has a massive impact on the intelligence that is produced.

For intelligent decision-making the recipe needs to be carefully applied to the data: this is the process we refer to as learning. The result is the combination of our data and the recipe. We need both to make predictions.

By sharing their algorithms, Facebook and Google are merely sharing the recipe. Someone has to provide the eggs and flour and provide the baking facilities (which in Google and Facebook’s case are vast data-computation facilities, often located near hydroelectric power stations for cheaper electricity).

So even before they start, an open question for OpenAI is how will it ensure it has access to the data on the necessary scale to make progress?…(More)”

Decoding the Future for National Security


George I. Seffers at Signal: “U.S. intelligence agencies are in the business of predicting the future, but no one has systematically evaluated the accuracy of those predictions—until now. The intelligence community’s cutting-edge research and development agency uses a handful of predictive analytics programs to measure and improve the ability to forecast major events, including political upheavals, disease outbreaks, insider threats and cyber attacks.

The Office for Anticipating Surprise at the Intelligence Advanced Research Projects Activity (IARPA) is a place where crystal balls come in the form of software, tournaments and throngs of people. The office sponsors eight programs designed to improve predictive analytics, which uses a variety of data to forecast events. The programs all focus on incidents outside of the United States, and the information is anonymized to protect privacy. The programs are in different stages, some having recently ended as others are preparing to award contracts.

But they all have one more thing in common: They use tournaments to advance the state of the predictive analytic arts. “We decided to run a series of forecasting tournaments in which people from around the world generate forecasts about, now, thousands of real-world events,” says Jason Matheny, IARPA’s new director. “All of our programs on predictive analytics do use this tournament style of funding and evaluating research.” The Open Source Indicators program used a crowdsourcing technique in which people across the globe offered their predictions on such events as political uprisings, disease outbreaks and elections.

The data analyzed included social media trends, Web search queries and even cancelled dinner reservations—an indication that people are sick. “The methods applied to this were all automated. They used machine learning to comb through billions of pieces of data to look for that signal, that leading indicator, that an event was about to happen,” Matheny explains. “And they made amazing progress. They were able to predict disease outbreaks weeks earlier than traditional reporting.” The recently completed Aggregative Contingent Estimation (ACE) program also used a crowdsourcing competition in which people predicted events, including whether weapons would be tested, treaties would be signed or armed conflict would break out along certain borders. Volunteers were asked to provide information about their own background and what sources they used. IARPA also tested participants’ cognitive reasoning abilities. Volunteers provided their forecasts every day, and IARPA personnel kept score. Interestingly, they discovered the “deep domain” experts were not the best at predicting events. Instead, people with a certain style of thinking came out the winners. “They read a lot, not just from one source, but from multiple sources that come from different viewpoints. They have different sources of data, and they revise their judgments when presented with new information. They don’t stick to their guns,” Matheny reveals. …

The ACE research also contributed to a recently released book, Superforecasting: The Art and Science of Prediction, according to the IARPA director. The book was co-authored, along with Dan Gardner, by Philip Tetlock, the Annenberg University professor of psychology and management at the University of Pennsylvania who also served as a principal investigator for the ACE program. Like ACE, the Crowdsourcing Evidence, Argumentation, Thinking and Evaluation program uses the forecasting tournament format, but it also requires participants to explain and defend their reasoning. The initiative aims to improve analytic thinking by combining structured reasoning techniques with crowdsourcing.

Meanwhile, the Foresight and Understanding from Scientific Exposition (FUSE) program forecasts science and technology breakthroughs….(More)”

How Africa can benefit from the data revolution


 in The Guardian: “….The modern information infrastructure is about movement of data. From data we derive information and knowledge, and that knowledge can be propagated rapidly across the country and throughout the world. Facebook and Google have both made massive investments in machine learning, the mainstay technology for converting data into knowledge. But the potential for these technologies in Africa is much larger: instead of simply advertising products to people, we can imagine modern distributed health systems, distributed markets, knowledge systems for disease intervention. The modern infrastructure should be data driven and deployed across the mobile network. A single good idea can then be rapidly implemented and distributed via the mobile phone app ecosystems.

The information infrastructure does not require large scale thinking and investment to deliver. In fact, it requires just the reverse. It requires agility and innovation. Larger companies cannot react quickly enough to exploit technological advances. Small companies with a good idea can grow quickly. From IBM to Microsoft, Google and now Facebook. All these companies now agree on one thing: data is where the value lies. Modern internet companies are data-driven from the ground up. Could the same thing happen in Africa’s economies? Can entire countries reformulate their infrastructures to be data-driven from the ground up?

Maybe, or maybe not, but it isn’t necessary to have a grand plan to give it a go. It is already natural to use data and communication to solve real world problems. In Silicon Valley these are the challenges of getting a taxi or reserving a restaurant. In Africa they are often more fundamental. John Quinn has been in Kampala, Uganda at Makerere University for eight years now targeting these challenges. In June this year, John and other researchers from across the region came together for Africa’s first workshop on data science at Dedan Kimathi University of Technology. The objective was to spread knowledge of technologies, ideas and solutions. For the modern information infrastructure to be successful software solutions need to be locally generated. African apps to solve African problems. With this in mind the workshop began with a three day summer school on data science which was then followed by two days of talks on challenges in African data science.

The ideas and solutions presented were cutting edge. The Umati project uses social media to understand the use of ethnic hate speech in Kenya (Sidney Ochieng, iHub, Nairobi). The use of social media for monitoring the evolution and effects of Ebola in west Africa (Nuri Pashwani, IBM Research Africa). The Kudusystem for market making in Ugandan farm produce distribution via SMS messages (Kenneth Bwire, Makerere University, Kampala). Telecommunications data for inferring the source and spread of a typhoid outbreak in Kampala (UN Pulse Lab, Kampala). The Punya system for prototyping and deployment of mobile phone apps to deal with emerging crises or market opportunities (Julius Adebayor, MIT) and large scale systems for collating and sharing data resources Open Data Kenya and UN OCHA Human Data Exchange….(More)”

Algorithms and Bias


Q. and A. With Cynthia Dwork in the New York Times: “Algorithms have become one of the most powerful arbiters in our lives. They make decisions about the news we read, the jobs we get, the people we meet, the schools we attend and the ads we see.

Yet there is growing evidence that algorithms and other types of software can discriminate. The people who write them incorporate their biases, and algorithms often learn from human behavior, so they reflect the biases we hold. For instance, research has shown that ad-targeting algorithms have shown ads for high-paying jobs to men but not women, and ads for high-interest loans to people in low-income neighborhoods.

Cynthia Dwork, a computer scientist at Microsoft Research in Silicon Valley, is one of the leading thinkers on these issues. In an Upshot interview, which has been edited, she discussed how algorithms learn to discriminate, who’s responsible when they do, and the trade-offs between fairness and privacy.

Q: Some people have argued that algorithms eliminate discriminationbecause they make decisions based on data, free of human bias. Others say algorithms reflect and perpetuate human biases. What do you think?

A: Algorithms do not automatically eliminate bias. Suppose a university, with admission and rejection records dating back for decades and faced with growing numbers of applicants, decides to use a machine learning algorithm that, using the historical records, identifies candidates who are more likely to be admitted. Historical biases in the training data will be learned by the algorithm, and past discrimination will lead to future discrimination.

Q: Are there examples of that happening?

A: A famous example of a system that has wrestled with bias is the resident matching program that matches graduating medical students with residency programs at hospitals. The matching could be slanted to maximize the happiness of the residency programs, or to maximize the happiness of the medical students. Prior to 1997, the match was mostly about the happiness of the programs.

This changed in 1997 in response to “a crisis of confidence concerning whether the matching algorithm was unreasonably favorable to employers at the expense of applicants, and whether applicants could ‘game the system,’ ” according to a paper by Alvin Roth and Elliott Peranson published in The American Economic Review.

Q: You have studied both privacy and algorithm design, and co-wrote a paper, “Fairness Through Awareness,” that came to some surprising conclusions about discriminatory algorithms and people’s privacy. Could you summarize those?

A: “Fairness Through Awareness” makes the observation that sometimes, in order to be fair, it is important to make use of sensitive information while carrying out the classification task. This may be a little counterintuitive: The instinct might be to hide information that could be the basis of discrimination….

Q: The law protects certain groups from discrimination. Is it possible to teach an algorithm to do the same?

A: This is a relatively new problem area in computer science, and there are grounds for optimism — for example, resources from the Fairness, Accountability and Transparency in Machine Learning workshop, which considers the role that machines play in consequential decisions in areas like employment, health care and policing. This is an exciting and valuable area for research. …(More)”

A Visual Introduction to Machine Learning


R2D3 introduction: “In machine learning, computers apply statistical learning techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions.

Keep scrolling. Using a data set about homes, we will create a machine learning model to distinguish homes in New York from homes in San Francisco…./

 

  1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
  2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data
  3. Overfitting happens when some boundaries are based on on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model….(More)”

Using Twitter as a data source: An overview of current social media research tools


Wasim Ahmed at the LSE Impact Blog: “I have a social media research blog where I find and write about tools that can be used to capture and analyse data from social media platforms. My PhD looks at Twitter data for health, such as the Ebola outbreak in West Africa. I am increasingly asked why I am looking at Twitter, and what tools and methods there are of capturing and analysing data from other platforms such as Facebook, or even less traditional platforms such as Amazon book reviews. Brainstorming a couple of responses to this question by talking to members of the New Social Media New Social Science network, there are at least six reasons:

  1. Twitter is a popular platform in terms of the media attention it receives and it therefore attracts more research due to its cultural status
  2. Twitter makes it easier to find and follow conversations (i.e., by both its search feature and by tweets appearing in Google search results)
  3. Twitter has hashtag norms which make it easier gathering, sorting, and expanding searches when collecting data
  4. Twitter data is easy to retrieve as major incidents, news stories and events on Twitter are tend to be centred around a hashtag
  5. The Twitter API is more open and accessible compared to other social media platforms, which makes Twitter more favourable to developers creating tools to access data. This consequently increases the availability of tools to researchers.
  6. Many researchers themselves are using Twitter and because of their favourable personal experiences, they feel more comfortable with researching a familiar platform.

It is probable that a combination of response 1 to 6 have led to more research on Twitter. However, this raises another distinct but closely related question: when research is focused so heavily on Twitter, what (if any) are the implications of this on our methods?

As for the methods that are currently used in analysing Twitter data i.e., sentiment analysis, time series analysis (examining peaks in tweets), network analysis etc., can these be applied to other platforms or are different tools, methods and techniques required? In addition to qualitative methods such as content analysis, I have used the following four methods in analysing Twitter data for the purposes of my PhD, below I consider whether these would work for other social media platforms:

  1. Sentiment analysis works well with Twitter data, as tweets are consistent in length (i.e., <= 140) would sentiment analysis work well with, for example Facebook data where posts may be longer?
  2. Time series analysis is normally used when examining tweets overtime to see when a peak of tweets may occur, would examining time stamps in Facebook posts, or Instagram posts, for example, produce the same results? Or is this only a viable method because of the real-time nature of Twitter data?
  3. Network analysis is used to visualize the connections between people and to better understand the structure of the conversation. Would this work as well on other platforms whereby users may not be connected to each other i.e., public Facebook pages?
  4. Machine learning methods may work well with Twitter data due to the length of tweets (i.e., <= 140) but would these work for longer posts and for platforms that are not text based i.e., Instagram?

It may well be that at least some of these methods can be applied to other platforms, however they may not be the best methods, and may require the formulation of new methods, techniques, and tools.

So, what are some of the tools available to social scientists for social media data? In the table below I provide an overview of some the tools I have been using (which require no programming knowledge and can be used by social scientists):…(More)”

The Data Revolution


Review of Rob Kitchin’s The Data Revolution: Big Data, Open Data, Data Infrastructures & their Consequences by David Moats in Theory, Culture and Society: “…As an industry, academia is not immune to cycles of hype and fashion. Terms like ‘postmodernism’, ‘globalisation’, and ‘new media’ have each had their turn filling the top line of funding proposals. Although they are each grounded in tangible shifts, these terms become stretched and fudged to the point of becoming almost meaningless. Yet, they elicit strong, polarised reactions. For at least the past few years, ‘big data’ seems to be the buzzword, which elicits funding, as well as the ire of many in the social sciences and humanities.

Rob Kitchin’s book The Data Revolution is one of the first systematic attempts to strip back the hype surrounding our current data deluge and take stock of what is really going on. This is crucial because this hype is underpinned by very real societal change, threats to personal privacy and shifts in store for research methods. The book acts as a helpful wayfinding device in an unfamiliar terrain, which is still being reshaped, and is admirably written in a language relevant to social scientists, comprehensible to policy makers and accessible even to the less tech savvy among us.

The Data Revolution seems to present itself as the definitive account of this phenomena but in filling this role ends up adopting a somewhat diplomatic posture. Kitchin takes all the correct and reasonable stances on the matter and advocates all the right courses of action but he is not able to, in the context of this book, pursue these propositions fully. This review will attempt to tease out some of these latent potentials and how they might be pushed in future work, in particular the implications of the ‘performative’ character of both big data narratives and data infrastructures for social science research.

Kitchin’s book starts with the observation that ‘data’ is a misnomer – etymologically data should refer to phenomena in the world which can be abstracted, measured etc. as opposed to the representations and measurements themselves, which should by all rights be called ‘capta’. This is ironic because the worst offenders in what Kitchin calls “data boosterism” seem to conflate data with ‘reality’, unmooring data from its conditions of production and making relationship between the two given or natural.

As Kitchin notes, following Bowker (2005), ‘raw data’ is an oxymoron: data are not so much mined as produced and are necessarily framed technically, ethically, temporally, spatially and philosophically. This is the central thesis of the book, that data and data infrastructures are not neutral and technical but also social and political phenomena. For those at the critical end of research with data, this is a starting assumption, but one which not enough practitioners heed. Most of the book is thus an attempt to flesh out these rapidly expanding data infrastructures and their politics….

Kitchin is at his best when revealing the gap between the narratives and the reality of data analysis such as the fallacy of empiricism – the assertion that, given the granularity and completeness of big data sets and the availability of machine learning algorithms which identify patterns within data (with or without the supervision of human coders), data can “speak for themselves”. Kitchin reminds us that no data set is complete and even these out-of-the-box algorithms are underpinned by theories and assumptions in their creation, and require context specific knowledge to unpack their findings. Kitchin also rightly raises concerns about the limits of big data, that access and interoperability of data is not given and that these gaps and silences are also patterned (Twitter is biased as a sample towards middle class, white, tech savy people). Yet, this language of veracity and reliability seems to suggest that big data is being conceptualised in relation to traditional surveys, or that our population is still the nation state, when big data could helpfully force us to reimagine our analytic objects and truth conditions and more pressingly, our ethics (Rieder, 2013).

However, performativity may again complicate things. As Kitchin observes, supermarket loyalty cards do not just create data about shopping, they encourage particular sorts of shopping; when research subjects change their behaviour to cater to the metrics and surveillance apparatuses built into platforms like Facebook (Bucher, 2012), then these are no longer just data points representing the social, but partially constitutive of new forms of sociality (this is also true of other types of data as discussed by Savage (2010), but in perhaps less obvious ways). This might have implications for how we interpret data, the distribution between quantitative and qualitative approaches (Latour et al., 2012) or even more radical experiments (Wilkie et al., 2014). Kitchin is relatively cautious about proposing these sorts of possibilities, which is not the remit of the book, though it clearly leaves the door open…(More)”

How Data Mining could have prevented Tunisia’s Terror attack in Bardo Museum


Wassim Zoghlami at Medium: “…Data mining is the process of posing queries and extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. Latelely there has been a big interest on leveraging the use of data mining for counter-terrorism applications

Using the data on more than 50.000+ ISIS connected twitter accounts , I was able to establish an understanding of some factors determined how often ISIS attacks occur , what different types of terror strikes are used in which geopolitical situations, and many other criteria through graphs about the frequency of hashtags usages and the frequency of a particular group of the words used in the tweets.

A simple data mining project of some of the repetitive hashtags and sequences of words used typically by ISIS militants in their tweets yielded surprising results. The results show a rise of some keywords on the tweets that started from Marsh 15, three days before Bardo museum attacks.

Some of the common frequent keywords and hashtags that had a unusual peak since marsh 15 , three days before the attack :

#طواغيت تونس : Tyrants of Tunisia = a reference to the military

بشرى تونس : Good news for Tunisia.

قريبا تونس : Soon in Tunisia.

#إفريقية_للإعلام : The head of social media of Afriqiyah

#غزوة_تونس : The foray of Tunis…

Big Data and Data Mining should be used for national security intelligence

The Tunisian national security has to leverage big data to predict such attacks and to achieve objectives as the volume of digital data. Some of the challenges facing the Data mining techniques are that to carry out effective data mining and extract useful information for counterterrorism and national security, we need to gather all kinds of information about individuals. However, this information could be a threat to the individuals’ privacy and civil liberties…(More)”

How Crowdsourcing And Machine Learning Will Change The Way We Design Cities


Shaunacy Ferro at FastCompany: “In 2011, researchers at the MIT Media Lab debuted Place Pulse, a website that served as a kind of “hot or not” for cities. Given two Google Street View images culled from a select few cities including New York City and Boston, the site asked users to click on the one that seemed safer, more affluent, or more unique. The result was an empirical way to measure urban aesthetics.

Now, that data is being used to predict what parts of cities feel the safest. StreetScore, a collaboration between the MIT Media Lab’s Macro Connections and Camera Culture groups, uses an algorithm to create a super high-resolution map of urban perceptions. The algorithmically generated data could one day be used to research the connection between urban perception and crime, as well as informing urban design decisions.

The algorithm, created by Nikhil Naik, a Ph.D. student in the Camera Culture lab, breaks an image down into its composite features—such as building texture, colors, and shapes. Based on how Place Pulse volunteers rated similar features, the algorithm assigns the streetscape a perceived safety score between 1 and 10. These scores are visualized as geographic points on a map, designed by MIT rising sophomore Jade Philipoom. Each image available from Google Maps in the two cities are represented by a colored dot: red for the locations that the algorithm tags as unsafe, and dark green for those that appear safest. The site, now limited to New York and Boston, will be expanded to feature Chicago and Detroit later this month, and eventually, with data collected from a new version of Place Pulse, will feature dozens of cities around the world….(More)”

Modern Methods for Sentiment Analysis


Review by Michael Czerny: “Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.

Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.

Word2Vec and Doc2Vec

Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method….(More)