Using Twitter as a data source: An overview of current social media research tools


Wasim Ahmed at the LSE Impact Blog: “I have a social media research blog where I find and write about tools that can be used to capture and analyse data from social media platforms. My PhD looks at Twitter data for health, such as the Ebola outbreak in West Africa. I am increasingly asked why I am looking at Twitter, and what tools and methods there are of capturing and analysing data from other platforms such as Facebook, or even less traditional platforms such as Amazon book reviews. Brainstorming a couple of responses to this question by talking to members of the New Social Media New Social Science network, there are at least six reasons:

  1. Twitter is a popular platform in terms of the media attention it receives and it therefore attracts more research due to its cultural status
  2. Twitter makes it easier to find and follow conversations (i.e., by both its search feature and by tweets appearing in Google search results)
  3. Twitter has hashtag norms which make it easier gathering, sorting, and expanding searches when collecting data
  4. Twitter data is easy to retrieve as major incidents, news stories and events on Twitter are tend to be centred around a hashtag
  5. The Twitter API is more open and accessible compared to other social media platforms, which makes Twitter more favourable to developers creating tools to access data. This consequently increases the availability of tools to researchers.
  6. Many researchers themselves are using Twitter and because of their favourable personal experiences, they feel more comfortable with researching a familiar platform.

It is probable that a combination of response 1 to 6 have led to more research on Twitter. However, this raises another distinct but closely related question: when research is focused so heavily on Twitter, what (if any) are the implications of this on our methods?

As for the methods that are currently used in analysing Twitter data i.e., sentiment analysis, time series analysis (examining peaks in tweets), network analysis etc., can these be applied to other platforms or are different tools, methods and techniques required? In addition to qualitative methods such as content analysis, I have used the following four methods in analysing Twitter data for the purposes of my PhD, below I consider whether these would work for other social media platforms:

  1. Sentiment analysis works well with Twitter data, as tweets are consistent in length (i.e., <= 140) would sentiment analysis work well with, for example Facebook data where posts may be longer?
  2. Time series analysis is normally used when examining tweets overtime to see when a peak of tweets may occur, would examining time stamps in Facebook posts, or Instagram posts, for example, produce the same results? Or is this only a viable method because of the real-time nature of Twitter data?
  3. Network analysis is used to visualize the connections between people and to better understand the structure of the conversation. Would this work as well on other platforms whereby users may not be connected to each other i.e., public Facebook pages?
  4. Machine learning methods may work well with Twitter data due to the length of tweets (i.e., <= 140) but would these work for longer posts and for platforms that are not text based i.e., Instagram?

It may well be that at least some of these methods can be applied to other platforms, however they may not be the best methods, and may require the formulation of new methods, techniques, and tools.

So, what are some of the tools available to social scientists for social media data? In the table below I provide an overview of some the tools I have been using (which require no programming knowledge and can be used by social scientists):…(More)”

The Data Revolution


Review of Rob Kitchin’s The Data Revolution: Big Data, Open Data, Data Infrastructures & their Consequences by David Moats in Theory, Culture and Society: “…As an industry, academia is not immune to cycles of hype and fashion. Terms like ‘postmodernism’, ‘globalisation’, and ‘new media’ have each had their turn filling the top line of funding proposals. Although they are each grounded in tangible shifts, these terms become stretched and fudged to the point of becoming almost meaningless. Yet, they elicit strong, polarised reactions. For at least the past few years, ‘big data’ seems to be the buzzword, which elicits funding, as well as the ire of many in the social sciences and humanities.

Rob Kitchin’s book The Data Revolution is one of the first systematic attempts to strip back the hype surrounding our current data deluge and take stock of what is really going on. This is crucial because this hype is underpinned by very real societal change, threats to personal privacy and shifts in store for research methods. The book acts as a helpful wayfinding device in an unfamiliar terrain, which is still being reshaped, and is admirably written in a language relevant to social scientists, comprehensible to policy makers and accessible even to the less tech savvy among us.

The Data Revolution seems to present itself as the definitive account of this phenomena but in filling this role ends up adopting a somewhat diplomatic posture. Kitchin takes all the correct and reasonable stances on the matter and advocates all the right courses of action but he is not able to, in the context of this book, pursue these propositions fully. This review will attempt to tease out some of these latent potentials and how they might be pushed in future work, in particular the implications of the ‘performative’ character of both big data narratives and data infrastructures for social science research.

Kitchin’s book starts with the observation that ‘data’ is a misnomer – etymologically data should refer to phenomena in the world which can be abstracted, measured etc. as opposed to the representations and measurements themselves, which should by all rights be called ‘capta’. This is ironic because the worst offenders in what Kitchin calls “data boosterism” seem to conflate data with ‘reality’, unmooring data from its conditions of production and making relationship between the two given or natural.

As Kitchin notes, following Bowker (2005), ‘raw data’ is an oxymoron: data are not so much mined as produced and are necessarily framed technically, ethically, temporally, spatially and philosophically. This is the central thesis of the book, that data and data infrastructures are not neutral and technical but also social and political phenomena. For those at the critical end of research with data, this is a starting assumption, but one which not enough practitioners heed. Most of the book is thus an attempt to flesh out these rapidly expanding data infrastructures and their politics….

Kitchin is at his best when revealing the gap between the narratives and the reality of data analysis such as the fallacy of empiricism – the assertion that, given the granularity and completeness of big data sets and the availability of machine learning algorithms which identify patterns within data (with or without the supervision of human coders), data can “speak for themselves”. Kitchin reminds us that no data set is complete and even these out-of-the-box algorithms are underpinned by theories and assumptions in their creation, and require context specific knowledge to unpack their findings. Kitchin also rightly raises concerns about the limits of big data, that access and interoperability of data is not given and that these gaps and silences are also patterned (Twitter is biased as a sample towards middle class, white, tech savy people). Yet, this language of veracity and reliability seems to suggest that big data is being conceptualised in relation to traditional surveys, or that our population is still the nation state, when big data could helpfully force us to reimagine our analytic objects and truth conditions and more pressingly, our ethics (Rieder, 2013).

However, performativity may again complicate things. As Kitchin observes, supermarket loyalty cards do not just create data about shopping, they encourage particular sorts of shopping; when research subjects change their behaviour to cater to the metrics and surveillance apparatuses built into platforms like Facebook (Bucher, 2012), then these are no longer just data points representing the social, but partially constitutive of new forms of sociality (this is also true of other types of data as discussed by Savage (2010), but in perhaps less obvious ways). This might have implications for how we interpret data, the distribution between quantitative and qualitative approaches (Latour et al., 2012) or even more radical experiments (Wilkie et al., 2014). Kitchin is relatively cautious about proposing these sorts of possibilities, which is not the remit of the book, though it clearly leaves the door open…(More)”

A Research Roadmap for Human Computation


Emerging Technology From the arXiv : “The wisdom of the crowd has become so powerful and so accessible via the Internet that it has become a resource in its own right. Various services now tap into this rich supply of human cognition, such as Wikipedia, Duolingo, and Amazon’s Mechanical Turk.

So important is this resource that scientists have given it a name; they call it human computation. And a rapidly emerging and increasingly important question is how best to exploit it.

Today, we get an answer of sorts thanks to a group of computer scientists, crowdsourcing pioneers, and visionaries who have created a roadmap for research into human computation. The team, led by Pietro Michelucci at the Human Computation Institute, point out that human computation systems have been hugely successful at tackling complex problems from identifying spiral galaxies to organizing disaster relief.

But their potential is even greater still, provided that human cognition can be efficiently harnessed on a global scale. Last year, they met to discuss these issues and have now published the results of their debate.

The begin by pointing out the extraordinary successes of human computation….then describe the kinds of projects they want to create. They call one idea Project Houston after the crowdsourced effort on the ground that helped bring back the Apollo 13 astronauts after an on-board explosion on the way to the moon.

Their idea is that similar help can be brought to bear from around the world when individuals on earth find themselves in trouble. By this they mean individuals who might be considering suicide or suffering from depression, for example.

The plan is to use state-of-the-art speech analysis and natural language understanding to detect stress and offer help. This would come in the form of composite personalities made up from individuals with varying levels of expertise in the crowd, supported by artificial intelligence techniques. “Project Houston could provide a consistently kind and patient personality even if the “crowd” changes completely over time,” they say.

Another idea is to build on the way that crowdsourcing helps people learn. One example of this is Duolingo, an app that offers free language lessons while simultaneously acting as a document translation service. “Why stop with language learning and translation?” they ask.

A similar approach could help people learn new skills as they work online, a process that should allow them to take on more complex roles. One example is in the field of radiology, where an important job is to recognize tumors on x-ray images. This is a task that machine vision algorithms do not yet perform reliably…..

Yet another idea would be to crowdsource information that helps the poorest families in America find social welfare programs. These programs are often difficult to navigate and represent a disproportionate hardship for the people who are most likely to benefit from them: those who are homeless, who have disabilities, who are on low income, and so on.

The idea is that the crowd should take on some of this burden freeing up this group for other tasks, like finding work, managing health problems and so on.

These are worthy goals but they raise some significant questions. Chief among these is the nature of the ethical, legal, and social implications of human computation. How can this work be designed to allow meaningful and dignified human participation? How can the outcomes be designed so that the most vulnerable people can benefit from it? And what is the optimal division of labor between machines and humans to produce a specific result?

Ref:  arxiv.org/abs/1505.07096 : A U.S. Research Roadmap for Human Computation”

Open data could save the NHS hundreds of millions, says top UK scientist


The Guardian: “The UK government must open up and highlight the power of more basic data sets to improve patient care in the NHS and save hundreds of millions of pounds a year, Nigel Shadbolt, chairman of the Open Data Institute (ODI) has urged.

The UK government topped the first league table for open data (paywall)produced by the ODI last year but Shadbolt warns that ministers’ open data responsibilities have not yet been satisfied.

Basic data on prescription administration is now published on a monthly basis but Shadbolt said medical practitioners must be educated about the power of this data to change prescribing habits across the country.

Other data sets, such as trusts’ opening times, consultant lists and details of services, that are promised to make the NHS more accessible are not currently available in a form that is machine-readable.

“These basic sets of information about the processes, the people and places in the health system are all fragmented and fractured and many of them are not available as registers that you can go to,” Shadbolt said.

“Whenever you talk about health data people think you must be talking about personal data and patient data and there are issues, obviously, of absolutely protecting privacy there. But there’s lots of data in the health service that is not about personal patient data at all that would be hugely useful to just have available as machine-readable data for apps to use.”

The UK government has led the way in recent years in encouraging transparency and accountability within the NHS by opening league tables. The publication of league tables on MRSA was followed by a 76-79% drop in infections.

Shadbolt said: “Those hospitals that were worst in their league table don’t like to be there and there was a very rapid diffusion of understanding of best practice across them that you can quantify. It’s many millions of pounds being saved.”

The artificial intelligence and open data expert said the next big area for open data improvement in the NHS is around prescriptions.

Shadbolt pointed to the publication of data about the prescription of statins,which has helped identify savings worth hundreds of millions of pounds: “There is little doubt that this pattern is likely to exist across the whole of the prescribing space.”…(More)”

How Data Mining could have prevented Tunisia’s Terror attack in Bardo Museum


Wassim Zoghlami at Medium: “…Data mining is the process of posing queries and extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. Latelely there has been a big interest on leveraging the use of data mining for counter-terrorism applications

Using the data on more than 50.000+ ISIS connected twitter accounts , I was able to establish an understanding of some factors determined how often ISIS attacks occur , what different types of terror strikes are used in which geopolitical situations, and many other criteria through graphs about the frequency of hashtags usages and the frequency of a particular group of the words used in the tweets.

A simple data mining project of some of the repetitive hashtags and sequences of words used typically by ISIS militants in their tweets yielded surprising results. The results show a rise of some keywords on the tweets that started from Marsh 15, three days before Bardo museum attacks.

Some of the common frequent keywords and hashtags that had a unusual peak since marsh 15 , three days before the attack :

#طواغيت تونس : Tyrants of Tunisia = a reference to the military

بشرى تونس : Good news for Tunisia.

قريبا تونس : Soon in Tunisia.

#إفريقية_للإعلام : The head of social media of Afriqiyah

#غزوة_تونس : The foray of Tunis…

Big Data and Data Mining should be used for national security intelligence

The Tunisian national security has to leverage big data to predict such attacks and to achieve objectives as the volume of digital data. Some of the challenges facing the Data mining techniques are that to carry out effective data mining and extract useful information for counterterrorism and national security, we need to gather all kinds of information about individuals. However, this information could be a threat to the individuals’ privacy and civil liberties…(More)”

How Crowdsourcing And Machine Learning Will Change The Way We Design Cities


Shaunacy Ferro at FastCompany: “In 2011, researchers at the MIT Media Lab debuted Place Pulse, a website that served as a kind of “hot or not” for cities. Given two Google Street View images culled from a select few cities including New York City and Boston, the site asked users to click on the one that seemed safer, more affluent, or more unique. The result was an empirical way to measure urban aesthetics.

Now, that data is being used to predict what parts of cities feel the safest. StreetScore, a collaboration between the MIT Media Lab’s Macro Connections and Camera Culture groups, uses an algorithm to create a super high-resolution map of urban perceptions. The algorithmically generated data could one day be used to research the connection between urban perception and crime, as well as informing urban design decisions.

The algorithm, created by Nikhil Naik, a Ph.D. student in the Camera Culture lab, breaks an image down into its composite features—such as building texture, colors, and shapes. Based on how Place Pulse volunteers rated similar features, the algorithm assigns the streetscape a perceived safety score between 1 and 10. These scores are visualized as geographic points on a map, designed by MIT rising sophomore Jade Philipoom. Each image available from Google Maps in the two cities are represented by a colored dot: red for the locations that the algorithm tags as unsafe, and dark green for those that appear safest. The site, now limited to New York and Boston, will be expanded to feature Chicago and Detroit later this month, and eventually, with data collected from a new version of Place Pulse, will feature dozens of cities around the world….(More)”

Modern Methods for Sentiment Analysis


Review by Michael Czerny: “Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.

Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.

Word2Vec and Doc2Vec

Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method….(More)

Citizen Science for Citizen Access to Law


Paper by Michael Curtotti, Wayne Weibel, Eric McCreath, Nicolas Ceynowa, Sara Frug, and Tom R Bruce: “This paper sits at the intersection of citizen access to law, legal informatics and plain language. The paper reports the results of a joint project of the Cornell University Legal Information Institute and the Australian National University which collected thousands of crowdsourced assessments of the readability of law through the Cornell LII site. The aim of the project is to enhance accuracy in the prediction of the readability of legal sentences. The study requested readers on legislative pages of the LII site to rate passages from the United States Code and the Code of Federal Regulations and other texts for readability and other characteristics. The research provides insight into who uses legal rules and how they do so. The study enables conclusions to be drawn as to the current readability of law and spread of readability among legal rules. The research is intended to enable the creation of a dataset of legal rules labelled by human judges as to readability. Such a dataset, in combination with machine learning, will assist in identifying factors in legal language which impede readability and access for citizens. As far as we are aware, this research is the largest ever study of readability and usability of legal language and the first research which has applied crowdsourcing to such an investigation. The research is an example of the possibilities open for enhancing access to law through engagement of end users in the online legal publishing environment for enhancement of legal accessibility and through collaboration between legal publishers and researchers….(More)”

The End of Asymmetric Information


Essay by Alex Tabarrok and Tyler Cowen: Might the age of asymmetric information – for better or worse – be over?  Market institutions are rapidly evolving to a situation where very often the buyer and the seller have roughly equal knowledge. Technological developments are giving everyone who wants it access to the very best information when it comes to product quality, worker performance, matches to friends and partners, and the nature of financial transactions, among many other areas.

These developments will have implications for how markets work, how much consumers benefit, and also economic policy and the law. As we will see, there may be some problematic sides to these new arrangements, specifically when it comes to privacy. Still, a large amount of economic regulation seems directed at a set of problems which, in large part, no longer exist…

Many “public choice” problems are really problems of asymmetric information. In William Niskanen’s (1974) model of bureaucracy, government workers usually benefit from larger bureaus, and they are able to expand their bureaus to inefficient size because they are the primary providers of information to politicians. Some bureaus, such as the NSA and the CIA, may still be able to use secrecy to benefit from information asymmetry. For instance they can claim to politicians that they need more resources to deter or prevent threats, and it is hard for the politicians to have well-informed responses on the other side of the argument. Timely, rich information about most other bureaucracies, however, is easily available to politicians and increasingly to the public as well. As information becomes more symmetric, Niskanen’s (1974) model becomes less applicable, and this may help check the growth of unneeded bureaucracy.

Cheap sensors are greatly extending how much information can be economically gathered and analyzed. It’s not uncommon for office workers to have every key stroke logged. When calling customer service, who has not been told “this call may be monitored for quality control purposes?” Service-call workers have their location tracked through cell phones. Even information that once was thought to be purely subjective can now be collected and analyzed, often with the aid of smart software or artificial intelligence. One firm, for example, uses badges equipped with microphones, accelerometers, and location sensors to measure tone of voice, posture, and body language, as well as who spoke to whom and for how long (Lohr 2014). The purpose is not only to monitor workers but to deduce when, where and why workers are the most productive. We are again seeing trade-offs which bring greater productivity, and limit asymmetric information, albeit at the expense of some privacy.

As information becomes more prevalent and symmetric, earlier solutions to asymmetric problems will become less necessary. When employers do not easily observe workers, for example, employers may pay workers unusually high wages, generating a rent. Workers will then work at high levels despite infrequent employer observation, to maintain their future rents (Shapiro and Stiglitz 1984). But those higher wages involved a cost, namely that fewer workers were hired, and the hires that were made often were directed to people who were already known to the firm. Better monitoring of workers will mean that employers will hire more people and furthermore they may be more willing to take chances on risky outsiders, rather than those applicants who come with impeccable pedigree. If the outsider does not work out and produce at an acceptable level, it is easy enough to figure this out and fire them later on….(More)”

Big Data for Social Good


Introduction to a Special Issue of the Journal “Big Data” by Catlett Charlie and Ghani Rayid: “…organizations focused on social good are realizing the potential as well but face several challenges as they seek to become more data-driven. The biggest challenge they face is a paucity of examples and case studies on how data can be used for social good. This special issue of Big Data is targeted at tackling that challenge and focuses on highlighting some exciting and impactful examples of work that uses data for social good. The special issue is just one example of the recent surge in such efforts by the data science community. …

This special issue solicited case studies and problem statements that would either highlight (1) the use of data to solve a social problem or (2) social challenges that need data-driven solutions. From roughly 20 submissions, we selected 5 articles that exemplify this type of work. These cover five broad application areas: international development, healthcare, democracy and government, human rights, and crime prevention.

“Understanding Democracy and Development Traps Using a Data-Driven Approach” (Ranganathan et al.) details a data-driven model between democracy, cultural values, and socioeconomic indicators to identify a model of two types of “traps” that hinder the development of democracy. They use historical data to detect causal factors and make predictions about the time expected for a given country to overcome these traps.

“Targeting Villages for Rural Development Using Satellite Image Analysis” (Varshney et al.) discusses two case studies that use data and machine learning techniques for international economic development—solar-powered microgrids in rural India and targeting financial aid to villages in sub-Saharan Africa. In the process, the authors stress the importance of understanding the characteristics and provenance of the data and the criticality of incorporating local “on the ground” expertise.

In “Human Rights Event Detection from Heterogeneous Social Media Graphs,” Chen and Neil describe efficient and scalable techniques to use social media in order to detect emerging patterns in human rights events. They test their approach on recent events in Mexico and show that they can accurately detect relevant human rights–related tweets prior to international news sources, and in some cases, prior to local news reports, which could potentially lead to more timely, targeted, and effective advocacy by relevant human rights groups.

“Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets” (Wang et al.) describes a case study with the Cambridge Police Department, using a subspace clustering method to analyze the department’s full housebreak database, which contains detailed information from thousands of crimes from over a decade. They find that the method allows human crime analysts to handle vast amounts of data and provides new insights into true patterns of crime committed in Cambridge…..(More)