Big Data for Social Good


Introduction to a Special Issue of the Journal “Big Data” by Catlett Charlie and Ghani Rayid: “…organizations focused on social good are realizing the potential as well but face several challenges as they seek to become more data-driven. The biggest challenge they face is a paucity of examples and case studies on how data can be used for social good. This special issue of Big Data is targeted at tackling that challenge and focuses on highlighting some exciting and impactful examples of work that uses data for social good. The special issue is just one example of the recent surge in such efforts by the data science community. …

This special issue solicited case studies and problem statements that would either highlight (1) the use of data to solve a social problem or (2) social challenges that need data-driven solutions. From roughly 20 submissions, we selected 5 articles that exemplify this type of work. These cover five broad application areas: international development, healthcare, democracy and government, human rights, and crime prevention.

“Understanding Democracy and Development Traps Using a Data-Driven Approach” (Ranganathan et al.) details a data-driven model between democracy, cultural values, and socioeconomic indicators to identify a model of two types of “traps” that hinder the development of democracy. They use historical data to detect causal factors and make predictions about the time expected for a given country to overcome these traps.

“Targeting Villages for Rural Development Using Satellite Image Analysis” (Varshney et al.) discusses two case studies that use data and machine learning techniques for international economic development—solar-powered microgrids in rural India and targeting financial aid to villages in sub-Saharan Africa. In the process, the authors stress the importance of understanding the characteristics and provenance of the data and the criticality of incorporating local “on the ground” expertise.

In “Human Rights Event Detection from Heterogeneous Social Media Graphs,” Chen and Neil describe efficient and scalable techniques to use social media in order to detect emerging patterns in human rights events. They test their approach on recent events in Mexico and show that they can accurately detect relevant human rights–related tweets prior to international news sources, and in some cases, prior to local news reports, which could potentially lead to more timely, targeted, and effective advocacy by relevant human rights groups.

“Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets” (Wang et al.) describes a case study with the Cambridge Police Department, using a subspace clustering method to analyze the department’s full housebreak database, which contains detailed information from thousands of crimes from over a decade. They find that the method allows human crime analysts to handle vast amounts of data and provides new insights into true patterns of crime committed in Cambridge…..(More)

What Your Tweets Say About You


at the New Yorker: “How much can your tweets reveal about you? Judging by the last nine hundred and seventy-two words that I used on Twitter, I’m about average when it comes to feeling upbeat and being personable, and I’m less likely than most people to be depressed or angry. That, at least, is the snapshot provided by AnalyzeWords, one of the latest creations from James Pennebaker, a psychologist at the University of Texas who studies how language relates to well-being and personality. One of Pennebaker’s most famous projects is a computer program called Linguistic Inquiry and Word Count (L.I.W.C.), which looks at the words we use, and in what frequency and context, and uses this information to gauge our psychological states and various aspects of our personality….

Take a study, out last month, from a group of researchers based at the University of Pennsylvania. The psychologist Johannes Eichstaedt and his colleagues analyzed eight hundred and twenty-six million tweets across fourteen hundred American counties. (The counties contained close to ninety per cent of the U.S. population.) Then, using lists of words—some developed by Pennebaker, others by Eichstaedt’s team—that can be reliably associated with anger, anxiety, social engagement, and positive and negative emotions, they gave each county an emotional profile. Finally, they asked a simple question: Could those profiles help determine which counties were likely to have more deaths from heart disease?

The answer, it turned out, was yes….

The researchers have a theory: they suggest that “the language of Twitter may be a window into the aggregated and powerful effects of the community context.” They point to other epidemiological studies which have shown that general facts about a community, such as its “social cohesion and social capital,” have consequences for the health of individuals. Broadly speaking, people who live in poorer, more fragmented communities are less healthy than people living in richer, integrated ones.“When we do a sub-analysis, we find that the power that Twitter has is in large part accounted for by community and socioeconomic variables,” Eichstaedt told me when we spoke over Skype. In short, a young person’s negative, angry, and stressed-out tweets might reflect his or her stress-inducing environment—and that same environment may have negative health repercussions for other, older members of the same community….(More)”

Data for policy: when the haystack is made of needles. A call for contributions


Diana Vlad-Câlcic at the European Commission: “If policy-making is ‘whatever government chooses to do or not to do’ (Th. Dye), then how do governments actually decide? Evidence-based policy-making is not a new answer to this question, but it is constantly challenging both policy-makers and scientists to sharpen their thinking, their tools and their responsiveness.  The European Commission has recognised this and has embedded in its processes, namely through Impact Assessment, policy monitoring and evaluation, an evidence-informed decision-making approach.

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. (John von Neumann)

New data technologies raise the bar high for advanced modelling, dynamic visualisation, real-time data flows and a variety of data sources, from sensors, to cell phones or the Internet as such. An abundance of (big) data, a haystack made of needles, but do public administrations have the right tools and skills to exploit it? How much of it adds real value to established statistics and to scientific evidence? Are the high hopes and the high expectations partly just hype? And what lessons can we learn from experience?

To explore these questions, the European Commission is launching a study with the Oxford Internet Institute, Technopolis and CEPS  on ‘Data for policy: big data and other innovative data-driven approaches for evidence-informed policymaking’. As a first step, the study will collect examples of initiatives in public institutions at national and international level, where innovative data technologies contribute to the policy process. It will eventually develop case-studies for EU policies.

Contribute to the collective reflection by sharing with us good practices and examples you have from other public administrations. Follow the developments of the study also on Twitter @data4policyEU

Our New Three Rs: Rigor, Relevance, and Readability


Article by Stephen J. Del Rosso in Governance: “…Because of the dizzying complexity of the contemporary world, the quest for a direct relationship between academic scholarship and its policy utility is both quixotic and unnecessary. The 2013 U.S. Senate’s vote to prohibit funding for political science projects through the National Science Foundation, except for those certified “as promoting national security or the economic interests of the United States,” revealed a fundamental misreading of the nonlinear path between idea and policy. Rather than providing a clear blueprint for addressing emergent or long-standing challenges, a more feasible role for academic scholarship is what political scientist Roland Paris describes as helping to “order the world in which officials operate.” Scholarly works can “influence practitioners’ understandings of what is possible or desirable in a particular policy field or set of circumstances,” he believes, by “creating operational frameworks for … identifying options and implementing policies.”

It is sometimes claimed that think tanks should play the main role in conveying scholarly insights to policymakers. But, however they may have mastered the sound bite, the putative role of think tanks as effective transmission belts for policy-relevant ideas is limited by their lack of academic rigor and systematic peer review. There is also a tendency, particularly among some “Inside the Beltway” experts, to trim their sails to the prevailing political winds and engage in self-censorship to keep employment options open in current or future presidential administrations. Scholarship’s comparative advantage in the marketplace of ideas is also evident in terms of its anticipatory function—the ability to loosen the intellectual bolts for promising policies not quite ready for implementation. A classic example is Swedish Nobel laureate Gunner Myrdal’s 1944 study of race relations, The American Dilemma, which was largely ignored and even disavowed by its sponsors for over a decade until it proved essential to the landmark Supreme Court decision in Brown v. Board of Education. Moreover, it should also be noted, rather than providing a detailed game plan for addressing the problem of race in the country, Myrdal’s work was a quintessential example of the power of scholarship to frame critically important issues.

To bridge the scholarship–policy gap, academics must balance rigor and relevance with a third “R”—readability. There is no shortage of important scholarly work that goes unnoticed or unread because of its presentation. Scholars interested in having influence beyond the ivory tower need to combine their pursuit of disciplinary requirements with efforts to make their work more intelligible and accessible to a broader audience. For example, new forms of dissemination, such as blogs and other social media innovations, provide policy-relevant scholars with ample opportunities to supplement more traditional academic outlets. The recent pushback from the editors of the International Studies Association’s journals to the announced prohibition on their blogging is one indication that the cracks in the old system are already appearing.

At the risk of oversimplification, there are three basic tribes populating the political science field. One tribe comprises those who “get it” when it comes to the importance of policy relevance, a second eschews such engagement with the real world in favor of knowledge for knowledge’s sake, and a third is made up of anxious untenured assistant professors who seek to follow the path that will best provide them with secure employment. If war, as was famously said, is too important to be left to the generals, then the future of the political science field is too important to be left to the intellectual ostriches who bury their heads in self-referential esoterica. However, the first tribe needs to be supported, and the third tribe needs to be shown that there is professional value in engaging with the world, both to enlighten and, perhaps more importantly, to provoke—a sentiment the policy-relevant scholar and inveterate provocateur, Huntington, would surely have endorsed…(More)”

Data scientists rejoice! There’s an online marketplace selling algorithms from academics


SiliconRepublic: “Algorithmia, an online marketplace that connects computer science researchers’ algorithms with developers who may have uses for them, has exited its private beta.

Algorithms are essential to our online experience. Google uses them to determine which search results are the most relevant. Facebook uses them to decide what should appear in your news feed. Netflix uses them to make movie recommendations.

Founded in 2013, Algorithmia could be described as an app store for algorithms, with over 800 of them available in its library. These algorithms provide the means of completing various tasks in the fields of machine learning, audio and visual processing, and computer vision.

Algorithmia found a way to monetise algorithms by creating a platform where academics can share their creations and charge a royalty fee per use, while developers and data scientists can request specific algorithms in return for a monetary reward. One such suggestion is for ‘punctuation prediction’, which would insert correct punctuation and capitalisation in speech-to-text translation.

While it’s not the first algorithm marketplace online, Algorithmia will accept and sell any type of algorithm and host them on its servers. What this means is that developers need only add a simple piece of code to their software in order to send a query to Algorithmia’s servers, so the algorithm itself doesn’t have to be integrated in its entirety….

Computer science researchers can spend years developing algorithms, only for them to be published in a scientific journal never to be read by software engineers.

Algorithmia intends to create a community space where academics and engineers can meet to discuss and refine these algorithms for practical use. A voting and commenting system on the site will allow users to engage and even share insights on how contributions can be improved.

To that end, Algorithmia’s ultimate goal is to advance the development of algorithms as well as their discovery and use….(More)”

Who Retweets Whom: How Digital And Legacy Journalists Interact on Twitter


Paper by Michael L. Barthel, Ruth Moon, and William Mari published by the Tow Center: “When bloggers and citizen journalists became fixtures of the U.S. media environment, traditional print journalists responded with a critique, as this latest Tow Center brief says. According to mainstream reporters, the interlopers were “unprofessional, unethical, and overly dependent on the very mainstream media they criticized. In a 2013 poll of journalists, 51 percent agreed that citizen journalism is not real journalism”.

However, the digital media environment, a space for easy interaction has provided opportunities for journalists of all stripes to vault the barriers between legacy and digital sectors; if not collaborating, then perhaps communicating at least.

This brief by three PhD candidates at The University of Washington, Michael L. Barthel, Ruth Moon and William Mari, takes a snapshot of how fifteen political journalists from BuzzFeed, Politico and The New York Times, interact (representing digital, hybrid and legacy outlets respectively). The researchers place those interactions in the context of reporters’ longstanding traditions of gossip, goading, collaboration and competition.

They found tribalism, pronounced most strongly in the legacy outlet, but present across each grouping. They found hierarchy and status-boosting. But those phenomena were not absolute; there were also instances of co-operation, sharing and mutual benefit. None-the-less, by these indicators at least; there was a clear pecking order: Digital and hybrid organizations’ journalists paid “more attention to traditional than digital publications”.

You can download your copy here (pdf).”

The Algorithmic Self


Frank Pasquale in The Hedgehog Review:“…For many technology enthusiasts, the answer to the obesity epidemic—and many other problems—lies in computational countermeasures to the wiles of the food scientists. App developers are pioneering behavioristic interventions to make calorie counting and exercise prompts automatic. For example, users of a new gadget, the Pavlok wristband, can program it to give them an electronic shock if they miss exercise targets. But can such stimuli break through the blooming, buzzing distractions of instant gratification on offer in so many rival games and apps? Moreover, is there another way of conceptualizing our relationship to our surroundings than as a suboptimal system of stimulus and response?
Some of our subtlest, most incisive cultural critics have offered alternatives. Rather than acquiesce to our manipulability, they urge us to become more conscious of its sources—be they intrusive advertisements or computers that we (think we) control. For example, Sherry Turkle, founder and director of the MIT Initiative on Technology and Self, sees excessive engagement with gadgets as a substitution of the “machinic” for the human—the “cheap date” of robotized interaction standing in for the more unpredictable but ultimately challenging and rewarding negotiation of friendship, love, and collegiality. In The Glass Cage, Nicholas Carr critiques the replacement of human skill with computer mediation that, while initially liberating, threatens to sap the reserves of ingenuity and creativity that enabled the computation in the first place.
Beyond the psychological, there is a political dimension, too. Legal theorist and Georgetown University law professor Julie Cohen warns of the dangers of “modulation,” which enables advertisers, media executives, political consultants, and intelligence operatives to deploy opaque algorithms to monitor and manipulate behavior. Cultural critic Rob Horning ups the ante on the concerns of Cohen and Turkle with a series of essays dissecting feedback loops among surveillance entities, the capture of important information, and self-readjusting computational interventions designed to channel behavior and thought into ever-narrower channels. Horning also criticizes Carr for failing to emphasize the almost irresistible economic logic behind algorithmic self-making—at first for competitive advantage, then, ultimately, for survival.
To negotiate contemporary algorithms of reputation and search—ranging from resumé optimization on LinkedIn to strategic Facebook status updates to OkCupid profile grooming—we are increasingly called on to adopt an algorithmic self, one well practiced in strategic self-promotion. This algorithmic selfhood may be critical to finding job opportunities (or even maintaining a reliable circle of friends and family) in an era of accelerating social change. But it can also become self-defeating. Consider, for instance, the self-promoter whose status updates on Facebook or LinkedIn gradually tip from informative to annoying. Or the search engine−optimizing website whose tactics become a bit too aggressive, thereby causing it to run afoul of Google’s web spam team and consequently sink into obscurity. The algorithms remain stubbornly opaque amid rapidly changing social norms. A cyber-vertigo results, as we are pressed to promote our algorithmic selves but puzzled over the best way to do so….(More)
 

New portal to crowdsource captions, transcripts of old photos, national archives


Irene Tham at The Straits Times: “Wanted: history enthusiasts to caption old photographs and transcribe handwritten manuscripts that contain a piece of Singapore’s history.

They are invited to contribute to an upcoming portal that will carry some 3,000 unidentified photographs dating back to the late 1800s, and 3,000 pages of Straits Settlement records including letters written during Sir Stamford Raffles’ administration of Singapore.

These are collections from the Government and individuals waiting to be “tagged” on the new portal – The Citizen Archivist Project at www.nas.gov.sg/citizenarchivist….

Without tagging – such as by photo captioning and digital transcription – these records cannot be searched. There are over 140,000 photos and about one million pages of Straits Settlements Records in total that cannot be searched today.

These records date back to the 1800s, and include letters written during Sir Stamford Raffles’ administration in Singapore.

“The key challenge is that they were written in elaborate cursive penmanship which is not machine-readable,” said Dr Yaacob, adding that the knowledge and wisdom of the public can be tapped on to make these documents more accessible.

Mr Arthur Fong (West Coast GRC) had asked how the Government could get young people interested in history, and Dr Yaacob said this initiative was something they would enjoy.

Portal users must first log in using their existing Facebook, Google or National Library Board accounts. Contributions will be saved in users’ profiles, automatically created upon signing in.

Transcript contributions on the portal work in similar ways to Wikipedia; contributed text will be uploaded immediately on the portal.

However, the National Archives will take up to three days to review photo caption contributions. Approved captions will be uploaded on its website at www.nas.gov.sg/archivesonline….(More)”

Tweets Can Predict Health Insurance Exchange Enrollment


PennMedicine: “An increase in Twitter sentiment (the positivity or negativity of tweets) is associated with an increase in state-level enrollment in the Affordable Care Act’s (ACA) health insurance marketplaces — a phenomenon that points to use of the social media platform as a real-time gauge of public opinion and provides a way for marketplaces to quickly identify enrollment changes and emerging issues. Although Twitter has been previously used to measure public perception on a range of health topics, this study, led by researchers at the Perelman School of Medicine at the University of Pennsylvania and published online in the Journal of Medical Internet Research, is the first to look at its relationship with the new national health insurance marketplace enrollment.

The study examined 977,303 ACA and “Obamacare”-related tweets — along with those directed toward the Twitter handle for HealthCare.gov and the 17 state-based marketplace Twitter accounts — in March 2014, then tested a correlation of Twitter sentiment with marketplace enrollment by state. Tweet sentiment was determined using the National Research Council (NRC) sentiment lexicon, which contains more than 54,000 words with corresponding sentiment weights ranging from positive to negative. For example, the word “excellent” has a positive sentiment weight, and is more positive than the word “good,” but the word “awful” is negative. Using this lexicon, researchers found that a .10 increase in the sentiment of tweets was associated with a nine percent increase in health insurance marketplace enrollment at the state level. While a .10 increase may seem small, these numbers indicate a significant correlation between Twitter sentiment and enrollment based on a continuum of sentiment scores that were examined over a million tweets.

“The correlation between Twitter sentiment and the number of eligible individuals who enrolled in a marketplace plan highlights the potential for Twitter to be a real-time monitoring strategy for future enrollment periods,” said first author Charlene A. Wong, MD, a Robert Wood Johnson Foundation Clinical Scholar and Fellow in Penn’s Leonard Davis Institute of Health Economics. “This would be especially valuable for quickly identifying emerging issues and making adjustments, instead of having to wait weeks or months for that information to be released in enrollment reports, for example.”…(More)”

“Data on the Web” Best Practices


W3C First Public Working Draft: “…The best practices described below have been developed to encourage and enable the continued expansion of the Web as a medium for the exchange of data. The growth of open data by governments across the world [OKFN-INDEX], the increasing publication of research data encouraged by organizations like the Research Data Alliance [RDA], the harvesting and analysis of social media, crowd-sourcing of information, the provision of important cultural heritage collections such as at the Bibliothèque nationale de France [BNF] and the sustained growth in the Linked Open Data Cloud [LODC], provide some examples of this phenomenon.

In broad terms, data publishers aim to share data either openly or with controlled access. Data consumers (who may also be producers themselves) want to be able to find and use data, especially if it is accurate, regularly updated and guaranteed to be available at all times. This creates a fundamental need for a common understanding between data publishers and data consumers. Without this agreement, data publishers’ efforts may be incompatible with data consumers’ desires.

Publishing data on the Web creates new challenges, such as how to represent, describe and make data available in a way that it will be easy to find and to understand. In this context, it becomes crucial to provide guidance to publishers that will improve consistency in the way data is managed, thus promoting the re-use of data and also to foster trust in the data among developers, whatever technology they choose to use, increasing the potential for genuine innovation.

This document sets out a series of best practices that will help publishers and consumers face the new challenges and opportunities posed by data on the Web.

Best practices cover different aspects related to data publishing and consumption, like data formats, data access, data identification and metadata. In order to delimit the scope and elicit the required features for Data on the Web Best Practices, the DWBP working group compiled a set of use cases [UCR] that represent scenarios of how data is commonly published on the Web and how it is used. The set of requirements derived from these use cases were used to guide the development of the best practice.

The Best Practices proposed in this document are intended to serve a more general purpose than the practices suggested in Best Practices for Publishing Linked Data [LD-BP] since it is domain-independent and whilst it recommends the use of Linked Data, it also promotes best practices for data on the web in formats such as CSV and JSON. The Best Practices related to the use of vocabularies incorporate practices that stem from Best Practices for Publishing Linked Data where appropriate….(More)