Our path to better science in less time using open data science tools


Julia S. Stewart Lowndes et al in Nature: “Reproducibility has long been a tenet of science but has been challenging to achieve—we learned this the hard way when our old approaches proved inadequate to efficiently reproduce our own work. Here we describe how several free software tools have fundamentally upgraded our approach to collaborative research, making our entire workflow more transparent and streamlined. By describing specific tools and how we incrementally began using them for the Ocean Health Index project, we hope to encourage others in the scientific community to do the same—so we can all produce better science in less time.

Figure 1: Better science in less time, illustrated by the Ocean Health Index project.
Figure 1

Every year since 2012 we have repeated Ocean Health Index (OHI) methods to track change in global ocean health36,37. Increased reproducibility and collaboration has reduced the amount of time required to repeat methods (size of bubbles) with updated data annually, allowing us to focus on improving methods each year (text labels show the biggest innovations). The original assessment in 2012 focused solely on scientific methods (for example, obtaining and analysing data, developing models, calculating, and presenting results; dark shading). In 2013, by necessity we gave more focus to data science (for example, data organization and wrangling, coding, versioning, and documentation; light shading), using open data science tools. We established R as the main language for all data preparation and modelling (using RStudio), which drastically decreased the time involved to complete the assessment. In 2014, we adopted Git and GitHub for version control, project management, and collaboration. This further decreased the time required to repeat the assessment. We also created the OHI Toolbox, which includes our R package ohicore for core analytical operations used in all OHI assessments. In subsequent years we have continued (and plan to continue) this trajectory towards better science in less time by improving code with principles of tidy data33; standardizing file and data structure; and focusing more on communication, in part by creating websites with the same open data science tools and workflow. See text and Table 1 for more details….(More)”

ControCurator: Understanding Controversy Using Collective Intelligence


Paper by Benjamin Timmermans et al: “There are many issues in the world that people do not agree on, such as Global Warming [Cook et al. 2013], Anti-Vaccination [Kata 2010] and Gun Control [Spitzer 2015]. Having opposing opinions on such topics can lead to heated discussions, making them appear controversial. Such opinions are often expressed through news articles and social media. There are increasing calls for methods to detect and monitor these online discussions on different topics. Existing methods focus on using sentiment analysis and Wikipedia for identifying controversy [Dori-Hacohen and Allan 2015]. The problem with this is that it relies on a well structured and existing debate, which may not always be the case. Take for instance news reporting during large disasters, in which case the structure of a discussion is not yet clear and may change rapidly. Adding to this is that there is currently no agreed upon definition as to what exactly defines controversy. It is only agreed that controversy arises when there is a large debate by people with opposing viewpoints, but we do not yet understand which are the characteristic aspects and how they can be measured. In this paper we use the collective intelligence of the crowd in order to gain a better understanding of controversy by evaluating the aspects that have impact on it….(More)”

See also http://crowdtruth.org/

 

Citizenship office wants ‘Emma’ to help you


 at FedScoop: “U.S. Citizenship and Immigration Services unveiled a new virtual assistant live-chat service, known as “Emma,” to assist customers and website visitors in finding information and answering questions in a timely and efficient fashion.

The agency told FedScoop that it built the chatbot with the help of Verizon and artificial intelligence interface company Next IT. The goal  is “to address the growing need for customers to obtain information quicker and through multiple access points, USCIS broadened the traditional call center business model to include web-based self-help tools,” the agency says.

USCIS, a component agency of the Department of Homeland Security, says it receives nearly 14 million calls relating to immigration every year. The virtual assistant and live-chat services are aimed at becoming the first line of help available to users of USCIS.gov who might have trouble finding answers by themselves.

The bot greets customers when they enter the website, answers basic questions via live chat and supplies additional information in both English and Spanish. As a result, the amount of time customers spend searching for information on the website is greatly reduced, according to USCIS. Because the virtual assistant is embedded within the website, it can rapidly provide relevant information that may have been difficult to access manually.

The nature of the bot lends itself to potential encounters with personally identifiable information (PII), or PII, of the customers it interacts with. Because of this, USCIS recently conducted a privacy impact assessment (PIA).

Much of the assessment revolved around accuracy and the security of information that Emma could potentially encounter in a customer interaction. For the most part, the chat bot doesn’t require customers to submit personal information. Instead, it draws its responses from content already available on USCIS.gov, relative to the amount of information that users choose to provide. Answers are, according to the PIA, verified by thorough and frequent examination of all content posted to the site.

According to USCIS, the Emma will delete all chat logs — and therefore all PII — immediately after the customer ends the chat session. Should a customer reach a question that it can’t answer effectively and choose to continue the session with an agent in a live chat, the bot will ask for the preferred language (English or Spanish), the general topic of conversation, short comments on why the customer wishes to speak with a live agent, and the case on file and receipt number.

This information would then be transferred to the live agent. All other sensitive information entered, such as Social Security numbers or receipt numbers, would then be automatically masked in the subsequent transfer to the live agent…(More)”

How can we study disguised propaganda on social media? Some methodological reflections


Jannick Schou and Johan Farkas at DataDrivenJournalism: ’Fake news’ has recently become a seemingly ubiquitous concept among journalists, researchers, and citizens alike. With the rise of platforms such as Facebook and Twitter, it has become possible to spread deliberate forms of misinformation in hitherto unforeseen ways. This has also spilled over into the political domain, where new forms of (disguised) propaganda and false information have recently begun to emerge. These new forms of propaganda have very real effects: they serve to obstruct political decision-making processes, instil false narratives within the general public, and add fuel to already heated sites of political conflict. They represent a genuine democratic problem.

Yet, so far, both critical researchers and journalists have faced a number of issues and challenges when attempting to understand these new forms of political propaganda. Simply put: when it comes to disguised propaganda and social media, we know very little about the actual mechanisms through which such content is produced, disseminated, and negotiated. One of the key explanations for this might be that fake profiles and disguised political agendas are incredibly difficult to study. They present a serious methodological challenge. This is not only due to their highly ephemeral nature, with Facebook pages being able to vanish after only a few days or hours, but also because of the anonymity of its producers. Often, we simply do not know who is disseminating what and with what purpose. This makes it difficult for us to understand and research exactly what is going on.

This post takes its point of departure from a new article published in the international academic journal New Media & Society. Based on the research done for this article, we want to offer some methodological reflections as to how disguised propaganda might be investigated. How can we research fake and disguised political agendas? And what methodological tools do we have at our disposal?…

two main methodological advices spring to mind. First of all: collect as much data as you can in as many ways as possible. Make screenshots, take detailed written observations, use data scraping, and (if possible) participate in citizen groups. One of the most valuable resources we had at our disposal was the set of heterogeneous data we collected from each page. Using this allowed us to carefully dissect and retrace the complex set of practices involved in each page long after they were gone. While we certainly tried to be as systematic in our data collection as possible, we also had to use every tool at our disposal. And we had to constantly be on our toes. As soon as a page emerged, we were there: ready to write down notes and collect data.

Second: be willing to participate and collaborate. Our research showcases the immense potential in researchers (and journalists) actively collaborating with citizen groups and grassroots movements. Using the collective insights and attention of this group allowed us to quickly find and track down pages. It gave us renewed methodological strength. Collaborating across otherwise closed boundaries between research and journalism opens up new avenues for deeper and more detailed insights….(More)”

Big Data: A New Empiricism and its Epistemic and Socio-Political Consequences


Chapter by Gernot Rieder and Judith Simon in by Berechenbarkeit der Welt? Philosophie und Wissenschaft im Zeitalter von Big Data: “…paper investigates the rise of Big Data in contemporary society. It examines the most prominent epistemological claims made by Big Data proponents, calls attention to the potential socio-political consequences of blind data trust, and proposes a possible way forward. The paper’s main focus is on the interplay between an emerging new empiricism and an increasingly opaque algorithmic environment that challenges democratic demands for transparency and accountability. It concludes that a responsible culture of quantification requires epistemic vigilance as well as a greater awareness of the potential dangers and pitfalls of an ever more data-driven society….(More)”.

Data Collaboratives: exchanging data to create public value across Latin America and the Caribbean


Stefaan Verhulst, Andrew Young and Prianka Srinivasan at IADB’s Abierto al Publico: “Data is playing an ever-increasing role in bolstering businesses across Latin America – and the rest of the word. In Brazil, Mexico and Colombia alone, the revenue from Big Data is calculated at more than US$603.7 million, a market that is only set to increase as more companies across Latin America and the Caribbean embrace data-driven strategies to enhance their bottom-line. Brazilian banking giant Itau plans to create six data centers across the country, and already uses data collected from consumers online to improve cross-selling techniques and streamline their investments. Data from web-clicks, social media profiles, and telecommunication services is fueling a new generation of entrepreneurs keen to make big dollars from big data.

What if this same data could be used not just to improve business, but to improve the collective well-being of our communities, public spaces, and cities? Analysis of social media data can offer powerful insights to city officials into public trends and movements to better plan infrastructure and policies. Public health officials and humanitarian workers can use mobile phone data to, for instance, map human mobility and better target their interventions. By repurposing the data collected by companies for their business interests, governments, international organizations and NGOs can leverage big data insights for the greater public good.

Key question is thus: How to unlock useful data collected by corporations in a responsible manner and ensure its vast potential does not go to waste?

Data Collaboratives” are emerging as a possible answer. Data collaboratives are a new type of public-private partnerships aimed at creating public value by exchanging data across sectors.

Research conducted by the GovLab finds that Data Collaboratives offer several potential benefits across a number of sectors, including humanitarian and anti-poverty efforts, urban planning, natural resource stewardship, health, and disaster management. As a greater number of companies in Latin America look to data to spur business interests, our research suggests that some companies are also sharing and collaborating around data to confront some of society’s most pressing problems.

Consider the following Data Collaboratives that seek to enhance…(More)”

Twitter as a data source: An overview of tools for journalists


Wasim Ahmed at Data Driven Journalism: “Journalists may wish to use data from social media platforms in order to provide greater insight and context to a news story. For example, journalists may wish to examine the contagion of hashtags and whether they are capable of achieving political or social change. Moreover, newsrooms may also wish to tap into social media posts during unfolding crisis events. For example, to find out who tweeted about a crisis event first, and to empirically examine the impact of social media.

Furthermore, Twitter users and accounts such as WikiLeaks may operate outside the constraints of traditional journalism, and therefore it becomes important to have tools and mechanisms in place in order to examine these kinds of influential users. For example, it was found that those who were backing Marine Le Pen on Twitter could have been users who had an affinity to Donald Trump.

There remains a number of different methods for analysing social media data. Take text analytics, for example, which can include using sentiment analysis to place bulk social media posts into categories of a particular feeling, such as positive, negative, or neutral. Or machine learning, which can automatically assign social media posts to a number of different topics.

There are other methods such as social network analysis, which examines online communities and the relationships between them. A number of qualitative methodologies also exist, such as content analysis and thematic analysis, which can be used to manually label social media posts. From a journalistic perspective, network analysis may be of importance initially via tools such as NodeXL. This is because it can quickly provide an overview of influential Twitter users alongside a topic overview.

From an industry standpoint, there has been much focus on gaining insight into users’ personalities, through services such as IBM Watson’s Personality Insights service. This uses linguistic analytics to derive intrinsic personality insights, such as emotions like anxiety, self-consciousness, and depression. This information can then be used by marketers to target certain products; for example, anti-anxiety medication to users who are more anxious…(An overview of tools for 2017).”

UK government watchdog examining political use of data analytics


“Given the big data revolution, it is understandable that political campaigns are exploring the potential of advanced data analysis tools to help win votes,” Elizabeth Denham, the information commissioner, writes on the ICO’s blog. However, “the public have the right to expect” that this takes place in accordance with existing data protection laws, she adds.

Political parties are able to use Facebook to target voters with different messages, tailoring the advert to recipients based on their demographic. In the 2015 UK general election, the Conservative party spent £1.2 million on Facebook campaigns and the Labour party £16,000. It is expected that Labour will vastly increase that spend for the general election on 8 June….

Political parties and third-party companies are allowed to collect data from sites like Facebook and Twitter that lets them tailor these ads to broadly target different demographics. However, if those ads target identifiable individuals, it runs afoul of the law….(More)”

Eliminating the Human


I suspect that we almost don’t notice this pattern because it’s hard to imagine what an alternative focus of tech development might be. Most of the news we get barraged with is about algorithms, AI, robots and self driving cars, all of which fit this pattern, though there are indeed many technological innovations underway that have nothing to do with eliminating human interaction from our lives. CRISPR-cas9 in genetics, new films that can efficiently and cheaply cool houses and quantum computing to name a few, but what we read about most and what touches us daily is the trajectory towards less human involvement. Note: I don’t consider chat rooms and product reviews as “human interaction”; they’re mediated and filtered by a screen.

I am not saying these developments are not efficient and convenient; this is not a judgement regarding the services and technology. I am simply noticing a pattern and wondering if that pattern means there are other possible roads we could be going down, and that the way we’re going is not in fact inevitable, but is (possibly unconsciously) chosen.

Here are some examples of tech that allows for less human interaction…

Lastly, “Social” media- social “interaction” that isn’t really social.

While the appearance on social networks is one of connection—as Facebook and others frequently claim—the fact is a lot of social media is a simulation of real social connection. As has been in evidence recently, social media actually increases divisions amongst us by amplifying echo effects and allowing us to live in cognitive bubbles. We are fed what we already like or what our similarly inclined friends like… or more likely now what someone has payed for us to see in an ad that mimics content. In this way, we actually become less connected except to those in our group…..

Many transformative movements in the past succeed based on leaders, agreed upon principles and organization. Although social media is a great tool for rallying people and bypassing government channels, it does not guarantee eventual success.

Social media is not really social—ticking boxes and having followers and getting feeds is NOT being social—it’s a screen simulation of human interaction. Human interaction is much more nuanced and complicated than what happens online. Engineers like things that are quantifiable. Smells, gestures, expression, tone of voice, etc. etc.—in short, all the various ways we communicate are VERY hard to quantify, and those are often how we tell if someone likes us or not….

To repeat what I wrote above—humans are capricious, erratic, emotional, irrational and biased in what sometimes seem like counterproductive ways. I’d argue that though those might seem like liabilities, many of those attributes actually work in our favor. Many of our emotional responses have evolved over millennia, and they are based on the probability that our responses, often prodded by an emotion, will more likely than not offer the best way to deal with a situation….

Our random accidents and odd behaviors are fun—they make life enjoyable. I’m wondering what we’re left with when there are fewer and fewer human interactions. Remove humans from the equation and we are less complete as people or as a society. “We” do not exist as isolated individuals—we as individuals are inhabitants of networks, we are relationships. That is how we prosper and thrive….(More)”.

Open Data Barometer 2016


Open Data Barometer: “Produced by the World Wide Web Foundation as a collaborative work of the Open Data for Development (OD4D) network and with the support of the Omidyar Network, the Open Data Barometer (ODB) aims to uncover the true prevalence and impact of open data initiatives around the world. It analyses global trends, and provides comparative data on countries and regions using an in-depth methodology that combines contextual data, technical assessments and secondary indicators.

Covering 115 jurisdictions in the fourth edition, the Barometer ranks governments on:

  • Readiness for open data initiatives.
  • Implementation of open data programmes.
  • Impact that open data is having on business, politics and civil society.

After three successful editions, the fourth marks another step towards becoming a global policymaking tool with a participatory and inclusive process and a strong regional focus. This year’s Barometer includes an assessment of government performance in fulfilling the Open Data Charter principles.

The Barometer is a truly global and collaborative effort, with input from more than 100 researchers and government representatives. It takes over six months and more than 10,000 hours of research work to compile. During this process, we address more than 20,000 questions and respond to more than 5,000 comments and suggestions.

The ODB global report is a summary of some of the most striking findings. The full data and methodology is available, and is intended to support secondary research and inform better decisions for the progression of open data policies and practices across the world…(More)”.