Data journalism and the ethics of publishing Twitter data


Matthew L. Williams at Data Driven Journalism: “Collecting and publishing data collected from social media sites such as Twitter are everyday practices for the data journalist. Recent findings from Cardiff University’s Social Data Science Lab question the practice of publishing Twitter content without seeking some form of informed consent from users beforehand. Researchers found that tweets collected around certain topics, such as those related to terrorism, political votes, changes in the law and health problems, create datasets that might contain sensitive content, such as extreme political opinion, grossly offensive comments, overly personal revelations and threats to life (both to oneself and to others). Handling these data in the process of analysis (such as classifying content as hateful and potentially illegal) and reporting has brought the ethics of using social media in social research and journalism into sharp focus.

Ethics is an issue that is becoming increasingly salient in research and journalism using social media data. The digital revolution has outpaced parallel developments in research governance and agreed good practice. Codes of ethical conduct that were written in the mid twentieth century are being relied upon to guide the collection, analysis and representation of digital data in the twenty-first century. Social media is particularly ethically challenging because of the open availability of the data (particularly from Twitter). Many platforms’ terms of service specifically state users’ data that are public will be made available to third parties, and by accepting these terms users legally consent to this. However, researchers and data journalists must interpret and engage with these commercially motivated terms of service through a more reflexive lens, which implies a context sensitive approach, rather than focusing on the legally permissible uses of these data.

Social media researchers and data journalists have experimented with data from a range of sources, including Facebook, YouTube, Flickr, Tumblr and Twitter to name a few. Twitter is by far the most studied of all these networks. This is because Twitter differs from other networks, such as Facebook, that are organised around groups of ‘friends’, in that it is more ‘open’ and the data (in part) are freely available to researchers. This makes Twitter a more public digital space that promotes the free exchange of opinions and ideas. Twitter has become the primary space for online citizens to publicly express their reaction to events of national significance, and also the primary source of data for social science research into digital publics.

The Twitter streaming API provides three levels of data access: the free random 1% that provides ~5M tweets daily and the random 10% and 100% (chargeable or free to academic researchers upon request). Datasets on social interactions of this scale, speed and ease of access have been hitherto unrealisable in the social sciences and journalism, and have led to a flood of journal articles and news pieces, many of which include tweets with full text content and author identity without informed consent. This is presumably because of Twitter’s ‘open’ nature, which leads to the assumption that ‘these are public data’ and using it does not require the rigor and scrutiny of an ethical oversight. Even when these data are scrutinised, journalists don’t need to be convinced by the ‘public data’ argument, due to the lack of a framework to evaluate the potential harms to users. The Social Data Science Lab takes a more ethically reflexive approach to the use of social media data in social research, and carefully considers users’ perceptions, online context and the role of algorithms in estimating potentially sensitive user characteristics.

recent Lab survey conducted into users’ perceptions of the use of their social media posts found the following:

  • 94% were aware that social media companies had Terms of Service
  • 65% had read the Terms of Service in whole or in part
  • 76% knew that when accepting Terms of Service they were giving permission for some of their information to be accessed by third parties
  • 80% agreed that if their social media information is used in a publication they would expect to be asked for consent
  • 90% agreed that if their tweets were used without their consent they should be anonymized…(More)”.

Spanning Today’s Chasms: Seven Steps to Building Trusted Data Intermediaries


James Shulman at the Mellon Foundation: “In 2001, when hundreds of individual colleges and universities were scrambling to scan their slide libraries, The Andrew W. Mellon Foundation created a new organization, Artstor, to assemble a massive library of digital images from disparate sources to support teaching and research in the arts and humanities.

Rather than encouraging—or paying for—each school to scan its own slide of the Mona Lisa, the Mellon Foundation created an intermediary organization that would balance the interests of those who created, photographed and cared for art works, such as artists and museums, and those who wanted to use such images for the admirable calling of teaching and studying history and culture.  This organization would reach across the gap that separated these two communities and would respect and balance the interests of both sides, while helping each accomplish their missions.  At the same time that Napster was using technology to facilitate the un-balanced transfer of digital content from creators to users, the Mellon Foundation set up a new institution aimed at respecting the interests of one side of the market and supporting the socially desirable work of the other.

As the internet has enabled the sharing of data across the world, new intermediaries have emerged as entire platforms. A networked world needs such bridges—think Etsy or Ebay sitting between sellers and buyers, or Facebook sitting between advertisers and users. While intermediaries that match sellers and buyers of things provide a marketplace to bridge from one side or the other, aggregators of data work in admittedly more shadowy territories.

In the many realms that market forces won’t support, however, a great deal of public good can be done by aggregating and managing access to datasets that might otherwise continue to live in isolation. Whether due to institutional sociology that favors local solutions, the technical challenges associated with merging heterogeneous databases built with different data models, intellectual property limitations, or privacy concerns, datasets are built and maintained by independent groups that—if networked—could be used to further each other’s work.

Think of those studying coral reefs, or those studying labor practices in developing markets, or child welfare offices seeking to call upon court records in different states, or medical researchers working in different sub-disciplines but on essentially the same disease.  What intermediary invests in joining these datasets?  Many people assume that computers can simply “talk” to each other and share data intuitively, but without targeted investment in connecting them, they can’t.  Unlike modern databases that are now often designed with the cloud in mind, decades of locally created databases churn away in isolation, at great opportunity cost to us all.

Art history research is an unusually vivid example. Most people can understand that if you want to study Caravaggio, you don’t want to hunt and peck across hundreds of museums, books, photo archives, libraries, churches, and private collections.  You want all that content in one place—exactly what Mellon sought to achieve by creating Artstor.

What did we learn in creating Artstor that might be distilled as lessons for others taking on an aggregation project to serve the public good?….(More)”.

Facebook’s next project: American inequality


Nancy Scola at Politico: “Facebook CEO Mark Zuckerberg is quietly cracking open his company’s vast trove of user data for a study on economic inequality in the U.S. — the latest sign of his efforts to reckon with divisions in American society that the social network is accused of making worse.

The study, which hasn’t previously been reported, is mining the social connections among Facebook’s American users to shed light on the growing income disparity in the U.S., where the top 1 percent of households is said to control 40 percent of the country’s wealth. Facebook is an incomparably rich source of information for that kind of research: By one estimate, about three of five American adults use the social network….

Facebook confirmed the broad contours of its partnership with Chetty but declined to elaborate on the substance of the study. Chetty, in a brief interview following a January speech in Washington, said he and his collaborators — who include researchers from Stanford and New York University — have been working on the inequality study for at least six months.

“We’re using social networks, and measuring interactions there, to understand the role of social capital much better than we’ve been able to,” he said.

Researchers say they see Facebook’s enormous cache of data as a remarkable resource, offering an unprecedentedly detailed and sweeping look at American society. That store of information contains both details that a user might tell Facebook — their age, hometown, schooling, family relationships — and insights that the company has picked up along the way, such as the interest groups they’ve joined and geographic distribution of who they call a “friend.”

It’s all the more significant, researchers say, when you consider that Facebook’s user base — about 239 million monthly users in the U.S. and Canada at last count — cuts across just about every demographic group.

And all that information, say researchers, lets them take guesses about users’ wealth. Facebook itself recently patented a way of figuring out someone’s socioeconomic status using factors ranging from their stated hobbies to how many internet-connected devices they own.

A Facebook spokesman addressed the potential privacy implications of the study’s access to user data, saying, “We conduct research at Facebook responsibly, which includes making sure we protect people’s information.” The spokesman added that Facebook follows an “enhanced” review process for research projects, adopted in 2014 after a controversy over a study that manipulated some people’s news feeds to see if it made them happier or sadder.

According to a Stanford University source familiar with Chetty’s study, the Facebook account data used in the research has been stripped of any details that could be used to identify users. The source added that academics involved in the study have gone through security screenings that include background checks, and can access the Facebook data only in secure facilities….(More)”.

The Social Media Threat to Society and Security


George Soros at Project Syndicate: “It takes significant effort to assert and defend what John Stuart Mill called the freedom of mind. And there is a real chance that, once lost, those who grow up in the digital age – in which the power to command and shape people’s attention is increasingly concentrated in the hands of a few companies – will have difficulty regaining it.

The current moment in world history is a painful one. Open societies are in crisis, and various forms of dictatorships and mafia states, exemplified by Vladimir Putin’s Russia, are on the rise. In the United States, President Donald Trump would like to establish his own mafia-style state but cannot, because the Constitution, other institutions, and a vibrant civil society won’t allow it….

The rise and monopolistic behavior of the giant American Internet platform companies is contributing mightily to the US government’s impotence. These companies have often played an innovative and liberating role. But as Facebook and Google have grown ever more powerful, they have become obstacles to innovation, and have caused a variety of problems of which we are only now beginning to become aware…

Social media companies’ true customers are their advertisers. But a new business model is gradually emerging, based not only on advertising but also on selling products and services directly to users. They exploit the data they control, bundle the services they offer, and use discriminatory pricing to keep more of the benefits that they would otherwise have to share with consumers. This enhances their profitability even further, but the bundling of services and discriminatory pricing undermine the efficiency of the market economy.

Social media companies deceive their users by manipulating their attention, directing it toward their own commercial purposes, and deliberately engineering addiction to the services they provide. This can be very harmful, particularly for adolescents.

There is a similarity between Internet platforms and gambling companies. Casinos have developed techniques to hook customers to the point that they gamble away all of their money, even money they don’t have.

Something similar – and potentially irreversible – is happening to human attention in our digital age. This is not a matter of mere distraction or addiction; social media companies are actually inducing people to surrender their autonomy. And this power to shape people’s attention is increasingly concentrated in the hands of a few companies.

It takes significant effort to assert and defend what John Stuart Mill called the freedom of mind. Once lost, those who grow up in the digital age may have difficulty regaining it.

This would have far-reaching political consequences. People without the freedom of mind can be easily manipulated. This danger does not loom only in the future; it already played an important role in the 2016 US presidential election.

There is an even more alarming prospect on the horizon: an alliance between authoritarian states and large, data-rich IT monopolies, bringing together nascent systems of corporate surveillance with already-developed systems of state-sponsored surveillance. This may well result in a web of totalitarian control the likes of which not even George Orwell could have imagined….(More)”.

Smarter New York City: How City Agencies Innovate


Book edited by André Corrêa d’Almeida: “Innovation is often presented as being in the exclusive domain of the private sector. Yet despite widespread perceptions of public-sector inefficiency, government agencies have much to teach us about how technological and social advances occur. Improving governance at the municipal level is critical to the future of the twenty-first-century city, from environmental sustainability to education, economic development, public health, and beyond. In this age of acceleration and massive migration of people into cities around the world, this book explains how innovation from within city agencies and administrations makes urban systems smarter and shapes life in New York City.
Using a series of case studies, Smarter New York City describes the drivers and constraints behind urban innovation, including leadership and organization; networks and interagency collaboration; institutional context; technology and real-time data collection; responsiveness and decision making; and results and impact. Cases include residential organic-waste collection, an NYPD program that identifies the sound of gunshots in real time, and the Vision Zero attempt to end traffic casualties, among others. Challenging the usefulness of a tech-centric view of urban innovation, Smarter New York City brings together a multidisciplinary and integrated perspective to imagine new possibilities from within city agencies, with practical lessons for city officials, urban planners, policy makers, civil society, and potential private-sector partners….(More)”.

Small Data for Big Impact


Liz Luckett at the Stanford Social Innovation Review: “As an investor in data-driven companies, I’ve been thinking a lot about my grandfather—a baker, a small business owner, and, I now realize, a pioneering data scientist. Without much more than pencil, paper, and extraordinarily deep knowledge of his customers in Washington Heights, Manhattan, he bought, sold, and managed inventory while also managing risk. His community was poor, but his business prospered. This was not because of what we celebrate today as the power and predictive promise of big data, but rather because of what I call small data: nuanced market insights that come through regular and trusted interactions.

Big data takes into account volumes of information from largely electronic sources—such as credit cards, pay stubs, test scores—and segments people into groups. As a result, people participating in the formalized economy benefit from big data. But people who are paid in cash and have no recognized accolades, such as higher education, are left out. Small data captures those insights to address this market failure. My grandfather, for example, had critical customer information he carefully gathered over the years: who could pay now, who needed a few days more, and which tabs to close. If he had access to a big data algorithm, it likely would have told him all his clients were unlikely to repay him, based on the fact that they were low income (vs. high income) and low education level (vs. college degree). Today, I worry that in our enthusiasm for big data and aggregated predictions, we often lose the critical insights we can gain from small data, because we don’t collect it. In the process, we are missing vital opportunities to both make money and create economic empowerment.

We won’t solve this problem of big data by returning to my grandfather’s shop floor. What we need is more and better data—a small data movement to supply vital missing links in marketplaces and supply chains the world over. What are the proxies that allow large companies to discern whom among the low income are good customers in the absence of a shopkeeper? At The Social Entrepreneurs’ Fund (TSEF), we are profitably investing in a new breed of data company: enterprises that are intentionally and responsibly serving low-income communities, and generating new and unique insights about the behavior of individuals in the process. The value of the small data they collect is becoming increasingly useful to other partners, including corporations who are willing to pay for it. It is a kind of dual market opportunity that for the first time makes it economically advantageous for these companies to reach the poor. We are betting on small data to transform opportunities and quality of life for the underserved, tap into markets that were once seen as too risky or too costly to reach, and earn significant returns for investors….(More)”.

‘Epic Duck Challenge’ shows drones can outdo people at surveying wildlife


Jarrod Hodgson, Aleks Terauds and Lian Pin Koh in the Conversation:”Ecologists are increasingly using drones to gather data. Scientists have used remotely piloted aircraft to estimate the health of fragile polar mosses, to measure and predict the mass of leopard seals, and even to collect whale snot. Drones have also been labelled as game-changers for wildlife population monitoring.

But once the take-off dust settles, how do we know if drones produce accurate data? Perhaps even more importantly, how do the data compare to those gathered using a traditional ground-based approach?

To answer these questions we created the #EpicDuckChallenge, which involved deploying thousands of plastic replica ducks on an Adelaide beach, and then testing various methods of tallying them up.

As we report today in the journal Methods in Ecology and Evolution, drones do indeed generate accurate wildlife population data – even more accurate, in fact, than those collected the old-fashioned way.

Assessing the accuracy of wildlife count data is hard. We can’t be sure of the true number of animals present in a group of wild animals. So, to overcome this uncertainty, we created life-sized, replica seabird colonies, each with a known number of individuals.

From the optimum vantage and in ideal weather conditions, experienced wildlife spotters independently counted the colonies from the ground using binoculars and telescopes. At the same time, a drone captured photographs of each colony from a range of heights. Citizen scientists then used these images to tally the number of animals they could see.

Counts of birds in drone-derived imagery were better than those made by wildlife observers on the ground. The drone approach was more precise and more accurate – it produced counts that were consistently closer to the true number of individuals….(More)”.

An AI That Reads Privacy Policies So That You Don’t Have To


Andy Greenberg at Wired: “…Today, researchers at Switzerland’s Federal Institute of Technology at Lausanne (EPFL), the University of Wisconsin and the University of Michigan announced the release of Polisis—short for “privacy policy analysis”—a new website and browser extension that uses their machine-learning-trained app to automatically read and make sense of any online service’s privacy policy, so you don’t have to.

In about 30 seconds, Polisis can read a privacy policy it’s never seen before and extract a readable summary, displayed in a graphic flow chart, of what kind of data a service collects, where that data could be sent, and whether a user can opt out of that collection or sharing. Polisis’ creators have also built a chat interface they call Pribot that’s designed to answer questions about any privacy policy, intended as a sort of privacy-focused paralegal advisor. Together, the researchers hope those tools can unlock the secrets of how tech firms use your data that have long been hidden in plain sight….

Polisis isn’t actually the first attempt to use machine learning to pull human-readable information out of privacy policies. Both Carnegie Mellon University and Columbia have made their own attempts at similar projects in recent years, points out NYU Law Professor Florencia Marotta-Wurgler, who has focused her own research on user interactions with terms of service contracts online. (One of her own studies showed that only .07 percent of users actually click on a terms of service link before clicking “agree.”) The Usable Privacy Policy Project, a collaboration that includes both Columbia and CMU, released its own automated tool to annotate privacy policies just last month. But Marotta-Wurgler notes that Polisis’ visual and chat-bot interfaces haven’t been tried before, and says the latest project is also more detailed in how it defines different kinds of data. “The granularity is really nice,” Marotta-Wurgler says. “It’s a way of communicating this information that’s more interactive.”…(More)”.

World’s biggest city database shines light on our increasingly urbanised planet


EU Joint Research Centers: “The JRC has launched a new tool with data on all 10,000 urban centres scattered across the globe. It is the largest and most comprehensive database on cities ever published.

With data derived from the JRC’s Global Human Settlement Layer (GHSL), researchers have discovered that the world has become even more urbanised than previously thought.

Populations in urban areas doubled in Africa and grew by 1.1 billion in Asia between 1990 and 2015.

Globally, more than 400 cities have a population between 1 and 5 million. More than 40 cities have 5 to 10 million people, and there are 32 ‘megacities’ with above 10 million inhabitants.

There are some promising signs for the environment: Cities became 25% greener between 2000 and 2015. And although air pollution in urban centres was increasing from 1990, between 2000 and 2015 the trend was reversed.

With every high density area of at least 50,000 inhabitants covered, the city centres database shows growth in population and built-up areas over the past 40 years.  Environmental factors tracked include:

  • ‘Greenness’: the estimated amount of healthy vegetation in the city centre
  • Soil sealing: the covering of the soil surface with materials like concrete and stone, as a result of new buildings, roads and other public and private spaces
  • Air pollution: the level of polluting particles such as PM2.5 in the air
  • Vicinity to protected areas: the percentage of natural protected space within 30 km distance from the city centre’s border
  • Disaster risk-related exposure of population and buildings in low lying areas and on steep slopes.

The data is free to access and open to everyone. It applies big data analytics and a global, people-based definition of cities, providing support to monitor global urbanisation and the 2030 Sustainable Development Agenda.

The information gained from the GHSL is used to map out population density and settlement maps. Satellite, census and local geographic information are used to create the maps….(More)”.

A science that knows no country: Pandemic preparedness, global risk, sovereign science


Paper by J. Benjamin Hurlbut: “… examines political norms and relationships associated with governance of pandemic risk. Through a pair of linked controversies over scientific access to H5N1 flu virus and genomic data, it examining the duties, obligations, and allocations of authority articulated around the imperative for globally free-flowing information and around the corollary imperative for a science that is set free to produce such information.

It argues that scientific regimes are laying claim to a kind of sovereignty, particularly in moments where scientific experts call into question the legitimacy of claims grounded in national sovereignty, by positioning the norms of scientific practice, including a commitment to unfettered access to scientific information and to the authority of science to declare what needs to be known, as essential to global governance. Scientific authority occupies a constitutional position insofar as it figures centrally in the repertoire of imaginaries that shape how a global community is imagined: what binds that community together and what shared political commitments, norms, and subjection to delegated authority are seen as necessary for it to be rightly governed….(More)”.