Website Seeks to Make Government Data Easier to Sift Through

Steve Lohr at the New York Times: “For years, the federal government, states and some cities have enthusiastically made vast troves of data open to the public. Acres of paper records on demographics, public health, traffic patterns, energy consumption, family incomes and many other topics have been digitized and posted on the web.

This abundance of data can be a gold mine for discovery and insights, but finding the nuggets can be arduous, requiring special skills.

A project coming out of the M.I.T. Media Lab on Monday seeks to ease that challenge and to make the value of government data available to a wider audience. The project, called Data USA, bills itself as “the most comprehensive visualization of U.S. public data.” It is free, and its software code is open source, meaning that developers can build custom applications by adding other data.

Cesar A. Hidalgo, an assistant professor of media arts and sciences at the M.I.T. Media Lab who led the development of Data USA, said the website was devised to “transform data into stories.” Those stories are typically presented as graphics, charts and written summaries….Type “New York” into the Data USA search box, and a drop-down menu presents choices — the city, the metropolitan area, the state and other options. Select the city, and the page displays an aerial shot of Manhattan with three basic statistics: population (8.49 million), median household income ($52,996) and median age (35.8).

Lower on the page are six icons for related subject categories, including economy, demographics and education. If you click on demographics, one of the so-called data stories appears, based largely on data from the American Community Survey of the United States Census Bureau.

Using colorful graphics and short sentences, it shows the median age of foreign-born residents of New York (44.7) and of residents born in the United States (28.6); the most common countries of origin for immigrants (the Dominican Republic, China and Mexico); and the percentage of residents who are American citizens (82.8 percent, compared with a national average of 93 percent).

Data USA features a selection of data results on its home page. They include the gender wage gap in Connecticut; the racial breakdown of poverty in Flint, Mich.; the wages of physicians and surgeons across the United States; and the institutions that award the most computer science degrees….(More)

Accountable machines: bureaucratic cybernetics?

Alison Powell at LSE Media Policy Project Blog: “Algorithms are everywhere, or so we are told, and the black boxes of algorithmic decision-making make oversight of processes that regulators and activists argue ought to be transparent more difficult than in the past. But when, and where, and which machines do we wish to make accountable, and for what purpose? In this post I discuss how algorithms discussed by scholars are most commonly those at work on media platforms whose main products are the social networks and attention of individuals. Algorithms, in this case, construct individual identities through patterns of behaviour, and provide the opportunity for finely targeted products and services. While there are serious concerns about, for instance, price discrimination, algorithmic systems for communicating and consuming are, in my view, less inherently problematic than processes that impact on our collective participation and belonging as citizenship. In this second sphere, algorithmic processes – especially machine learning – combine with processes of governance that focus on individual identity performance to profoundly transform how citizenship is understood and undertaken.

Communicating and consuming

In the communications sphere, algorithms are what makes it possible to make money from the web for example through advertising brokerage platforms that help companies bid for ads on major newspaper websites. IP address monitoring, which tracks clicks and web activity, creates detailed consumer profiles and transform the everyday experience of communication into a constantly-updated production of consumer information. This process of personal profiling is at the heart of many of the concerns about algorithmic accountability. The consequence of perpetual production of data by individuals and the increasing capacity to analyse it even when it doesn’t appear to relate has certainly revolutionalised advertising by allowing more precise targeting, but what has it done for areas of public interest?

John Cheney-Lippold identifies how the categories of identity are now developed algorithmically, since a category like gender is not based on self-discloure, but instead on patterns of behaviour that fit with expectations set by previous alignment to a norm. In assessing ‘algorithmic identities’, he notes that these produce identity profiles which are narrower and more behaviour-based than the identities that we perform. This is a result of the fact that many of the systems that inspired the design of algorithmic systems were based on using behaviour and other markers to optimise consumption. Algorithmic identity construction has spread from the world of marketing to the broader world of citizenship – as evidenced by the Citizen Ex experiment shown at the Web We Want Festival in 2015.

Individual consumer-citizens

What’s really at stake is that the expansion of algorithmic assessment of commercially derived big data has extended the frame of the individual consumer into all kinds of other areas of experience. In a supposed ‘age of austerity’ when governments believe it’s important to cut costs, this connects with the view of citizens as primarily consumers of services, and furthermore, with the idea that a citizen is an individual subject whose relation to a state can be disintermediated given enough technology. So, with sensors on your garbage bins you don’t need to even remember to take them out. With pothole reporting platforms like FixMyStreet, a city government can be responsive to an aggregate of individual reports. But what aspects of our citizenship are collective? When, in the algorithmic state, can we expect to be together?

Put another way, is there any algorithmic process to value the long term education, inclusion, and sustenance of a whole community for example through library services?…

Seeing algorithms – machine learning in particular – as supporting decision-making for broad collective benefit rather than as part of ever more specific individual targeting and segmentation might make them more accountable. But more importantly, this would help algorithms support society – not just individual consumers….(More)”

It’s not big data that discriminates – it’s the people that use it

 in the Conversation: “Data can’t be racist or sexist, but the way it is used can help reinforce discrimination. The internet means more data is collected about us than ever before and it is used to make automatic decisions that can hugely affect our lives, from our credit scores to our employment opportunities.

If that data reflects unfair social biases against sensitive attributes, such as our race or gender, the conclusions drawn from that data might also be based on those biases.

But this era of “big data” doesn’t need to to entrench inequality in this way. If we build smarter algorithms to analyse our information and ensure we’re aware of how discrimination and injustice may be at work, we can actually use big data to counter our human prejudices.

This kind of problem can arise when computer models are used to make predictions in areas such as insurance, financial loans and policing. If members of a certain racial group have historically been more likely to default on their loans, or been more likely to be convicted of a crime, then the model can deem these people more risky. That doesn’t necessarily mean that these people actually engage in more criminal behaviour or are worse at managing their money. They may just be disproportionately targeted by police and sub-prime mortgage salesmen.

Excluding sensitive attributes

Data scientist Cathy O’Neil has written about her experience of developing models for homeless services in New York City. The models were used to predict how long homeless clients would be in the system and to match them with appropriate services. She argues that including race in the analysis would have been unethical.

If the data showed white clients were more likely to find a job than black ones, the argument goes, then staff might focus their limited resources on those white clients that would more likely have a positive outcome. While sociological research has unveiled the ways that racial disparities in homelessness and unemployment are the result of unjust discrimination, algorithms can’t tell the difference between just and unjust patterns. And so datasets should exclude characteristics that may be used to reinforce the bias, such as race.

But this simple response isn’t necessarily the answer. For one thing, machine learning algorithms can often infer sensitive attributes from a combination of other, non-sensitive facts. People of a particular race may be more likely to live in a certain area, for example. So excluding those attributes may not be enough to remove the bias….

An enlightened service provider might, upon seeing the results of the analysis, investigate whether and how racism is a barrier to their black clients getting hired. Equipped with this knowledge they could begin to do something about it. For instance, they could ensure that local employers’ hiring practices are fair and provide additional help to those applicants more likely to face discrimination. The moral responsibility lies with those responsible for interpreting and acting on the model, not the model itself.

So the argument that sensitive attributes should be stripped from the datasets we use to train predictive models is too simple. Of course, collecting sensitive data should be carefully regulated because it can easily be misused. But misuse is not inevitable, and in some cases, collecting sensitive attributes could prove absolutely essential in uncovering, predicting, and correcting unjust discrimination. For example, in the case of homeless services discussed above, the city would need to collect data on ethnicity in order to discover potential biases in employment practices….(More)

Participatory Budgeting

Five ways tech is crowdsourcing women’s empowerment

Zara Rahman in The Guardian: “Around the world, women’s rights advocates are crowdsourcing their own data rather than relying on institutional datasets.

Citizen-generated data is especially important for women’s rights issues. In many countries the lack of women in positions of institutional power, combined with slow, bureaucratic systems and a lack of prioritisation of women’s rights issues means data isn’t gathered on relevant topics, let alone appropriately responded to by the state.

Even when data is gathered by institutions, societal pressures may mean it remains inadequate. In the case of gender-based violence, for instance, women often suffer in silence, worrying nobody will believe them or that they will be blamed. Providing a way for women to contribute data anonymously or, if they so choose, with their own details, can be key to documenting violence and understanding the scale of a problem, and thus deciding upon appropriate responses.

Crowdsourcing data on street harassment in Egypt

Using open source platform Ushahidi, HarassMap provides women with a way to document incidences of street harassment. The project, which began in 2010, is raising awareness of how common street harassment is, giving women’s rights advocates a concrete way to highlight the scale of the problem….

Documenting experiences of reporting sexual harassment and violence to the police in India

Last year, The Ladies Finger, a women’s zine based in India, partnered with Amnesty International to support its Ready to Report campaign, which aimed to make it easier for survivors of sexual violence to file a police complaint. Using social media and through word of mouth, it asked the community if they had experiences to share about reporting sexual assault and harassment to the police. Using these crowdsourced leads, The Ladies Finger’s reporters spoke to people willing to share their experiences and put together a series of detailed contextualised stories. They included a piece that evoked a national outcry and spurred the Uttar Pradesh government to make an arrest for stalking, after six months of inaction….

Reporting sexual violence in Syria

Women Under Siege is a global project by Women’s Media Centre that is investigating how rape and sexual violence is used in conflicts. Its Syria project crowdsources data on sexual violence in the war-torn country. Like HarassMap, it uses the Ushahidi platform to geolocate where acts of sexual violence take place. Where possible, initial reports are contextualised with deeper media reports around the case in question….

Finding respectful gynaecologists in India

After recognising that many women in her personal networks were having bad experiences with gynaecologists in India, Delhi-based Amba Azaad began – with the help of her friends – putting together a list of gynaecologists who had treated patients respectfully called Gynaecologists We Trust. As the site says, “Finding doctors who are on our side is hard enough, and when it comes to something as intimate as our internal plumbing, it’s even more difficult.”…

Ending tech-related violence against women

In 2011, Take Back the Tech, an initiative from the Association for Progressive Communications, started a map gathering incidences of tech-related violence against women. Campaign coordinator Sara Baker says crowdsourcing data on this topic is particularly useful as “victims/survivors are often forced to tell their stories repeatedly in an attempt to access justice with little to no action taken on the part of authorities or intermediaries”. Rather than telling that story multiple times and seeing it go nowhere, their initiative gives people “the opportunity to make their experience visible (even if anonymously) and makes them feel like someone is listening and taking action”….(More)

Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers

Suju Rajan at Yahoo Labs: “Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers.

Research scientists at Yahoo Labs have long enjoyed working on large-scale machine learning problems inspired by consumer-facing products. This has enabled us to advance the thinking in areas such as search ranking, computational advertising, information retrieval, and core machine learning. A key aspect of interest to the external research community has been the application of new algorithms and methodologies to production traffic and to large-scale datasets gathered from real products.

Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use.

In addition to the interaction data, we are providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining….(More)”

Finding the Missing Millions Can Help Achieve the Sustainable Development Goals

 and Mariana Dahan in the Huffington Post: “The 2030 Agenda for Sustainable Development, approved in September, takes a holistic approach to development and presents no less than 17 global Sustainable Development Goals (SDGs). In committing to the goals and associated targets, the international community has agreed to a more ambitious development compact — that of ending poverty, protecting the planet while “leaving no one behind”.

Despite this ambition, we may not know who precisely is being left out of our development programs or how to more effectively target our intended beneficiaries.

A staggering 2.4 billion people today lack any form of recognized identity (ID), including some 625 million children, aged 0-14 years, whose births have never been registered with a civil authority. Only 19 out of 198 economies provide a unique ID at birth and use this consistently in civil identification and public services.

The Center for Global Development recently organized an event titled “Identity and the SDGs: How Finding the Missing Millions Can Help Achieve Development Goals”. While intending to speak about SDG target 16.9 on legal identity for all, including birth registration, by 2030, it became obvious that the importance of robust identification goes beyond its intrinsic value: it also enables the achievement of many other SDGs, such as financial inclusion, reduced corruption, gender equality, access to health services and appropriate social protection schemes.

Global initiatives, such as the World Bank Group’s Identification for Development (ID4D) agenda, a cross-institutional and multi-sectoral effort, aim to “make everyone count.” They will build new alliances and reshape existing development strategies in the areas of identification and civil registration and vital statistics (CRVS). On the latter, the World Bank, with a number of partners – including UNICEF, the World Health Organization (WHO) and the Economic and Social Commission for Asia and the Pacific, and several bilateral donors — is launching the Global Financing Facility for Every Woman Every Child, which includes financing aimed at strengthening and expanding ID platforms of CRVS systems….

Finally, the international community should establish the right monitoring mechanisms and indicators to measure whether we are on track to achieving the SDGs. This target for universal identity will be especially critical as a means of monitoring and achieving the SDGs as a whole. As the saying goes, what is not counted doesn’t count and what is not measured cannot be managed and thus measuring progress towards global targets is a fundamental component of meeting the ambitious goals we have set….(More)”

Biases in collective platforms: Wikipedia, GitHub and crowdmapping

Stefana Broadbent at Nesta: “Many of the collaboratively developed knowledge platforms we discussed at our recent conference, At The Roots of Collective Intelligence, suffer from a well-known “contributors’ bias”.

More than 85% of Wikipedia’s entries have been written by men 

OpenStack, as with most other Open Source projects, has seen the emergence of a small group of developers who author the majority of the projects. In fact 80% of the commits have been authored by slightly less than 8% of the authors, while 90% of the commits correspond to about 17% of all the authors.

GitHub’s Be Social function allows users to “follow” other participants and receive notification of their activity. The most popular contributors tend therefore to attract other users to the projects they are working on. And Open Street Map has 1.2 million registered users, but less than 15% of them have produced the majority of the 13 million elements of information.

Research by Quattrone, Capra, De Meo (2015) showed that while the content mapped was not different between active and occasional mappers, the social composition of the power users led to a geographical bias, with less affluent areas remaining unmapped more frequently than urban centres.

These well-known biases in crowdsourcing information, also known as the ‘power users’ effect, were discussed by Professor Licia Capra from the Department of Engineering at UCL. Watch the video of her talk here.

In essence, despite the fact that crowd-sourcing platforms are inclusive and open to anyone willing to dedicate the time and effort, there is a process of self-selection. Different factors can explain why there are certain gender and socio economic groups that are drawn to specific activities, but it is clear that there is a progressive reduction of the diversity of contributors over time.

The effect is more extreme where there is the need for continuous contributions. As the Humanitarian Open StreetMap Team project data showed, humanitarian crises attract many users who contribute intensely for a short time, but only very few participants contribute regularly for a long time. Only a small proportion of power users continue editing or adding code for sustained periods. This effect begs two important questions: does the editing job of the active few skew the information made available, and what can be done to avoid this type of concentration?….

The issue of how to attract more volunteers and editors is more complex and is a constant challenge for any crowdsourcing platform. We can look back at when Wikipedia started losing contributors, which coincided with a period of tighter restrictions to the editing process. This suggests that alongside designing the interface in a way to make contributions easy to be created and shared, it is also necessary to design practices and social norms that are immediately and continuously inclusive. – (More)”


Peer review in 2015: A global view

A white paper by Taylor & Francis: “Within the academic community, peer review is widely recognized as being at the heart of scholarly research. However, faith in peer review’s integrity is of ongoing and increasing concern to many. It is imperative that publishers (and academic editors) of peer-reviewed scholarly research learn from each other, working together to improve practices in areas such as ethical issues, training, and data transparency….Key findings:

  • Authors, editors and reviewers all agreed that the most important motivation to publish in peer reviewed journals is making a contribution to the field and sharing research with others.
  • Playing a part in the academic process and improving papers are the most important motivations for reviewers. Similarly, 90% of SAS study respondents said that playing a role in the academic community was a motivation to review.
  • Most researchers, across the humanities and social sciences (HSS) and science, technology and medicine (STM), rate the benefit of the peer review process towards improving their article as 8 or above out of 10. This was found to be the most important aspect of peer review in both the ideal and the real world, echoing the earlier large-scale peer review studies.
  • In an ideal world, there is agreement that peer review should detect plagiarism (with mean ratings of 7.1 for HSS and 7.5 for STM out of 10), but agreement that peer review is currently achieving this in the real world is only 5.7 HSS / 6.3 STM out of 10.
  • Researchers thought there was a low prevalence of gender bias but higher prevalence of regional and seniority bias – and suggest that double blind peer review is most capable of preventing reviewer discrimination where it is based on an author’s identity.
  • Most researchers wait between one and six months for an article they’ve written to undergo peer review, yet authors (not reviewers / editors) think up to two months is reasonable .
  • HSS authors say they are kept less well informed than STM authors about the progress of their article through peer review….(More)”

Statactivism: Forms of Action between Disclosure and Affirmation

Paper by Bruno Isabelle, Didier Emmanuel and Vitale Tommaso: “This article introduces the special issue on statactivism, a particular form of action within the repertoire used by contemporary social movements: the mobilization of statistics. Traditionally, statistics has been used by the worker movement within the class conflicts. But in the current configuration of state restructuring, new accumulation regimes, and changes in work organization in capitalists societies, the activist use of statistics is moving. This first article seeks to show the use of statistics and quantification in contentious performances connected with state restructuring, main transformations of the varieties of capitalisms, and changes in work organization regimes. The double role of statistics in representing as well as criticizing reality is considered. After showing how important statistical tools are in producing a shared reading of reality, we will discuss the two main dimensions of statactivism – disclosure and affirmation. In other words, we will see the role of stat-activists in denouncing a certain state of reality, and then the efforts to use statistics in creating equivalency among disparate conditions and in cementing emerging social categories. Finally, we present the main contributions of the various research papers in this special issue regarding the use of statistics as a form of action within a larger repertoire of contentious action. Six empirical papers focus on statactivism against the penal machinery in the early 1970s (Grégory Salle), on the mobilisation on the price index in Guadalupe in 2009 (Boris Samuel), and in Argentina in 2007 (Celia Lury and Ana Gross), on the mobilisations of experts to consolidate a link between working conditions and health issues (Marion Gilles), on the production of activity data for disability policy in France (Pierre-Yves Baudot), and on the use of statistics in social mobilizations for gender equality (Eugenia De Rosa). Alain Desrosières wrote the last paper, coping with mobilizations proposing innovations in the way of measuring inflation, unemployment, poverty, GDP, and climate change. This special issue is dedicated to him, in order to honor his everlasting intellectual legacy….(More)”