Making data open for everyone


Kathryn L.S. Pettit and Jonathan Schwabis at UrbanWire: “Over the past few years, there have been some exciting developments in open source tools and programming languages, business intelligence tools, big data, open data, and data visualization. These trends, and others, are changing the way we interact with and consume information and data. And that change is driving more organizations and governments to consider better ways to provide their data to more people.

The World Bank, for example, has a concerted effort underway to open its data in better and more visual ways. Google’s Public Data Explorer brings together large datasets from around the world into a single interface. For-profit providers like OpenGov and Socrata are helping local, state, and federal governments open their data (both internally and externally) in newer platforms.

We are firm believers in open data. (There are, of course, limitations to open data because of privacy or security, but that’s a discussion for another time). But open data is not simply about putting more data on the Internet. It’s not just only about posting files and telling people where to find them. To allow and encourage more people to use and interact with data, that data needs to be useful and readable not only by researchers, but also by the dad in northern Virginia or the student in rural Indiana who wants to know more about their public libraries.

Open data should be easy to access, analyze, and visualize

Many are working hard to provide more data in better ways, but we have a long way to go. Take, for example, the Congressional Budget Office (full disclosure, one of us used to work at CBO). Twice a year, CBO releases its Budget and Economic Outlook, which provides the 10-year budget projections for the federal government. Say you want to analyze 10-year budget projections for the Pell Grant program. You’d need to select “Get Data” and click on “Baseline Projections for Education” and then choose “Pell Grant Programs.” This brings you to a PDF report, where you can copy the data table you’re looking for into a format you can actually use (say, Excel). You would need to repeat the exercise to find projections for the 21 other programs for which the CBO provides data.

In another case, the Bureau of Labor Statistics has tried to provide users with query tools that avoid the use of PDFs, but still require extra steps to process. You can get the unemployment rate data through their Java Applet (which doesn’t work on all browsers, by the way), select the various series you want, and click “Get Data.” On the subsequent screen, you are given some basic formatting options, but the default display shows all of your data series as separate Excel files. You can then copy and paste or download each one and then piece them together.

Taking a step closer to the ideal of open data, the Institute of Museum and Library Services (IMLS)followed President Obama’s May 2013 executive order to make their data open in a machine-readable format. That’s great, but it only goes so far. The IMLS platform, for example, allows you to explore information about your own public library. But the data are labeled with variable names such as BRANLIB and BKMOB that are not intuitive or clear. Users then have to find the data dictionary to understand what data fields mean, how they’re defined, and how to use them.

These efforts to provide more data represent real progress, but often fail to be useful to the average person. They move from publishing data that are not readable (buried in PDFs or systems that allow the user to see only one record at a time) to data that are machine-readable (libraries of raw data files or APIs, from which data can be extracted using computer code). We now need to move from a world in which data are simply machine-readable to one in which data are human-readable….(More)”

New Privacy Research Has Implications for Design and Policy


 at PrivacyTech: “Try visualizing the Internet’s basic architecture. Could you draw it? What would be your mental model for it?

Let’s be more specific: Say you just purchased shoes off a website using your mobile phone at work. How would you visualize that digital process? Would a deeper knowledge of this architecture make more apparent the myriad potential privacy risks in this transaction? Or to put it another way, what would your knowledge, or lack thereof, for these architectural underpinnings reveal about your understanding of privacy and security risks?

Whether you’re a Luddite or a tech wiz, creating these mental models of the Internet is not the easiest endeavor. Just try doing so yourself.

It is an exercise, however, that several individuals underwent for new research that has instructive implications for privacy and security pros.

“So everything I do on the Internet or that other people do on the Internet is basically asking the Internet for information, and the Internet is sending us to various places where the information is and then bringing us back.” – CO1

You’d think those who have a better understanding of how the Internet works would probably have a better understanding of the privacy and security risks, right? Most likely. Paradoxically, though, a better technological understanding may have very little influence on an individual’s response to potential privacy risks.

This is what a dedicated team of researchers from Carnegie Mellon University worked to discover recently in their award-winning paper, “My Data Just Goes Everywhere”: User Mental Models of the Internet and Implications for Privacy and Security—a culmination of research from Ruogu Kang, Laura Dabbish, Nathaniel Fruchter and Sara Kiesler—all from CMU’s Human-Computer Interaction Institute and the Heinz College in Pittsburgh, PA.

“I try to browse through the terms and conditions but there’s so much there I really don’t retain it.” – T11

Presented at the CyLab Usable Privacy and Security Laboratory’s (CUPS) 11thSymposium on Usable Privacy and Security (SOUPS), their research demonstrated that even though savvy and non-savvy users of the Internet have much different perceptions of its architecture, such knowledge was not predictive of whether a user would take the necessary steps to protect their privacy online. Experience, rather, appears to play a more determinate role.

Kang, who led the team, said she was surprised by the results….(More)”

Push, Pull, and Spill: A Transdisciplinary Case Study in Municipal Open Government


New paper by Jan Whittington et al: “Cities hold considerable information, including details about the daily lives of residents and employees, maps of critical infrastructure, and records of the officials’ internal deliberations. Cities are beginning to realize that this data has economic and other value: If done wisely, the responsible release of city information can also release greater efficiency and innovation in the public and private sector. New services are cropping up that leverage open city data to great effect.

Meanwhile, activist groups and individual residents are placing increasing pressure on state and local government to be more transparent and accountable, even as others sound an alarm over the privacy issues that inevitably attend greater data promiscuity. This takes the form of political pressure to release more information, as well as increased requests for information under the many public records acts across the country.

The result of these forces is that cities are beginning to open their data as never before. It turns out there is surprisingly little research to date into the important and growing area of municipal open data. This article is among the first sustained, cross-disciplinary assessments of an open municipal government system. We are a team of researchers in law, computer science, information science, and urban studies. We have worked hand-in-hand with the City of Seattle, Washington for the better part of a year to understand its current procedures from each disciplinary perspective. Based on this empirical work, we generate a set of recommendations to help the city manage risk latent in opening its data….(More)”

Algorithms and Bias


Q. and A. With Cynthia Dwork in the New York Times: “Algorithms have become one of the most powerful arbiters in our lives. They make decisions about the news we read, the jobs we get, the people we meet, the schools we attend and the ads we see.

Yet there is growing evidence that algorithms and other types of software can discriminate. The people who write them incorporate their biases, and algorithms often learn from human behavior, so they reflect the biases we hold. For instance, research has shown that ad-targeting algorithms have shown ads for high-paying jobs to men but not women, and ads for high-interest loans to people in low-income neighborhoods.

Cynthia Dwork, a computer scientist at Microsoft Research in Silicon Valley, is one of the leading thinkers on these issues. In an Upshot interview, which has been edited, she discussed how algorithms learn to discriminate, who’s responsible when they do, and the trade-offs between fairness and privacy.

Q: Some people have argued that algorithms eliminate discriminationbecause they make decisions based on data, free of human bias. Others say algorithms reflect and perpetuate human biases. What do you think?

A: Algorithms do not automatically eliminate bias. Suppose a university, with admission and rejection records dating back for decades and faced with growing numbers of applicants, decides to use a machine learning algorithm that, using the historical records, identifies candidates who are more likely to be admitted. Historical biases in the training data will be learned by the algorithm, and past discrimination will lead to future discrimination.

Q: Are there examples of that happening?

A: A famous example of a system that has wrestled with bias is the resident matching program that matches graduating medical students with residency programs at hospitals. The matching could be slanted to maximize the happiness of the residency programs, or to maximize the happiness of the medical students. Prior to 1997, the match was mostly about the happiness of the programs.

This changed in 1997 in response to “a crisis of confidence concerning whether the matching algorithm was unreasonably favorable to employers at the expense of applicants, and whether applicants could ‘game the system,’ ” according to a paper by Alvin Roth and Elliott Peranson published in The American Economic Review.

Q: You have studied both privacy and algorithm design, and co-wrote a paper, “Fairness Through Awareness,” that came to some surprising conclusions about discriminatory algorithms and people’s privacy. Could you summarize those?

A: “Fairness Through Awareness” makes the observation that sometimes, in order to be fair, it is important to make use of sensitive information while carrying out the classification task. This may be a little counterintuitive: The instinct might be to hide information that could be the basis of discrimination….

Q: The law protects certain groups from discrimination. Is it possible to teach an algorithm to do the same?

A: This is a relatively new problem area in computer science, and there are grounds for optimism — for example, resources from the Fairness, Accountability and Transparency in Machine Learning workshop, which considers the role that machines play in consequential decisions in areas like employment, health care and policing. This is an exciting and valuable area for research. …(More)”

Beyond the Common Rule: Ethical Structures for Data Research in Non-Academic Settings


Future of Privacy Forum: “In the wake of last year’s news about the Facebook “emotional contagion” study and subsequent public debate about the role of A/B Testing and ethical concerns around the use of Big Data, FPF Senior Fellow Omer Tene participated in a December symposum on corporate consumer research hosted by Silicon Flatirons. This past month, the Colorado Technology Law Journal published a series of papers that emerged out of the symposium, including “Beyond the Common Rule: Ethical Structures for Data Research in Non-Academic Settings.”

“Beyond the Common Rule,” by Jules Polonetsky, Omer Tene, and Joseph Jerome, continues the Future of Privacy Forum’s effort to build on the notion of consumer subject review boards first advocated by Ryan Calo at FPF’s 2013 Big Data symposium. It explores how researchers, increasingly in corporate settings, are analyzing data and testing theories using often sensitive personal information. Many of these new uses of PII are simply natural extensions of current practices, and are either within the expectations of individuals or the bounds of the FIPPs. Yet many of these projects could involve surprising applications or uses of data, exceeding user expectations, and offering notice and obtaining consent could may not be feasible.

This article expands on ideas and suggestions put forward around the recent discussion draft of the White House Consumer Privacy Bill of Rights, which espouses “Privacy Review Boards” as a safety value for noncontextual data uses. It explores how existing institutional review boards within the academy and for human testing research could offer lessons for guiding principles, providing accountability and enhancing consumer trust, and offers suggestions for how companies — and researchers — can pursue both knowledge and data innovation responsibly and ethically….(More)”

Harnessing Mistrust for Civic Action


Ethan Zuckerman: “…One predictable consequence of mistrust in institutions is a decrease in participation. Fewer than 37% of eligible US voters participated in the 2014 Congressional election. Participation in European parliamentary and national elections across Europe is higher than the US’s dismal rates, but has steadily declined since 1979, with turnout for the 2014 European parliamentary elections dropping below 43%. It’s a mistake to blame low turnout on distracted or disinterested voters, when a better explanation exists: why vote if you don’t believe the US congress or European Parliament is capable of making meaningful change in the world?

In his 2012 book, “Twilight of the Elites”, Christopher Hayes suggests that the political tension of our time is not between left and right, but between institutionalists and insurrectionists. Institutionalists believe we can fix the world’s problems by strengthening and revitalizing the institutions we have. Insurrectionists believe we need to abandon these broken institutions we have and replace them with new, less corrupted ones, or with nothing at all. The institutionalists show up to vote in elections, but they’re being crowded out by the insurrectionists, who take to the streets to protest, or more worryingly, disengage entirely from civic life.

Conventional wisdom suggests that insurrectionists will grow up, stop protesting and start voting. But we may have reached a tipping point where the cultural zeitgeist favors insurrection. My students at MIT don’t want to work for banks, for Google or for universities – they want to build startups that disrupt banks, Google and universities.

The future of democracy depends on finding effective ways for people who mistrust institutions to make change in their communities, their nations and the world as a whole. The real danger is not that our broken institutions are toppled by a wave of digital disruption, but that a generation disengages from politics and civics as a whole.

It’s time to stop criticizing youth for their failure to vote and time to start celebrating the ways insurrectionists are actually trying to change the world. Those who mistrust institutions aren’t just ignoring them. Some are building new systems designed to make existing institutions obsolete. Others are becoming the fiercest and most engaged critics of of our institutions, while the most radical are building new systems that resist centralization and concentration of power.

Those outraged by government and corporate complicity in surveillance of the internet have the option of lobbying their governments to forbid these violations of privacy, or building and spreading tools that make it vastly harder for US and European governments to read our mail and track our online behavior. We need both better laws and better tools. But we must recognize that the programmers who build systems like Tor, PGP and Textsecure are engaged in civics as surely as anyone crafting a party’s political platform. The same goes for entrepreneurs building better electric cars, rather than fighting to legislate carbon taxes. As people lose faith in institutions, they seek change less through passing and enforcing laws, and more through building new technologies and businesses whose adoption has the same benefits as wisely crafted and enforced laws….(More)”

‘Smart Cities’ Will Know Everything About You


Mike Weston in the Wall Street Journal: “From Boston to Beijing, municipalities and governments across the world are pledging billions to create “smart cities”—urban areas covered with Internet-connected devices that control citywide systems, such as transit, and collect data. Although the details can vary, the basic goal is to create super-efficient infrastructure, aid urban planning and improve the well-being of the populace.

A byproduct of a tech utopia will be a prodigious amount of data collected on the inhabitants. For instance, at the company I head, we recently undertook an experiment in which some staff volunteered to wear devices around the clock for 10 days. We monitored more than 170 metrics reflecting their daily habits and preferences—including how they slept, where they traveled and how they felt (a fast heart rate and no movement can indicate excitement or stress).

If the Internet age has taught us anything, it’s that where there is information, there is money to be made. With so much personal information available and countless ways to use it, businesses and authorities will be faced with a number of ethical questions.

In a fully “smart” city, every movement an individual makes can be tracked. The data will reveal where she works, how she commutes, her shopping habits, places she visits and her proximity to other people. You could argue that this sort of tracking already exists via various apps and on social-media platforms, or is held by public-transport companies and e-commerce sites. The difference is that with a smart city this data will be centralized and easy to access. Given the value of this data, it’s conceivable that municipalities or private businesses that pay to create a smart city will seek to recoup their expenses by selling it….

Recent history—issues of privacy and security on social networks and chatting apps, and questions about how intellectual-property regulations apply online—has shown that the law has been slow to catch up with digital innovations. So businesses that can purchase smart-city data will be presented with many strategic and ethical concerns.

What degree of targeting is too specific and violates privacy? Should businesses limit the types of goods or services they offer to certain individuals? Is it ethical for data—on an employee’s eating habits, for instance—to be sold to employers or to insurance companies to help them assess claims? Do individuals own their own personal data once it enters the smart-city system?

With or without stringent controlling legislation, businesses in a smart city will need to craft their own policies and procedures regarding the use of data. A large-scale misuse of personal data could provoke a consumer backlash that could cripple a company’s reputation and lead to monster lawsuits. An additional problem is that businesses won’t know which individuals might welcome the convenience of targeted advertising and which will find it creepy—although data science could solve this equation eventually by predicting where each individual’s privacy line is.

A smart city doesn’t have to be as Orwellian as it sounds. If businesses act responsibly, there is no reason why what sounds intrusive in the abstract can’t revolutionize the way people live for the better by offering services that anticipates their needs; by designing ultraefficient infrastructure that makes commuting a (relative) dream; or with a revolutionary approach to how energy is generated and used by businesses and the populace at large….(More)”

The case for data ethics


Steven Tiell at Accenture: “Personal data is the coin of the digital realm, which for business leaders creates a critical dilemma. Companies are being asked to gather more types of data faster than ever to maintain a competitive edge in the digital marketplace; at the same time, however, they are being asked to provide pervasive and granular control mechanisms over the use of that data throughout the data supply chain.

The stakes couldn’t be higher. If organizations, or the platforms they use to deliver services, fail to secure personal data, they expose themselves to tremendous risk—from eroding brand value and the hard-won trust of established vendors and customers to ceding market share, from violating laws to costing top executives their jobs.

To distinguish their businesses in this marketplace, leaders should be asking themselves two questions. What are the appropriate standards and practices our company needs to have in place to govern the handling of data? And how can our company make strong data controls a value proposition for our employees, customers and partners?

Defining effective compliance activities to support legal and regulatory obligations can be a starting point. However, mere compliance with existing regulations—which are, for the most part, focused on privacy—is insufficient. Respect for privacy is a byproduct of high ethical standards, but it is only part of the picture. Companies need to embrace data ethics, an expansive set of practices and behaviors grounded in a moral framework for the betterment of a community (however defined).

 RAISING THE BAR

Why ethics? When communities of people—in this case, the business community at large—encounter new influences, the way they respond to and engage with those influences becomes the community’s shared ethics. Individuals who behave in accordance with these community norms are said to be moral, and those who are exemplary are able to gain the trust of their community.

Over time, as ethical standards within a community shift, the bar for trustworthiness is raised on the assumption that participants in civil society must, at a minimum, adhere to the rule of law. And thus, to maintain moral authority and a high degree of trust, actors in a community must constantly evolve to adopt the highest ethical standards.

Actors in the big data community, where security and privacy are at the core of relationships with stakeholders, must adhere to a high ethical standard to gain this trust. This requires them to go beyond privacy law and existing data control measures. It will also reward those who practice strong ethical behaviors and a high degree of transparency at every stage of the data supply chain. The most successful actors will become the platform-based trust authorities, and others will depend on these platforms for disclosure, sharing and analytics of big data assets.

Data ethics becomes a value proposition only once controls and capabilities are in place to granularly manage data assets at scale throughout the data supply chain. It is also beneficial when a community shares the same behavioral norms and taxonomy to describe the data itself, the ethical decision points along the data supply chain, and how those decisions lead to beneficial or harmful impacts….(More)”

Why Protecting Data Privacy Matters, and When


Anne Russell at Data Science Central: “It’s official. Public concerns over the privacy of data used in digital approaches have reached an apex. Worried about the safety of digital networks, consumers want to gain control over what they increasingly sense as a loss of power over how their data is used. It’s not hard to wonder why. Look at the extent of coverage on the U.S. Government data breach last month and the sheer growth in the number of attacks against government and others overall. Then there is the increasing coverage on the inherent security flaws built into the internet, through which most of our data flows. The costs of data breaches to individuals, industries, and government are adding up. And users are taking note…..
If you’re not sure whether the data fueling your approach will raise privacy and security flags, consider the following. When it comes to data privacy and security, not all data is going to be of equal concern. Much depends on the level of detail in data content, data type, data structure, volume, and velocity, and indeed how the data itself will be used and released.

First there is the data where security and privacy has always mattered and for which there is already an existing and well galvanized body of law in place. Foremost among these is classified or national security data where data usage is highly regulated and enforced. Other data for which there exists a considerable body of international and national law regulating usage includes:

  • Proprietary Data – specifically the data that makes up the intellectual capital of individual businesses and gives them their competitive economic advantage over others, including data protected under copyright, patent, or trade secret laws and the sensitive, protected data that companies collect on behalf of its customers;
  • Infrastructure Data – data from the physical facilities and systems – such as roads, electrical systems, communications services, etc. – that enable local, regional, national, and international economic activity; and
  • Controlled Technical Data – technical, biological, chemical, and military-related data and research that could be considered of national interest and be under foreign export restrictions….

The second group of data that raises privacy and security concerns is personal data. Commonly referred to as Personally Identifiable Information (PII), it is any data that distinguishes individuals from each other. It is also the data that an increasing number of digital approaches rely on, and the data whose use tends to raise the most public ire. …

A third category of data needing privacy consideration is the data related to good people working in difficult or dangerous places. Activists, journalists, politicians, whistle-blowers, business owners, and others working in contentious areas and conflict zones need secure means to communicate and share data without fear of retribution and personal harm.  That there are parts of the world where individuals can be in mortal danger for speaking out is one of the reason that TOR (The Onion Router) has received substantial funding from multiple government and philanthropic groups, even at the high risk of enabling anonymized criminal behavior. Indeed, in the absence of alternate secure networks on which to pass data, many would be in grave danger, including those such as the organizers of the Arab Spring in 2010 as well as dissidents in Syria and elsewhere….(More)”

 

The Data Revolution


Review of Rob Kitchin’s The Data Revolution: Big Data, Open Data, Data Infrastructures & their Consequences by David Moats in Theory, Culture and Society: “…As an industry, academia is not immune to cycles of hype and fashion. Terms like ‘postmodernism’, ‘globalisation’, and ‘new media’ have each had their turn filling the top line of funding proposals. Although they are each grounded in tangible shifts, these terms become stretched and fudged to the point of becoming almost meaningless. Yet, they elicit strong, polarised reactions. For at least the past few years, ‘big data’ seems to be the buzzword, which elicits funding, as well as the ire of many in the social sciences and humanities.

Rob Kitchin’s book The Data Revolution is one of the first systematic attempts to strip back the hype surrounding our current data deluge and take stock of what is really going on. This is crucial because this hype is underpinned by very real societal change, threats to personal privacy and shifts in store for research methods. The book acts as a helpful wayfinding device in an unfamiliar terrain, which is still being reshaped, and is admirably written in a language relevant to social scientists, comprehensible to policy makers and accessible even to the less tech savvy among us.

The Data Revolution seems to present itself as the definitive account of this phenomena but in filling this role ends up adopting a somewhat diplomatic posture. Kitchin takes all the correct and reasonable stances on the matter and advocates all the right courses of action but he is not able to, in the context of this book, pursue these propositions fully. This review will attempt to tease out some of these latent potentials and how they might be pushed in future work, in particular the implications of the ‘performative’ character of both big data narratives and data infrastructures for social science research.

Kitchin’s book starts with the observation that ‘data’ is a misnomer – etymologically data should refer to phenomena in the world which can be abstracted, measured etc. as opposed to the representations and measurements themselves, which should by all rights be called ‘capta’. This is ironic because the worst offenders in what Kitchin calls “data boosterism” seem to conflate data with ‘reality’, unmooring data from its conditions of production and making relationship between the two given or natural.

As Kitchin notes, following Bowker (2005), ‘raw data’ is an oxymoron: data are not so much mined as produced and are necessarily framed technically, ethically, temporally, spatially and philosophically. This is the central thesis of the book, that data and data infrastructures are not neutral and technical but also social and political phenomena. For those at the critical end of research with data, this is a starting assumption, but one which not enough practitioners heed. Most of the book is thus an attempt to flesh out these rapidly expanding data infrastructures and their politics….

Kitchin is at his best when revealing the gap between the narratives and the reality of data analysis such as the fallacy of empiricism – the assertion that, given the granularity and completeness of big data sets and the availability of machine learning algorithms which identify patterns within data (with or without the supervision of human coders), data can “speak for themselves”. Kitchin reminds us that no data set is complete and even these out-of-the-box algorithms are underpinned by theories and assumptions in their creation, and require context specific knowledge to unpack their findings. Kitchin also rightly raises concerns about the limits of big data, that access and interoperability of data is not given and that these gaps and silences are also patterned (Twitter is biased as a sample towards middle class, white, tech savy people). Yet, this language of veracity and reliability seems to suggest that big data is being conceptualised in relation to traditional surveys, or that our population is still the nation state, when big data could helpfully force us to reimagine our analytic objects and truth conditions and more pressingly, our ethics (Rieder, 2013).

However, performativity may again complicate things. As Kitchin observes, supermarket loyalty cards do not just create data about shopping, they encourage particular sorts of shopping; when research subjects change their behaviour to cater to the metrics and surveillance apparatuses built into platforms like Facebook (Bucher, 2012), then these are no longer just data points representing the social, but partially constitutive of new forms of sociality (this is also true of other types of data as discussed by Savage (2010), but in perhaps less obvious ways). This might have implications for how we interpret data, the distribution between quantitative and qualitative approaches (Latour et al., 2012) or even more radical experiments (Wilkie et al., 2014). Kitchin is relatively cautious about proposing these sorts of possibilities, which is not the remit of the book, though it clearly leaves the door open…(More)”