Google launches new search engine to help scientists find the datasets they need


James Vincent at The Verge: “The service, called Dataset Search, launches today, and it will be a companion of sorts to Google Scholar, the company’s popular search engine for academic studies and reports. Institutions that publish their data online, like universities and governments, will need to include metadata tags in their webpages that describe their data, including who created it, when it was published, how it was collected, and so on. This information will then be indexed by Google’s search engine and combined with information from the Knowledge Graph. (So if dataset X was published by CERN, a little information about the institute will also be included in the search.)

Speaking to The Verge, Natasha Noy, a research scientist at Google AI who helped created Dataset Search, says the aim is to unify the tens of thousands of different repositories for datasets online. “We want to make that data discoverable, but keep it where it is,” says Noy.

At the moment, dataset publication is extremely fragmented. Different scientific domains have their own preferred repositories, as do different governments and local authorities. “Scientists say, ‘I know where I need to go to find my datasets, but that’s not what I always want,’” says Noy. “Once they step out of their unique community, that’s when it gets hard.”

Noy gives the example of a climate scientist she spoke to recently who told her she’d been looking for a specific dataset on ocean temperatures for an upcoming study but couldn’t find it anywhere. She didn’t track it down until she ran into a colleague at a conference who recognized the dataset and told her where it was hosted. Only then could she continue with her work. “And this wasn’t even a particularly boutique depository,” says Noy. “The dataset was well written up in a fairly prominent place, but it was still difficult to find.”

An example search for weather records in Google Dataset Search.
 Image: Google

The initial release of Dataset Search will cover the environmental and social sciences, government data, and datasets from news organizations like ProPublica. However, if the service becomes popular, the amount of data it indexes should quickly snowball as institutions and scientists scramble to make their information accessible….(More)”.

Reflecting the Past, Shaping the Future: Making AI Work for International Development


USAID Report: “We are in the midst of an unprecedented surge of interest in machine learning (ML) and artificial intelligence (AI) technologies. These tools, which allow computers to make data-derived predictions and automate decisions, have become part of daily life for billions of people. Ubiquitous digital services such as interactive maps, tailored advertisements, and voice-activated personal assistants are likely only the beginning. Some AI advocates even claim that AI’s impact will be as profound as “electricity or fire” that it will revolutionize nearly every field of human activity. This enthusiasm has reached international development as well. Emerging ML/AI applications promise to reshape healthcare, agriculture, and democracy in the developing world. ML and AI show tremendous potential for helping to achieve sustainable development objectives globally. They can improve efficiency by automating labor-intensive tasks, or offer new insights by finding patterns in large, complex datasets. A recent report suggests that AI advances could double economic growth rates and increase labor productivity 40% by 2035. At the same time, the very nature of these tools — their ability to codify and reproduce patterns they detect — introduces significant concerns alongside promise.

In developed countries, ML tools have sometimes been found to automate racial profiling, to foster surveillance, and to perpetuate racial stereotypes. Algorithms may be used, either intentionally or unintentionally, in ways that result in disparate or unfair outcomes between minority and majority populations. Complex models can make it difficult to establish accountability or seek redress when models make mistakes. These shortcomings are not restricted to developed countries. They can manifest in any setting, especially in places with histories of ethnic conflict or inequality. As the development community adopts tools enabled by ML and AI, we need a cleareyed understanding of how to ensure their application is effective, inclusive, and fair. This requires knowing when ML and AI offer a suitable solution to the challenge at hand. It also requires appreciating that these technologies can do harm — and committing to addressing and mitigating these harms.

ML and AI applications may sometimes seem like science fiction, and the technical intricacies of ML and AI can be off-putting for those who haven’t been formally trained in the field. However, there is a critical role for development actors to play as we begin to lean on these tools more and more in our work. Even without technical training in ML, development professionals have the ability — and the responsibility — to meaningfully influence how these technologies impact people.

You don’t need to be an ML or AI expert to shape the development and use of these tools. All of us can learn to ask the hard questions that will keep solutions working for, and not against, the development challenges we care about. Development practitioners already have deep expertise in their respective sectors or regions. They bring necessary experience in engaging local stakeholders, working with complex social systems, and identifying structural inequities that undermine inclusive progress. Unless this expert perspective informs the construction and adoption of ML/AI technologies, ML and AI will fail to reach their transformative potential in development.

This document aims to inform and empower those who may have limited technical experience as they navigate an emerging ML/AI landscape in developing countries. Donors, implementers, and other development partners should expect to come away with a basic grasp of common ML techniques and the problems ML is uniquely well-suited to solve. We will also explore some of the ways in which ML/AI may fail or be ill-suited for deployment in developing-country contexts. Awareness of these risks, and acknowledgement of our role in perpetuating or minimizing them, will help us work together to protect against harmful outcomes and ensure that AI and ML are contributing to a fair, equitable, and empowering future…(More)”.

The UK’s Gender Pay Gap Open Data Law Has Flaws, But Is A Positive Step Forward


Article by Michael McLaughlin: “Last year, the United Kingdom enacted a new regulation requiring companies to report information about their gender pay gap—a measure of the difference in average pay between men and women. The new rules are a good example of how open data can drive social change. However, the regulations have produced some misleading statistics, highlighting the importance of carefully crafting reporting requirements to ensure that they produce useful data.

In the UK, nearly 11,000 companies have filed gender pay gap reports, which include both the difference between the mean and median hourly pay rates for men and women as well the difference in bonuses. And the initial data reveals several interesting findings. Median pay for men is 11.8 percent higher than for women, on average, and nearly 87 percent of companies pay men more than women on average. In addition, over 1,000 firms had a median pay gap greater than 30 percent. The sectors with the highest pay gaps—construction, finance, and insurance—each pay men at least 20 percent more than women. A major reason for the gap is a lack of women in senior positions—UK women actually make more than men between the ages of 22-29. The total pay gap is also a result of more women holding part-time jobs.

However, as detractors note, the UK’s data can be misleading. For example, the data overstates the pay gap on bonuses because it does not adjust these figures for hours worked. More women work part-time than men, so it makes sense that women would receive less in bonus pay when they work less. The data also understates the pay gap because it excludes the high compensation of partners in organizations such as law firms, a group that includes few women. And it is important to note that—by definition—the pay gap data does not compare the wages of men and women working the same jobs, so the data says nothing about whether women receive equal pay for equal work.

Still, publication of the data has sparked an important national conversation. Google searches in the UK for the phrase “gender pay gap” experienced a 12-month high the week the regulations began enforcement, and major news sites like Financial Times have provided significant coverage of the issue by analyzing the reported data. While it is too soon to tell if the law will change employer behavior, such as businesses hiring more female executives, or employee behavior, such as women leaving companies or fields that pay less, countries with similar reporting requirements, such as Belgium, have seen the pay gap narrow following implementation of their rules.

Requiring companies to report this data to the government may be the only way to obtain gender pay gap data, because evidence suggests that the private sector will not produce this data on its own. Only 300 UK organizations joined a voluntary government program to report their gender pay gap in 2011, and as few as 11 actually published the data. Crowdsourced efforts, where women voluntary report their pay, have also suffered from incomplete data. And even complete data does not illuminate variables such as why women may work in a field that pays less….(More)”.

Following Fenno: Learning from Senate Candidates in the Age of Social Media and Party Polarization


David C.W. Parker  at The Forum: “Nearly 40 years ago, Richard Fenno published Home Style, a seminal volume explaining how members of Congress think about and engage in the process of representation. To accomplish his task, he observed members of Congress as they crafted and communicated their representational styles to the folks back home in their districts. The book, and Fenno’s ensuing research agenda, served as a clarion call to move beyond sophisticated quantitative analyses of roll call voting and elite interviews in Washington, D.C. to comprehend congressional representation. Instead, Fenno argued, political scientists are better served by going home with members of Congress where “their perceptions of their constituencies are shaped, sharpened, or altered” (Fenno 1978, p. xiii). These perceptions of constituencies fundamentally shape what members of Congress do at home and in Washington. If members of Congress are single-minded seekers of reelection, as we often assume, then political scientists must begin with the constituent relationship essential to winning reelection. Go home, Fenno says, to understand Congress.

There are many ways constituency relationships can be understood and uncovered; the preferred method for Fenno is participant observation, which he variously terms as “soaking and poking” or “just hanging around.” Although it sounds easy enough to sit and watch, good participant observation requires many considerations (as Fenno details in a thorough appendix to Home Style). In this appendix, and in another series of essays, Fenno grapples forthrightly with the tough choices researchers must consider when watching and learning from politicians.

In this essay, I respond to Fenno’s thought-provoking methodological treatise in Home Style and the ensuing collection of musings he published as Watching Politicians: Essays on Participant Observation. I do so for three reasons: First, I wish to reinforce Fenno’s call to action. As the study of political science has matured, it has moved away from engaging with politicians in the field across the various sub-fields, favoring statistical analyses. “Everyone cites Fenno, but no one does Fenno,” I recently opined, echoing another scholar commenting on Fenno’s work (Fenno 2013, p. 2; Parker 2015, p. 246). Unfortunately, that sentiment is supported by data (Grimmer 2013, pp. 13–19; Curry 2017). Although quantitative and formal analyses have led to important insights into the study of political behavior and institutions, politics is as important to our discipline as science. And in politics, the motives and concerns of people are important to witness, not just because they add complexity and richness to our stories, but because they aid in theory generation.1 Fenno’s study was exploratory, but is full of key theoretical insights relevant to explaining how members of Congress understand their constituencies and the ensuing political choices they make.

Second, to “do” participant observation requires understanding the choices the methodology imposes. This necessitates that those who practice this method of discovery document and share their experiences (Lin 2000). The more the prospective participant observer can understand the size of the choice set she faces and the potential consequences at each decision point in advance, the better her odds of avoiding unanticipated consequences with both immediate and long-term research ramifications. I hope that adding my cumulative experiences to this ongoing methodological conversation will assist in minimizing both unexpected and undesirable consequences for those who follow into the field. Fenno is open about his own choices, and the difficult decisions he faced as a participant observer. Encouraging scholars to engage in participant observation is only half the battle. The other half is to encourage interested scholars to think about those same choices and methodological considerations, while acknowledging that context precludes a one-size fits all approach. Fenno’s choices may not be your choices – and that might be just fine depending upon your circumstances. Fenno would wholeheartedly agree.

Finally, Congress and American politics have changed considerably from when Fenno embarked on his research in Home Style. At the end of his introduction, Fenno writes that “this book is about the early to mid-1970s only. These years were characterized by the steady decline of strong national party attachments and strong local party organizations. … Had these conditions been different, House members might have behaved differently in their constituencies” (xv). Developments since Fenno put down his pen include political parties polarizing to an almost unprecedented degree, partisan attachments strengthening among voters, and technology emerging to change fundamentally how politicians engage with constituents. In light of this evolution of political culture in Washington and at home, it is worth considering the consequences for the participant-observation research approach. Many have asked me if it is still possible to do such work in the current political environment, and if so, what are the challenges facing political scientists going into the field? This essay provides some answers.

I proceed as follows: First, I briefly discuss my own foray into the world of participant observation, which occurred during the 2012 Senate race in Montana. Second, I consider two important methodological considerations raised by Fenno: access and participation as an observer. Third, I relate these two issues to a final consideration: the development of social media and the consequences of this for the participant observation enterprise. Finally, I show the perils of social science divorced from context, as demonstrated by the recent Stanford-Dartmouth mailer scandal. I conclude with not just a plea for us to pick up where Fenno has left off, but by suggesting that more thinking like a participant observer would benefit the discipline as whole by reminding us of our ethical obligations as researchers to each other, and to the political community that we study…(More)”.

AI and Big Data: A Blueprint for a Human Rights, Social and Ethical Impact Assessment


Alessandro Mantelero in Computer Law & Security Review: “The use of algorithms in modern data processing techniques, as well as data-intensive technological trends, suggests the adoption of a broader view of the data protection impact assessment. This will force data controllers to go beyond the traditional focus on data quality and security, and consider the impact of data processing on fundamental rights and collective social and ethical values.

Building on studies of the collective dimension of data protection, this article sets out to embed this new perspective in an assessment model centred on human rights (Human Rights, Ethical and Social Impact Assessment-HRESIA). This self-assessment model intends to overcome the limitations of the existing assessment models, which are either too closely focused on data processing or have an extent and granularity that make them too complicated to evaluate the consequences of a given use of data. In terms of architecture, the HRESIA has two main elements: a self-assessment questionnaire and an ad hoc expert committee. As a blueprint, this contribution focuses mainly on the nature of the proposed model, its architecture and its challenges; a more detailed description of the model and the content of the questionnaire will be discussed in a future publication drawing on the ongoing research….(More)”.

Towards Digital Enlightenment: Essays on the Dark and Light Sides of the Digital Revolution


Book edited by Dirk Helbing: “This new collection of essays follows in the footsteps of the successful volume Thinking Ahead – Essays on Big Data, Digital Revolution, and Participatory Market Society, published at a time when our societies were on a path to technological totalitarianism, as exemplified by mass surveillance reported by Edward Snowden and others.

Meanwhile the threats have diversified and tech companies have gathered enough data to create detailed profiles about almost everyone living in the modern world – profiles that can predict our behavior better than our friends, families, or even partners. This is not only used to manipulate peoples’ opinions and voting behaviors, but more generally to influence consumer behavior at all levels. It is becoming increasingly clear that we are rapidly heading towards a cybernetic society, in which algorithms and social bots aim to control both the societal dynamics and individual behaviors….(More)”.

Long Term Info-structure


Long Now Foundation Seminar by Juan Benet: “We live in a spectacular time,”…”We’re a century into our computing phase transition. The latest stages have created astonishing powers for individuals, groups, and our species as a whole. We are also faced with accumulating dangers — the capabilities to end the whole humanity experiment are growing and are ever more accessible. In light of the promethean fire that is computing, we must prevent bad outcomes and lock in good ones to build robust foundations for our knowledge, and a safe future. There is much we can do in the short-term to secure the long-term.”

“I come from the front lines of computing platform design to share a number of new super-powers at our disposal, some old challenges that are now soluble, and some new open problems. In this next decade, we’ll need to leverage peer-to-peer networks, crypto-economics, blockchains, Open Source, Open Services, decentralization, incentive-structure engineering, and so much more to ensure short-term safety and the long-term flourishing of humanity.”

Juan Benet is the inventor of the InterPlanetary File System (IPFS)—a new protocol which uses content-addressing to make the web faster, safer, and more open—and the creator of Filecoin, a cryptocurrency-incentivized storage market….(More + Video)”

Making a Smart City a Fairer City: Chicago’s Technologists Address Issues of Privacy, Ethics, and Equity, 2011-2018


Case study by Gabriel Kuris and Steven S. Strauss at Innovations for Successful Societies: “In 2011, voters in Chicago elected Rahm Emanuel, a 51-year-old former Chicago congressman, as their new mayor. Emanuel inherited a city on the upswing after years of decline but still marked by high rates of crime and poverty, racial segregation, and public distrust in government. The Emanuel administration hoped to harness the city’s trove of digital data to improve Chicagoans’ health, safety, and quality of life. During the next several years, Chief Data Officer Brett Goldstein and his successor Tom Schenk led innovative uses of city data, ranging from crisis management to the statistical targeting of restaurant inspections and pest extermination. As their teams took on more-sophisticated projects that predicted lead-poisoning risks and Escherichia coli outbreaks and created a citywide network of ambient sensors, the two faced new concerns about normative issues like privacy, ethics, and equity. By 2018, Chicago had won acclaim as a smarter city, but was it a fairer city? This case study discusses some of the approaches the city developed to address those challenges and manage the societal implications of cutting-edge technologies….(More)”.

Protecting the Confidentiality of America’s Statistics: Adopting Modern Disclosure Avoidance Methods at the Census Bureau


John Abowd at US Census: “…Throughout our history, we have been leaders in statistical data protection, which we call disclosure avoidance. Other statistical agencies use the terms “disclosure limitation” and “disclosure control.” These terms are all synonymous. Disclosure avoidance methods have evolved since the censuses of the early 1800s, when the only protection used was simply removing names. Executive orders, and a series of laws modified the legal basis for these protections, which were finally codified in the 1954 Census Act (13 U.S.C. Sections 8(b) and 9). We have continually added better and stronger protections to keep the data we publish anonymous and underlying records confidential.

However, historical methods cannot completely defend against the threats posed by today’s technology. Growth in computing power, advances in mathematics, and easy access to large, public databases pose a significant threat to confidentiality. These forces have made it possible for sophisticated users to ferret out common data points between databases using only our published statistics. If left unchecked, those users might be able to stitch together these common threads to identify the people or businesses behind the statistics as was done in the case of the Netflix Challenge.

The Census Bureau has been addressing these issues from every feasible angle and changing rapidly with the times to ensure that we protect the data our census and survey respondents provide us. We are doing this by moving to a new, advanced, and far more powerful confidentiality protection system, which uses a rigorous mathematical process that protects respondents’ information and identity in all of our publications.

The new tool is based on the concept known in scientific and academic circles as “differential privacy.” It is also called “formal privacy” because it provides provable mathematical guarantees, similar to those found in modern cryptography, about the confidentiality protections that can be independently verified without compromising the underlying protections.

“Differential privacy” is based on the cryptographic principle that an attacker should not be able to learn any more about you from the statistics we publish using your data than from statistics that did not use your data. After tabulating the data, we apply carefully constructed algorithms to modify the statistics in a way that protects individuals while continuing to yield accurate results. We assume that everyone’s data are vulnerable and provide the same strong, state-of-the-art protection to every record in our database.

The Census Bureau did not invent the science behind differential privacy. However, we were the first organization anywhere to use it when we incorporated differential privacy into the OnTheMap application in 2008. It was used in this event to protect block-level residential population data. Recently, Google, Apple, Microsoft, and Uber have all followed the Census Bureau’s lead, adopting differentially privacy systems as the standard for protecting user data confidentiality inside their browsers (Chrome), products (iPhones), operating systems (Windows 10), and apps (Uber)….(More)”.

Origin Privacy: Protecting Privacy in the Big-Data Era


Paper by Helen Nissenbaum, Sebastian Benthall, Anupam Datta, Michael Carl Tschantz, and Piot Mardziel: “Machine learning over big data poses challenges for our conceptualization of privacy. Such techniques can discover surprising and counteractive associations that take innocent looking data and turns it into important inferences about a person. For example, the buying carbon monoxide monitors has been linked to paying credit card bills, while buying chrome-skull car accessories predicts not doing so. Also, Target may have used the buying of scent-free hand lotion and vitamins as a sign that the buyer is pregnant. If we take pregnancy status to be private and assume that we should prohibit the sharing information that can reveal that fact, then we have created an unworkable notion of privacy, one in which sharing any scrap of data may violate privacy.

Prior technical specifications of privacy depend on the classification of certain types of information as private or sensitive; privacy policies in these frameworks limit access to data that allow inference of this sensitive information. As the above examples show, today’s data rich world creates a new kind of problem: it is difficult if not impossible to guarantee that information does notallow inference of sensitive topics. This makes information flow rules based on information topic unstable.

We address the problem of providing a workable definition of private data that takes into account emerging threats to privacy from large-scale data collection systems. We build on Contextual Integrity and its claim that privacy is appropriate information flow, or flow according to socially or legally specified rules.

As in other adaptations of Contextual Integrity (CI) to computer science, the parameterization of social norms in CI is translated into a logical specification. In this work, we depart from CI by considering rules that restrict information flow based on its origin and provenance, instead of on it’s type, topic, or subject.

We call this concept of privacy as adherence to origin-based rules Origin Privacy. Origin Privacy rules can be found in some existing data protection laws. This motivates the computational implementation of origin-based rules for the simple purpose of compliance engineering. We also formally model origin privacy to determine what security properties it guarantees relative to the concerns that motivate it….(More)”.