There aren’t any rules on how social scientists use private data. Here’s why we need them.


 at SSRC: “The politics of social science access to data are shifting rapidly in the United States as in other developed countries. It used to be that states were the most important source of data on their citizens, economy, and society. States needed to collect and aggregate large amounts of information for their own purposes. They gathered this directly—e.g., through censuses of individuals and firms—and also constructed relevant indicators. Sometimes state agencies helped to fund social science projects in data gathering, such as the National Science Foundation’s funding of the American National Election Survey over decades. While scholars such as James Scott and John Brewer disagreed about the benefits of state data gathering, they recognized the state’s primary role.

In this world, the politics of access to data were often the politics of engaging with the state. Sometimes the state was reluctant to provide information, either for ethical reasons (e.g. the privacy of its citizens) or self-interest. However, democratic states did typically provide access to standard statistical series and the like, and where they did not, scholars could bring pressure to bear on them. This led to well-understood rules about the common availability of standard data for many research questions and built the foundations for standard academic practices. It was relatively easy for scholars to criticize each other’s work when they were drawing on common sources. This had costs—scholars tended to ask the kinds of questions that readily available data allowed them to ask—but also significant benefits. In particular, it made research more easily reproducible.

We are now moving to a very different world. On the one hand, open data initiatives in government are making more data available than in the past (albeit often without much in the way of background resources or documentation).The new universe of private data is reshaping social science research in some ways that are still poorly understood. On the other, for many research purposes, large firms such as Google or Facebook (or even Apple) have much better data than the government. The new universe of private data is reshaping social science research in some ways that are still poorly understood. Here are some of the issues that we need to think about:…(More)”

Data as a Means, Not an End: A Brief Case Study


Tracie Neuhaus & Jarasa Kanok  in the Stanford Social Innovation Review: “In 2014, City Year—the well-known national education nonprofit that leverages young adults in national service to help students and schools succeed—was outgrowing the methods it used for collecting, managing, and using performance data. As the organization established its strategy for long-term impact, leaders identified a business problem: The current system for data collection and use would need to evolve to address the more-complex challenges the organization was undertaking. Staff throughout the organization were citing pain points one might expect, including onerous manual data collection, and long lag times to get much-needed data and reports on student attendance, grades, and academic and social-emotional assessments. After digging deeper, leaders realized they couldn’t fix the organization’s challenges with technology or improved methods without first addressing more fundamental issues. They saw City Year lacked a common “language” for the data it collected and used. Staff varied widely in their levels of data literacy, as did the scope of data-sharing agreements with the 27 urban school districts where City Year was working at the time. What’s more, its evaluation group had gradually become a default clearinghouse for a wide variety of service requests from across the organization that the group was neither designed nor staffed to address. The situation was much more complex than it appeared.

With significant technology roadmap decisions looming, City Year engaged with us to help it develop its data strategy. Together we came to realize that these symptoms were reflective of a single issue, one that exists in many organizations: City Year’s focus on data wasn’t targeted to address the very different kinds of decisions that each staff member—from the front office to the front lines—needed to make. …

Many of us in the social sector have probably seen elements of this dynamic. Many organizations create impact reports designed to satisfy external demands from donors, but these reports have little relevance to the operational or strategic choices the organizations face every day, much less address harder-to-measure, system-level outcomes. As a result, over time and in the face of constrained resources, measurement is relegated to a compliance activity, disconnected from identifying and collecting the information that directly enables individuals within the organization to drive impact. Gathering data becomes an end in itself, rather than a means of enabling ground-level work and learning how to improve the organization’s impact.

Overcoming this all-too-common “measurement drift” requires that we challenge the underlying orthodoxies that drive it and reorient measurement activities around one simple premise: Data should support better decision-making. This enables organizations to not only shed a significant burden of unproductive activity, but also drive themselves to new heights of performance.

In the case of City Year, leaders realized that to really take advantage of existing technology platforms, they needed a broader mindset shift….(More)”

Research in the Crowdsourcing Age, a Case Study


Report by  (Pew): “How scholars, companies and workers are using Mechanical Turk, a ‘gig economy’ platform, for tasks computers can’t handle

How Mechanical Turk WorksDigital age platforms are providing researchers the ability to outsource portions of their work – not just to increasingly intelligent machines, but also to a relatively low-cost online labor force comprised of humans. These so-called “online outsourcing” services help employers connect with a global pool of free-agent workers who are willing to complete a variety of specialized or repetitive tasks.

Because it provides access to large numbers of workers at relatively low cost, online outsourcing holds a particular appeal for academics and nonprofit research organizations – many of whom have limited resources compared with corporate America. For instance, Pew Research Center has experimented with using these services to perform tasks such as classifying documents and collecting website URLs. And a Google search of scholarly academic literature shows that more than 800 studies – ranging from medical research to social science – were published using data from one such platform, Amazon’s Mechanical Turk, in 2015 alone.1

The rise of these platforms has also generated considerable commentary about the so-called “gig economy” and the possible impact it will have on traditional notions about the nature of work, the structure of compensation and the “social contract” between firms and workers. Pew Research Center recently explored some of the policy and employment implications of these new platforms in a national survey of Americans.

Proponents say this technology-driven innovation can offer employers – whether companies or academics – the ability to control costs by relying on a global workforce that is available 24 hours a day to perform relatively inexpensive tasks. They also argue that these arrangements offer workers the flexibility to work when and where they want to. On the other hand, some critics worry this type of arrangement does not give employees the same type of protections offered in more traditional work environments – while others have raised concerns about the quality and consistency of data collected in this manner.

A recent report from the World Bank found that the online outsourcing industry generated roughly $2 billion in 2013 and involved 48 million registered workers (though only 10% of them were considered “active”). By 2020, the report predicted, the industry will generate between $15 billion and $25 billion.

Amazon’s Mechanical Turk is one of the largest outsourcing platforms in the United States and has become particularly popular in the social science research community as a way to conduct inexpensive surveys and experiments. The platform has also become an emblem of the way that the internet enables new businesses and social structures to arise.

In light of its widespread use by the research community and overall prominence within the emerging world of online outsourcing, Pew Research Center conducted a detailed case study examining the Mechanical Turk platform in late 2015 and early 2016. The study utilizes three different research methodologies to examine various aspects of the Mechanical Turk ecosystem. These include human content analysis of the platform, a canvassing of Mechanical Turk workers and an analysis of third party data.

The first goal of this research was to understand who uses the Mechanical Turk platform for research or business purposes, why they use it and who completes the work assignments posted there. To evaluate these issues, Pew Research Center performed a content analysis of the tasks posted on the site during the week of Dec. 7-11, 2015.

A second goal was to examine the demographics and experiences of the workers who complete the tasks appearing on the site. This is relevant not just to fellow researchers that might be interested in using the platform, but as a snapshot of one set of “gig economy” workers. To address these questions, Pew Research Center administered a nonprobability online survey of Turkers from Feb. 9-25, 2016, by posting a task on Mechanical Turk that rewarded workers for answering questions about their demographics and work habits. The sample of 3,370 workers contains any number of interesting findings, but it has its limits. This canvassing emerges from an opt-in sample of those who were active on MTurk during this particular period, who saw our survey and who had the time and interest to respond. It does not represent all active Turkers in this period or, more broadly, all workers on MTurk.

Finally, this report uses data collected by the online tool mturk-tracker, which is run by Dr. Panagiotis G. Ipeirotis of the New York University Stern School of Business, to examine the amount of activity occurring on the site. The mturk-tracker data are publically available online, though the insights presented here have not been previously published elsewhere….(More)”

Postal big data: Global flows as proxy indicators for national wellbeing


Data Driven Journalism: “A new project has developed an innovative means to approximate socioeconomic indicators by analyzing the network of international postal flows.

The project used 14 million aggregated electronic postal records from 187 countries collected by the Universal Postal Union over a four-year period (2010-2014) to create an international network showing the way post flows around the world.

In addition, the project builds upon previous research efforts using global flow networks, derived from the five following open data sources:

For each network, a country’s degree of connectivity for incoming and outgoing flows was quantified using the Jaccard coefficient and Spearman’s rank correlation coefficient….

To understand these connections in the context of socioeconomic indicators, the researchers then compared these positions to the values of GDP, Life expectancy, Corruption Perception Index, Internet penetration rate, Happiness index, Gini index, Economic Complexity Index, Literacy, Poverty, CO2 emissions, Fixed phone line penetration, Mobile phone users, and the Human Development Index.

007.png

Image: Spearman rank correlations between global flow network degrees and socioeconomic indicators (CC BY 4.0).

From this analysis, the researchers revealed that:

  • The best-performing degree, in terms of consistently high performance across indicators is the global degree, suggesting that looking at how well connected a country is in the global multiplex can be more indicative of its socioeconomic profile as a whole than looking at single networks.
  • GDP per capita and life expectancy are most closely correlated with the global degree, closely followed by the postal, trade and IP weighed degrees – indicative of a relationship between national wealth and the flow of goods and information.
  • Similarly to GDP, the rate of poverty of a country is best represented by the global degree, followed by the postal degree. The negative correlation indicates that the more impoverished a country is, the less well connected it is to the rest of the world.
  • Low human development (high rank) is most highly negatively correlated with the global degree, followed by the postal, trade and IP degrees. This shows that high human development (low rank) is associated with high global connectivity and activity in terms of incoming and outgoing flows of information and goods. ….Read the fully study here.”

Priorities for the National Privacy Research Strategy


James Kurose and Keith Marzullo at the White House: “Vast improvements in computing and communications are creating new opportunities for improving life and health, eliminating barriers to education and employment, and enabling advances in many sectors of the economy. The promise of these new applications frequently comes from their ability to create, collect, process, and archive information on a massive scale.

However, the rapid increase in the quantity of personal information that is being collected and retained, combined with our increased ability to analyze and combine it with other information, is creating concerns about privacy. When information about people and their activities can be collected, analyzed, and repurposed in so many ways, it can create new opportunities for crime, discrimination, inadvertent disclosure, embarrassment, and harassment.

This Administration has been a strong champion of initiatives to improve the state of privacy, such as the “Consumer Privacy Bill of Rights” proposal and the creation of the Federal Privacy Council. Similarly, the White House report Big Data: Seizing Opportunities, Preserving Values highlights the need for large-scale privacy research, stating: “We should dramatically increase investment for research and development in privacy-enhancing technologies, encouraging cross-cutting research that involves not only computer science and mathematics, but also social science, communications and legal disciplines.”

Today, we are pleased to release the National Privacy Research Strategy. Research agencies across government participated in the development of the strategy, reviewing existing Federal research activities in privacy-enhancing technologies, soliciting inputs from the private sector, and identifying priorities for privacy research funded by the Federal Government. The National Privacy Research Strategy calls for research along a continuum of challenges, from how people understand privacy in different situations and how their privacy needs can be formally specified, to how these needs can be addressed, to how to mitigate and remediate the effects when privacy expectations are violated. This strategy proposes the following priorities for privacy research:

  • Foster a multidisciplinary approach to privacy research and solutions;
  • Understand and measure privacy desires and impacts;
  • Develop system design methods that incorporate privacy desires, requirements, and controls;
  • Increase transparency of data collection, sharing, use, and retention;
  • Assure that information flows and use are consistent with privacy rules;
  • Develop approaches for remediation and recovery; and
  • Reduce privacy risks of analytical algorithms.

With this strategy, our goal is to produce knowledge and technology that will enable individuals, commercial entities, and the Federal Government to benefit from technological advancements and data use while proactively identifying and mitigating privacy risks. Following the release of this strategy, we are also launching a Federal Privacy R&D Interagency Working Group, which will lead the coordination of the Federal Government’s privacy research efforts. Among the group’s first public activities will be to host a workshop to discuss the strategic plan and explore directions of follow-on research. It is our hope that this strategy will also inspire parallel efforts in the private sector….(More)”

Reforms to improve U.S. government accountability


Alexander B. Howard and Patrice McDermott in Science: “Five decades after the United States first enacted the Freedom of Information Act (FOIA), Congress has voted to make the first major reforms to the statute since 2007. President Lyndon Johnson signed the first FOIA on 4 July 1966, enshrining in law the public’s right to access to information from executive branch government agencies. Scientists and others around the world can use the FOIA to learn what the U.S. government has done in its policies and practices. Proposed reforms should be a net benefit to public understanding of the scientific process and knowledge, by increasing the access of scientists to archival materials and reducing the likelihood of science and scientists being suppressed by official secrecy or bureaucracy.

Although the FOIA has been important for accountability, reform is sorely needed. An analysis of the 15 federal government agencies that received the most FOIA requests found poor to abysmal compliance rates (1, 2). In 2016, the Associated Press found that the Obama Administration had set a new record for unfulfilled FOIA requests (3). Although that has to be considered in the context of a rise in request volume without commensurate increases in resources to address them, researchers have found that most agencies simply ignore routine requests for travel schedules (4). An audit of 165 federal government agencies found that only 40% complied with the E-FOIA Act of 1996; just 67 of them had online libraries that were regularly updated with a substantial number of documents released under FOIA (5).

In the face of growing concerns about compliance, FOIA reform was one of the few recent instances of bicameral bipartisanship in Congress, with both the House and Senate each passing bills this spring with broad support. Now that Congress moved to send the Senate bill on to the president to sign into law, implementation of specific provisions will bear close scrutiny, including the potential impact of disclosure upon scientists who work in or with government agencies (6). Proposed revisions to the FOIA statute would improve how government discloses information to the public, while leaving intact exemptions for privacy, proprietary information, deliberative documents, and national security.

Features of Reforms

One of the major reforms in the House and Senate bills was to codify the “presumption of openness” outlined by President Obama the day after he took office in January 2009 when he declared that FOIA should be administered with a clear presumption: In the face of doubt, “openness” would prevail. This presumption of openness was affirmed by U.S. Attorney General Holder in March 2009. Although these declarations have had limited effect in the agencies (as described above), codifying these reforms into law is crucial not only to ensure that this remains executive branch policy after this president leaves office but also to provide requesters with legal force beyond an executive order….(More)”

You can help stop human trafficking with the TraffickCam app


 in TechCrunch: “In a world where the phrase “oh god, not another app” often springs to mind, along with “Yeah, yeah, I’m sure you want to make a world a better place” TraffickCam is a blast of icy-fresh air.

TraffickCam is an app developed by the Exchange Initiative, an organization fighting back against sex trafficking.

The goal of the new app is to build a national database of photos of the insides of hotel rooms to help law enforcement match images posted by sex traffickers to locations, in an effort to map out the routes and methods used by traffickers. The app will also be useful to help locate victims — and the people who put them in their predicament.

Available for both iOS and Android, the app is unlikely to win any design awards, but that isn’t the point; the app makers are solving a tremendous problem and any tools available to help resolve some of this will be welcomed with open arms by the organizations fighting the good fight….

The app, then, is a crowd-sourced data gathering tool which can be used to match known locations to photos confiscated from or shared by the perpetrators. Features such as patterns in the carpeting, furniture, room accessories and window views can be analyzed, and according to the app’s creators, testing shows that the app is 85 percent accurate in identifying the correct hotel in the top 20 matches.

“Law enforcement is always looking for new and innovative ways to recover victims, locate suspects and investigate criminal activity,” said Sergeant Adam Kavanagh, St. Louis County Police Department and Supervisor of the St. Louis County Multi-Jurisdictional Human Trafficking Task Force.

 Today, the organization’s database contains 1.5 million photos from more than 145,000 hotels in every major metropolitan area of the U.S., a combination of photos taken by early users of the TraffickCam smartphone app and from publicly available sources of hotel room images….(More)”

This text-message hotline can predict your risk of depression or stress


Clinton Nguyen for TechInsider: “When counselors are helping someone in the midst of an emotional crisis, they must not only know how to talk – they also must be willing to text.

Crisis Text Line, a non-profit text-message-based counseling service, operates a hotline for people who find it safer or easier to text about their problems than make a phone call or send an instant message. Over 1,500 volunteers are on hand 24/7 to lend support about problems including bullying, isolation, suicidal thoughts, bereavement, self-harm, or even just stress.

But in addition to providing a new outlet for those who prefer to communicate by text, the service is gathering a wellspring of anonymized data.

“We look for patterns in historical conversations that end up being higher risk for self harm and suicide attempts,” Liz Eddy, a Crisis Text Line spokesperson, tells Tech Insider. “By grounding in historical data, we can predict the risk of new texters coming in.crisis-text-line-sms

According to Fortune, the organization is using machine learning to prioritize higher-risk individuals for quicker and more effective responses. But Crisis Text Line is also wielding the data it gathers in other ways – the company has published a page of trends that tells the public which hours or days people are more likely to be affected by certain issues, as well as which US states are most affected by specific crises or psychological states.

According to the data, residents of Alaska reach out to the Text Line for LGBTQ issues more than those in other states, and Maine is one of the most stressed out states. Physical abuse is most commonly reported in North Dakota and Wyoming, while depression is more prevalent in texters from Kentucky and West Virginia.

The research comes at an especially critical time. According to studies from the National Center for Health Statistics, US suicide rates have surged to a 30-year high. The study noted a rise in suicide rates for all demographics except black men over the age of 75. Alarmingly, the suicide rate among 10- to 14-year-old girls has tripled since 1999….(More)”

The Billions We’re Wasting in Our Jails


Stephen Goldsmith  and Jane Wiseman in Governing: “By using data analytics to make decisions about pretrial detention, local governments could find substantial savings while making their communities safer….

Few areas of local government spending present better opportunities for dramatic savings than those that surround pretrial detention. Cities and counties are wasting more than $3 billion a year, and often inducing crime and job loss, by holding the wrong people while they await trial. The problem: Only 10 percent of jurisdictions use risk data analytics when deciding which defendants should be detained.

As a result, dangerous people are out in our communities, while many who could be safely in the community are behind bars. Vast numbers of people accused of petty offenses spend their pretrial detention time jailed alongside hardened convicts, learning from them how to be better criminals….

In this era of big data, analytics not only can predict and prevent crime but also can discern who should be diverted from jail to treatment for underlying mental health or substance abuse issues. Avoided costs aggregating in the billions could be better spent on detaining high-risk individuals, more mental health and substance abuse treatment, more police officers and other public safety services.

Jurisdictions that do use data to make pretrial decisions have achieved not only lower costs but also greater fairness and lower crime rates. Washington, D.C., releases 85 percent of defendants awaiting trial. Compared to the national average, those released in D.C. are two and a half times more likely to remain arrest-free and one and a half times as likely to show up for court.

Louisville, Ky., implemented risk-based decision-making using a tool developed by the Laura and John Arnold Foundation and now releases 70 percent of defendants before trial. Those released have turned out to be twice as likely to return to court and to stay arrest-free as those in other jurisdictions. Mesa County, Colo., and Allegheny County, Pa., both have achieved significant savings from reduced jail populations due to data-driven release of low-risk defendants.

Data-driven approaches are beginning to produce benefits not only in the area of pretrial detention but throughout the criminal justice process. Dashboards now in use in a handful of jurisdictions allow not only administrators but also the public to see court waiting times by offender type and to identify and address processing bottlenecks….(More)”

Connect the corporate dots to see true transparency


Gillian Tett at the Financial Times: “…In all this, a crucial point is often forgotten: simply amassing data will not solve the problem of transparency. What is also needed is a way for analysts to track the connections that exist between companies scattered across different national jurisdictions.

There are more than 45,000 companies listed on global stock exchanges and, according to Chris Taggart of OpenCorporates, an independent data company, there are between 250m and 400m unlisted groups. Many of these are listed on national registries but, since registries are extremely fragmented, it is very difficult for shareholders or regulators to form a complete picture of company activity.

It also creates financial stability risks. One reason why it is currently hard to track the scale of Chinese corporate debt, say, is that it is being issued by an opaque web of legal entities. Similarly, regulators struggled to cope with the fallout from the Lehman Brothers collapse in 2008 because the bank was operating almost 3,000 different legal entities around the world.

Is there a solution to this? A good place to start would be for governments to put their corporate registries online. Another crucial step would be for governments and companies to agree on a common standard for labelling legal entities, so that these can be tracked across borders.

Happily, work on that has begun: in 2014, the Global Legal Entity Identifier Foundation was created. It supports the implementation and use of “legal entity identifiers”, a data standard that identifies participants in financial transactions. Groups such as the Data Coalition in Washington DC are lobbying for laws that would force companies to use LEIs….However, this inter-governmental project is moving so slowly that the private sector may be a better bet. In recent years, companies such as Dun & Bradstreet have begun to amass proprietary information about complex corporate webs, and computer nerds are also starting to use the power of big data to join up the corporate dots in a public format.

OpenCorporates is a good example. Over the past five years, a dozen staff there have been painstakingly scraping national corporate registries to create a database designed to show how companies are connected around the world. This is far from complete but data from 100m entities have already been logged. And in the wake of the Panama Papers, more governments are coming on board — data from the Cayman Islands are currently being added and France is likely to collaborate soon.

Sadly, these moves will not deliver real transparency straight away. If you type “MIO” into the search box on the OpenCorporates website, you will not see a map of all of McKinsey’s activities — at least not yet.

The good news, however, is that with every data scrape, or use of an LEI, the picture of global corporate activity is becoming slightly less opaque thanks to the work of a hidden army of geeks. They deserve acclaim and support — even (or especially) from management consultants….(More)”