Sharing Private Data for Public Good


Stefaan G. Verhulst at Project Syndicate: “After Hurricane Katrina struck New Orleans in 2005, the direct-mail marketing company Valassis shared its database with emergency agencies and volunteers to help improve aid delivery. In Santiago, Chile, analysts from Universidad del Desarrollo, ISI Foundation, UNICEF, and the GovLab collaborated with Telefónica, the city’s largest mobile operator, to study gender-based mobility patterns in order to design a more equitable transportation policy. And as part of the Yale University Open Data Access project, health-care companies Johnson & Johnson, Medtronic, and SI-BONE give researchers access to previously walled-off data from 333 clinical trials, opening the door to possible new innovations in medicine.

These are just three examples of “data collaboratives,” an emerging form of partnership in which participants exchange data for the public good. Such tie-ups typically involve public bodies using data from corporations and other private-sector entities to benefit society. But data collaboratives can help companies, too – pharmaceutical firms share data on biomarkers to accelerate their own drug-research efforts, for example. Data-sharing initiatives also have huge potential to improve artificial intelligence (AI). But they must be designed responsibly and take data-privacy concerns into account.

Understanding the societal and business case for data collaboratives, as well as the forms they can take, is critical to gaining a deeper appreciation the potential and limitations of such ventures. The GovLab has identified over 150 data collaboratives spanning continents and sectors; they include companies such as Air FranceZillow, and Facebook. Our research suggests that such partnerships can create value in three main ways….(More)”.

Companies Collect a Lot of Data, But How Much Do They Actually Use?


Article by Priceonomics Data Studio: “For all the talk of how data is the new oil and the most valuable resource of any enterprise, there is a deep dark secret companies are reluctant to share — most of the data collected by businesses simply goes unused.

This unknown and unused data, known as dark data comprises more than half the data collected by companies. Given that some estimates indicate that 7.5 septillion (7,700,000,000,000,000,000,000) gigabytes of data are generated every single day, not using  most of it is a considerable issue.

In this article, we’ll look at this dark data. Just how much of it is created by companies, what are the reasons this data isn’t being analyzed, and what are the costs and implications of companies not using the majority of the data they collect.  

Before diving into the analysis, it’s worth spending a moment clarifying what we mean by the term “dark data.” Gartner defines dark data as:

“The information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). 

To learn more about this phenomenon, Splunk commissioned a global survey of 1,300+ business leaders to better understand how much data they collect, and how much is dark. Respondents were from IT and business roles, and were located in Australia, China, France, Germany, Japan, the United States, and the United Kingdom. across various industries. For the report, Splunk defines dark data as: “all the unknown and untapped data across an organization, generated by systems, devices and interactions.”

While the costs of storing data has decreased overtime, the cost of saving septillions of gigabytes of wasted data is still significant. What’s more, during this time the strategic importance of data has increased as companies have found more and more uses for it. Given the cost of storage and the value of data, why does so much of it go unused?

The following chart shows the reasons why dark data isn’t currently being harnessed:

By a large margin, the number one reason given for not using dark data is that companies lack a tool to capture or analyze the data. Companies accumulate data from server logs, GPS networks, security tools, call records, web traffic and more. Companies track everything from digital transactions to the temperature of their server rooms to the contents of retail shelves. Most of this data lies in separate systems, is unstructured, and cannot be connected or analyzed.

Second, the data captured just isn’t good enough. You might have important customer information about a transaction, but it’s missing location or other important metadata because that information sits somewhere else or was never captured in useable format.

Additionally, dark data exists because there is simply too much data out there and a lot of is unstructured. The larger the dataset (or the less structured it is), the more sophisticated the tool required for analysis. Additionally, these kinds of datasets often time require analysis by individuals with significant data science expertise who are often is short supply

The implications of the prevalence are vast. As a result of the data deluge, companies often don’t know where all the sensitive data is stored and can’t be confident they are complying with consumer data protection measures like GDPR. …(More)”.

Exploring Digital Ecosystems: Organizational and Human Challenges


Proceedings edited by Alessandra Lazazzara, Francesca Ricciardi and Stefano Za: “The recent surge of interest in digital ecosystems is not only transforming the business landscape, but also poses several human and organizational challenges. Due to the pervasive effects of the transformation on firms and societies alike, both scholars and practitioners are interested in understanding the key mechanisms behind digital ecosystems, their emergence and evolution. In order to disentangle such factors, this book presents a collection of research papers focusing on the relationship between technologies (e.g. digital platforms, AI, infrastructure) and behaviours (e.g. digital learning, knowledge sharing, decision-making). Moreover, it provides critical insights into how digital ecosystems can shape value creation and benefit various stakeholders. The plurality of perspectives offered makes the book particularly relevant for users, companies, scientists and governments. The content is based on a selection of the best papers – original double-blind peer-reviewed contributions – presented at the annual conference of the Italian chapter of the AIS, which took place in Pavia, Italy in October 2018….(More)”.

What can the labor flow of 500 million people on LinkedIn tell us about the structure of the global economy?


Paper by Jaehyuk Park et al: “…One of the most popular concepts for policy makers and business economists to understand the structure of the global economy is “cluster”, the geographical agglomeration of interconnected firms such as Silicon ValleyWall Street, and Hollywood. By studying those well-known clusters, we become to understand the advantage of participating in a geo-industrial cluster for firms and how it is related to the economic growth of a region. 

However, the existing definition of geo-industrial cluster is not systematic enough to reveal the whole picture of the global economy. Often, after defining as a group of firms in a certain area, the geo-industrial clusters are considered as independent to each other. As we should consider the interaction between accounting team and marketing team to understand the organizational structure of a firm, the relationships among those geo-industrial clusters are the essential part of the whole picture….

In this new study, my colleagues and I at Indiana University — with support from LinkedIn — have finally overcome these limitations by defining geo-industrial clusters through labor flow and constructing a global labor flow network from LinkedIn’s individual-level job history dataset. Our access to this data was made possible by our selection as one of 11 teams selected to participate in the LinkedIn Economic Graph Challenge.

The transitioning of workers between jobs and firms — also known as labor flow — is considered central in driving firms towards geo-industrial clusters due to knowledge spillover and labor market pooling. In response, we mapped the cluster structure of the world economy based on labor mobility between firms during the last 25 years, constructing a “labor flow network.” 

To do this, we leverage LinkedIn’s data on professional demographics and employment histories from more than 500 million people between 1990 and 2015. The network, which captures approximately 130 million job transitions between more than 4 million firms, is the first-ever flow network of global labor.

The resulting “map” allows us to:

  • identify geo-industrial clusters systematically and organically using network community detection
  • verify the importance of region and industry in labor mobility
  • compare the relative importance between the two constraints in different hierarchical levels, and
  • reveal the practical advantage of the geo-industrial cluster as a unit of future economic analyses.
  • show a better picture of what industry in what region leads the economic growth of the industry or the region, at the same time
  • find out emerging and declining skills based on the representativeness of them in growing and declining geo-industrial clusters…(More)”.

“Anonymous” Data Won’t Protect Your Identity


Sophie Bushwick at Scientific American: “The world produces roughly 2.5 quintillion bytes of digital data per day, adding to a sea of information that includes intimate details about many individuals’ health and habits. To protect privacy, data brokers must anonymize such records before sharing them with researchers and marketers. But a new study finds it is relatively easy to reidentify a person from a supposedly anonymized data set—even when that set is incomplete.

Massive data repositories can reveal trends that teach medical researchers about disease, demonstrate issues such as the effects of income inequality, coach artificial intelligence into humanlike behavior and, of course, aim advertising more efficiently. To shield people who—wittingly or not—contribute personal information to these digital storehouses, most brokers send their data through a process of deidentification. This procedure involves removing obvious markers, including names and social security numbers, and sometimes taking other precautions, such as introducing random “noise” data to the collection or replacing specific details with general ones (for example, swapping a birth date of “March 7, 1990” for “January–April 1990”). The brokers then release or sell a portion of this information.

“Data anonymization is basically how, for the past 25 years, we’ve been using data for statistical purposes and research while preserving people’s privacy,” says Yves-Alexandre de Montjoye, an assistant professor of computational privacy at Imperial College London and co-author of the new study, published this week in Nature Communications.  Many commonly used anonymization techniques, however, originated in the 1990s, before the Internet’s rapid development made it possible to collect such an enormous amount of detail about things such as an individual’s health, finances, and shopping and browsing habits. This discrepancy has made it relatively easy to connect an anonymous line of data to a specific person: if a private detective is searching for someone in New York City and knows the subject is male, is 30 to 35 years old and has diabetes, the sleuth would not be able to deduce the man’s name—but could likely do so quite easily if he or she also knows the target’s birthday, number of children, zip code, employer and car model….(More)”

The value of data in Canada: Experimental estimates


Statistics Canada: “As data and information take on a far more prominent role in Canada and, indeed, all over the world, data, databases and data science have become a staple of modern life. When the electricity goes out, Canadians are as much in search of their data feed as they are food and heat. Consumers are using more and more data that is embodied in the products they buy, whether those products are music, reading material, cars and other appliances, or a wide range of other goods and services. Manufacturers, merchants and other businesses depend increasingly on the collection, processing and analysis of data to make their production processes more efficient and to drive their marketing strategies.

The increasing use of and investment in all things data is driving economic growth, changing the employment landscape and reshaping how and from where we buy and sell goods. Yet the rapid rise in the use and importance of data is not well measured in the existing statistical system. Given the ‘lack of data on data’, Statistics Canada has initiated new research to produce a first set of estimates of the value of data, databases and data science. The development of these estimates benefited from collaboration with the Bureau of Economic Analysis in the United States and the Organisation for Economic Co-operation and Development.

In 2018, Canadian investment in data, databases and data science was estimated to be as high as $40 billion. This was greater than the annual investment in industrial machinery, transportation equipment, and research and development and represented approximately 12% of total non-residential investment in 2018….

Statistics Canada recently released a conceptual framework outlining how one might measure the economic value of data, databases and data science. Thanks to this new framework, the growing role of data in Canada can be measured through time. This framework is described in a paper that was released in The Daily on June 24, 2019 entitled “Measuring investments in data, databases and data science: Conceptual framework.” That paper describes the concept of an ‘information chain’ in which data are derived from everyday observations, databases are constructed from data, and data science creates new knowledge by analyzing the contents of databases….(More)”.

What Restaurant Reviews Reveal About Cities


Linda Poon at CityLab: “Online review sites can tell you a lot about a city’s restaurant scene, and they can reveal a lot about the city itself, too.

Researchers at MIT recently found that information about restaurants gathered from popular review sites can be used to uncover a number of socioeconomic factors of a neighborhood, including its employment rates and demographic profiles of the people who live, work, and travel there.

A report published last week in the Proceedings of the National Academy of Sciences explains how the researchers used information found on Dianping—a Yelp-like site in China—to find information that might usually be gleaned from an official government census. The model could prove especially useful for gathering information about cities that don’t have that kind of reliable or up-to-date government data, especially in developing countries with limited resources to conduct regular surveys….

Zheng and her colleagues tested out their machine-learning model using restaurant data from nine Chinese cities of various sizes—from crowded ones like Beijing, with a population of more than 10 million, to smaller ones like Baoding, a city of fewer than 3 million people.

They pulled data from 630,000 restaurants listed on Dianping, including each business’s location, menu prices, opening day, and customer ratings. Then they ran it through a machine-learning model with official census data and with anonymous location and spending data gathered from cell phones and bank cards. By comparing the information, they were able to determine where the restaurant data reflected the other data they had about neighborhoods’ characteristics.

They found that the local restaurant scene can predict, with 95 percent accuracy, variations in a neighborhood’s daytime and nighttime populations, which are measured using mobile phone data. They can also predict, with 90 and 93 percent accuracy, respectively, the number of businesses and the volume of consumer consumption. The type of cuisines offered and kind of eateries available (coffeeshop vs. traditional teahouses, for example), can also predict the proportion of immigrants or age and income breakdown of residents. The predictions are more accurate for neighborhoods near urban centers as opposed to those near suburbs, and for smaller cities, where neighborhoods don’t vary as widely as those in bigger metropolises….(More)”.

The Power of Global Performance Indicators


Introduction to Special Issue of International Organization by
Judith G. Kelley and Beth A. Simmons: “In recent decades, IGOs, NGOs, private firms and even states have begun to regularly package and distribute information on the relative performance of states. From the World Bank’s Ease of Doing Business Index to the Financial Action Task Force blacklist, global performance indicators (GPIs) are increasingly deployed to influence governance globally. We argue that GPIs derive influence from their ability to frame issues, extend the authority of the creator, and — most importantly — to invoke recurrent comparison that stimulates governments’ concerns for their own and their country’s reputation. Their public and ongoing ratings and rankings of states are particularly adept at capturing attention not only at elite policy levels but also among other domestic and transnational actors. GPIs thus raise new questions for research on politics and governance globally. What are the social and political effects of this form of information on discourse, policies and behavior? What types of actors can effectively wield GPIs and on what types of issues? In this symposium introduction, we define GPIs, describe their rise, and theorize and discuss these questions in light of the findings of the symposium contributions…(More)”.

How an AI Utopia Would Work


Sami Mahroum at Project Syndicate: “…It is more than 500 years since Sir Thomas More found inspiration for the “Kingdom of Utopia” while strolling the streets of Antwerp. So, when I traveled there from Dubai in May to speak about artificial intelligence (AI), I couldn’t help but draw parallels to Raphael Hythloday, the character in Utopia who regales sixteenth-century Englanders with tales of a better world.

As home to the world’s first Minister of AI, as well as museumsacademies, and foundations dedicated to studying the future, Dubai is on its own Hythloday-esque voyage. Whereas Europe, in general, has grown increasingly anxious about technological threats to employment, the United Arab Emirates has enthusiastically embraced the labor-saving potential of AI and automation.

There are practical reasons for this. The ratio of indigenous-to-foreign labor in the Gulf states is highly imbalanced, ranging from a high of 67% in Saudi Arabia to a low of 11% in the UAE. And because the region’s desert environment cannot support further population growth, the prospect of replacing people with machines has become increasingly attractive.

But there is also a deeper cultural difference between the two regions. Unlike Western Europe, the birthplace of both the Industrial Revolution and the “Protestant work ethic,” Arab societies generally do not “live to work,” but rather “work to live,” placing a greater value on leisure time. Such attitudes are not particularly compatible with economic systems that require squeezing ever more productivity out of labor, but they are well suited for an age of AI and automation….

Fortunately, AI and data-driven innovation could offer a way forward. In what could be perceived as a kind of AI utopia, the paradox of a bigger state with a smaller budget could be reconciled, because the government would have the tools to expand public goods and services at a very small cost.

The biggest hurdle would be cultural: As early as 1948, the German philosopher Joseph Pieper warned against the “proletarianization” of people and called for leisure to be the basis for culture. Westerners would have to abandon their obsession with the work ethic, as well as their deep-seated resentment toward “free riders.” They would have to start differentiating between work that is necessary for a dignified existence, and work that is geared toward amassing wealth and achieving status. The former could potentially be all but eliminated.

With the right mindset, all societies could start to forge a new AI-driven social contract, wherein the state would capture a larger share of the return on assets, and distribute the surplus generated by AI and automation to residents. Publicly-owned machines would produce a wide range of goods and services, from generic drugs, food, clothes, and housing, to basic research, security, and transportation….(More)”.

Trusted data and the future of information sharing


 MIT Technology Review: “Data in some form underpins almost every action or process in today’s modern world. Consider that even farming, the world’s oldest industry, is on the verge of a digital revolution, with AI, drones, sensors, and blockchain technology promising to boost efficiencies. The market value of an apple will increasingly reflect not only traditional farming inputs but also some value of modern data, such as weather patterns, soil acidity levels and agri-supply-chain information. By 2022 more than 60% of global GDP will be digitized, according to IDC.

Governments seeking to foster growth in their digital economies need to be more active in encouraging safe data sharing between organizations. Tolerating the sharing of data and stepping in only where security breaches occur is no longer enough. Sharing data across different organizations enables the whole ecosystem to grow and can be a unique source of competitive advantage. But businesses need guidelines and support in how to do this effectively.   

This is how Singapore’s data-sharing worldview has evolved, according to Janil Puthucheary, senior minister of state for communications and information and transport, upon launching the city-state’s new Trusted Data Sharing Framework in June 2019.

The Framework, a product of consultations between Singapore’s Infocomm Media Development Authority (IMDA), its Personal Data Protection Commission (PDPC), and industry players, is intended to create a common data-sharing language for relevant stakeholders. Specifically, it addresses four common categories of concerns with data sharing: how to formulate an overall data-sharing strategy, legal and regulatory considerations, technical and organizational considerations, and the actual operationalizing of data sharing.

For instance, companies often have trouble assessing the value of their own data, a necessary first step before sharing should even be considered. The framework describes the three general approaches used: market-, cost-, and income-based. The legal and regulatory section details when businesses can, among other things, seek exemptions from Singapore’s Personal Data Protection Act.

The technical and organizational chapter includes details on governance, infrastructure security, and risk management. Finally, the section on operational aspects of data sharing includes guidelines for when it is appropriate to use shared data for a secondary purpose or not….(More)”.