Guidance Note: Statistical Disclosure Control


Centre for Humanitarian Data: “Survey and needs assessment data, or what is known as ‘microdata’, is essential for providing adequate response to crisis-affected people. However, collecting this information does present risks. Even as great effort is taken to remove unique identifiers such as names and phone numbers from microdata so no individual persons or communities are exposed, combining key variables such as location or ethnicity can still allow for re-identification of individual respondents. Statistical Disclosure Control (SDC) is one method for reducing this risk. 

The Centre has developed a Guidance Note on Statistical Disclosure Control that outlines the steps involved in the SDC process, potential applications for its use, case studies and key actions for humanitarian data practitioners to take when managing sensitive microdata. Along with an overview of what SDC is and what tools are available, the Guidance Note outlines how the Centre is using this process to mitigate risk for datasets shared on HDX. …(More)”.

Hacking for Housing: How open data and civic hacking creates wins for housing advocates


Krista Chan at Sunlight: “…Housing advocates have an essential role to play in protecting residents from the consequences of real estate speculation. But they’re often at a significant disadvantage; the real estate lobby has access to a wealth of data and technological expertise. Civic hackers and open data could play an essential role in leveling the playing field.

Civic hackers have facilitated wins for housing advocates by scraping data or submitting FOIA requests where data is not open and creating apps to help advocates gain insights that they can turn into action. 

Hackers at New York City’s Housing Data Coalition created a host of civic apps that identify problematic landlords by exposing owners behind shell companies, or flagging buildings where tenants are at risk of displacement. In a similar vein, Washington DC’s Housing Insights tool aggregates a wide variety of data to help advocates make decisions about affordable housing.

Barriers and opportunities

Today, the degree to which housing data exists, is openly available, and consistently reliable varies widely, even within cities themselves. Cities with robust communities of affordable housing advocacy groups may not be connected to people who can help open data and build usable tools. Even in cities with robust advocacy and civic tech communities, these groups may not know how to work together because of the significant institutional knowledge that’s required to understand how to best support housing advocacy efforts.

In cities where civic hackers have tried to create useful open housing data repositories, similar data cleaning processes have been replicated, such as record linkage of building owners or identification of rent-controlled units. Civic hackers need to take on these data cleaning and “extract, transform, load” (ETL) processes in order to work with the data itself, even if it’s openly available. The Housing Data Coalition has assembled NYC-DB, a tool which builds a postgres database containing a variety of housing related data pertaining to New York City, and Washington DC’s Housing Insights similarly ingests housing data into a postgres database and API for front-end access

Since these tools are open source, civic hackers in a multitude of cities can use existing work to develop their own, locally relevant tools to support local housing advocates….(More)”.

The value of data in Canada: Experimental estimates


Statistics Canada: “As data and information take on a far more prominent role in Canada and, indeed, all over the world, data, databases and data science have become a staple of modern life. When the electricity goes out, Canadians are as much in search of their data feed as they are food and heat. Consumers are using more and more data that is embodied in the products they buy, whether those products are music, reading material, cars and other appliances, or a wide range of other goods and services. Manufacturers, merchants and other businesses depend increasingly on the collection, processing and analysis of data to make their production processes more efficient and to drive their marketing strategies.

The increasing use of and investment in all things data is driving economic growth, changing the employment landscape and reshaping how and from where we buy and sell goods. Yet the rapid rise in the use and importance of data is not well measured in the existing statistical system. Given the ‘lack of data on data’, Statistics Canada has initiated new research to produce a first set of estimates of the value of data, databases and data science. The development of these estimates benefited from collaboration with the Bureau of Economic Analysis in the United States and the Organisation for Economic Co-operation and Development.

In 2018, Canadian investment in data, databases and data science was estimated to be as high as $40 billion. This was greater than the annual investment in industrial machinery, transportation equipment, and research and development and represented approximately 12% of total non-residential investment in 2018….

Statistics Canada recently released a conceptual framework outlining how one might measure the economic value of data, databases and data science. Thanks to this new framework, the growing role of data in Canada can be measured through time. This framework is described in a paper that was released in The Daily on June 24, 2019 entitled “Measuring investments in data, databases and data science: Conceptual framework.” That paper describes the concept of an ‘information chain’ in which data are derived from everyday observations, databases are constructed from data, and data science creates new knowledge by analyzing the contents of databases….(More)”.

How we can place a value on health care data


Report by E&Y: “Unlocking the power of health care data to fuel innovation in medical research and improve patient care is at the heart of today’s health care revolution. When curated or consolidated into a single longitudinal dataset, patient-level records will trace a complete story of a patient’s demographics, health, wellness, diagnosis, treatments, medical procedures and outcomes. Health care providers need to recognize patient data for what it is: a valuable intangible asset desired by multiple stakeholders, a treasure trove of information.

Among the universe of providers holding significant data assets, the United Kingdom’s National Health Service (NHS) is the single largest integrated health care provider in the world. Its patient records cover the entire UK population from birth to death.

We estimate that the 55 million patient records held by the NHS today may have an indicative market value of several billion pounds to a commercial organization. We estimate also that the value of the curated NHS dataset could be as much as £5bn per annum and deliver around £4.6bn of benefit to patients per annum, in potential operational savings for the NHS, enhanced patient outcomes and generation of wider economic benefits to the UK….(More)”.

The plan to mine the world’s research papers


Priyanka Pulla in Nature: “Carl Malamud is on a crusade to liberate information locked up behind paywalls — and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it.

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. “This is not every journal article ever written, but it’s a lot,” Malamud says. It’s comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot.

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

The unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control — and often limit — the speed and scope of such projects, which typically confine themselves to abstracts, not full text. Researchers in India, the United States and the United Kingdom are already making plans to use the JNU store instead. Malamud and Lynn have held workshops at Indian government laboratories and universities to explain the idea. “We bring in professors and explain what we are doing. They get all excited and they say, ‘Oh gosh, this is wonderful’,” says Malamud.

But the depot’s legal status isn’t yet clear. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit. “Our position is that what we are doing is perfectly legal,” he says. For the moment, he is proceeding with caution: the JNU data depot is air-gapped, meaning that no one can access it from the Internet. Users have to physically visit the facility, and only researchers who want to mine for non-commercial purposes are currently allowed in. Malamud says his team does plan to allow remote access in the future. “The hope is to do this slowly and deliberately. We are not throwing this open right away,” he says….(More)”.

What Restaurant Reviews Reveal About Cities


Linda Poon at CityLab: “Online review sites can tell you a lot about a city’s restaurant scene, and they can reveal a lot about the city itself, too.

Researchers at MIT recently found that information about restaurants gathered from popular review sites can be used to uncover a number of socioeconomic factors of a neighborhood, including its employment rates and demographic profiles of the people who live, work, and travel there.

A report published last week in the Proceedings of the National Academy of Sciences explains how the researchers used information found on Dianping—a Yelp-like site in China—to find information that might usually be gleaned from an official government census. The model could prove especially useful for gathering information about cities that don’t have that kind of reliable or up-to-date government data, especially in developing countries with limited resources to conduct regular surveys….

Zheng and her colleagues tested out their machine-learning model using restaurant data from nine Chinese cities of various sizes—from crowded ones like Beijing, with a population of more than 10 million, to smaller ones like Baoding, a city of fewer than 3 million people.

They pulled data from 630,000 restaurants listed on Dianping, including each business’s location, menu prices, opening day, and customer ratings. Then they ran it through a machine-learning model with official census data and with anonymous location and spending data gathered from cell phones and bank cards. By comparing the information, they were able to determine where the restaurant data reflected the other data they had about neighborhoods’ characteristics.

They found that the local restaurant scene can predict, with 95 percent accuracy, variations in a neighborhood’s daytime and nighttime populations, which are measured using mobile phone data. They can also predict, with 90 and 93 percent accuracy, respectively, the number of businesses and the volume of consumer consumption. The type of cuisines offered and kind of eateries available (coffeeshop vs. traditional teahouses, for example), can also predict the proportion of immigrants or age and income breakdown of residents. The predictions are more accurate for neighborhoods near urban centers as opposed to those near suburbs, and for smaller cities, where neighborhoods don’t vary as widely as those in bigger metropolises….(More)”.

Studying Crime and Place with the Crime Open Database


M. P. J. Ashby in Research Data Journal for the Humanities and Social Sciences: “The study of spatial and temporal crime patterns is important for both academic understanding of crime-generating processes and for policies aimed at reducing crime. However, studying crime and place is often made more difficult by restrictions on access to appropriate crime data. This means understanding of many spatio-temporal crime patterns are limited to data from a single geographic setting, and there are few attempts at replication. This article introduces the Crime Open Database (code), a database of 16 million offenses from 10 of the largest United States cities over 11 years and more than 60 offense types. Open crime data were obtained from each city, having been published in multiple incompatible formats. The data were processed to harmonize geographic co-ordinates, dates and times, offense categories and location types, as well as adding census and other geographic identifiers. The resulting database allows the wider study of spatio-temporal patterns of crime across multiple US cities, allowing greater understanding of variations in the relationships between crime and place across different settings, as well as facilitating replication of research….(More)”.

How can Indigenous Data Sovereignty (IDS) be promoted and mainstreamed within open data movements?


OD Mekong Blog: “Considering Indigenous rights in the open data and technology space is a relatively new concept. Called “Indigenous Data Sovereignty” (IDS), it is defined as “the right of Indigenous peoples to govern the collection, ownership, and application of data about Indigenous communities, peoples, lands, and resources”, regardless of where the data is held or by whom. By default, this broad and all-encompassing framework bucks fundamental concepts of open data, and asks traditional open data practitioners to critically consider how open data can be used as a tool of transparency that also upholds equal rights for all…

Four main areas of concern and relevant barriers identified by participants were:

Self-determination to identify their membership

  • National governments in many states, particularly across Asia and South America, still do not allow for self-determination under the law. Even when legislation offers some recognition these are scarcely enforced, and mainstream discourse demonises Indigenous self-determination.
  • However, because Indigenous and ethnic minorities frequently face hardships and persecution on a daily basis, there were concerns about the applicability of data sovereignty at the local levels.

Intellectual Property Protocols

  • It has become the norm in the everyday lives of people for big tech companies to extract data in excessive amounts. How do disenfranchised communities combat this?
  • Indigenous data is often misappropriated to the detriment of Indigenous peoples.
  • Intellectual property concepts, such as copyright, are not an ideal approach for protecting Indigenous knowledge and intellectual property rights because they are rooted in commercialistic ideals that are difficult to apply to Indigenous contexts. This is especially so because many groups do not practice commercialization in the globalized context. Also, as a concept based on exclusivity (i.e., when licenses expire knowledge gets transferred over as public goods), it doesn’t take into account the collectivist ideals of Indigenous peoples.

Data Governance

  • Ultimately, data protection is about protecting lives. Having the ability to use data to direct decisions on Indigenous development places greater control in the hands of Indigenous peoples.
  • National governments are barriers due to conflicts in sovereignty interests. Nation-state legal systems are often contradictory to customary laws, and thus don’t often reflect rights-based approaches.

Consent — Free Prior and Informed Consent (FPIC)

  • FPIC, referring to a set of principles that define the process and mechanisms that apply specifically to Indigenous peoples in relation to the exercise of their collective rights, is a well-known phrase. They are intended to ensure that Indigenous peoples are treated as sovereign peoples with their own decision-making power, customary governance systems, and collective decision-making processes, but it is questionable as to what level one can ensure true FPIC in the Indigenous context.²
  • It remains a question as too how effectively due diligence can be applied to research protocols, so as to ensure that the rights associated with FPIC and the UNDRIP framework are upheld….(More)”.

Beyond Open Data Hackathons: Exploring Digital Innovation Success


Paper by Fotis Kitsios and Maria Kamariotou: “Previous researchers have examined the motivations of developers to participate in hackathons events and the challenges of open data hackathons, but limited studies have focused on the preparation and evaluation of these contests. Thus, the purpose of this paper is to examine factors that lead to the effective implementation and success of open data hackathons and innovation contests.

Six case studies of open data hackathons and innovation contests held between 2014 and 2018 in Thessaloniki were studied in order to identify the factors leading to the success of hackathon contests using criteria from the existing literature. The results show that the most significant factors were clear problem definition, mentors’ participation to the contest, level of support to participants by mentors in order to launch their applications to the market, jury members’ knowledge and experience, the entry requirements of the competition, and the participation of companies, data providers, and academics. Furthermore, organizers should take team members’ competences and skills, as well as the support of post-launch activities for applications, into consideration. This paper can be of interest to organizers of hackathon events because they could be knowledgeable about the factors that should take into consideration for the successful implementation of these events….(More)”.

Trusted data and the future of information sharing


 MIT Technology Review: “Data in some form underpins almost every action or process in today’s modern world. Consider that even farming, the world’s oldest industry, is on the verge of a digital revolution, with AI, drones, sensors, and blockchain technology promising to boost efficiencies. The market value of an apple will increasingly reflect not only traditional farming inputs but also some value of modern data, such as weather patterns, soil acidity levels and agri-supply-chain information. By 2022 more than 60% of global GDP will be digitized, according to IDC.

Governments seeking to foster growth in their digital economies need to be more active in encouraging safe data sharing between organizations. Tolerating the sharing of data and stepping in only where security breaches occur is no longer enough. Sharing data across different organizations enables the whole ecosystem to grow and can be a unique source of competitive advantage. But businesses need guidelines and support in how to do this effectively.   

This is how Singapore’s data-sharing worldview has evolved, according to Janil Puthucheary, senior minister of state for communications and information and transport, upon launching the city-state’s new Trusted Data Sharing Framework in June 2019.

The Framework, a product of consultations between Singapore’s Infocomm Media Development Authority (IMDA), its Personal Data Protection Commission (PDPC), and industry players, is intended to create a common data-sharing language for relevant stakeholders. Specifically, it addresses four common categories of concerns with data sharing: how to formulate an overall data-sharing strategy, legal and regulatory considerations, technical and organizational considerations, and the actual operationalizing of data sharing.

For instance, companies often have trouble assessing the value of their own data, a necessary first step before sharing should even be considered. The framework describes the three general approaches used: market-, cost-, and income-based. The legal and regulatory section details when businesses can, among other things, seek exemptions from Singapore’s Personal Data Protection Act.

The technical and organizational chapter includes details on governance, infrastructure security, and risk management. Finally, the section on operational aspects of data sharing includes guidelines for when it is appropriate to use shared data for a secondary purpose or not….(More)”.