Big data needs big governance: best practices from Brain-CODE, the Ontario Brain Institute’s neuroinformatics platform


Shannon C. Lefaivre et al in Frontiers of Genetics: “The Ontario Brain Institute (OBI) has begun to catalyze scientific discovery in the field of neuroscience through its large-scale informatics platform, known as Brain-CODE. The platform supports the capture, storage, federation, sharing and analysis of different data types across several brain disorders. Underlying the platform is a robust and scalable data governance structure which allows for the flexibility to advance scientific understanding, while protecting the privacy of research participants.

Recognizing the value of an open science approach to enabling discovery, the governance structure was designed not only to support collaborative research programs, but also to support open science by making all data open and accessible in the future. OBI’s rigorous approach to data sharing maintains the accessibility of research data for big discoveries without compromising privacy and security. Taking a Privacy by Design approach to both data sharing and development of the platform has allowed OBI to establish some best practices related to large scale data sharing within Canada. The aim of this report is to highlight these best practices and develop a key open resource which may be referenced during the development of similar open science initiatives….(More)”.

Balancing information governance obligations when accessing social care data for collaborative research


Paper by Malkiat Thiarai, Sarunkorn Chotvijit and Stephen Jarvis: “There is significant national interest in tackling issues surrounding the needs of vulnerable children and adults. This paper aims to argue that much value can be gained from the application of new data-analytic approaches to assist with the care provided to vulnerable children. This paper highlights the ethical and information governance issues raised in the development of a research project that sought to access and analyse children’s social care data.


The paper documents the process involved in identifying, accessing and using data held in Birmingham City Council’s social care system for collaborative research with a partner organisation. This includes identifying the data, its structure and format; understanding the Data Protection Act 1998 and 2018 (DPA) exemptions that are relevant to ensure that legal obligations are met; data security and access management; the ethical and governance approval process.


The findings will include approaches to understanding the data, its structure and accessibility tasks involved in addressing ethical and legal obligations and requirements of the ethical and governance processes….(More)”.

Privacy and Smart Cities: A Canadian Survey


Report by Sara Bannerman and Angela Orasch: “This report presents the findings of a national survey of Canadians about smart-city privacy conducted in October and November 2018. Our research questions were: How concerned are Canadians about smart-city privacy? How do these concerns intersect with age, gender, ethnicity, and location? Moreover, what are the expectations of Canadians with regards to their ability to control, use, or opt-out of data collection in smart-city context? What rights and privileges do Canadians feel are appropriate with regard to data self-determination, and what types of data are considered more sensitive than others?

What is a smart city?
A ‘smart city’ adopts digital and data-driven technologies in the planning, management and delivery of municipal services. Information and communications technologies (ICTs), data analytics, and the internet of
things (IoT) are some of the main components of these technologies, joined by web design, online marketing campaigns and digital services. Such technologies can include smart utility and transportation infrastructure, smart cards, smart transit, camera and sensor networks, or data collection by businesses to provide customized advertisements or other services. Smart-city technologies “monitor, manage and regulate city flows and processes, often in real-time” (Kitchin 2014, 2).

In 2017, a framework agreement was established between Waterfront Toronto, the organization charged with revitalizing Toronto’s waterfront, and Sidewalk Labs, parent company of Google, to develop a smart city on Toronto’s Eastern waterfront (Sidewalk Toronto 2018). This news was met with questions and concerns from experts in data privacy and the public at large regarding what was to be included in Sidewalk Lab’s smart-city vision. How would the overall governance structure function? How were the privacy rights of residents going to be protected, and what mechanisms, if any, would ensure that protection? The Toronto waterfront is just one of numerous examples of smart-city developments….(More)”.

Consumers kinda, sorta care about their data


Kim Hart at Axios: “A full 81% of consumers say that in the past year they’ve become more concerned with how companies are using their data, and 87% say they’ve come to believe companies that manage personal data should be more regulated, according to a survey out Monday by IBM’s Institute for Business Value.

Yes, but: They aren’t totally convinced they should care about how their data is being used, and many aren’t taking meaningful action after privacy breaches, according to the survey. Despite increasing data risks, 71% say it’s worth sacrificing privacy given the benefits of technology.Show less

By the numbers:

  • 89% say technology companies need to be more transparent about their products
  • 75% say that in the past year they’ve become less likely to trust companies with their personal data
  • 88% say the emergence of technologies like AI increase the need for clear policies about the use of personal data.

The other side: Despite increasing awareness of privacy and security breaches, most consumers aren’t taking consequential action to protect their personal data.

  • Fewer than half (45%) report that they’ve updated privacy settings, and only 16% stopped doing business with an entity due to data misuse….(More)”.

Algorithmic fairness: A code-based primer for public-sector data scientists


Paper by Ken Steif and Sydney Goldstein: “As the number of government algorithms grow, so does the need to evaluate algorithmic fairness. This paper has three goals. First, we ground the notion of algorithmic fairness in the context of disparate impact, arguing that for an algorithm to be fair, its predictions must generalize across different protected groups. Next, two algorithmic use cases are presented with code examples for how to evaluate fairness. Finally, we promote the concept of an open source repository of government algorithmic “scorecards,” allowing stakeholders to compare across algorithms and use cases….(More)”.

Are Requirements to Deposit Data in Research Repositories Compatible With the European Union’s General Data Protection Regulation?


Paper by Deborah Mascalzoni et al: “To reproduce study findings and facilitate new discoveries, many funding bodies, publishers, and professional communities are encouraging—and increasingly requiring—investigators to deposit their data, including individual-level health information, in research repositories. For example, in some cases the National Institutes of Health (NIH) and editors of some Springer Nature journals require investigators to deposit individual-level health data via a publicly accessible repository (12). However, this requirement may conflict with the core privacy principles of European Union (EU) General Data Protection Regulation 2016/679 (GDPR), which focuses on the rights of individuals as well as researchers’ obligations regarding transparency and accountability.

The GDPR establishes legally binding rules for processing personal data in the EU, as well as outside the EU in some cases. Researchers in the EU, and often their global collaborators, must comply with the regulation. Health and genetic data are considered special categories of personal data and are subject to relatively stringent rules for processing….(More)”.

The privacy threat posed by detailed census data


Gillian Tett at the Financial Times: “Wilbur Ross suffered the political equivalent of a small(ish) black eye last month: a federal judge blocked the US commerce secretary’s attempts to insert a question about citizenship into the 2020 census and accused him of committing “egregious” legal violations.

The Supreme Court has agreed to hear the administration’s appeal in April. But while this high-profile fight unfolds, there is a second, less noticed, census issue about data privacy emerging that could have big implications for businesses (and citizens). Last weekend John Abowd, the Census Bureau’s chief scientist, told an academic gathering that statisticians had uncovered shortcomings in the protection of personal data in past censuses. There is no public evidence that anyone has actually used these weaknesses to hack records, and Mr Abowd insisted that the bureau is using cutting-edge tools to fight back. But, if nothing else, this revelation shows the mounting problem around data privacy. Or, as Mr Abowd, noted: “These developments are sobering to everyone.” These flaws are “not just a challenge for statistical agencies or internet giants,” he added, but affect any institution engaged in internet commerce and “bioinformatics”, as well as commercial lenders and non-profit survey groups. Bluntly, this includes most companies and banks.

The crucial problem revolves around what is known as “re-identification” risk. When companies and government institutions amass sensitive information about individuals, they typically protect privacy in two ways: they hide the full data set from outside eyes or they release it in an “anonymous” manner, stripped of identifying details. The census bureau does both: it is required by law to publish detailed data and protect confidentiality. Since 1990, it has tried to resolve these contradictory mandates by using “household-level swapping” — moving some households from one geographic location to another to generate enough uncertainty to prevent re-identification. This used to work. But today there are so many commercially-available data sets and computers are so powerful that it is possible to re-identify “anonymous” data by combining data sets. …

Thankfully, statisticians think there is a solution. The Census Bureau now plans to use a technique known as “differential privacy” which would introduce “noise” into the public statistics, using complex algorithms. This technique is expected to create just enough statistical fog to protect personal confidentiality in published data — while also preserving information in an encrypted form that statisticians can later unscramble, as needed. Companies such as Google, Microsoft and Apple have already used variants of this technique for several years, seemingly successfully. However, nobody has employed this system on the scale that the Census Bureau needs — or in relation to such a high stakes event. And the idea has sparked some controversy because some statisticians fear that even “differential privacy” tools can be hacked — and others fret it makes data too “noisy” to be useful….(More)”.

Tomorrow’s Data Heroes


Article by Florian GrönePierre Péladeau, and Rawia Abdel Samad: “Telecom companies are struggling to find a profitable identity in today’s digital sphere. What about helping customers control their information?…

By 2025, Alex had had enough. There no longer seemed to be any distinction between her analog and digital lives. Everywhere she went, every purchase she completed, and just about every move she made, from exercising at the gym to idly surfing the Web, triggered a vast flow of data. That in turn meant she was bombarded with personalized advertising messages, targeted more and more eerily to her. As she walked down the street, messages appeared on her phone about the stores she was passing. Ads popped up on her all-purpose tablet–computer–phone pushing drugs for minor health problems she didn’t know she had — until the symptoms appeared the next day. Worse, she had recently learned that she was being reassigned at work. An AI machine had mastered her current job by analyzing her use of the firm’s productivity software.

It was as if the algorithms of global companies knew more about her than she knew herself — and they probably did. How was it that her every action and conversation, even her thoughts, added to the store of data held about her? After all, it was her data: her preferences, dislikes, interests, friendships, consumer choices, activities, and whereabouts — her very identity — that was being collected, analyzed, profited from, and even used to manage her. All these companies seemed to be making money buying and selling this information. Why shouldn’t she gain some control over the data she generated, and maybe earn some cash by selling it to the companies that had long collected it free of charge?

So Alex signed up for the “personal data manager,” a new service that promised to give her control over her privacy and identity. It was offered by her U.S.-based connectivity company (in this article, we’ll call it DigiLife, but it could be one of many former telephone companies providing Internet services in 2025). During the previous few years, DigiLife had transformed itself into a connectivity hub: a platform that made it easier for customers to join, manage, and track interactions with media and software entities across the online world. Thanks to recently passed laws regarding digital identity and data management, including the “right to be forgotten,” the DigiLife data manager was more than window dressing. It laid out easy-to-follow choices that all Web-based service providers were required by law to honor….

Today, in 2019, personal data management applications like the one Alex used exist only in nascent form, and consumers have yet to demonstrate that they trust these services. Nor can they yet profit by selling their data. But the need is great, and so is the opportunity for companies that fulfill it. By 2025, the total value of the data economy as currently structured will rise to more than US$400 billion, and by monetizing the vast amounts of data they produce, consumers can potentially recapture as much as a quarter of that total.

Given the critical role of telecom operating companies within the digital economy — the central position of their data networks, their networking capabilities, their customer relationships, and their experience in government affairs — they are in a good position to seize this business opportunity. They might not do it alone; they are likely to form consortia with software companies or other digital partners. Nonetheless, for legacy connectivity companies, providing this type of service may be the most sustainable business option. It may also be the best option for the rest of us, as we try to maintain control in a digital world flooded with our personal data….(More)”.

Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice


Paper by Rashida Richardson, Jason Schultz, and Kate Crawford: “Law enforcement agencies are increasingly using algorithmic predictive policing systems to forecast criminal activity and allocate police resources. Yet in numerous jurisdictions, these systems are built on data produced within the context of flawed, racially fraught and sometimes unlawful practices (‘dirty policing’). This can include systemic data manipulation, falsifying police reports, unlawful use of force, planted evidence, and unconstitutional searches. These policing practices shape the environment and the methodology by which data is created, which leads to inaccuracies, skews, and forms of systemic bias embedded in the data (‘dirty data’). Predictive policing systems informed by such data cannot escape the legacy of unlawful or biased policing practices that they are built on. Nor do claims by predictive policing vendors that these systems provide greater objectivity, transparency, or accountability hold up. While some systems offer the ability to see the algorithms used and even occasionally access to the data itself, there is no evidence to suggest that vendors independently or adequately assess the impact that unlawful and bias policing practices have on their systems, or otherwise assess how broader societal biases may affect their systems.

In our research, we examine the implications of using dirty data with predictive policing, and look at jurisdictions that (1) have utilized predictive policing systems and (2) have done so while under government commission investigations or federal court monitored settlements, consent decrees, or memoranda of agreement stemming from corrupt, racially biased, or otherwise illegal policing practices. In particular, we examine the link between unlawful and biased police practices and the data used to train or implement these systems across thirteen case studies. We highlight three of these: (1) Chicago, an example of where dirty data was ingested directly into the city’s predictive system; (2) New Orleans, an example where the extensive evidence of dirty policing practices suggests an extremely high risk that dirty data was or will be used in any predictive policing application, and (3) Maricopa County where despite extensive evidence of dirty policing practices, lack of transparency and public accountability surrounding predictive policing inhibits the public from assessing the risks of dirty data within such systems. The implications of these findings have widespread ramifications for predictive policing writ large. Deploying predictive policing systems in jurisdictions with extensive histories of unlawful police practices presents elevated risks that dirty data will lead to flawed, biased, and unlawful predictions which in turn risk perpetuating additional harm via feedback loops throughout the criminal justice system. Thus, for any jurisdiction where police have been found to engage in such practices, the use of predictive policing in any context must be treated with skepticism and mechanisms for the public to examine and reject such systems are imperative….(More)”.

Hundreds of Bounty Hunters Had Access to AT&T, T-Mobile, and Sprint Customer Location Data for Years


Joseph Cox at Motherboard: ” In January, Motherboard revealed that AT&T, T-Mobile, and Sprint were selling their customers’ real-time location data, which trickled down through a complex network of companies until eventually ending up in the hands of at least one bounty hunter. Motherboard was also able to purchase the real-time location of a T-Mobile phone on the black market from a bounty hunter source for $300. In response, telecom companies said that this abuse was a fringe case.

In reality, it was far from an isolated incident.

Around 250 bounty hunters and related businesses had access to AT&T, T-Mobile, and Sprint customer location data, with one bail bond firm using the phone location service more than 18,000 times, and others using it thousands or tens of thousands of times, according to internal documents obtained by Motherboard from a company called CerCareOne, a now-defunct location data seller that operated until 2017. The documents list not only the companies that had access to the data, but specific phone numbers that were pinged by those companies.

In some cases, the data sold is more sensitive than that offered by the service used by Motherboard last month, which estimated a location based on the cell phone towers that a phone connected to. CerCareOne sold cell phone tower data, but also sold highly sensitive and accurate GPS data to bounty hunters; an unprecedented move that means users could locate someone so accurately so as to see where they are inside a building. This company operated in near-total secrecy for over 5 years by making its customers agree to “keep the existence of CerCareOne.com confidential,” according to a terms of use document obtained by Motherboard.

Some of these bounty hunters then resold location data to those unauthorized to handle it, according to two independent sources familiar with CerCareOne’s operations.

The news shows how widely available Americans’ sensitive location data was to bounty hunters. This ease-of-access dramatically increased the risk of abuse….(More)”.