Characterizing the Biomedical Data-Sharing Landscape


Paper by Angela G. Villanueva et al: “Advances in technologies and biomedical informatics have expanded capacity to generate and share biomedical data. With a lens on genomic data, we present a typology characterizing the data-sharing landscape in biomedical research to advance understanding of the key stakeholders and existing data-sharing practices. The typology highlights the diversity of data-sharing efforts and facilitators and reveals how novel data-sharing efforts are challenging existing norms regarding the role of individuals whom the data describe.

Technologies such as next-generation sequencing have dramatically expanded capacity to generate genomic data at a reasonable cost, while advances in biomedical informatics have created new tools for linking and analyzing diverse data types from multiple sources. Further, many research-funding agencies now mandate that grantees share data. The National Institutes of Health’s (NIH) Genomic Data Sharing (GDS) Policy, for example, requires NIH-funded research projects generating large-scale human genomic data to share those data via an NIH-designated data repository such as the Database of Geno-types and Phenotypes (dbGaP). Another example is the Parent Project Muscular Dystrophy, a non-profit organization that requires applicants to propose a data-sharing plan and take into account an applicant’s history of data sharing.

The flow of data to and from different projects, institutions, and sectors is creating a medical information commons (MIC), a data-sharing ecosystem consisting of networked resources sharing diverse health-related data from multiple sources for research and clinical uses. This concept aligns with the 2018 NIH Strategic Plan for Data Science, which uses the term “data ecosystem” to describe “a distributed, adaptive, open system with properties of self-organization, scalability and sustainability” and proposes to “modernize the biomedical research data ecosystem” by funding projects such as the NIH Data Commons. Consistent with Elinor Ostrom’s discussion of nested institutional arrangements, an MIC is both singular and plural and may describe the ecosystem as a whole or individual components contributing to the ecosystem. Thus, resources like the NIH Data Commons with its associated institutional arrangements are MICs, and also form part of the larger MIC that encompasses all such resources and arrangements.

Although many research funders incentivize data sharing, in practice, progress in making biomedical data broadly available to maximize its utility is often hampered by a broad range of technical, legal, cultural, normative, and policy challenges that include achieving interoperability, changing the standards for academic promotion, and addressing data privacy and security concerns. Addressing these challenges requires multi-stakeholder involvement. To identify relevant stakeholders and advance understanding of the contributors to an MIC, we conducted a landscape analysis of existing data-sharing efforts and facilitators. Our work builds on typologies describing various aspects of data sharing that focused on biobanks, research consortia, or where data reside (e.g., degree of data centralization).7 While these works are informative, we aimed to capture the biomedical data-sharing ecosystem with a wider scope. Understanding the components of an MIC ecosystem and how they interact, and identifying emerging trends that test existing norms (such as norms respecting the role of individuals from whom the data describe), is essential to fostering effective practices, policies and governance structures, guiding resource allocation, and promoting the overall sustainability of the MIC….(More)”

How Recommendation Algorithms Run the World


Article by Zeynep Tufekci: “What should you watch? What should you read? What’s news? What’s trending? Wherever you go online, companies have come up with very particular, imperfect ways of answering these questions. Everywhere you look, recommendation engines offer striking examples of how values and judgments become embedded in algorithms and how algorithms can be gamed by strategic actors.

Consider a common, seemingly straightforward method of making suggestions: a recommendation based on what people “like you” have read, watched, or shopped for. What exactly is a person like me? Which dimension of me? Is it someone of the same age, gender, race, or location? Do they share my interests? My eye color? My height? Or is their resemblance to me determined by a whole mess of “big data” (aka surveillance) crunched by a machine-learning algorithm?

Deep down, behind every “people like you” recommendation is a computational method for distilling stereotypes through data. Even when these methods work, they can help entrench the stereotypes they’re mobilizing. They might easily recommend books about coding to boys and books about fashion to girls, simply by tracking the next most likely click. Of course, that creates a feedback cycle: If you keep being shown coding books, you’re probably more likely to eventually check one out.

Another common method for generating recommendations is to extrapolate from patterns in how people consume things. People who watched this then watched that; shoppers who purchased this item also added that one to their shopping cart. Amazon uses this method a lot, and I admit, it’s often quite useful. Buy an electric toothbrush? How nice that the correct replacement head appears in your recommendations. Congratulations on your new vacuum cleaner: Here are some bags that fit your machine.

But these recommendations can also be revealing in ways that are creepy. …

One final method for generating recommendations is to identify what’s “trending” and push that to a broader user base. But this, too, involves making a lot of judgments….(More)”.

Leveraging Big Data for Social Responsibility


Paper by Cynthia Ann Peterson: “Big data has the potential to revolutionize the way social risks are managed by providing enhanced insight to enable more informed actions to be taken. The objective of this paper is to share the approach taken by PETRONAS to leverage big data to enhance its social performance practice, specifically in social risk assessments and grievance mechanism.

The paper will deliberate on the benefits, challenges and opportunities to improve the management of social risk through analytics, and how PETRONAS has taken those factors into consideration in the enhancement of its social risk assessment and grievance mechanism tools. Key considerations such as disaggregation of data, the appropriate leading and lagging indicators and having a human rights lens to data will also be discussed.

Leveraging on big data is still in its early stages in the social risk space, similar with other areas in the oil and gas industry according to research by Wood Mackenzie. Even so, there are several concerns which include; the aggregation of data may result in risks to minority or vulnerable groups not getting surfaced; privacy breaches which violate human rights and potential discrimination due to prescriptive analysis, such as on a community’s propensity to pose certain social risks to projects or operations. Certainly, there are many challenges ahead which need to be considered, including how best to take a human rights approach to using big data.

Nevertheless, harnessing the power of big data will help social risk practitioners turn a high volume of disparate pieces of raw data from grievance mechanisms and social risk assessments into information that can be used to avoid or mitigate risks now and in the future through predictive technology. Consumer and other industries are benefiting from this leverage now, and social performance practitioners in the oil and gas industry can emulate these proven models….(More)”.

The Importance of Data Access Regimes for Artificial Intelligence and Machine Learning


JRC Digital Economy Working Paper by Bertin Martens: “Digitization triggered a steep drop in the cost of information. The resulting data glut created a bottleneck because human cognitive capacity is unable to cope with large amounts of information. Artificial intelligence and machine learning (AI/ML) triggered a similar drop in the cost of machine-based decision-making and helps in overcoming this bottleneck. Substantial change in the relative price of resources puts pressure on ownership and access rights to these resources. This explains pressure on access rights to data. ML thrives on access to big and varied datasets. We discuss the implications of access regimes for the development of AI in its current form of ML. The economic characteristics of data (non-rivalry, economies of scale and scope) favour data aggregation in big datasets. Non-rivalry implies the need for exclusive rights in order to incentivise data production when it is costly. The balance between access and exclusion is at the centre of the debate on data regimes. We explore the economic implications of several modalities for access to data, ranging from exclusive monopolistic control to monopolistic competition and free access. Regulatory intervention may push the market beyond voluntary exchanges, either towards more openness or reduced access. This may generate private costs for firms and individuals. Society can choose to do so if the social benefits of this intervention outweigh the private costs.

We briefly discuss the main EU legal instruments that are relevant for data access and ownership, including the General Data Protection Regulation (GDPR) that defines the rights of data subjects with respect to their personal data and the Database Directive (DBD) that grants ownership rights to database producers. These two instruments leave a wide legal no-man’s land where data access is ruled by bilateral contracts and Technical Protection Measures that give exclusive control to de facto data holders, and by market forces that drive access, trade and pricing of data. The absence of exclusive rights might facilitate data sharing and access or it may result in a segmented data landscape where data aggregation for ML purposes is hard to achieve. It is unclear if incompletely specified ownership and access rights maximize the welfare of society and facilitate the development of AI/ML…(More)”

Cyberdiplomacy: Managing Security and Governance Online


Book by Shaun Riordan: “The world has been sleep-walking into cyber chaos. The spread of misinformation via social media and the theft of data and intellectual property, along with regular cyberattacks, threaten the fabric of modern societies. All the while, the Internet of Things increases the vulnerability of computer systems, including those controlling critical infrastructure. What can be done to tackle these problems? Does diplomacy offer ways of managing security and containing conflict online?

In this provocative book, Shaun Riordan shows how traditional diplomatic skills and mindsets can be combined with new technologies to bring order and enhance international cooperation. He explains what cyberdiplomacy means for diplomats, foreign services and corporations and explores how it can be applied to issues such as internet governance, cybersecurity, cybercrime and information warfare. Cyberspace, he argues, is too important to leave to technicians. Using the vital tools offered by cyberdiplomacy, we can reduce the escalation and proliferation of cyberconflicts by proactively promoting negotiation and collaboration online….(More)”.

Data Trusts: More Data than Trust? The Perspective of the Data Subject in the Face of a Growing Problem


Paper by Christine Rinik: “In the recent report, Growing the Artificial Intelligence Industry in the UK, Hall and Pesenti suggest the use of a ‘data trust’ to facilitate data sharing. Whilst government and corporations are focusing on their need to facilitate data sharing, the perspective of many individuals is that too much data is being shared. The issue is not only about data, but about power. The individual does not often have a voice when issues relating to data sharing are tackled. Regulators can cite the ‘public interest’ when data governance is discussed, but the individual’s interests may diverge from that of the public.

This paper considers the data subject’s position with respect to data collection leading to considerations about surveillance and datafication. Proposals for data trusts will be considered applying principles of English trust law to possibly mitigate the imbalance of power between large data users and individual data subjects. Finally, the possibility of a workable remedy in the form of a class action lawsuit which could give the data subjects some collective power in the event of a data breach will be explored. Despite regulatory efforts to protect personal data, there is a lack of public trust in the current data sharing system….(More)”.

Finding Wisdom in Politically Polarized Crowds


Eamon Duede at Nature Research: “We were seeing that the consumption of ideas seemed deeply related io political alignment, and because our group (Knowledge Lab) is concerned with understanding the social dynamics involved in production of ideas, we began wondering whether and to what extent the political alignment of individuals contributes to a group’s ability to produce knowledge. A Wikipedia article is full of smuggled content and worked into a narrative by a diverse team of editors. Because those articles constitute knowledge, we were curious to know whether political polarization within those teams had an effect on the quality of that production. So, we decided to braid both strands of research together and look at the way in which individual political alignments and the polarization of the teams they form affect the quality of the work that is produced collaboratively on Wikipedia.

To answer this question, we turned not to the article itself, but the immense history of articles on Wikipedia. Every edit to every article, no matter how insignificant, is documented and saved in Wikipedia’s astonishingly massive archives. And every edit to every article, no matter how insignificant, is evaluated for its relevance or validity by the vast community of editors, both robotic and human. Remarkable teamwork has gone into producing the encyclopedia. Some people edit randomly, simply cleaning typos, adding citations, or contributing graffiti and vandalism (I’ve experimented with this, and it gets painted over very quickly, no matter where you put it). Yet, many people are genuinely purposeful in their work, and contribute specifically to topics on which they have both interest and knowledge. They tend and grow a handful of articles or a few broad topics like gardeners. We walked through the histories of these gardens, looking back at who made contributions here and there, how much they contributed, and where. We thought that editors who make frequent contributions to pages associated with American liberalism would hold left leaning opinions, and for conservatism opinions on the right. This was a controversial hypothesis, and many in the Wikipedia community felt that perhaps the opposite would be true, with liberals correcting conservative pages and conservatives kindly returning the favor -like weeding or applying pesticide. But a survey we conducted of active Wikipedia editors found that building a function over the relative number of bits they contributed to liberal versus conservative pages predicted more than a third of the probability that they identified as such and voted accordingly.

Following this validation, we assigned a political alignment score to hundreds of thousands of editors by looking at where they make contributions, and then examined the polarization within teams of editors that produced hundreds of thousands of Wikipedia articles in the broad topic areas of politics, social issues, and science. We found that when most members of a team have the same political alignment, whether conservative, liberal, or “independent”, the quality of the Wikipedia pages they produce is not as strong as those of teams with polarized compositions of editors (Shi et al. 2019).

The United States Senate is increasingly polarized, but largely balanced in its polarization. If the Senate was trying to write a Wikipedia article, would they produce a high quality article? If they are doing so on Wikipedia, following norms of civility and balance inscribed within Wikipedia’s policies and guidelines, committed to the production of knowledge rather than self-promotion, then the answer is probably “yes”. That is a surprising finding. We think that the reason for this is that the policies of Wikipedia work to suppress the kind of rhetoric and sophistry common in everyday discourse, not to mention toxic language and name calling. Wikipedia’s policies are intolerant of discussion that could distort balanced consideration of the edit and topic under consideration, and, given that these policies shut down discourse that could bias proposed edits, teams with polarized viewpoints have to spend significantly more time discussing and debating the content that is up for consideration for inclusion in an article. These diverse viewpoints seem to bring out points and arguments between team members that sharpen and refine the quality of the content they can collectively agree to. With assumptions and norms of respect and civility, political polarization can be powerful and generative….(More)”

Crowdsourcing in medical research: concepts and applications


Paper by Joseph D. Tucker, Suzanne Day, Weiming Tang, and Barry Bayus: “Crowdsourcing shifts medical research from a closed environment to an open collaboration between the public and researchers. We define crowdsourcing as an approach to problem solving which involves an organization having a large group attempt to solve a problem or part of a problem, then sharing solutions. Crowdsourcing allows large groups of individuals to participate in medical research through innovation challenges, hackathons, and related activities. The purpose of this literature review is to examine the definition, concepts, and applications of crowdsourcing in medicine.

This multi-disciplinary review defines crowdsourcing for medicine, identifies conceptual antecedents (collective intelligence and open source models), and explores implications of the approach. Several critiques of crowdsourcing are also examined. Although several crowdsourcing definitions exist, there are two essential elements: (1) having a large group of individuals, including those with skills and those without skills, propose potential solutions; (2) sharing solutions through implementation or open access materials. The public can be a central force in contributing to formative, pre-clinical, and clinical research. A growing evidence base suggests that crowdsourcing in medicine can result in high-quality outcomes, broad community engagement, and more open science….(More)”

Five myths about whistleblowers


Dana Gold in the Washington Post: “When a whistleblower revealed the Trump administration’s decision to overturn 25 security clearance denials, it was the latest in a long and storied history of insiders exposing significant abuses of public trust. Whistles were blown on U.S. involvement in Vietnam, the Watergate coverupEnron’s financial fraud, the National Security Agency’s mass surveillance of domestic electronic communications and, during the Trump administration, the corruption of former Environmental Protection Agency chief Scott Pruitt , Cambridge Analytica’s theft of Facebook users’ data to develop targeted political ads, and harm to children posed by the “zero tolerance” immigration policy. Despite the essential role whistleblowers play in illuminating the truth and protecting the public interest, several myths persist about them, some pernicious.

MYTH NO. 1 Whistleblowers are employees who report problems externally….

MYTH NO. 2 Whistleblowers are either disloyal or heroes….

MYTH NO. 3 ‘Leaker’ is another term for ‘whistleblower.’…

MYTH NO. 4 Remaining anonymous is the best strategy for whistleblowing….

MYTH NO. 5 Julian Assange is a whistleblower….(More)”.

Illuminating Big Data will leave governments in the dark


Robin Wigglesworth in the Financial Times: “Imagine a world where interminable waits for backward-looking, frequently-revised economic data seem as archaically quaint as floppy disks, beepers and a civil internet. This fantasy realm may be closer than you think.

The Bureau of Economic Analysis will soon publish its preliminary estimate for US economic growth in the first three months of the year, finally catching up on its regular schedule after a government shutdown paralysed the agency. But other data are still delayed, and the final official result for US gross domestic product won’t be available until July. Along the way there are likely to be many tweaks.

Collecting timely and accurate data are a Herculean task, especially for an economy as vast and varied as the US’s. But last week’s World Bank-International Monetary Fund’s annual spring meetings offered some clues on a brighter, more digital future for economic data.

The IMF hosted a series of seminars and discussions exploring how the hot new world of Big Data could be harnessed to produce more timely economic figures — and improve economic forecasts. Jiaxiong Yao, an IMF official in its African department, explained how it could use satellites to measure the intensity of night-time lights, and derive a real-time gauge of economic health.

“If a country gets brighter over time, it is growing. If it is getting darker then it probably needs an IMF programme,” he noted. Further sessions explored how the IMF could use machine learning — a popular field of artificial intelligence — to improve its influential but often faulty economic forecasts; and real-time shipping data to map global trade flows.

Sophisticated hedge funds have been mining some of these new “alternative” data sets for some time, but statistical agencies, central banks and multinational organisations such as the IMF and the World Bank are also starting to embrace the potential.

The amount of digital data around the world is already unimaginably vast. As more of our social and economic activity migrates online, the quantity and quality is going to increase exponentially. The potential is mind-boggling. Setting aside the obvious and thorny privacy issues, it is likely to lead to a revolution in the world of economic statistics. …

Yet the biggest issues are not the weaknesses of these new data sets — all statistics have inherent flaws — but their nature and location.

Firstly, it depends on the lax regulatory and personal attitudes towards personal data continuing, and there are signs of a (healthy) backlash brewing.

Secondly, almost all of this alternative data is being generated and stored in the private sector, not by government bodies such as the Bureau of Economic Analysis, Eurostat or the UK’s Office for National Statistics.

Public bodies are generally too poorly funded to buy or clean all this data themselves, meaning hedge funds will benefit from better economic data than the broader public. We might, in fact, need legislation mandating that statistical agencies receive free access to any aggregated private sector data sets that might be useful to their work.

That would ensure that our economic officials and policymakers don’t fly blind in an increasingly illuminated world….(More)”.