Medical data has a silo problem. These models could help fix it.

Scott Khan at the WEF: “Every day, more and more data about our health is generated. Data, which if analyzed, could hold the key to unlocking cures for rare diseases, help us manage our health risk factors and provide evidence for public policy decisions. However, due to the highly sensitive nature of health data, much is out of reach to researchers, halting discovery and innovation. The problem is amplified further in the international context when governments naturally want to protect their citizens’ privacy and therefore restrict the movement of health data across international borders. To address this challenge, governments will need to pursue a special approach to policymaking that acknowledges new technology capabilities.

Understanding data siloes

Data becomes siloed for a range of well-considered reasons ranging from restrictions on terms-of-use (e.g., commercial, non-commercial, disease-specific, etc), regulations imposed by governments (e.g., Safe Harbor, privacy, etc.), and an inability to obtain informed consent from historically marginalized populations.

Siloed data, however, also creates a range of problems for researchers looking to make that data useful to the general population. Siloes, for example, block researchers from accessing the most up-to-date information or the most diverse, comprehensive datasets. They can slow the development of new treatments and therefore, curtail key findings that can lead to much needed treatments or cures.

Even when these challenges are overcome, the incidences of data mis-use – where health data is used to explore non-health related topics or without an individual’s consent – continue to erode public trust in the same research institutions that are dependent on such data to advance medical knowledge.

Solving this problem through technology

Technology designed to better protect and decentralize data is being developed to address many of these challenges. Techniques such as homomorphic encryption (a cryptosystem that encrypts data with a public key) and differential privacy (a system leveraging information about a group without revealing details about individuals) both provide means to protect and centralize data while distributing the control of its use to the parties that steward the respective data sets.

Federated data leverages a special type of distributed database management system that can provide an alternative approach to centralizing encoded data without moving the data sets across jurisdictions or between institutions. Such an approach can help connect data sources while accounting for privacy. To further forge trust in the system, a federated model can be implemented to return encoded data to prevent unauthorized distribution of data and learnings as a result of the research activity.

To be sure, within every discussion of the analysis of aggregated data lies challenges with data fusion between data sets, between different studies, between data silos, between institutions. Despite there being several data standards that could be used, most data exist within bespoke data models built for a single purpose rather than for the facilitation of data sharing and data fusion. Furthermore, even when data has been captured into a standardized data model (e.g., the Global Alliance for Genomics and Health offers some models for standardizing sensitive health data), many data sets are still narrowly defined. They often lack any shared identifiers to combine data from different sources into a coherent aggregate data source useful for research. Within a model of data centralization, data fusion can be addressed through data curation of each data set, whereas within a federated model, data fusion is much more vexing….(More)“.