Trove of unique health data sets could help AI predict medical conditions earlier


Madhumita Murgia at the Financial Times: “…Ziad Obermeyer, a physician and machine learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month — a treasure trove of unique medical data sets, each curated around an unsolved medical mystery that artificial intelligence could help to solve.

The data sets, released after the project received $2m of funding from former Google chief executive Eric Schmidt, could help to train computer algorithms to predict medical conditions earlier, triage better and save lives.

The data include 40 terabytes of medical imagery, such as X-rays, electrocardiogram waveforms and pathology specimens, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures and Covid-19. Each image is labelled with the patient’s medical outcomes, such as the stage of breast cancer and whether it resulted in death, or whether a Covid patient needed a ventilator.

Obermeyer has made the data sets free to use and mainly worked with hospitals in the US and Taiwan to build them over two years. He plans to expand this to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.

“Nothing exists like it,” said Obermeyer, who announced the new project in December alongside colleagues at NeurIPS, the global academic conference for artificial intelligence. “What sets this apart from anything available online is the data sets are labelled with the ‘ground truth’, which means with what really happened to a patient and not just a doctor’s opinion.”…

The Nightingale data sets were among dozens proposed this year at NeurIPS.

Other projects included a speech data set of Mandarin and eight subdialects recorded by 27,000 speakers in 34 cities in China; the largest audio data set of Covid respiratory sounds, such as breathing, coughing and voice recordings, from more than 36,000 participants to help screen for the disease; and a data set of satellite images covering the entire country of South Africa from 2006 to 2017, divided and labelled by neighbourhood, to study the social effects of spatial apartheid.

Elaine Nsoesie, a computational epidemiologist at the Boston University School of Public Health, said new types of data could also help with studying the spread of diseases in diverse locations, as people from different cultures react differently to illnesses.

She said her grandmother in Cameroon, for example, might think differently than Americans do about health. “If someone had an influenza-like illness in Cameroon, they may be looking for traditional, herbal treatments or home remedies, compared to drugs or different home remedies in the US.”

Computer scientists Serena Yeung and Joaquin Vanschoren, who proposed that research to build new data sets should be exchanged at NeurIPS, pointed out that the vast majority of the AI community still cannot find good data sets to evaluate their algorithms. This meant that AI researchers were still turning to data that were potentially “plagued with bias”, they said. “There are no good models without good data.”…(More)”.