Commentary
Article
Author(s):
The AI-READI project aims to establish fair, equitable, and ethical access to big data, enhancing artificial intelligence’s ability to diagnose systemic diseases and drive progress in ophthalmology research.
(Image credit: AdobeStock/Deemerwha studio)
Artificial Intelligence (AI) holds tremendous promise in medicine. Ophthalmologists are at the forefront of this revolution given the emergence of oculomics, the concept of leveraging retinal images as a means of diagnosing a variety of systemic diseases because they provide noninvasive visual access to microvascular structure and function.
Although oculomics is still in its infancy, there are many exciting applications already being explored. In order to maximize the potential of this burgeoning science, we must pursue several key strategies. One of these is to promote easy to use, cost-effective multimodal imaging devices that can be used in large population studies with the ability to scale across multiple clinical sites. Low-cost devices that are multimodal (like the Topcon Maestro2) will be important for both the gathering of data at scale and also for the deployment to enable population level healthcare delivery.
Importantly, we must work to develop ethically sourced flagship data sets. As a data scientist and retinal specialist, I have been training deep learning models for retinal imaging for many years. One challenge that we have encountered in this effort has been the limitations of existing data sets1,2 in terms of what they can and cannot do, what models we can train, and the biases built into them.
A few years ago, the National Institutes of Health (NIH) Common Fund, a funding entity of the NIH that supports cutting-edge scientific research, issued a call for grant applications for its Bridge to Artificial Intelligence (Bridge2AI) program. Bridge2AI aims to advance biomedical research by promoting and preparing for widespread adoption of AI.
One of the program’s key initiatives is the Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI) project, the goal of which isto create a flagship ethically sourced dataset to provide critical insights into type 2 diabetes mellitus (T2DM) for future AI and machine learning research.2
My colleague, Cecilia S. Lee, MD, MS, (who also happens to be my wife), and I applied, outlining what we thought would be the most ideal data set for training deep learning models in this space. Our proposal became one of the four data generation projects that the NIH decided to support.
For the past two years, we have been collecting data from a population-representative T2DM cohort balanced for disease severity, race/ethnicity, and sex. Crucially, the dataset will aim to have equal numbers of white, black, Asian-American, and Hispanic people to ensure that models trained on the dataset will hopefully be less biased. The data include a variety of multimodal images as well as wearable and environmental sensor data collected from remote settings.
Recently, we released an initial data set including 1,067 people. Our goal over the next two and a half years is to collect this data from a total of 4,000 subjects and make it publicly available for research. In order to do so, we first had to navigate a number of ethical considerations. For example, how do we make this data available to as many people as possible but also put in safeguards for how to share it without causing harm to the participants who were generous enough to donate their data for science?
To address this and other concerns, we developed a framework and custom portal that adopts some brand-new ideas that will make it more possible to share the data openly.
A critical first step was to clearly communicate the terms of the initiative for those choosing to participate. We created an informed consent agreement spelling out in great detail that some data would be shared publicly, but also that more sensitive data would only be shared under controlled access, requiring institutional agreements. The consent also explained that the data would be used for both non-profit and commercial entities, and that no profit would stream back to participants.
Recognizing how rapidly technology evolves, we also acknowledged that, based on our understanding of the current technology, we believe it is safe to share this data, but it is possible that this could change in the future.
For those who want to access the data, we developed a series of attestations that the user must type out character by character. These include agreements that the user will not attempt to reidentify anyone in the data set, use this data set to harm anyone, or copy the data set and make it available to other people without going through the data portal. We chose this approach instead of using long and complex legal agreements to help ensure that users actually read and understand the terms.
For participants and researchers who want to understand how the data is being used, we built a registry of all the things that people disclose they intend to do with the data set. This registry of purposed research plans is open for public view.
Licenses have existed for software for some time, including some specific to data. However, some of these, such as the creative commons license, are extremely permissible, so we didn’t feel comfortable applying them.Working with legal scholars, we created a brand-new license to put legal protections in place on what people can and cannot not do with this data set.3
For every person who requests data, we have an automated process that goes through and imbeds a unique digital signature in the files tied to that person’s identity. If the data set ends up online, indicating that someone violated the terms, the watermark allows us to pinpoint who shared that data. This measure is very new and experimental, but we are hopeful that it will allow for some accountability.
Our project and the strategies that we’ve put in place to ensure fair, equitable, and ethical access to the data may very well be the first of their kind, and we are hopeful that they will be part of the foundation that accelerates the field’s advancement.
Moving forward, the most important success factors for advancing the field of oculomics will be collaboration and coordination. The AI-READI project includes stakeholders across government, academics, research medicine, and industry. These types of initiatives, in which organizations work seamlessly together, are the best way to ensure that AI ultimately delivers what we believe it can – measurable, meaningful improvements in patient care.