Done right, artificial intelligence holds could achieve all this.
Yet, AI is no silver bullet. It can fall prey to the cognitive fallibilities and blind spots of the humans who design it. AI models can be as imperfect as the data and clinical practices that the machine-learning algorithms are trained on, propagating the very same biases AI was designed to eliminate in the first place. Beyond conceptual and design pitfalls, realizing the potential of AI also requires overcoming systemic hurdles that stand in the way of integrating AI-based technologies into clinical practice.
How does the field of medicine move forward to harness the promise of AI? How does it eliminate the perils posed by its suboptimal design or inappropriate use? How can AI be integrated seamlessly into frontline clinical care? These are some of the overarching questions that will be tackled at the inaugural Symposium on Artificial Intelligence for Learning Health Systems (SAIL) to be held Oct. 18-20 in Hamilton, Bermuda, an event envisioned to become the an annual conference.
Conceived by Zak Kohane, chair of the department of biomedical informatics in the Blavatnik Institute at Harvard Medical School, the symposium will bring together the brightest minds from academia and industry in the fields of computer science, AI, clinical medicine, and healthcare. The charge is to establish both the philosophical and practical frameworks toward optimal translation of AI to the clinic.
“How can we bring the best aspects of AI to augment the best and most human components of the patient-clinician relationship and to safely accelerate 21st century medicine? These are the central questions that I hope we will be able to answer,” Kohane said.
One of the more pedestrian—and more immediate—goals of the symposium, however, will be to bridge the chasms between various players in AI and medicine by simply having them talk with each other.
“We have various communities involved in AI all expecting AI to somehow improve medicine, but they have no way of communicating and engaging with one another. So the underlined goal of this symposium is to create communication—not around theoretical issue of methodology but around the pragmatics of implementation,” Kohane said.
But the idea is not merely to have theoreticians and methodologists engage with practitioners. It is to create a common space for conversation—and eventual collaboration—among experts who have traditionally worked in parallel rather than at intersections, including clinical informaticians, machine learning specialists, clinicians, administrators, medical journal editors, and those charged with implementing AI in health care.
The event was originally scheduled to take place in 2020, but the COVID-19 pandemic derailed those plans. To keep the momentum going and to lay the groundwork for the main event, Kohane held a virtual warm-up session in the fall of 2020, during which experts mapped some of the most acute challenges and greatest opportunities in the field of medical AI.
The AI-optimized physician
The immense processing and analytic capacity of AI and machine learning, a form of artificial intelligence, can powerfully augment human decision-making and complement uniquely human cognitive capacities such as detecting finer nuances and applying common sense. The combination of human and machine intelligence could optimize the practice of clinical medicine and streamline health care operations. Machine learning-based AI tools could be especially valuable because they rely on adaptive learning. This means that with each exposure to new data, the algorithm gets better at detecting telltale patterns. Such tools have the capacity to transcend the knowledge-absorption and information-retention limits of the human brain because they can be “trained” to consider millions of medical records and billions of data points. Such tools could boost individual physicians’ decision-making by offering doctors accumulated knowledge from billions of medical decisions, billions of patient cases, and billions of outcomes to inform the diagnosis and treatment of an individual patient. AI-based tools could alert clinicians to a suboptimal medication choice, or they could triage patient cases with rare, confounding symptoms to rare-disease experts for remote consults.
AI can help optimize both diagnostic and prognostic clinical decisions, it can help individualize treatment and it can identify patients at high risk for progressing to serious disease or for developing a condition, allowing physicians to intervene preemptively. For example, such companion clinician tools can be used not only to detect the presence or absence of disease but also to predict disease risk. They could help predict who is likely to respond to certain treatments, such as chemotherapy or immunotherapy and who may not be benefit from them, said pre-symposium panelist Anant Madabhushi, professor of biomedical engineering at Case Western Reserve University and director of the Center for Computational Imaging and Personalized Diagnostics there.
AI can help streamline the workflow of radiologists by being used as the primary triage of mammography scans, for example.
“We shouldn’t be thinking about replacing people, we should be thinking about augmenting people, and the way to augment people is to allow them to focus in the place where people have most value,” said Greg Hager, professor of computer science at Johns Hopkins University.
“We are at the beginning of starting to design human-and-machine systems operating together. The right combination of human intelligence and machine intelligence can produce a result that’s better than either,” Hager added.
A team led by Hager developed an algorithm that identifies normal mammogram scans with greater precision than human radiologists. These prefiltered, prescreened images can be then sent to a human clinician for eventual review and sign-off. This AI assist provides more than prescreening. It gives radiologists the time and cognitive space to focus on the images that exhibit abnormalities and are, therefore, much more consequential, Hager said. “What we looked at was how can we build tools that improve workflow and relieve radiologists form doing drudgery work,” he said. The overarching idea, Hager added, was to use an AI tool to replace doctors in to handle the rote initial reviews and leave the human physicians at the tip of the pyramid, where there’s high value in human input.
Optimizing the patient-doctor relationship
Medicine is, at its core, a deeply humanistic profession and caring for a patient is one of the most human and humane activities. Yet, health care can be dehumanizing.
“A lot of technology introduced in health care has been rightly criticized for getting in the way of the patient-doctor relationship,” said Nicholas Tatonetti, associate professor of biomedical informatics at Columbia University. “There is an opportunity for technology not to get in the way any longer but to start to disappear into the background and put that interaction at the center. How do we make that happen?” The fundamental concept behind AI tools should be that they supplement and optimize human clinicians, not displace them in making highly complex decisions.
“The doctor and the patient and the machine learning model can all work together as long as we design decision aides rather than decisionmakers,” said pre-symposium panelist Cynthia Rudin, professor of computer science, electrical and computer engineering and statistical science at Duke University.
From disease treatment to wellness preservation
When we think about health care, we usually think about acute problems. At its best, AI-optimized health care should move us away from this imperfect default, Hager said. “What we really should be thinking about is how to keep people well and what does it mean to have a longer timeframe perspective on the trajectory of an individual and what’s going to keep them well,” Hager said. “If you are a doctor and you came to a patient and you had access to this more holistic longitudinal perspective, you could start to contextualize some of the things you see in front of you. Instead of seeing a patient who you’ve diagnosed with a disease, you could start to see the broader context of their life situation and see this disease as a step along a path. We should be thinking about how we correct this path as opposed to how we treat this one particular acute disease at this point in time. Instead of thinking of an episode, think of a series of episodes and how they string together into the story of the health of this individual.” It is the difference between seeing health care as the treatment of disease instead of as the preservation of wellness and health, Hager said. AI could contribute powerfully in this area by offering rich data and long-range perspective on individual patients.
The perils of black-box AI
One of the essential must-have elements of any medical AI tool is transparency. Thus, developers of such models must ensure that these tools are understandable and interpretable. In other words, human clinicians should be able to understand the “reasoning” of the AI models they are using and how and why an AI tool renders one verdict or outcome instead of another.
“If you don’t understand the reasoning process of the AI model, it is possible that it could be reasoning about things the wrong way,” Rudin said.
Opaque reasoning in machine learning is also known as black-box AI. A black-box model is a function that’s either too complex for a human to understand or it’s a model that is proprietary—“meaning it’s someone’s secret sauce, and you don’t get access to it,” Rudin said.
The importance of being multilingual
Building reliable AI models starts with having the right conversations between designers and end-users. To illustrate this, Madabhushi told the story of an AI model his team designed to help differentiate between malignant and benign lung nodules. Although a preexisting model performed well, Madabhushi said, one of the things this model “assumed” was that all the information needed to distinguish between malignant and benign nodules lies inside the nodule. Madabhushi and colleagues decided to design another tool, but importantly, they first spoke with their intended frontline users—the radiologists and cardiothoracic surgeons examining and treating lung lesions. In that conversation, the team explored the possibility that some of the tell-tale features that may portend malignancy may lie outside the nodule, rather than inside it. They looked at blood vessels that fed the nodule from outside and took into account blood vessel density and tortuosity around the tumor. Nodules that were enveloped in a dense, twisted network of blood vessels tended to be malignant, compared with benign nodules, which tended to have a smoother vasculature. The team then came up with a model that based on that characteristic alone could reliably discern cancerous from benign lung nodules.
“This is a domain knowledge where having conversations with clinicians allows you to come up with features that are interpretable and have a connection to the pathobiology of the disease, and you’ve involved the clinicians in the process of the development of the tool,” Madabhushi said.
Toward equitable AI
AI is not infallible and will not solve all problems in clinical medicine, many of which involve broader healthcare systems challenges not solvable by computers.
One such example is bias. Bias in medicine is real and it seeps into AI. Data and practices that are not ethnically, racially, and otherwise diverse can fuel biased algorithms and AI tools. Such tools can, in turn, propagate existing biases and inequities in clinical care.
AI tools will perform only as well as the models and data they are trained on. One of the most critical barriers on that front is access to raw data that are ethnically, racially, and otherwise diverse.
One telling example are some current risk-prediction tools used to analyze tissue images of prostate cancer to forecast disease progression and severity in Black and white men. Black men are known to have more severe prostate cancer, yet the existing risk models of prostate cancer have been built largely using data derived from non-Black men, Madabhushi said. His team conducted a study and found characteristic differences on images of tumors from Black men and white men. They used these features to build a population-specific disease-severity prediction model for prostate cancer.
“Using this model in Black men resulted in much higher accuracy in predicting risk of recurrence,” he said. “We have a responsibility now to see how we can use AI in a way that helps address the health needs of underrepresented populations. Here’s an opportunity where AI could help rectify some of those disparities.”
Combating bias, however, is not merely a function of ensuring that the model is fed representative and inclusive data. The more fundamental issue is who is building the algorithms. Collecting and feeding reliable data begins with the questions that are asked by those who build the algorithms in the first place, said pre-symposium speaker Tiffani Bright, biomedical informatics lead for the Center for AI, Research, and Evaluation at IBM Watson Health.
“If you’re building these algorithms and you don’t have a representative workforce, there’s bias, but also stereotypes. What’s a stereotype? I might not even be thinking about that as I am building these tools,” Bright said. “That’s one aspect. Then you get into data.”
The bias vulnerability extends both to the design and the subsequent validation of AI tools, said pre-symposium panelist Marzyeh Ghassemi, assistant professor of computer science and medicine at the University of Toronto. “Look at the sources of data you have access to in order to validate an algorithm,” Ghassemi said. “All of the data that we have from clinical trials have been shown time and time again to be very biased. Often, these trials do not have a diverse group of people.” “You don’t have to, when you get a medical device FDA approved, demonstrate that it works on black skin. There is a reason that those SPO2 monitors were not working very well initially when COVID hit on some populations,” Ghassemi added, referring to blood oxygen-level monitors used to detect signs of advancing lung disease in patients with COVID-19. “Yes, validation and deployment are often different things but we don’t always think about how baked-in the bias is in at every step—thinking about research questions, what research teams are going to look at them, what gets funded, what data you get to collect, and what kinds of outcomes you’re going to study—then we build the algorithm based on these really biased structures that already exist,” she said.
“If we do not address this at the very fundamental level of AI design and implementation, then whatever biases are present in the system, we’d be pouring concrete over them,” Kohane said. “If we do this naively, if we just use the existing data sets to drive the standard of care for the future, then we will essentially reproduce that.” Kohane warns that having a theoretical sensitivity about this issue is not enough. “How do you address this in practice? This is still very much an open question, and we need to have that conversation, which has not really happened yet anywhere substantially.”
The matter is also more complicated than the quality of data fed into the model. “Let’s assume the data are perfect quality but represent what we’re doing to our patients today,” Kohane said. “This is not a data issue, it’s a separate issue stemming from decades of systemic and structural bias that may be reflected in current practices of medicine. Then whatever is right with the practice of medicine gets propagated and so is whatever’s wrong with the practice of medicine.” “If we, as clinicians, are not being aggressive enough in our efforts to prevent preterm labor in pregnant African-American women, then that’s exactly what these machine learning algorithms are going to pick up,” he added.
More fundamentally, AI tools should be designed to represent and reflect the goals and needs of those they are intended to serve, including clinicians and patients.
“We’ve been thinking about how we create an antibias taskforce and the people who should be at the table,” said pre-symposium speaker Maia Hightower, chief medical information officer at the University of Utah Health. “Of course, the data scientists, but also our community members, because what’s important for a community may be different from what we perceive is important within the healthcare system.” “One of the clinicians that I respect the most always tells me, ‘Don’t ask me, ask the nurses, they do all the hard work,’” Ghassemi said. “Often, in the machine-learning community when we do collaborations, we are overwhelmingly speaking to academic clinicians. Our focus is often really myopic, and it needs to broadened.”
Hurdles to implementation
Incorporating the use of AI tools into frontline clinical care has been bumpy at best. The key challenges include improving interoperability and adaptability but, most ambitiously and most fundamentally, aligning the goals and incentives of various players in the field of health care and medicine.
The coronavirus pandemic underscored—with disillusioning clarity—the importance of interoperability.
“Despite the incredible digitization over the past decade or 15 years in all manner of healthcare operations, we found we still have a frustrating inability to connect the digital dots,” said pre-symposium speaker Peter Lee, corporate vice president of research and incubations at Microsoft.
Lee said that such interoperability problems were magnified by business practices and standards, technological challenges, and regulatory environment—all conspiring to make this more complicated.
“The world is not a set of dropdown boxes,” said pre-symposium speaker Ken Ehlert, chief scientific officer at United Healthcare Group. “The world is not rigidly defined in a way that allows us to break these problems down into an absolutely standard method.” One of the most important steps toward solving this issue is to have a clear understanding of the purpose of the data and of the AI tool that captures that data. “We need to not only think about standards and interoperability but also what’s the purpose of the data that you’re going to collect because if you anticipate that purpose, you can do a better job of collecting data that fulfills that purpose,” said pre-symposium speaker Eric Rubin, editor-in-chief at The New England Journal of Medicine and NEJM Group. Rubin pointed out that most electronic health record systems in use now were built with an eye toward facilitating the generation and submission of insurance claims. To solve the interoperability issues the approach has to be far more fundamental because the lack of interoperability may be a symptom of a deeper problem—namely, poorly defined goals by various stakeholders and systems. “The number one goal in healthcare is to give people the best health outcome and the highest quality of life for the maximum number of years possible,” Ehlert said. “If that’s the goal, how do we align people across the system so that data is collected to best make this happen?” “One example is that we have to align on our definition of what better health means. Is it two extra weeks at the end of life or is a better 20 years in the years leading up to it?” he added. “Questions like this need to be answered in order to help solve the interoperability issues.” “Ultimately what we are all jointly striving for is that we recognize there’s a risk and loss at every point in the health care journey,” Lee said. “Fundamentally, we are all seeing the possibility that we can construct profitable businesses by eliminating that loss and deliver value that way. This is one of the unique places where our mandate to make a profit aligns with the societal need to have more efficient, lower cost, and more accessible health care for all people.”
AI tools in the trenches
Yet, this is not to say that AI is not already delivering on its promise in some ways, the panelists said.
Lee identified several imminent areas of AI-driven augmentation, including clinical decision support providing intelligent assistance during the patient-doctor interaction to eliminate administrative burden and illuminating the fundamental biologic processes that fuel disease development, which can eventually lead to better diagnostic tools and more effective, precision-targeted therapies.
“One emerging tool is deploying ambient clinical intelligence that is able to listen to patient-doctor conversations and set up automatically clinical encounter notes, relieving doctors of note-taking,” Lee said. “We now have language models with hundreds of billions of parameters in the neural nets. This allows AI models to read thousands of medical research papers and abstracts posted every day and synthesize them into knowledge graphs and provide them for better decision support,” Lee said. These programs, he said, are already being used to support the decisions of hospital tumor boards.
Suchi Saria, associate professor and director of AI and Health at Johns Hopkins University, underscored the sometimes-wasted potential of AI tools to be used for anticipatory intervention.
“We can shift from a reactionary paradigm to a more participatory paradigm for example in areas like sepsis, the 1oth leading cause of death,” Saria said. “Interventions exist but they are much more effective if you apply them early. AI is pretty much the only way to identify condition syndromes for sepsis in at-risk patients early and precisely because of heterogeneity of presentation across patients.” Another example is the use of AI in stroke, she said. The ability to use advance AI techniques to analyze images and identify patients with large-vessel occlusions means the patient can be referred to a stroke center in a matter of minutes rather than hours, which is the traditional workflow. In readmissions, the ability to assess which patients are at high risk for ending up back in the hospital and why they are at risk for readmission could help physicians tackle this risk proactively and dramatically alter outcomes, Saria said.
But Saria added that some of the potential of AI is being allowed to languish, underscoring the urgency of implementing infrastructure to support learning from real-time data.
“Our reliance on randomized clinical trials alone for evidence generation is dramatically slowing down the rate at which we can learn from our data,” Saria said. “In COVID, the fact is we didn’t have answers to questions like ‘Is proning effective?’ or ‘Should we try to do invasive mechanical ventilation early or do alternative therapies?’ These questions are possible to answer in a pretty granular way with the kind of data collected today.” Randomized clinical trials to interrogate and generate data are an imperfect gold standard, Rubin agreed. “You can only ask one question, and it can take 10 years and hundreds of millions of dollars to answer that one question,” he said.
“I think real-world evidence is fundamentally different. It’s a work in progress to figure out how to make that rigorous,” Rubin added. “There is a marriage of the two. I think there’s an opportunity to use the electronic medical record together with recruiting criteria to figure out what happened without going through the traditional clinical trial track, which is really onerous, but that’s not the final answer. The final answer is how to bring rigor and how to understand the rigor within trials that are not traditionally designed.”
Both Lee and Saria also cautioned against the dangers of magical thinking and unrestrained faith in the abilities of AI.
“Sometimes, the medical community is overly dazzled by the emerging power of AI, which sometimes leads to an over-optimism from both directions that needs to be checked, while not giving up on actual true possibilities,” Lee said. Saria said educating frontline clinicians on the nuances of AI is critical in shielding practitioners from being overly dazzled. “This is where journals have a huge role to play in terms of educating readers that there are ways in which we can learn trusted unbiased results using the messy data that exist in electronic health records,” she said.
Evaluating performance of point-of-care AI
One of the most critical questions around AI is how to assess its performance and how to engineer algorithms that ensure the claims made by the developer are borne out in practice. And one of the most important long-range challenges of regulation and approval of AI tools will be providing contextualized comparative assessment of AI tools rather than merely providing isolated reports on each individual tools’ safety and effectiveness, Kohane said. “The FDA is doing great and creative things, and what it does is it says ‘this is safe to a given standard.’ What it doesn’t do is say ‘this is better than the other product.’” Kohane said
Kohane gave a hypothetical example, comparing two liquid biopsy tools that use blood to detect cancer recurrence. Based on peer-reviewed publications, these two algorithms are 50 percent effective in detecting the presence of mutations. Yet they are both FDA-approved. The FDA, however, does not offer comparisons of these tools’ performance against other available options to detect cancer recurrence. “So, the FDA is not doing the Consumer Reports’ function,” Kohane said. “It’s a sad comment that we’re going to have less knowledge about which AI algorithm to adopt than we have in buying a refrigerator or a mattress.” One intermediate solution could be to include confidence ranges for all machine-learning models to inform clinicians how accurate a model is likely to be, Kohane and colleagues wrote in a 2019 perspective piece published in The New England Journal of Medicine. Even more importantly, the authors said, is that all models should be subject to periodic reevaluations and tests, not unlike the periodic board exams physicians must take to maintain certifications in a given field of medicine.
Ultimately, assessing AI performance should go beyond a mere evaluation of its ability to perform better than a human operator or better than another tool. The ultimate yardstick should be whether a given AI model improves actual patient outcomes.
Thus, the two most important criteria for vetting AI performance, Kohane said, should be whether it leads to a better outcome for patients and whether its reasoning is transparent and explainable. “At this propitious moment where we see AI being implemented in the clinic, the most critical question is how to balance the wise ancient counsel primum, non nocere—first, do no harm—while helping patients do better with the use of these new technologies where humans alone cannot.”
The 2021 conference will unfold along three thematic tracks:
Helpful versus Hateful
This section is dedicated to addressing issues of equity in AI design and use. Participants will discuss the importance of recognizing bias in AI algorithms and tools and identifying ways to minimize and eliminate such bias. Various dimensions of bias and disparities, including some instances when bias may be useful in AI design, will also be explored.
Helpful versus Hype-ful
Panelists will discuss the importance of designing tools that perform reliably, effectively, and safely and the importance of vetting and validating AI performance in the real world. Another dimension of this discussion will be bridging the gaps between AI models that are intellectually interesting and AI models that are meaningful and applicable for use on the frontlines of medicine.
The Future is Now: Tales from the trenches
Panelists will explore AI-powered tools already in clinical use and will engage scientists who have gone through the process of getting regulatory approval.