An artificial intelligence tool developed for predicting the deterioration of patients in an ICU produces risk scores that are lower than the actual clinical trajectory of the patients it evaluates. This does not occur overnight. It happens over the course of weeks as the population of patients in the unit changes. The model does not recalibrate because it has been trained on data during a different time. No automated alert detects this change, and no performance thresholds have been met that would require investigations. The issue has remained unnoticed until a staff member observed that the model is consistently underestimating the level of concern and acuity of patients. These patients subsequently undergo rapid response interventions. By the time the problem is identified, the tool has been producing inaccurate scores for nearly two months.

The institutional response follows the standard patient safety process. A root cause analysis is developed, the tool is taken offline, and a committee is formed by the leadership to review the incident. The process is methodical, diligent, and reactive. This approach to managing risk associated with AI tools is not applicable to the situation.

Incident Response Systems Have Limitations

Generally, most of the healthcare responses to incidents have been designed for the singular occurrences of wrongly prescribed drugs, faulty medical procedures, or patients falling off the bed due to low height. Failures that are isolated, identifiable, and attributable have specific breaks in the process of care. The response teams have a clearly defined purpose: to determine what went wrong and act to ensure that it does not happen again.

AI failures are more complex. AHRQ has noted the distinctions between the failures in AI versus the more conventional patient safety events. Research has shown these failures in AI are more systemic, happen on an incremental basis, and in the majority of commonplace scenarios, become invisible to the incident reporting system. A medication error, for example, may be visible and correctable at the time of the error. In contrast, AI tools that are unverified and inaccurate will cause unnamed injuries to a large, unquantified, and longer lasting patient population. Moreover, the AI output and the resultant injuries to patients may be entirely unrelated and remain unrecognized by the frontline clinician.

ECRI Institute has investigated the linkage of AI and patient safety related incidents, and most of these incidents were undetected by reporting systems. Most of these incidents were the result of retrospective chart audits, external reviews, and clinician surveys that spanned the time period when AI tools were producing biased and inaccurate outputs. The surveys were conducted weeks to months after the AI systems were implemented. Once these incidents completed the incident response cycle, the extent of the undetected harm that better oversight could have prevented was considerable.

The Drift Problem

One of the most plausible ways that AI will fail in healthcare scenarios is through model drift. This is the least matched case of failure with incident response. MIT research on the performance drift of clinical AI tools across various healthcare settings found that drift is an inevitable byproduct of the deployment of machine learning models into adaptive clinical environments. Factors that influence AI model performance in an unpredictable manner include the patient demographic, clinical guidelines, and data input. Since models are a function of their training data, the outputs produced become irrelevant and inaccurate in response to changing inputs.

In the case of AI enabled medical devices, the FDA response to drift is that, without continuous measurements to ascertain that the machine learning model performance is within the bounds of an empirically and clinically determined acceptable level of accuracy after initial validation, the model will drift. Moreover, they have stated that most of the burden of this measurement falls on the healthcare organization that uses the device. Simply put, without sufficient measurement, monitoring, and assessment of the technique, the healthcare provider will create avoidable patient safety events.

Operational responses to model drift are that healthcare entities establish a well defined performance baseline for every AI resource deployed in a healthcare setting, and develop systems for automated detection of performance deficits versus the baseline. Joint Commission has noted that the expectation of a clinician identifying the issue is, at best, an oversight concern, and at worst, a large and unaddressed systematic clinical management flaw.

What Proactive Monitoring Looks Like

A proactive AI safety program differs from incident response in three structural ways.

To begin with, it lays out performance benchmarks for all implemented AI applications, establishing these benchmarks for clinical use case accuracy, sensitivity, specificity, and related metrics. Deficits falling outside of these benchmarks will cause an automated alert and will trigger response protocols, and this will be done in real time rather than after the clinician has reported it.

Next, a proactive program incorporates scheduled performance reviews which assess clinical outcomes versus the AI tool output on a continuous basis in a more structured manner than incident based reviews. These reviews are considered to be preventative in nature, as scheduled reviews will occur irrespective of the existence of a problem, as these reviews are carved into the operational calendar. Health Affairs research on scheduled reviews of AI performance found that review systems in health systems were shown to bring greater benefit and result in fewer patient safety incidents associated with AI.

Finally, a proactive program designates a specific role and function for AI oversight within the health system. The position is primarily responsible for managing the performance dashboards, liaising with clinical teams when problems arise, and ensuring that all the appropriate corrective actions are completed. AHRQ recommends that health systems establish AI safety functions that are distinct from the teams who implement the AI technology in order to ensure that the oversight is objective and consistent.

Moving Past the Fallacy

Incident response as a standalone model fails because the systems healthcare has built for tracking traditional patient safety events cannot account for the way AI produces harm. AI risk is deep and systemic, and for clinical teams, it becomes visible only after the damage is done.

Health systems managers must consider how much an organization will invest in preventative AI risk monitoring systems before an incident is bound to happen. For any healthcare system implementing clinical AI, the decision to not pair that with an AI risk monitoring system is accepting the risk that the next AI failure will be caught by a clinician after considerable damage. The evidence is clear that this will most likely not be the case.

Context and Sources

AHRQ has looked into the contrast between AI related adverse events and traditional patient safety events. ECRI Institute has analyzed AI related patient safety events through its surveillance network. MIT has looked into clinical AI tools and performance drift. FDA has acknowledged drift issues in medical devices and Joint Commission has suggested performance baselines and automated monitoring for clinical AI. Health Affairs has researched the impact of scheduled performance reviews for AI. This edition touches on the operational and institutional themes in editions AH, AK, and AE of this newsletter.

Christopher Hutchins
Founder & CEO, Hutchins Data Strategy Consultants

Recommended for you