A health system gets a complaint from a patient who is denied a recommended procedure after a coverage determination based on an AI risk classification tool. The patient appeals. The compliance team starts reconstructing the decision trail. They know the AI tool produced a risk score. They know that the score was used in the coverage determination. What they do not know is why the model assigned that specific score to that specific patient, what were the data inputs that led to that outcome, and had the model been updated, including changes to the model, between scoring and reviewing.
The audit trail exists in form, and it does not exist in substance. The health system can show that a decision was made, and that AI contributed, and yet it cannot articulate the decision in a satisfactory manner to meet regulatory scrutiny, a legal challenge, or the patient who is interrogating the process. This is the gap that constitutes most health systems audit trail illusion.
This is What the Audit Trail Lacks
For example, in the traditional clinical documentation, the decision trail is part of the record. A clinician captures the clinical judgment, the diagnostic results, the treatment options evaluated, and the reasons for the selected approach. In the event of a query, documentation is a sufficient source of a reconstructable decision narrative.
AI entry into a new industry shifts the paradigm. For example, many operational and clinical AI applications output risk scores, recommendations, and classifications without revealing the inputs, the weighing, or the reasoning leading to the output. A clinician may note that an AI tool was used and may document the output, and yet the internal model logic is usually absent from the clinical documentation and inaccessible to compliance, legal, or QA reviewers.
Research published in npj Digital Medicine analyzes the documentation of AI-assisted clinical decision-making in the health systems. The results show that less than one in five health systems has documentation policies that require specifying inputs, the AI model version, or confidence scores of the recommendation, and the clinical output. In many instances, the clinical record has only the recommendation, without any contextual information to facilitate further review.
The Office of the Inspector General has studied algorithmic opacity in audit trails associated with Medicare Advantage coverage decisions. Payer and health systems were often unable to articulate the reasoning of the model applied to a given coverage decision, leading to the conclusion that even if a decision was correct or if the model was operational, the model was not functioning as intended.
The Explainability Problem
The audit trail gap is complicated by the explainability limitations of numerous AI models employed in healthcare. The predictions generated by cutting-edge machine learning models, in particular, deep learning and ensemble methods, occur through mechanisms that do not decompose into human-understandable rationale. A model may flag a patient as high risk because of an intricate network of hundreds of input variables and a multitude of nonlinear interactions, and the resulting prediction cannot be decomposed to a simple causal statement.
For its part, the FDA has recognized this obstacle in devising its own framework for AI and machine learning-integrated software as a medical device. This framework explicitly states that the explainability of model outputs is one of the factors considered during the regulatory assessment of AI technology, and that, in a given model, the predictive outcomes manufacturers are expected to explain. The framework, however, does not prescribe particular explainability thresholds, leaving it to the health systems to determine their own thresholds.
Some of the research conducted at the MIT Lincoln Laboratory has examined certain post-hoc approaches to the explainability of clinical AI models, particularly SHAP values and LIME-like methods. Their research, to a certain degree, has proven that, while these methods provide at least the semblance of model behavior, they do not demonstrate an overall reliability of such methods, particularly when such models are designed to be high dimensional and to operate in the clinical context. Explainability tools are an improvement over the alternative of no explanations, however, they are not substitutes for inherently explainable model architectures.
Constructing an Authentic Audit Trail
In an optimum world, an audit trail in AI-driven decisions is built out in three dimensions. Beginning model-wise, health systems need to formulate ways to record and hold model versions for every output, input data, and model prediction confidence score. This needs to be linked to the clinical record in systems accessible to reviewing teams.
At the workflow dimension, clinicians require structured documentation fields to capture the AI output in the decision-making process. There should be documentation on whether the recommendation was followed, modified, overridden, and the clinical reasoning articulating the decision. Studies from Vanderbilt University Medical Center have investigated the use of structured AI documentation templates in clinical workflow and have determined that this method succeeds in improving documentation of AI and clinical workflow integration without increasing the time of the encounter.
Institution-wise, health systems need model retention policies that are consistent with their medical record retention policies. If a model artifact is requested for a clinical decision two years from the current date, the institution should be able to demonstrate data management capabilities, and health systems, for the most part, have not reached that level.
The Stakes
The illusion of an audit trail creates multiple fronts of exposure. There is a growing body of regulations concerning the use of AI technologies in decision-making. The absence of a model logic reconstruction will create a considerable compliance issue. The documentation that will be needed when AI-driven clinical decisions go to litigation is not available to the majority of health systems. Patients trust the system. The trust is based on the visible decision-making process. When decisions are assisted by AI, the trust evaporates, and the relationship deteriorates.
From the perspective of health system leaders, the question is whether an AI-assisted decision can be adequately justified at the level of detail, and with the level of documentation, that the institution would expect for any other clinical decision. If the answer is no, the audit trail is not a trail at all. It is a gap that will be magnified with every AI-assisted decision, and it is a gap that will be more burdensome to fill.
Context and Sources
The documentation gap for AI-assisted clinical decisions has been discussed in the npj Digital Medicine. The OIG has analyzed the lack of an audit trail in algorithmic coverage determinations. The FDA approach to AI/ML software provides an outline of the issues concerning explainability. Post hoc explainability in clinical AI systems has been explored by the MIT Lincoln Laboratory. Vanderbilt University Medical Center has studied the use of structured templates for documentation of AI systems. This edition connects to the topics of clinical ownership and institutional design discussed in Editions Y, Z, and W of this newsletter.
Christopher Hutchins
Founder & CEO, Hutchins Data Strategy Consultants