Highlight
- Systemically biased AI models decreased clinicians’ diagnostic accuracy for common causes of acute respiratory failure.
- Providing AI-generated explanations did not significantly mitigate the negative effects of bias.
- Clinicians’ overreliance on AI persisted, even when explanations highlighted model errors.
- The study underscores the potential risks of deploying inadequately validated AI decision support in clinical settings.
Background
Artificial intelligence (AI) and machine learning tools are increasingly used to assist clinicians in diagnostic processes, aiming to improve accuracy and efficiency. However, the introduction of AI into clinical decision-making is not without risks. Systemic biases—errors introduced by non-representative training data or flawed model development—can propagate through AI outputs, potentially leading to diagnostic errors and patient harm. Recent regulatory guidelines have advocated for the use of AI-generated explanations as a safeguard, yet the effectiveness of this strategy remains unclear.
Hospitalized patients with acute respiratory failure, often due to pneumonia, heart failure, or chronic obstructive pulmonary disease (COPD), require timely and accurate diagnosis for optimal care. Errors in this context can result in inappropriate treatment, increased morbidity, and healthcare costs. Investigating how AI tools, particularly those with known biases, influence diagnostic performance is therefore of pressing clinical importance.
Study Overview and Methodological Design
Jabbour et al. conducted a randomized clinical vignette survey study (JAMA, 2023) to evaluate the impact of both standard and systematically biased AI models on clinicians’ diagnostic accuracy. The survey, administered between April 2022 and January 2023 across 13 US states, included 457 hospital-based clinicians—physicians, nurse practitioners, and physician assistants. Participants were randomized to receive AI predictions with or without accompanying explanations.
Each clinician reviewed nine carefully constructed vignettes representing hospitalized patients with acute respiratory failure. Each vignette included presenting symptoms, examination findings, laboratory results, and chest radiographs. For each vignette, clinicians assessed the probability of three target diagnoses: pneumonia, heart failure, or COPD. Two vignettes were presented without AI input (baseline), six included AI predictions (three unbiased, three systematically biased), and one involved a simulated peer consultation. The primary endpoint was diagnostic accuracy: the proportion of correct diagnoses out of all assessments.
Key Findings
Baseline diagnostic accuracy for the three conditions was 73%. When presented with standard (unbiased) AI model predictions, clinicians’ accuracy improved modestly—from baseline by 2.9% without explanations and 4.4% with explanations. However, exposure to systematically biased AI model predictions led to a significant decrease in performance: diagnostic accuracy dropped by 11.3% without explanations and 9.1% with explanations compared to baseline.
Statistical analysis indicated that the reduction in accuracy was primarily due to decreased specificity—clinicians were more likely to make false positive diagnoses when following biased AI advice. Notably, the provision of AI-generated explanations did not substantially mitigate these harms. Even when explanations highlighted model focus on non-relevant image regions, clinicians often failed to detect the underlying error and continued to rely on AI outputs.
Mechanistic Insights and Pathophysiological Context
AI models, especially those analyzing imaging data, can inadvertently learn non-causal associations from training datasets—such as image artifacts or demographic confounders. Systemic bias arises when these models consistently misclassify based on such flawed features. In this study, the biased models systematically erred in ways that were not immediately apparent to clinicians, leading to decreased diagnostic specificity.
The lack of mitigation by explanations may reflect cognitive biases such as automation bias (overreliance on algorithmic outputs) or anchoring, where clinicians fixate on AI suggestions despite contradictory evidence. Furthermore, the technical complexity or superficiality of explanations may limit their practical utility, especially if clinicians lack the time or expertise to critically appraise them during routine care.
Clinical Implications
These findings raise caution about the uncritical adoption of AI diagnostic tools in real-world practice. While AI holds promise for augmenting clinician performance, systemically biased models can undermine care quality—particularly when clinicians are unaware of or unable to compensate for these flaws. The study suggests that explanations, at least as currently implemented, may be insufficient to guard against the propagation of AI-driven diagnostic errors.
For hospitalists and acute care teams, this underscores the importance of ongoing clinical vigilance and skepticism when interpreting AI-assisted recommendations. Health systems should prioritize rigorous external validation and bias assessment of AI tools before deployment, and clinicians may benefit from targeted education on the limitations of AI explanations.
Limitations and Controversies
Several limitations must be considered. The study used web-based vignettes rather than real-time clinical encounters, possibly overestimating or underestimating AI impact relative to actual practice. The clinician cohort skewed younger and may not reflect the experience distribution of practicing hospitalists. Additionally, the study focused on diagnostic decisions for three common conditions, and results may not generalize to other diseases or specialties.
There is also ongoing debate about the optimal design and transparency of AI explanations. Some experts argue for more interactive or context-sensitive explanation frameworks, while others suggest that inherent model transparency may never substitute for rigorous clinical oversight.
Expert Commentary or Guideline Positioning
Dr. Suman Pal, a hospital medicine expert not involved in the study, noted: “It was interesting to note that explanations did not significantly mitigate the decrease in clinician accuracy from systematically biased AI model predictions.” Current professional guidelines from regulatory bodies, including the FDA, emphasize explainability but do not yet specify standards for effectiveness in mitigating bias.
Conclusion
Systemic bias in AI diagnostic models can meaningfully degrade clinician accuracy, and simplistic explanatory frameworks may not suffice to prevent harm. As AI becomes more deeply integrated into hospital care, robust validation, transparency, and clinician education will be essential to maximize benefits while minimizing risks. Further research should focus on developing and testing more effective strategies for identifying and correcting AI-driven bias in clinical workflows.
References
1. Jabbour S, Fouhey D, Shepard S, Valley TS, Kazerooni EA, Banovic N, Wiens J, Sjoding MW. Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study. JAMA. 2023 Dec 19;330(23):2275-2284. doi:10.1001/jama.2023.22295.
2. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44-56. doi:10.1038/s41591-018-0300-7.
3. U.S. Food & Drug Administration. Artificial Intelligence and Machine Learning in Software as a Medical Device. FDA; 2021.