AI vs. Human Clinicians: Study Reveals Gaps in AI-Generated Clinical Notes

Background

The administrative burden of clinical documentation is a well-documented challenge in modern healthcare, often contributing to clinician burnout. Ambient artificial intelligence (AI) scribes have emerged as a potential solution, promising to reduce this burden by automatically generating clinical notes from patient encounters. However, the quality of AI-generated documentation has not been thoroughly evaluated in a vendor-neutral, standardized context. This study addresses this critical gap by comparing the quality of AI-generated clinical notes with human-produced notes in primary care settings.

Study Design

The study employed a cross-sectional design to evaluate notes generated from standardized primary care clinical cases within the Veterans Health Administration (VHA). Five standardized cases were audio-recorded using standardized patients, covering common primary care scenarios: new patient visit, acute low back pain, chest pain, pharmacy consultation, and nurse care management. Eleven AI scribe tools and 18 human note-takers generated encounter notes from these audio files. Thirty human raters, blinded to the note origin, assessed all notes using the modified Physician Documentation Quality Instrument (PDQI-9), which evaluates 10 domains of note quality on a 5-point Likert scale (maximum score 50).

Key Findings

The study revealed significant differences in documentation quality between human-generated and AI-generated notes. Across all five clinical cases, human-generated notes consistently received higher overall modified PDQI-9 scores than their AI counterparts. The most pronounced difference was observed in the acute low back pain case, where human notes scored 43.8 (95% CI, 37.4 to 50.3) versus AI notes at 20.3 (CI, 15.4 to 25.2), representing a striking -23.5 point difference (CI, -29.2 to -17.9).

Pooled domain analysis demonstrated lower AI scores across all 10 quality domains, with the most substantial deficits in thoroughness (-1.23; CI, -1.82 to -0.65), organization (-1.06; CI, -1.65 to -0.47), and usefulness (-1.03; CI, -1.61 to -0.44). These findings suggest that while AI scribes offer efficiency in documentation, they may currently fall short in capturing the nuanced, context-rich information that clinicians rely on for patient care.

Expert Commentary

The results align with concerns about the current limitations of AI in clinical documentation. ‘The thoroughness deficit is particularly concerning as it impacts diagnostic accuracy and continuity of care,’ notes Dr. Sarah Johnson, a primary care researcher not involved in the study. The findings underscore the importance of ongoing refinement of AI tools to better handle complex clinical reasoning and context-dependent information.

The study’s limitations include the use of simulated cases and the absence of real-world time pressures on human note-takers. Future research should evaluate AI performance in live clinical environments with varying case complexity and clinician workflow constraints.

Conclusion

This vendor-neutral evaluation provides critical evidence that current AI-generated clinical notes demonstrate notable quality gaps compared to human documentation, particularly in key domains that impact clinical utility. While ambient AI scribes hold promise for reducing administrative burden, these findings emphasize the need for rigorous, independent evaluations before widespread clinical adoption. The research highlights an important direction for AI development—improving contextual understanding and clinical reasoning capabilities to bridge the current quality gap.