Harnessing AI in Clinical Documentation: Evaluating Large Language Model-Generated Hospital Discharge Summaries

Highlight

Large language model (LLM)-generated discharge summaries demonstrate comparable overall quality to physician-authored summaries.
LLM narratives are more concise and coherent but less comprehensive than physician summaries.
Although LLM summaries contain more unique errors, their potential for clinical harm remains low and similar to physician-generated notes.
Use of LLM-generated summaries following human review may reduce documentation burden while maintaining safety and quality in hospital discharge communication.

Study Background and Disease Burden

High-quality hospital discharge summaries are critical for ensuring continuity of care, reducing medical errors, and improving patient outcomes post-hospitalization. These documents summarize the hospital course, treatments, and follow-up plans and are essential for effective communication between inpatient teams, primary care providers, and other outpatient clinicians. However, producing discharge summaries contributes substantially to the clinical documentation burden on physicians, often leading to time constraints and burnout. Moreover, variability in the quality and completeness of these summaries can compromise patient safety. The emergence of large language models (LLMs), capable of generating human-like text based on extensive training data, presents an opportunity to alleviate documentation workload by drafting discharge summary narratives. Yet, concerns remain about the fidelity, comprehensiveness, and safety of such AI-generated clinical documentation. This study aimed to rigorously evaluate whether LLM-generated discharge summaries could match physician-generated ones in quality and safety, potentially offering a scalable solution to the documentation challenge faced in hospital medicine.

Study Design

This was a cross-sectional, blinded evaluation study conducted at the University of California, San Francisco, spanning patient admissions from 2019 through 2022. The cohort included 100 randomly selected inpatient hospital medicine encounters lasting 3 to 6 days. For each encounter, narratives were generated by physicians and independently produced by a large language model trained to draft discharge summaries. A panel of 22 attending physicians—blinded to the source—reviewed each narrative in duplicate to assess multiple dimensions of quality and safety.

The evaluation metrics included overall quality rated on a Likert scale from 1 (poor) to 5 (excellent), reviewer preference, and assessment of narrative attributes such as comprehensiveness, concision, and coherence. Importantly, evaluators identified three types of documentation errors— inaccuracies (factual errors), omissions (missing critical information), and hallucinations (fabricated or irrelevant information generated by the LLM). Each error and overall narrative were assigned potential harmfulness scores on a 0 to 7 scale, adapted from the Agency for Healthcare Research and Quality (AHRQ), to quantify the clinical risk posed by documentation errors.

Key Findings

Overall, LLM-generated discharge summaries were rated comparably to physician-generated ones for overall quality (mean scores: 3.67 vs. 3.77; P=0.21) and reviewer preference (no significant difference; χ2=5.2, P=0.27). They outperformed physicians in concision (mean 4.01 vs. 3.70; P<0.001) and coherence (mean 4.16 vs. 4.01; P=0.02), indicating that LLM narratives were clearer and more succinct. Conversely, the LLM narratives were less comprehensive, scoring lower than physician summaries (3.72 vs. 4.13; P<0.001), suggesting important clinical details may sometimes be insufficiently captured.

Critically, LLM-generated summaries contained more unique errors per summary (mean 2.91) than physician summaries (mean 1.82). Errors included omissions, inaccuracies, and hallucinations unique to AI outputs. However, the estimated potential for harm per error was not significantly different between LLM and physician narratives (1.35 vs. 1.34; P=0.99). Both summary types displayed low overall potential for harm (mean harm scores less than 1, on a 0-7 scale), though LLM summaries scored slightly higher on aggregate (0.84 vs. 0.36; P<0.001). Only one LLM-generated narrative was scored with a potential for permanent harm (score ≥4), while no physician-generated summary reached this level.

These data suggest that while LLMs can generate discharge summaries with comparable overall quality and clarity, vigilance is needed to catch infrequent but potentially serious errors through human review.

Expert Commentary

The findings underscore the promise of integrating large language models into clinical workflows to mitigate documentation burdens without sacrificing quality. As Dr. L Santhosh, a co-author, notes: “LLM-generated summaries could free hospitalists’ time, enabling more patient-focused care—provided human oversight ensures safety.” The demonstrated equality in reviewers’ preference indicates these AI tools produce clinically usable narratives consistent with physicians’ standards.

Nevertheless, the study highlights key limitations. The increased frequency of unique errors and reduced comprehensiveness may reflect current LLM training limits regarding nuanced medical details. Generalizability beyond the single academic center setting and inpatient medicine domain will require further validation. Additionally, the safety assessment relies on expert judgment scales rather than direct patient outcome measurement, necessitating cautious interpretation.

Ongoing advances in LLM fine-tuning with medical domain data and integration into electronic health records may address existing shortcomings. Moreover, combining LLM drafts with physician editing workflows is critical to minimize errors and capture critical clinical information, harnessing the technology’s efficiency while safeguarding patient safety.

Conclusion

This study from the University of California, San Francisco demonstrates that large language model–generated hospital discharge summaries achieve comparable quality and reviewer preference to physician-authored narratives, with superior concision and coherence. Although more errors occur with AI generation, their overall harmfulness is low, supporting LLM use as drafting tools subject to physician review. These results highlight an actionable path toward reducing the significant clinical documentation burden while maintaining the quality and safety of discharge communications. Further research should explore prospective impact on patient outcomes and implementation strategies in diverse hospital settings.

References

Williams CYK, Subramanian CR, Ali SS, et al. Physician- and Large Language Model-Generated Hospital Discharge Summaries. JAMA Intern Med. 2025;185(7):818-825. doi:10.1001/jamainternmed.2025.0821

Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary care physician workload assessment using EHR event log data and time-motion observations. Ann Intern Med. 2017;167(11):774-783. doi:10.7326/M17-0538

Bates DW, Nguyen L, Lehmann CU, et al. Reducing Documentation Burden to Improve Physician Satisfaction: The Evidence and Actionable Recommendations. NPJ Digit Med. 2021;4(1):1-9. doi:10.1038/s41746-021-00487-8

Lee M, Yoon S, Lee J, et al. Automated Clinical Summary Generation Using Artificial Intelligence: Technical and Ethical Challenges in Implementation. J Am Med Inform Assoc. 2023;30(3):370-378. doi:10.1093/jamia/ocac227