Advancing Clinical Care with AI Agents: Systematic Review of Performance and Integration in Medicine

Study Background and Disease Burden

Artificial intelligence (AI) is rapidly transforming clinical medicine, particularly through the use of large language models (LLMs) that can understand and generate human-like text. Recently, AI agents—advanced systems built on LLMs capable of planning tasks, employing external tools, coordinating with other agents, and executing complex multi-step clinical workflows—have emerged as innovative tools targeting unmet medical needs. These agents promise to enhance clinical decision-making, reduce cognitive burden on clinicians, optimize diagnostic accuracy, expedite evidence synthesis, aid in treatment planning, and improve administrative efficiency. The increasing complexity and volume of medical knowledge and patient data necessitate intelligent systems that can handle multiple data streams and real-time updates, going beyond static models. However, despite growing interest, critical knowledge gaps remain regarding the performance gains offered by AI agents over standard LLMs, the comparative benefits of multi-agent versus single-agent frameworks, and the effective integration of auxiliary clinical tools to accomplish healthcare tasks efficiently.

Study Design

This systematic review analyzed peer-reviewed studies from PubMed, Web of Science, and Scopus databases dated between October 1, 2022, and August 5, 2025, that quantitatively evaluated AI agent implementations in clinical settings. Eligible studies included those applying AI agents for clinical and administrative healthcare tasks, with explicit performance comparisons against baseline LLMs or other standards. Two independent reviewers (A.G., M.O.) systematically extracted data on the AI architectures employed, performance metrics such as accuracy or clinical outcome improvement, clinical applications, and evaluation datasets. Disagreements during data extraction were resolved through discussion and consulting a third reviewer (E.K.) when consensus could not be reached. The included studies encompassed a scope of clinical domains from diagnosis, prognosis, treatment planning to clinical operations and medical education.

Key Findings

Twenty qualifying studies published predominantly between 2024 and 2025 met rigorous inclusion criteria, analyzing diverse datasets ranging from clinical case series (16–302 cases), medical records and electrophysiological reports (totaling 419 reports), multiple-choice clinical questions (5,120 items), evidence synthesis queries (50–500 queries), actual patient data from 117 individuals, extensive computational vignettes (>10,000 calculations), and genomic/biological datasets (including biomarker panels, nanobodies, gene sets, and scientific articles).

All evaluated AI agent frameworks consistently outperformed their baseline LLM counterparts in accuracy and task efficacy measures. Clinical applications focused particularly on decision-support roles, with diagnosis and prognosis, especially rare disease identification, constituting 40% of studies. Other significant areas included evidence synthesis (25%), treatment planning (15%), clinical operations such as appointment scheduling (10%), genomics (10%), and medical education (5%).

Three primary AI agent architectural archetypes emerged: single-agent tool-calling frameworks (40%), multi-agent systems without integrated tool use (25%), and hybrid multi-agent systems augmented with tool calling (35%). The dominant LLMs powering these agents were GPT-4 family models (75%), with supplementary use of Llama-3, Claude-3 Opus, and Gemini-1.5 models.

Regarding multi-agent systems, two distinct approaches were identified. Pure multi-agent frameworks without tool augmentation showed moderate improvements over base LLMs (median gain +14.05%, IQR 8.95–45.15%). Mixed multi-agent tool-calling systems yielded slightly higher gains (median +17.17%, IQR 4.12–39.3%) but with substantial variability. This high variance likely reflects task heterogeneity, since some tasks were manageable by single agents or simpler tool-augmented LLMs, while others necessitated more complex multi-agent coordination.

Noteworthy examples of multi-agent success included:
– Qu et al.’s multi-agent team utilizing a fine-tuned CRISPR-Llama3 model to complete 22 gene-editing tasks over 288 benchmarks, with wet-lab validation of gene knockouts.
– Swanson et al.’s “virtual laboratory” featuring specialized agents in immunology and machine learning, facilitating antibody development validated experimentally.
– Wang (2025) deploying a multi-agent oncology treatment planner, surpassing standard ECHO auto-planning for lung cancer by +4.75%.
– Ke et al.’s system significantly mitigating clinical decision bias, improving accuracy from 0% to 76% on complex biased cases, outperforming physicians.
– Chen et al.’s improvement in reasoning processes for rare disease diagnosis through multi-agent frameworks.

Multi-agent systems were demonstrated to be especially beneficial in highly complex clinical domains requiring integration of diverse expertise and detailed reasoning steps. Conversely, when applied to tasks amenable to simpler computational approaches, the added complexity of multi-agent collaboration did not yield substantial advantages relative to tool utilization alone.

Analysis of agent quantity and tool integration revealed an inverted-U shaped performance curve based on the number of agents, with optimal outcomes at 4–5 agents before performance declined (β = −8.815, R2 = 0.162). Tool number displayed a weak positive correlation with task performance (β = 8.869, R2 = 0.377), though these relationships were influenced by heterogeneity across tasks and study designs.

Consensus and coordination mechanisms within multi-agent systems varied: supervisor-led coordination (36.4%), sequential processing (45.5%), majority voting (9.1%), and bespoke methodologies (9.1%). These strategies contributed variably to performance gains.

Single-agent tool-calling frameworks often achieved substantial median improvements of 53 percentage points, particularly excelling in discrete clinical tasks such as medication dosing and targeted evidence retrieval. Multi-agent systems excelled in managing high complexity and uncertainty, highlighting the importance of aligning AI architecture complexity with clinical task complexity for optimal benefit.

Expert Commentary

The reviewed evidence substantiates the transformative potential for AI agents to enhance clinical decision-making and operational workflows, but also underscores the nuances in deployment. While multi-agent architectures demonstrate notable advantages for intricate tasks, minimal benefit is seen in simpler scenarios better served by single-agent or tool-augmented LLMs, emphasizing the need for task-centric design choices.

Methodological considerations warrant attention: most studies lacked prospective randomized designs, limiting generalizability and safety assessment in real-world settings. Moreover, selective reliance on synthetic or simulated data in several reports restricts applicability. The observed inverted-U effect stresses that exceeding an optimal number of collaborative agents can degrade outcomes, potentially due to coordination overhead or conflicting inputs.

Current clinical guidelines and expert opinion have yet to incorporate specific recommendations on AI agent use, reflecting the emergent nature of this field. Continued transparency in AI architecture, reproducibility, and external validation remain critical.

Limitations

Task heterogeneity, variable study designs, and outcome measures precluded quantitative meta-analysis. The limited number of prospective randomized controlled trials restricts evidence strength on clinical effectiveness, safety, and cost implications. Heavy dependence on synthetic datasets in multiple studies may overestimate real-world performance. Moreover, optimal methods for agent consensus and tool integration remain to be standardized.

Conclusions

AI agents integrated with large language models unequivocally enhance clinical task performance relative to standalone LLMs, particularly when system complexity is aligned with task demands. Multi-agent systems show the greatest promise in highly complex, multi-faceted clinical scenarios, though simpler tasks may be sufficiently addressed by single-agent tool-augmented models.

These findings signify a paradigm shift in clinical AI applications, unlocking domains previously inaccessible to base LLMs. Moving forward, large-scale prospective, multi-center clinical trials using real-world patient data are imperative to rigorously evaluate safety, effectiveness, scalability, and cost-benefit profiles. Transparent reporting, standardized evaluation frameworks, and integration pathways tailored to clinical workflows will be essential for successful clinical translation.

Primary funding for this systematic review was provided by institutional resources at the Icahn School of Medicine at Mount Sinai, including the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 and NIH infrastructure awards S10OD026880 and S10OD030463. Authors acknowledge the responsibility for content accuracy independent of funding bodies.

References

1. Gorenshtein A, Omar M, Glicksberg BS, Nadkarni GN, Klang E. AI Agents in Clinical Medicine: A Systematic Review. medRxiv [Preprint]. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232. PMID: 40909853; PMCID: PMC12407621.
2. Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24-29.
3. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in healthcare: The hope, the hype, the promise, the peril. Nat Med. 2022;28(1):34-44.
4. Esteva A, Chou K, Yeung S, et al. Meeting the challenge of rare disease diagnosis with artificial intelligence. NPJ Digit Med. 2023;6(1):22.
5. Darekar A, Nguyen TN, Shimizu K. AI agents and multi-agent systems for clinical applications: A scoping review. J Med Internet Res. 2024;26:e36754.