Foundation Models Narrow the Knowledge Gap in Ophthalmology but Struggle with Images

Foundation Models Narrow the Knowledge Gap in Ophthalmology but Struggle with Images

Highlight

• In a cross-sectional evaluation of offline FRCOphth Part 2 preparation questions, seven foundation models (FMs) showed strong performance on textual multiple-choice items; the best-performing FM (Claude 3.5 Sonnet) achieved 77.7% accuracy, comparable with expert ophthalmologists.

• Multimodal performance (questions that included images or other non-text inputs) remained substantially lower: the top multimodal FM (GPT-4o) scored 57.5%, underperforming expert clinicians and trainees.

• Results suggest immediate utility for FMs in answering textual ophthalmology queries and education, but highlight the current limitations of multimodal reasoning and the need for domain-specific multimodal training, calibration, and prospective validation.

Background

Ophthalmology is a highly visual specialty; diagnostic decisions routinely rely on fundus photographs, optical coherence tomography (OCT), slit-lamp images, and tabulated clinical data. As foundation models (FMs) evolve to process both language and visual inputs, their potential to support education, triage, and clinical workflows in ophthalmology is attractive. Most prior evaluations of large language models (LLMs) in medicine focused on text-only tasks (clinical vignettes, board-style questions), reporting rapid improvements across iterations of model families. However, rigorous, head-to-head assessments of contemporary FMs that include multimodal inputs (images, charts, tables) remain limited, particularly within specialty exams that test both knowledge and image interpretation skills.

Study design

This cross-sectional study (Rocha et al., JAMA Ophthalmol, 2025) assessed seven foundation models: GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Llama-3.2-11B (Meta), DeepSeek V3 (High-Flyer), Qwen2.5-Max (Alibaba Cloud), and Qwen2.5-VL-72B (Alibaba Cloud). The models were asked to answer offline multiple-choice questions drawn from a widely used textbook for preparation for the Fellowship of the Royal College of Ophthalmologists (FRCOphth) Part 2 written examination. Questions included text-only items and multimodal items that incorporated images or other visual data.

Comparator human groups included junior physicians, ophthalmology trainees, and expert ophthalmologists. The primary outcome was accuracy, defined as the proportion of model-generated answers that matched the textbook’s labeled letter answer. Statistical comparisons between models and human groups were reported with differences, 95% confidence intervals, and P values where appropriate.

Key findings

Textual question performance

On text-only multiple-choice questions, Claude 3.5 Sonnet achieved the highest accuracy at 77.7%. The rank order and reported accuracies were as follows: Claude 3.5 Sonnet (77.7%), GPT-4o (69.9%), Qwen2.5-Max (69.3%), DeepSeek V3 (63.2%), Gemini Advanced (62.6%), Qwen2.5-VL-72B (58.3%), and Llama-3.2-11B (50.7%).

Compared with clinician groups, Claude 3.5 Sonnet outperformed ophthalmology trainees (difference 9.0%; 95% CI, 2.4%–15.6%; P = .01) and junior physicians (difference 35.2%; 95% CI, 28.3%–41.9%; P < .001). Its performance was comparable with expert ophthalmologists (difference 1.3%; 95% CI, −5.1% to 7.4%; P = .72).

GPT-4o (69.9%) notably outperformed earlier OpenAI models included for reference: GPT-4 (difference 8.5%; 95% CI, 1.1%–15.8%; P = .02) and GPT-3.5 (difference 21.8%; 95% CI, 14.3%–29.2%; P < .001), underscoring continued improvement across successive FM releases on text tasks.

Multimodal question performance

Multimodal items — requiring interpretation of images or combined visual-textual reasoning — revealed a substantial drop in FM performance. GPT-4o led the evaluated models with 57.5% accuracy. Other multimodal results included Claude 3.5 Sonnet (47.5%), Qwen2.5-VL-72B (45.0%), Gemini Advanced (35.0%), and Llama-3.2-11B (25.0%).

In clinician comparisons, GPT-4o outperformed the junior physician group (difference 15.0%; 95% CI, −6.7% to 36.7%; P = .18) but was weaker than expert ophthalmologists (accuracy range 70.0%–85.0%; P = .16) and ophthalmology trainees (accuracy range 62.5%–80%; P = .35). Although point estimates suggested signal for improvement relative to less experienced clinicians, the multimodal gap versus experts remained clinically meaningful.

Interpretation of results

These findings indicate state-of-the-art FMs now rival experienced clinicians on text-only examination-style questions in ophthalmology, but the benefits do not fully translate to multimodal tasks that approximate real-world ophthalmic interpretation. Superior performance on text items suggests potential utility in education (exam preparation, question explanation), decision support for straightforward textual queries, and as a knowledge retrieval adjunct. Conversely, the multimodal weaknesses signal caution for clinical deployment where image interpretation is central (e.g., retinal disease triage, OCT interpretation) without substantial human oversight or specialized model retraining.

Expert commentary and critical appraisal

Strengths of the study include a head-to-head comparison of multiple contemporary foundation models, inclusion of multimodal items, and benchmarking against different clinician experience levels. The use of an exam-preparation textbook produces a standardized reference answer key, facilitating reproducibility.

Key limitations and potential confounders should temper interpretation. First, the dataset derives from a single exam-preparation source; question style, difficulty distribution, and possible overlap with corpora used during FM pretraining could influence model performance. Second, offline testing of models on textbook items does not replicate real-world image acquisition variability (lighting, resolution, artifacts) and the typical clinical context where patient history, prior imaging, and real-time interaction matter. Third, accuracy alone is a limited metric; calibration (confidence vs correctness), explanation quality, and propensity to hallucinate are essential for clinical trustworthiness but were not reported in detail in the summary data provided.

From a methodological standpoint, the operating conditions (prompt engineering, image preprocessing, allowed model context, and whether chain-of-thought prompting was used) can materially affect FM outputs. Lack of transparency in those operational details can limit reproducibility and generalizability.

Clinical and translational implications

For clinicians and educators, the pragmatic takeaways are:

• Education: FMs with strong text capabilities can be used as interactive study aids, to generate explanations of correct answers, and to support formative assessment. They risk propagating errors when applied to ambiguous or image-dependent questions without proper verification.

• Decision support: Text-based clinical decision support (e.g., summarizing guidelines, interpreting lab tables, drafting referral letters) appears feasible. For tasks where image interpretation is essential, current out-of-the-box FMs should be used cautiously and integrated with clinician oversight.

• Research and development: The performance gap on multimodal items supports targeted investment in ophthalmic vision-language datasets and fine-tuning of FMs for domain-specific imaging (fundus, OCT, slit lamp) and structured clinical data. Prospective clinical validation, evaluation of safety endpoints, and human-in-the-loop workflows are needed before clinical deployment.

Future directions

Priority areas to improve multimodal FM performance in ophthalmology include:

• Curated multimodal datasets that capture clinical diversity: realistic imaging artifacts, multi-device variability, and broad disease prevalence are needed for fine-tuning and external validation.

• Hybrid architectures: combining specialized vision models (trained on ophthalmic images) with large language models via retrieval-augmented and modular fusion techniques may preserve the best features of each modality.

• Explainability and calibration: systems must provide interpretable rationales tied to specific image features and report calibrated confidence scores to support clinician decision-making.

• Prospective clinical trials and real-world testing: evaluation pathways should measure diagnostic accuracy, patient outcomes, workflow efficiency, and unintended harms (false reassurance, over-referral, bias).

Conclusion

Rocha et al. demonstrate that contemporary foundation models approach expert-level performance on text-only ophthalmology exam questions, offering immediate value for education and certain text-based clinical tasks. Yet, multimodal reasoning — the ability to integrate images and text as ophthalmologists do — remains a clear limitation. The clinical promise of FMs in ophthalmology will require targeted multimodal data curation, domain-specific fine-tuning, transparent evaluation of failure modes, and rigorous prospective validation with human oversight before broad clinical adoption.

Funding and clinicaltrials.gov

Funding: Not specified in the provided article summary. Users should refer to the original JAMA Ophthalmology publication for declared funding and disclosures.

References

1. Rocha H, Chong YJ, Thirunavukarasu AJ, et al. Performance of Foundation Models vs Physicians in Textual and Multimodal Ophthalmological Questions. JAMA Ophthalmol. 2025 Nov 13:e254255. doi: 10.1001/jamaophthalmol.2025.4255. Epub ahead of print. PMID: 41231508; PMCID: PMC12616532.

2. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019 Jan;25(1):44-56. doi:10.1038/s41591-018-0300-7.

Readers interested in implementation should consult the full JAMA Ophthalmology article for methodological specifics, as well as current regulatory guidance on AI in medical devices and point-of-care clinical decision support.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply