Deep Learning in Otolaryngology: Promises, Performance, and Pathways to Clinical Use

Highlight

– A comprehensive narrative review (2020–2025) screened 1,422 articles and synthesized 327 original deep learning (DL) studies in otolaryngology, grouped into detection/diagnosis (55%), segmentation (28%), prediction/prognostics (5%), and emerging applications (12%).

– Proof-of-concept DL models frequently achieved expert-comparable diagnostic accuracy (examples: nasopharyngeal carcinoma detection 92%, laryngeal malignancy 86%, otologic pathology >95%), but prognostic work and prospective, multi-institutional validation remain sparse.

– Key implementation enablers include harmonized multi-center datasets, standardized acquisition protocols, federated learning for privacy-preserving model training, interpretable models, and prospective clinical evaluation with human-in-the-loop workflows.

Background

Otolaryngology encompasses a broad set of diagnostic and therapeutic tasks—endoscopic visualization (nasopharynx, larynx), microscopic otologic assessment, radiologic interpretation, and physiologic signal analysis (audiometry, vestibular testing). Many tasks are image- or signal-driven and therefore well suited to deep learning (DL), a subset of artificial intelligence (AI) that uses multilayered neural networks to learn hierarchical features from raw data. The clinical burden—late-stage head and neck cancers, chronic otitis media, hearing loss—and variable access to specialist expertise motivate interest in algorithmic aids that can improve diagnostic accuracy, triage, intraoperative decision support, and personalized device optimization.

Study design (review scope and methods)

The narrative review by Novi et al. (JAMA Otolaryngol Head Neck Surg. 2025) screened English-language publications from 2020 through 2025 and included 327 original research studies of DL applied to otolaryngology. Studies were categorized into detection and diagnosis (179 studies), prediction and prognostics (16 studies), image segmentation (93 studies), and emerging applications (39 studies). The included literature comprises proof-of-concept retrospective model development and internal validation studies, with relatively few multi-institutional datasets or prospective clinical trials.

Key findings and results

Overall landscape

DL in otolaryngology has proliferated across multiple subdomains: endoscopic image interpretation (nasopharynx, larynx, sinonasal cavities), otologic imaging and tympanic membrane analysis, radiologic tumor detection/segmentation on CT/MRI, and physiologic signal optimization for hearing devices. The majority of studies focused on classification tasks (can a model detect disease?), followed by segmentation (delineating anatomy or lesions) and a smaller number addressing prognostic prediction (survival, recurrence) or real-time intraoperative assistance (instrument tracking, landmark identification).

Detection and diagnostic performance

Of the 179 detection-focused studies, many reported performance metrics (accuracy, sensitivity, specificity, area under the ROC curve) comparable with or, in some tasks, above single-expert performance. Representative reported figures from the review include:

Nasopharyngeal carcinoma detection: reported pooled accuracy around 92% in select image-based models.
Laryngeal malignancy classification: reported accuracy approximately 86% in validated datasets.
Otologic pathology (e.g., tympanic membrane disease): reported accuracy often >95% in curated image sets.

These performance figures generally derive from retrospective, often single-center datasets and frequently used internal cross-validation or holdout sets; external validation against geographically distinct cohorts was less common.

Image segmentation

Ninety-three studies addressed segmentation, reliably delineating anatomical regions (airway, paranasal sinuses, tumor boundaries) and demonstrating utility for planning, volumetric measurement, and radiation therapy target delineation. Segmentation models demonstrated consistent performance metrics (Dice coefficients frequently reported in acceptable ranges) in controlled datasets, supporting downstream tasks such as automated measurement and registration.

Prediction and prognostics

Prognostic applications were comparatively limited (16 studies) but promising. Examples included survival stratification in oropharyngeal cancer and recurrence prediction in chronic rhinosinusitis. These studies often combined imaging features with clinical metadata to enhance predictive power. However, the small number of studies and heterogeneous endpoints limit definitive conclusions about clinical utility.

Emerging intraoperative and device optimization applications

Emerging uses highlighted by the review included real-time surgical instrument tracking, intraoperative landmark identification (useful for minimally invasive skull base and endoscopic sinus surgery), and optimization algorithms for cochlear implant mapping and hearing-aid personalization. These applications emphasize low-latency inference and human–machine interfaces, but few have progressed beyond early feasibility demonstrations.

Methodological observations

Common methodological themes included heavy reliance on supervised learning with annotated datasets, variable reporting of dataset demographics and labeling protocols, frequent class imbalance, limited external testing, and inconsistent reporting of confidence intervals or calibration metrics. Explainability methods (saliency maps, attention visualization) were employed in some studies but rarely rigorously evaluated with clinician end-users.

Expert commentary: strengths, limitations, and caveats

Strengths: The collective body of work demonstrates that DL can extract clinically relevant patterns from endoscopic images, radiology, and physiologic signals in otolaryngology. Where high-quality annotated datasets exist, models frequently attain performance approaching specialist clinicians and provide reproducible segmentation that may streamline workflow.

Limitations and critical caveats:

Dataset representativeness and bias: Many datasets are single-center, enriched for positive cases, or lack sociodemographic diversity. Risk of spectrum bias and reduced generalizability is high.
External and prospective validation: Few studies report multicenter external validation or prospective impact analysis that measure patient-centered outcomes (diagnostic delays avoided, change in treatment decisions, harm reduction).
Interpretability and clinician trust: Post hoc visualization (e.g., heatmaps) is helpful but insufficient. Clinicians need transparent models that provide reasoning, uncertainty quantification, and clear failure modes.
Regulatory and integration hurdles: Real-world deployment requires robust pipelines for image acquisition standardization, data governance, HIPAA-compliant architectures, and regulatory clearances that account for algorithm updates.
Operational considerations: Latency, user interface design, and integration into existing electronic health records and operative workflows are frequently under-addressed.

Pathways to clinical adoption: practical recommendations

To move from proof-of-concept to routine clinical tools, the field should prioritize:

High-quality, multi-institutional datasets: Shared, well-annotated datasets with standardized acquisition protocols and clear labels enable robust training and external validation.
Federated and privacy-preserving learning: Federated learning can increase sample diversity while preserving patient privacy and institutional data control.
Standardized reporting and prospective validation: Adoption of established AI reporting frameworks and prospective, ideally randomized, clinical impact studies that measure diagnostic accuracy, workflow efficiency, and patient outcomes.
Interpretability and uncertainty quantification: Models should provide actionable explanations and calibrated probabilities; human-in-the-loop systems can allow clinicians to override or confirm algorithmic suggestions.
Bias mitigation and equity testing: Routine subgroup analyses for race, age, sex, device type, and imaging equipment; mitigation strategies to prevent amplification of health disparities.
Cross-disciplinary collaborations: Clinicians, data scientists, engineers, ethicists, and regulatory experts must co-design models and deployment strategies.

Conclusion

The narrative review synthesizes an accelerating literature showing that deep learning has genuine potential for image-based diagnosis, segmentation, prognostic modeling, and intraoperative support in otolaryngology. However, most published work remains at the proof-of-concept stage, with a pressing need for representative, multi-institutional datasets, transparent models, and rigorous prospective validation that demonstrate meaningful clinical impact. A measured, interdisciplinary approach—combining federated data strategies, interpretability frameworks, and human-in-the-loop deployment—will be essential to translate algorithmic promise into safe, equitable clinical tools.

Funding and clinicaltrials.gov

For funding statements and trial registrations related to the studies included in the review, refer to the original article: Novi SL, Navarathna N, D’Cruz M, Brooks JR, Maron BA, Isaiah A. Deep Learning in Otolaryngology: A Narrative Review. JAMA Otolaryngol Head Neck Surg. 2025 Nov 13. doi: 10.1001/jamaoto.2025.3911. Epub ahead of print. PMID: 41231484.

References

1. Novi SL, Navarathna N, D’Cruz M, Brooks JR, Maron BA, Isaiah A. Deep Learning in Otolaryngology: A Narrative Review. JAMA Otolaryngol Head Neck Surg. 2025 Nov 13. doi: 10.1001/jamaoto.2025.3911. Epub ahead of print. PMID: 41231484.