Comparative Diagnostic Performance of Global Thyroid Ultrasound Risk Stratification Systems: The Impact of Nodule Size and Lexicon Standardization

Highlights

Significant heterogeneity exists among the five major thyroid risk stratification systems (RSSs) regarding nodule classification and malignancy risk estimation.
The American Thyroid Association (ATA) and EU-TIRADS prioritize sensitivity, leading to higher rates of unnecessary biopsies across all nodule sizes.
The ACR TI-RADS maintains the lowest unnecessary biopsy rate (UBR) but at the cost of reduced sensitivity compared to other international systems.
Nodule size (>2 cm vs. ≤2 cm) is a critical determinant of diagnostic performance, with K-TIRADS showing marked sensitivity shifts between size cohorts.

Background

The management of thyroid nodules has been transformed by the development of ultrasound (US) risk stratification systems (RSSs), commonly known as Thyroid Imaging Reporting and Data Systems (TIRADS). While these systems aim to standardize the management of thyroid nodules and reduce unnecessary interventions, the proliferation of different societal guidelines—including those from the American College of Radiology (ACR), the American Thyroid Association (ATA), and European (EU), Korean (K), and Chinese (C) societies—has created a landscape of clinical heterogeneity.

One of the primary challenges in clinical practice is the lack of a unified lexicon and varying size thresholds for fine-needle aspiration biopsy (FNAB). This heterogeneity often leads to disparate recommendations for the same nodule depending on which system is applied. Recent efforts by international expert consensuses have proposed a standardized US lexicon to bridge these gaps. Understanding how these systems perform relative to one another, particularly when categorized by nodule size, is essential for refining clinical decision-making and moving toward a globally unified RSS.

Key Content

Methodological Framework and Standardized Lexicon

In a comprehensive retrospective analysis spanning from March 2017 to February 2024, researchers evaluated 3,774 thyroid nodules larger than 1 cm. The study utilized a standardized US lexicon to retrospectively classify nodules according to the specific criteria of the ATA, EU-TIRADS, K-TIRADS, ACR TI-RADS, and C-TIRADS. This approach allowed for a direct comparison of the systems’ intrinsic logic without the confounding variable of disparate descriptive terminology.

Comparative Distribution and Malignancy Risk

The study revealed stark differences in how nodules are distributed across risk categories. Inter-system agreement was found to be highly variable, with kappa (κ) values ranging from a low of 0.05 to a high of 0.85. This suggests that while some systems align closely on high-risk features, their treatment of intermediate and low-risk nodules is fundamentally different. Specifically, the malignancy risk associated with ostensibly similar categories (e.g., ‘High Suspicion’ vs. ‘TIRADS 5’) varied significantly across the five systems (p < 0.001).

Diagnostic Performance by Nodule Size

The study divided the cohort into small nodules (≤ 2 cm) and large nodules (> 2 cm) to assess the impact of size on biopsy criteria performance:

ATA and EU-TIRADS: These systems demonstrated the highest sensitivity for both small and large nodules. However, this high sensitivity was coupled with a significantly higher unnecessary biopsy rate (UBR). For clinicians, this represents a trade-off: a lower likelihood of missing a malignancy but a higher burden of low-yield invasive procedures.
ACR TI-RADS: In contrast, the ACR system showed the lowest sensitivity but also the lowest UBR across both size groups. The ACR’s point-based system and higher size thresholds for biopsy effectively filter out many benign nodules that other systems would target for FNAB.
K-TIRADS: The Korean system exhibited a unique size-dependent performance shift. For small nodules, it had the lowest sensitivity and UBR. However, for large nodules, its sensitivity and UBR increased dramatically, aligning more closely with the ATA and EU-TIRADS.
C-TIRADS: The Chinese system showed low sensitivity similar to ACR TI-RADS but suffered from a higher UBR, suggesting that its internal criteria for risk assignment may be less specific for the markers of benignity in this cohort.

The Challenge of Large Nodules

The data suggests that the greatest disparities in performance occur in the management of large nodules (>2 cm). The differences in diagnostic performance stem primarily from variations in biopsy size thresholds and the specific US criteria used to designate ‘no-biopsy-indicated’ status. Large nodules that lack highly suspicious features (e.g., microcalcifications or non-parallel orientation) are handled very differently across the five systems, leading to the observed variance in UBR.

Expert Commentary

The findings of this large-scale comparison underscore a fundamental tension in thyroid oncology: the balance between diagnostic sensitivity and the prevention of overdiagnosis and overtreatment. The high UBR observed in the ATA and EU-TIRADS systems reflects a conservative approach intended to capture as many malignancies as possible. However, in an era where many thyroid cancers are indolent (such as papillary thyroid microcarcinomas), the ACR TI-RADS approach of prioritizing specificity may be more aligned with modern goals of reducing medical waste and patient anxiety.

The size-dependent performance of K-TIRADS is particularly intriguing. It suggests that the US features traditionally associated with malignancy may manifest differently or carry different predictive weight as a nodule grows. This highlights the need for a size-adjusted or ‘dynamic’ risk stratification model. Furthermore, the wide range of inter-system agreement (κ = 0.05-0.85) proves that we are still far from a ‘universal language’ in thyroid US, despite the use of a standardized lexicon in this study. The differences are not merely semantic; they are embedded in the weighting of US features and the size thresholds chosen for intervention.

Conclusion

The comparative analysis of the ATA, EU-, K-, ACR-, and C-TIRADS systems reveals that they are not interchangeable. The choice of which RSS to use significantly impacts the number of biopsies performed and the sensitivity for detecting thyroid malignancy. Specifically, the differences in diagnostic performance are most pronounced due to variations in biopsy size thresholds for small nodules and disparate US criteria for no-biopsy-indicated large nodules.

Future efforts to establish a unified, international TIRADS must focus on optimizing risk stratification for large nodules and harmonizing biopsy thresholds. Until such a unified system is realized, clinicians should be aware of the specific strengths and limitations of the system utilized in their practice—balancing the high sensitivity of the ATA/EU systems against the high specificity and efficiency of the ACR TI-RADS.

References

Na DG, Noh BJ, Kim WJ, et al. Diagnostic Performance of Five Societies’ Ultrasound Risk Stratification Systems for Thyroid Malignancy According to Nodule Size: A Comparison Using a Standardized Ultrasound Lexicon. Thyroid. 2026; doi:10.1089/thy.2024.xxxx. PMID: 41789443.
Tessler FN, Middleton WD, Grant EG, et al. ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee. J Am Coll Radiol. 2017;14(5):587-595.
Haugen BR, Alexander EK, Bible KC, et al. 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid. 2016;26(1):1-133.