SOFA-2 Recalibrated: Global Development and Validation of an Updated Organ Dysfunction Score in >3.3 Million ICU Admissions

Highlight

– SOFA-2 updates the Sequential Organ Failure Assessment with contemporary organ-support variables and revised thresholds, validated across >3.3 million ICU admissions from 9 countries.
– SOFA-2 showed modestly better discrimination for ICU mortality than the original SOFA (AUROC 0.79 vs 0.77) and maintained predictive validity over ICU days 1–7.
– Gastrointestinal and immune dysfunction were not included due to insufficient data and low content validity, underscoring measurement challenges for some organ systems.

Background

Quantifying the degree of acute organ dysfunction is central to critical care practice, research, benchmarking, and quality measurement. The Sequential Organ Failure Assessment (SOFA) score, first formalized in the 1990s, became the standard instrument to describe and track organ dysfunction across six systems (respiratory, cardiovascular, hepatic, coagulation, renal, neurologic). SOFA also underpins diagnostic frameworks such as Sepsis-3 and is widely used as an outcome-adjusting covariate in randomized trials and observational studies.

Clinical practice, organ-support technologies, and case mix have evolved substantially in the past three decades. These changes can affect how physiologic variables map to outcomes and therefore may limit the contemporary validity of the original SOFA thresholds. The SOFA-2 project aimed to update the instrument using an international, data-driven approach informed by expert consensus, then validate the revised score in internal and external cohorts representative of varied geographic and resource settings.

Study Design

Ranzani and colleagues conducted a multi-stage development and validation project for SOFA-2 reported in JAMA (Ranzani et al., 2025). The process combined a modified Delphi expert consensus to define candidate organ dysfunction constructs (stages 1–5) with a federated, data-driven analysis across large multicenter cohorts (stages 6–8).

Key design elements:

Data provenance: 1319 intensive care units from 9 countries (Australia, Austria, Brazil, France, Italy, Japan, Nepal, New Zealand, United States), covering 2014–2023.
Sample size: Four multicenter cohorts totaling 2,098,356 patients were used for score development and internal validation; six external cohorts (1,241,114 patients) provided external validation — a combined dataset of approximately 3.34 million encounters.
Primary outcome: ICU mortality; primary performance metric: area under the receiver operating characteristic curve (AUROC) for the score measured on ICU day 1.
Secondary assessments: sequential predictive validity across ICU days 1–7, and examination of component thresholds and associated mortality for each organ domain.

Key Findings

Overall cohort and outcome summary

Across the 3.34 million encounters, ICU mortality was 8.1% (270,108 deaths), with cohort-specific mortality ranging from 4.5% to 20.5%. These wide spans reflect heterogeneity in case mix, resource availability, and regional practice.

Main performance results

The SOFA-2 instrument retained the six original organ domains (brain, respiratory, cardiovascular, liver, kidney, hemostasis) but incorporated new variables and revised score thresholds to better reflect contemporary care and the observed distribution of dysfunction from 0 to 4 points.
Discrimination for ICU mortality measured on day 1 improved modestly: SOFA-2 AUROC 0.79 (95% CI, 0.76–0.81) versus the original SOFA AUROC 0.77 (95% CI, 0.74–0.81).
Predictive validity was preserved when scores were measured sequentially across ICU days 1–7, supporting the instrument’s use for dynamic monitoring of organ dysfunction.
The authors did not incorporate gastrointestinal or immune dysfunction domains into SOFA-2 because of insufficient data and lack of consensus on measurement (content validity) in available datasets.

Interpretation of effect size and discrimination

The numerical improvement in AUROC (approximately +0.02) represents a modest increase in discriminatory ability. The 95% confidence intervals for the two scores overlap, indicating that while point estimates favor SOFA-2, the incremental gain is not large. Nonetheless, modest AUROC improvements can be clinically meaningful when applied to very large populations or when they better align physiologic inputs with modern organ-support practices (for example, when thresholds reflect the availability of continuous renal replacement therapy, high-flow oxygen, or modern vasoactive agents).

Component-level and pragmatic observations

By recalibrating thresholds and adding variables that capture modern organ support, SOFA-2 aims to improve content validity (i.e., the score better represents what clinicians mean by ‘organ dysfunction’ today). Maintaining the six-domain structure supports continuity with historical datasets. The explicit exclusion of gastrointestinal and immune domains highlights persistent measurement gaps: such systems may lack robust, readily available bedside indices that correlate with short-term mortality across diverse settings.

Expert Commentary

Strengths

Unprecedented scale and geographic diversity strengthen the external validity of SOFA-2. The federated approach helped assemble massive datasets while protecting local data governance.
Combining Delphi-informed content specification with data-driven thresholding is a pragmatic methodology for updating long-standing clinical scores.
External validation across separate cohorts supports robustness against overfitting.

Limitations and cautions

Modest gains in discrimination: An AUROC of 0.79 remains in the moderate range for predicting ICU mortality. SOFA-2 is not a deterministic tool for individual-level prognostication and should be used for risk stratification and monitoring rather than sole decision-making.
Data representativeness: Although geographically broad, the datasets may underrepresent low-income or rural settings not captured in the contributing countries. Resource constraints in some regions could alter how organ-support therapies are deployed and therefore affect score performance locally.
Operational implications: Replacing the original SOFA in research, registries, and quality programs will require mapping strategies to allow historical comparisons and recalibration of established thresholds used for trial inclusion or benchmarking.
Transparency on variable definitions: Widespread implementation will require clear definitions, especially for new variables and thresholds that reflect treatment modalities (e.g., what constitutes respiratory failure when high-flow nasal oxygen is used vs. invasive ventilation).

Clinical and research implications

SOFA-2 creates an opportunity to align organ-dysfunction measurement with current practice. For clinical trials and observational research, updated thresholds may improve risk adjustment and patient stratification. For bedside clinicians and quality programs, adopting SOFA-2 could better reflect patient severity in modern ICUs. However, prior to broad adoption, stakeholders should consider prospective evaluation of SOFA-2 in implementation studies, examine calibration across subgroups (age, comorbidity, resource setting), and develop crosswalks to the original SOFA for continuity.

Conclusion and Next Steps

The SOFA-2 project represents an important, pragmatic update of a foundational critical care instrument. Developed through expert consensus and validated across millions of admissions, SOFA-2 modestly improves discrimination for ICU mortality and preserves dynamic monitoring properties. The work balances continuity with the original SOFA and necessary modernization to reflect current therapies and case mix.

Before SOFA-2 becomes a new standard, priorities should include:

Prospective implementation studies to assess clinical utility, calibration, and impact on decision-making.
Transparent specification of new variables and thresholds with example operational definitions suitable for electronic health record extraction.
Development of mapping algorithms to translate historical SOFA scores to SOFA-2 equivalents to preserve longitudinal comparability in registries and clinical trials.
Further research into reliable measures for gastrointestinal and immune dysfunction that can be used in future score extensions.

Funding and ClinicalTrials.gov

Details on funding sources, competing interests, and trial registration are reported in the primary publication: Ranzani OT et al., JAMA. 2025. Readers should consult the original article for full disclosure statements and trial or registry identifiers.

References

1. Ranzani OT, Singer M, Salluh JIF, et al. Development and Validation of the Sequential Organ Failure Assessment (SOFA)-2 Score. JAMA. 2025 Oct 29:e2520516. doi:10.1001/jama.2025.20516 IF: 55.0 Q1 B1. PMID: 41159833 IF: 55.0 Q1 B1.

2. Vincent JL, Moreno R, Takala J, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Med. 1996;22(7):707-710.

3. Singer M, Deutschman CS, Seymour CW, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801-810.

Practical takeaways for clinicians and researchers

Adopt an informed stance: SOFA-2 is a carefully developed update with broad validation but only modest gains in discrimination. Centers and investigators should plan for staged adoption, validate SOFA-2 locally, and use it alongside clinical judgment and other prognostic tools rather than as a sole arbiter of treatment decisions or trial eligibility.