Multicenter Validation Suggests MySurgeryRisk Can Accurately Predict Major Postoperative Complications and In-Hospital Mortality Across Diverse Hospitals

Title

Multicenter Validation Suggests MySurgeryRisk Can Accurately Predict Major Postoperative Complications and In-Hospital Mortality Across Diverse Hospitals

Highlights

In a retrospective multicenter cohort spanning 14 institutions in the OneFlorida+ network, MySurgeryRisk-based models achieved excellent discrimination for several clinically meaningful postoperative outcomes, with AUROC values of 0.93 for ICU admission, 0.94 for postoperative mechanical ventilation, 0.92 for acute kidney injury, and 0.95 for in-hospital mortality.

The study included 508,097 major inpatient operations from 366,875 adults between 2012 and 2023, providing one of the largest external-style validations of the MySurgeryRisk framework to date.

Predictive performance remained comparable to previously published single-center MySurgeryRisk models, supporting the transportability of the feature engineering and modeling strategy to broader health system settings.

Primary procedure code and clinician-specific factors were among the most influential predictors, underscoring that postoperative risk reflects not only patient physiology and comorbidity burden but also procedure complexity and care-context variables.

Background

Postoperative complications remain a major source of morbidity, mortality, prolonged hospitalization, and health care cost despite advances in surgical technique, anesthesia, perioperative monitoring, and enhanced recovery protocols. Even in contemporary practice, serious complications such as unexpected intensive care unit admission, prolonged postoperative mechanical ventilation, acute kidney injury, and death continue to affect a meaningful proportion of patients undergoing major inpatient surgery.

Risk stratification before surgery is therefore central to perioperative medicine. Accurate estimates can inform shared decision-making, triage to higher-acuity monitoring, prehabilitation, optimization of modifiable risk factors, and planning of postoperative resources. Yet traditional risk tools often have important limitations. Many rely on a limited set of manually entered variables, focus on selected procedures or organ systems, or were developed in narrow populations that may not generalize well to other hospitals. In addition, the increasing density of routinely collected electronic health record data creates an opportunity for machine-learning approaches to capture nonlinear interactions and latent clinical signals not readily represented by conventional models.

The MySurgeryRisk framework was previously developed and prospectively validated in a single-center setting to predict postoperative complications using perioperative electronic health record data. The key question addressed by the current study was whether the framework would retain performance when scaled to a large and heterogeneous multicenter network. This is a crucial translational step for any artificial intelligence tool intended for real-world perioperative use.

Study Design

Design and setting

Ren and colleagues conducted a retrospective, longitudinal, multicenter cohort study using data from the OneFlorida+ network. The analysis covered 14 health care institutions and included adult patients admitted for major inpatient surgery between 2012 and 2023.

Population

The final cohort comprised 508,097 encounters from 366,875 adult patients. The mean age was 59 years with a standard deviation of 18 years. Women accounted for 190,799 patients, or 52%, and men for 176,076, or 48%. The size and diversity of the cohort are notable strengths, especially for evaluating transportability across multiple institutions with differing workflows, case mix, and documentation patterns.

Development and validation strategy

The investigators split the data temporally into a development set from 2012 to 2020, containing 358,216 encounters, and a validation set from 2020 to 2023, containing 149,881 encounters. This approach is methodologically preferable to random splitting for many clinical prediction applications because it better simulates prospective deployment in future patients.

Using the feature selection and transformation methods previously validated within the MySurgeryRisk framework, the authors trained eXtreme Gradient Boosting models to predict four postoperative outcomes: intensive care unit admission, postoperative mechanical ventilation, postoperative acute kidney injury, and in-hospital mortality.

Outcomes

The primary performance metric was the area under the receiver operating characteristic curve, or AUROC, which measures discrimination: the ability of a model to assign higher risk to patients who will develop an outcome than to those who will not. The outcomes selected are clinically meaningful because each is associated with increased resource use, patient harm, and downstream complications.

Key Findings

Event rates

The prevalence of postoperative events in the full cohort was 8% for ICU admission, corresponding to 42,302 encounters; 4% for postoperative mechanical ventilation, corresponding to 20,435 encounters; 7% for acute kidney injury, corresponding to 36,027 encounters; and 1% for in-hospital mortality, corresponding to 5,131 encounters. These frequencies reflect both the clinical importance and the statistical challenges of building models for outcomes with varying class imbalance, especially mortality.

Discrimination performance

The models demonstrated strong discrimination across all four endpoints in the validation dataset. AUROC for ICU admission was 0.93 with a 95% confidence interval of 0.93 to 0.93. AUROC for postoperative mechanical ventilation was 0.94 with a 95% confidence interval of 0.94 to 0.94. AUROC for postoperative acute kidney injury was 0.92 with a 95% confidence interval of 0.92 to 0.92. AUROC for in-hospital mortality was 0.95 with a 95% confidence interval of 0.94 to 0.95.

These are high-performing values by perioperative prediction standards and suggest excellent separation between patients who did and did not experience these complications. Notably, mortality prediction retained very strong discrimination despite a relatively low event rate of 1%, which is often where predictive models struggle.

Generalizability

A central claim of the paper is that predictive performance was comparable with previously validated single-center MySurgeryRisk models. This matters because many machine-learning systems show performance decay when moved beyond the environment in which they were built. Differences in coding, laboratory reporting, case mix, perioperative pathways, and institutional practice patterns can all reduce external validity. The present study therefore supports the hypothesis that the MySurgeryRisk framework can generalize across a broad network without major loss of discrimination.

Although the abstract emphasizes AUROC, clinicians should interpret generalizability in a broader sense. Transportability is not only about preserving headline performance metrics but also about whether influential predictors remain clinically plausible and whether the model can function across sites with different patterns of missingness and workflow. In that respect, the fact that routine variables could support robust prediction across 14 institutions is encouraging.

Most influential predictors

Primary procedure code and clinician-specific factors emerged consistently as the most influential variables. This is an important and somewhat provocative finding. It implies that postoperative risk is deeply shaped by the type of operation performed and by contextual features related to care delivery. Procedure code likely acts as a strong proxy for surgical magnitude, anatomical site, urgency, blood loss risk, and expected postoperative trajectory. Clinician-specific factors may reflect operator experience, specialty practice patterns, team performance, patient selection, and institution-level protocols.

At the same time, these variables raise key implementation questions. Predictors that encode clinician or site effects may improve accuracy, but they may also embed historical practice variation, structural inequities, or local idiosyncrasies. For operational deployment, health systems will need to balance predictive performance with fairness, interpretability, and the possibility that model outputs could change if staffing or referral patterns change over time.

Clinical interpretation of each endpoint

Prediction of ICU admission may help hospitals anticipate bed demand and identify patients who need closer postoperative surveillance. However, ICU admission is partly a clinical decision and may reflect local practice norms as much as patient physiology. A model that predicts ICU use may therefore capture both true severity and site-level thresholds for escalation.

Prediction of postoperative mechanical ventilation is highly relevant in patients undergoing high-risk abdominal, thoracic, vascular, or emergency surgery, and may support discussions about planned postoperative critical care, extubation readiness, and respiratory optimization.

Prediction of acute kidney injury is especially valuable because AKI is common, often underrecognized early, and associated with chronic kidney disease, longer hospitalization, and higher mortality. A reliable perioperative AKI risk estimate could potentially guide fluid strategy, nephrotoxin avoidance, hemodynamic vigilance, and laboratory surveillance.

Prediction of in-hospital mortality is perhaps the most visible output, but it should be used carefully. Mortality models may support informed consent and triage, yet they should complement rather than replace clinician judgment, especially when decisions involve goals of care or limitations of treatment.

Expert Commentary

This study is a meaningful advance in perioperative artificial intelligence because it moves beyond proof-of-concept into multicenter validation at substantial scale. Several aspects strengthen its clinical relevance. First, the cohort is large and diverse. Second, the use of a temporal validation split approximates real-world future deployment better than random partitioning. Third, the outcomes are practical and clinically consequential rather than abstract composite endpoints.

The findings also align with a broader body of literature showing that perioperative risk is best understood as a multidimensional construct shaped by patient factors, procedure-related risk, intraoperative stress, and system-level context. Machine-learning methods are well suited to this task because they can model nonlinearities and interactions among these domains. The use of eXtreme Gradient Boosting is consistent with other high-performing structured-data clinical prediction studies.

Still, several caveats deserve attention. AUROC alone is insufficient to judge readiness for clinical implementation. Calibration, meaning how closely predicted probabilities match observed outcomes, is critical when risk estimates are used for counseling or threshold-based decisions. The abstract does not report calibration metrics, decision-curve analysis, subgroup performance, or site-specific heterogeneity. A model can discriminate well but still overestimate or underestimate absolute risk in important patient groups.

Another issue is outcome definition. ICU admission and postoperative mechanical ventilation may partly reflect institutional workflows and resource availability rather than purely biological deterioration. This does not negate their value, but it means they are hybrid outcomes influenced by both patient need and operational practice. If hospitals differ in ICU triage thresholds or extubation protocols, some portion of model performance may derive from learning local utilization patterns.

The prominence of clinician-specific factors also requires careful governance. These variables may be highly informative, but their use in deployed algorithms could raise concerns about fairness, transparency, and unintended reputational consequences. If a model attributes higher risk partly because of clinician identity or related signals, end users will need to understand whether the output is intended for patient-level decision support, quality improvement, or system planning. In some contexts, including such variables may improve prediction but complicate ethical acceptability.

Additionally, retrospective validation, even when multicenter, does not prove clinical utility. The decisive next step would be prospective implementation studies asking whether use of MySurgeryRisk changes clinician behavior, improves resource allocation, reduces complications, or supports more informed shared decision-making. It would also be important to test for performance drift over time and across subgroups defined by age, sex, race and ethnicity, comorbidity burden, surgical specialty, urgency, and hospital type.

Finally, interpretability remains central. The authors identify major influential variables, which is useful, but operational deployment will likely require user-facing explanations that clinicians can act on. Risk scores are most helpful when coupled to concrete care pathways such as AKI prevention bundles, respiratory support planning, or postoperative monitoring protocols.

Clinical and Health System Implications

If integrated thoughtfully, a tool like MySurgeryRisk could support several perioperative workflows. Before surgery, it could strengthen risk communication, particularly for complex inpatient procedures. During care planning, it might help determine postoperative destination, the need for ICU-level resources, or the intensity of renal and respiratory monitoring. At the health system level, it could assist bed management and forecasting of postoperative resource demand.

However, clinical integration should not be reduced to displaying a probability score in the electronic health record. High-value implementation usually requires pairing prediction with action. For example, a high AKI risk output could trigger a renal-protective checklist. Elevated predicted ventilation risk could prompt preoperative pulmonary optimization or planned postoperative critical care consultation. Mortality risk estimates may be most useful when embedded in broader shared decision-making frameworks that include patient goals and expected quality of recovery.

The study also reinforces the idea that model portability is achievable when training data are large, heterogeneous, and based on routinely captured information. This is important for health systems seeking scalable AI solutions rather than niche tools requiring specialized data collection. Nevertheless, local validation remains essential before adoption in any new environment.

Limitations

Several limitations can be inferred from the study design and abstract. The retrospective nature of the analysis means that unmeasured confounding, coding differences, and data-quality issues remain possible. The abstract does not detail missing-data handling, calibration statistics, or subgroup analyses. It also does not specify whether hospital-level clustering or between-site heterogeneity materially affected performance.

The outcomes themselves vary in objectivity. In-hospital mortality and laboratory-defined AKI are generally more robust than ICU admission, which can be influenced by practice style. Furthermore, the use of clinician-specific predictors may improve prediction while potentially limiting interpretability and raising bias concerns. Finally, because the validation occurred within one regional data network, broader external validation in other states, health systems, and countries would still be valuable.

Conclusion

The multicenter study by Ren and colleagues provides strong evidence that the MySurgeryRisk framework can retain high discrimination for major postoperative complications and in-hospital mortality when applied across a large and diverse health care network. In more than 508,000 major inpatient operations, the models predicted ICU admission, postoperative mechanical ventilation, acute kidney injury, and death with AUROC values ranging from 0.92 to 0.95. These findings support the broader generalizability of the framework and underscore the predictive importance of procedure type and clinician-related factors.

For clinicians and health systems, the study is best viewed as an important validation milestone rather than the final step toward routine use. The next priorities are prospective implementation, calibration assessment, subgroup fairness analyses, and evaluation of whether risk-informed interventions actually improve outcomes. Even so, this work meaningfully advances the field of perioperative AI and suggests that routinely collected electronic health record data can support scalable, clinically relevant surgical risk prediction.

Funding and ClinicalTrials.gov

The abstract provided does not report a ClinicalTrials.gov registration number, which is expected given the retrospective observational design. Specific funding details are not included in the supplied summary and should be confirmed from the full JAMA Surgery article before formal publication use.

References

1. Ren Y, Adiyeke E, Guan Z, Hu Z, Loftus TJ, Shickel B, Rashidi P, Ozrazgat-Baslanti T, Bihorac A. MySurgeryRisk Model Predictions of Postoperative Complications and Mortality. JAMA Surgery. 2026;161(6):619-627. PMID: 42054034.

2. American College of Surgeons National Surgical Quality Improvement Program. User Guide for the ACS NSQIP Surgical Risk Calculator. American College of Surgeons. Accessed for context on established surgical risk prediction tools.

3. Bihorac A, Ozrazgat-Baslanti T, Ebadi A, et al. MySurgeryRisk: Development and validation of a machine-learning risk algorithm for major complications and death after surgery. This foundational work established the original framework; readers should verify the exact bibliographic details from PubMed or the journal archive when citing.

Multicenter Validation Suggests MySurgeryRisk Can Accurately Predict Major Postoperative Complications and In-Hospital Mortality Across Diverse Hospitals

Title

Highlights

Background

Study Design

Design and setting

Population

Development and validation strategy

Outcomes

Key Findings

Event rates

Discrimination performance

Generalizability

Most influential predictors

Clinical interpretation of each endpoint

Expert Commentary

Clinical and Health System Implications

Limitations

Conclusion

Funding and ClinicalTrials.gov

References

Comments

Leave a Reply Cancel reply