Highlights
An ensemble of fine-tuned tabular foundation models predicted pathological complete response after total neoadjuvant therapy and total mesorectal excision with moderate discrimination in locally advanced rectal cancer.
In the surgical development cohort, model performance was AUROC 0.71, AUPRC 0.44, and Brier score 0.17, suggesting reasonable ranking ability and acceptable calibration.
When externally tested in a watch-and-wait cohort, the model predicted persistent clinical complete response with AUROC 0.69; calibration was initially suboptimal but improved substantially after recalibration.
The study addresses a major clinical gap: selecting patients for nonoperative management after total neoadjuvant therapy without relying solely on imperfect clinical assessment.
Background
Management of locally advanced rectal cancer has shifted rapidly in the era of total neoadjuvant therapy (TNT). Delivering systemic chemotherapy and chemoradiation before surgery can improve treatment completion, downstage tumors, and increase the chance of complete response. This has intensified interest in a watch-and-wait strategy for patients who appear to achieve a clinical complete response after TNT, potentially sparing them total mesorectal excision (TME) and its long-term consequences, including bowel dysfunction, sexual dysfunction, urinary dysfunction, and permanent stoma formation.
However, the central problem remains unchanged: clinical complete response is difficult to define with certainty. Endoscopy, digital rectal examination, MRI, and interval reassessment can identify patients with deep response, but residual viable tumor and subsequent local regrowth remain important concerns. Pathological complete response (pCR), confirmed only after resection, is the strongest benchmark for complete eradication of the primary tumor, yet it is unavailable in patients treated nonoperatively. A model that estimates the likelihood of pCR, or of durable complete response in a watch-and-wait pathway, could therefore be clinically valuable.
The study by Varghese and colleagues addresses this unmet need by applying an ensemble of tabular foundation models to routinely collected clinical data from patients with stage II or III microsatellite stable rectal adenocarcinoma treated with TNT. The authors aimed to build a prediction tool that could support patient selection for watch-and-wait, one of the most nuanced decisions in rectal cancer care.
Study Design
Population and setting
The development cohort included 308 adults with clinical stage II or III microsatellite stable primary rectal adenocarcinoma who underwent TNT followed by TME between 2018 and 2023. Median age was 56 years, and 40% were female. This cohort was used to train and evaluate the model for predicting pCR.
The external validation cohort included 83 patients managed with TNT followed by watch-and-wait rather than immediate surgery. Median age was 57 years, and 37% were female. In this cohort, the relevant endpoint was persistent clinical complete response (pcCR), defined as the absence of local regrowth, distant metastases, or persistent near-complete clinical response.
Modeling approach
The investigators used an ensemble of tabular foundation models fine-tuned on pre-TNT, post-TNT, and pre-TME variables. The abstract does not list each predictor, but the timing suggests the model integrated baseline disease features together with response-assessment data collected after TNT and before definitive management.
This is an important design choice. Many prior rectal cancer prediction tools have been based on conventional regression methods and often rely on smaller variable sets. Foundation-style tabular models may better capture nonlinear interactions among clinical, endoscopic, radiologic, and treatment-related features, although their performance still depends heavily on data quality, cohort size, and calibration.
Endpoints and performance metrics
The primary endpoint in the TNT plus TME cohort was pathological complete response. In the TNT plus watch-and-wait cohort, the external validation endpoint was persistent clinical complete response.
Performance was assessed using AUROC, AUPRC, and Brier score with 95% confidence intervals. This metric set is appropriate and clinically informative. AUROC describes discrimination across thresholds, AUPRC is especially useful when class imbalance exists, and the Brier score captures overall calibration and accuracy of probabilistic predictions.
Key Findings
Performance in the surgical cohort
In the 308-patient TNT plus TME cohort, the model predicted pCR with an AUROC of 0.71 (95% CI 0.65-0.77). This indicates moderate discrimination: clearly better than chance, but not strong enough to support autonomous decision-making. In clinical terms, the model has some ability to rank patients according to probability of pCR, but considerable overlap remains between responders and nonresponders.
The AUPRC was 0.44 (95% CI 0.35-0.57). Precision-recall performance is often more clinically revealing than AUROC when the outcome is not highly prevalent. An AUPRC in this range suggests meaningful signal, though not a high-confidence rule-in tool.
Calibration in the development setting appeared acceptable, with a Brier score of 0.17 (95% CI 0.15-0.20). Good calibration matters because watch-and-wait decisions are based not merely on ranking patients but on estimated absolute probabilities. A well-calibrated model allows clinicians to interpret risk outputs more credibly at the bedside.
External validation in the watch-and-wait cohort
In the 83-patient TNT plus watch-and-wait cohort, the model predicted persistent clinical complete response with an AUROC of 0.69 (95% CI 0.57-0.82). This is broadly consistent with the development result and suggests that the model retained moderate discriminatory ability when transported to a clinically distinct population.
The AUPRC in this cohort was 0.90 (95% CI 0.82-0.96). At first glance, this appears excellent. But precision-recall metrics are highly sensitive to event prevalence. Because the watch-and-wait cohort likely consisted of highly selected patients with favorable response characteristics, the high AUPRC should not be interpreted as evidence of near-perfect predictive accuracy. Rather, it reflects both the model’s performance and the enriched case mix.
The more cautionary finding was calibration. The Brier score in the external validation cohort was 0.30 (95% CI 0.26-0.33), indicating poorer probabilistic accuracy when the pCR-trained model was applied to the clinical complete response setting. After recalibration, the Brier score improved to 0.17. This is a clinically important result. It suggests the model’s raw outputs are not directly portable from a surgery-based pathological endpoint to a watch-and-wait clinical endpoint, but that recalibration can restore practical utility.
Why recalibration matters
Recalibration is not a technical footnote; it is central to real-world implementation. A prediction model trained on pCR among resected patients is being asked to estimate a related, but not identical, endpoint in a selected nonoperative cohort. Differences in case mix, prevalence, outcome definition, and surveillance intensity can shift baseline risk and distort probability estimates. The improvement in Brier score after recalibration underscores that transportability across management pathways is achievable, but not automatic.
Clinical Interpretation
This study is timely because watch-and-wait has moved from a niche strategy to a legitimate organ-preservation pathway in experienced centers. Yet the current standard for selecting candidates remains a multimodal clinical assessment that is expert-dependent and imperfect. A model that combines baseline and post-TNT data may help quantify uncertainty, standardize discussions across teams, and identify patients who warrant either intensified surveillance or a lower threshold for surgery.
Still, the reported discrimination suggests this tool should be viewed as decision support rather than a decision maker. An AUROC around 0.70 is useful, but it does not justify replacing MRI interpretation, endoscopic assessment, digital rectal examination, or multidisciplinary review. For a patient considering omission of TME, the clinical stakes are too high for moderate-performance algorithms to stand alone.
Where the model may be most helpful is in borderline scenarios: patients with near-complete response, equivocal MRI findings, discordance between endoscopy and imaging, or reluctance about surgery. In such cases, a calibrated probability estimate could enrich shared decision-making, especially if presented alongside known surveillance burdens and salvage surgery options.
Strengths of the Study
Several features strengthen the study. First, the investigators used a clinically relevant cohort confined to stage II and III microsatellite stable rectal adenocarcinoma, reducing biological heterogeneity. Second, the model incorporated longitudinal information spanning pre-TNT, post-TNT, and pre-TME time points, which mirrors real care pathways more closely than baseline-only models. Third, the authors performed external validation in a watch-and-wait cohort, the very setting for which the tool is intended. Many predictive oncology models never reach this step.
The use of multiple performance metrics is another positive aspect. Discrimination alone can be misleading, particularly when treatment decisions hinge on absolute risk. Reporting both AUPRC and Brier score provides a more balanced view of utility.
Limitations and Cautions
The main limitation is sample size, especially for external validation. Although 308 surgical patients is respectable for a single-institution or limited-center machine learning study, it remains modest for training flexible foundation-style models. The watch-and-wait cohort of 83 patients is clinically valuable but small, and the confidence intervals are correspondingly wide.
A second concern is endpoint mismatch. pCR and persistent clinical complete response are related but not interchangeable. pCR is a pathological endpoint assessed after resection; pcCR is a longitudinal clinical outcome influenced by surveillance quality, timing, and management thresholds. Using a pCR-trained model to inform pcCR prediction is conceptually reasonable, but it requires recalibration and further validation.
Third, the abstract does not specify which variables contributed most strongly to prediction, nor does it report model interpretability methods. For clinicians, transparency is essential. If the model is to influence organ-preservation decisions, users will want to know whether it relies heavily on MRI restaging, endoscopic findings, treatment interval, nodal features, carcinoembryonic antigen, or other signals.
Fourth, generalizability is uncertain. Watch-and-wait outcomes are strongly center-dependent, influenced by expertise in response assessment and adherence to surveillance protocols. A model developed in one practice environment may not translate cleanly to centers with different imaging standards, endoscopic expertise, or thresholds for declaring complete response.
Finally, no funding statement or ClinicalTrials.gov registration number is provided in the abstract. Because this is a predictive modeling study rather than an interventional trial, trial registration may not apply, but transparency around data provenance and support remains important.
How This Fits With the Existing Literature
Interest in nonoperative management after rectal cancer response has been shaped by the pioneering work of Habr-Gama and by subsequent international registry data showing that selected patients with clinical complete response can avoid surgery with acceptable oncologic outcomes, provided surveillance is rigorous. More recent TNT trials, including RAPIDO, PRODIGE 23, and OPRA, have increased downstaging rates and made organ preservation more feasible.
The OPRA trial was especially influential because it prospectively evaluated TNT sequencing and organ-preservation outcomes, reinforcing the feasibility of watch-and-wait in carefully selected responders. Yet even in that setting, response assessment remained a clinical art supported by imaging and endoscopy rather than a validated prediction engine. The present study pushes the field toward a more formal quantitative framework.
Importantly, the study does not solve the biological problem of residual microscopic disease. Instead, it offers a probabilistic layer that may complement current assessment strategies. In that sense, it aligns with broader oncology efforts to combine imaging, clinical variables, and machine learning rather than relying on any single modality.
Implications for Practice and Research
For current practice, the findings support cautious exploration of AI-assisted response prediction in specialized rectal cancer programs. The most realistic near-term use case is multidisciplinary support: a calibrated model output presented alongside MRI staging, endoscopic appearance, examination findings, and patient preferences. Such a tool may improve consistency in counseling but should not be used in isolation to recommend watch-and-wait.
For research, several next steps are clear. Larger multi-institutional validation is needed, ideally across diverse imaging platforms and surveillance protocols. Prospective impact studies should test whether the model actually improves decision quality, reduces unnecessary surgery, or identifies patients at higher risk of local regrowth. Future versions may benefit from explicit integration of MRI radiomics, endoscopic image analysis, circulating tumor DNA, and serial carcinoembryonic antigen measurements.
Equally important is model explainability. Clinicians are more likely to adopt prediction tools that reveal key drivers of risk and provide calibrated confidence estimates. Implementation studies should also define clinically actionable thresholds. For example, what predicted probability of pCR or pcCR is sufficient to support watch-and-wait, and how should that threshold change based on patient age, comorbidity, low tumor location, or willingness to undergo intense surveillance?
Conclusion
Varghese and colleagues report a novel AI-based model that moderately predicts pathological complete response after TNT in locally advanced rectal cancer and shows promising, though not plug-and-play, utility for predicting persistent clinical complete response in watch-and-wait patients. The major message is not that AI can now determine who should safely avoid surgery. Rather, it is that quantitative response prediction is becoming mature enough to enter the multidisciplinary conversation, especially when accompanied by recalibration and careful clinical oversight.
For a field increasingly focused on organ preservation, that is a meaningful advance. The next challenge is not simply building more accurate models, but proving that these tools improve patient-centered decision-making without compromising oncologic safety.
Funding and ClinicalTrials.gov
No funding source or ClinicalTrials.gov registration number is reported in the abstract provided. Readers should consult the full Annals of Surgery article for disclosure details and any supplementary methodological information.
References
1. Varghese C, Ng JC, Sassun R, Thiels C, Salehinejad H, Perry WRG, Mathis KL, Larson DW. Predicting Treatment Response After Total Neoadjuvant Therapy for Locally Advanced Rectal Cancer. Ann Surg. 2026 Jun 4. PMID: 42240534.
2. Garcia-Aguilar J, Patil S, Kim JK, et al. Organ Preservation in Patients With Rectal Adenocarcinoma Treated With Total Neoadjuvant Therapy. J Clin Oncol. 2022;40(23):2546-2556.
3. Bahadoer RR, Dijkstra EA, van Etten B, et al. Short-course radiotherapy followed by chemotherapy before total mesorectal excision in locally advanced rectal cancer (RAPIDO): a multicentre, randomised, open-label, phase 3 trial. Lancet Oncol. 2021;22(1):29-42.
4. Conroy T, Bosset JF, Etienne PL, et al. Neoadjuvant chemotherapy with FOLFIRINOX and preoperative chemoradiotherapy for patients with locally advanced rectal cancer (PRODIGE 23): a multicentre, randomised, open-label, phase 3 trial. Lancet Oncol. 2021;22(5):702-715.
5. van der Valk MJM, Hilling DE, Bastiaannet E, et al. Long-term outcomes of clinical complete responders after neoadjuvant treatment for rectal cancer in the International Watch & Wait Database (IWWD). Lancet. 2018;391(10139):2537-2545.
6. Habr-Gama A, Perez RO, Nadalin W, et al. Operative versus nonoperative treatment for stage 0 distal rectal cancer following chemoradiation therapy: long-term results. Ann Surg. 2004;240(4):711-717.

