Highlights
– Bedside nurse-documented ICDSC using the standard cutoff (≥4) had only moderate agreement with trained researcher CAM-ICU assessments (Cohen’s kappa = 0.42) in 1,535 matched assessments from 279 ICU patients.
– A logistic model incorporating individual ICDSC items and clinical variables (mechanical ventilation status, admission SOFA) predicted researcher-identified delirium with good discrimination (AUC = 0.87) and a cross-validated F1 score of 0.72.
– Simpler models with limited ICDSC information still improved performance versus the standard ICDSC cutoff (cross-validated F1 = 0.60–0.70), suggesting avenues for pragmatic, EHR-based delirium phenotyping for research.
Background
Delirium is an acute, fluctuating disturbance in attention and cognition that affects up to half of critically ill patients and is independently associated with increased mortality, longer ICU and hospital length of stay, and long-term cognitive impairment. Current critical care guidelines recommend routine use of validated bedside screening tools—principally the Confusion Assessment Method for the ICU (CAM-ICU) and the Intensive Care Delirium Screening Checklist (ICDSC)—to detect delirium in mechanically ventilated and nonventilated patients alike.
Despite validation studies showing good psychometric properties for both tools, real-world bedside documentation by nurses often shows variable agreement with researcher-administered reference-standard assessments. This gap limits the utility of routine clinical delirium documentation for observational research, quality measurement, and pragmatic trials conducted using electronic health records (EHRs).
Study design
Setting and population
Toth et al. analyzed prospectively collected, matched daily delirium assessments from critically ill adults with acute respiratory failure or sepsis admitted to intensive care units in a large academic health system in southwestern Pennsylvania. Between bedside nurses and trained researchers, paired assessments were used to compare delirium ascertainment.
Assessments and reference standard
Bedside nurses documented the ICU Delirium Screening Checklist (ICDSC) in routine care. Trained research staff performed CAM-ICU assessments, which served as the reference standard for the analyses. Assessments were matched temporally (same ICU day) and limited to noncomatose patients for the primary analyses.
Analytic approach
The authors first compared delirium classification by the established ICDSC cutoff (ICDSC ≥ 4 = delirium) with the CAM-ICU, quantifying agreement using Cohen’s kappa. They then developed logistic regression models to predict a positive CAM-ICU using varying degrees of ICDSC information and readily available clinical variables: components of the ICDSC (individual item scores), mechanical ventilation status, and admission Sequential Organ Failure Assessment (SOFA) score. Models were internally validated using ten-fold cross-validation and evaluated with discrimination metrics (AUC) and F1 scores to balance sensitivity and precision. Sensitivity analyses assessed performance of models using more limited ICDSC information to simulate scenarios where full item-level data might not be available in an EHR.
Key findings
Study sample: 1,535 matched nurse-to-researcher assessments from 279 patients.
Agreement using established ICDSC cutoff
Using the standard ICDSC threshold (≥4) produced only moderate agreement with researcher-administered CAM-ICU assessments (Cohen’s kappa = 0.42). This finding echoes prior reports that simple dichotomization of bedside ICDSC scores can misclassify delirium compared with research-standard evaluations.
Model-informed prediction
The most comprehensive logistic regression model—leveraging individual ICDSC item data plus mechanical ventilation status and admission SOFA—achieved good discrimination for predicting a positive CAM-ICU (AUC = 0.87). Performance was stable in internal validation (ten-fold cross-validation), with an F1 score of 0.72, indicating an encouraging balance of sensitivity and positive predictive value.
Simpler models and sensitivity analyses
When item-level ICDSC information was restricted (simulating partial documentation in some EHR contexts), classification performance decreased but remained clinically useful (cross-validated F1 scores 0.60–0.70). Importantly, the conventional ICDSC cutoff model had the lowest predictive performance among the evaluated approaches.
Interpretation of effect sizes
Although the study report focuses on discrimination and F1 rather than raw sensitivity/specificity pairs for all models, the substantial improvement in AUC and F1 indicates that incorporating clinical context (mechanical ventilation, organ dysfunction) and item-level delirium features meaningfully reduces misclassification versus relying on the ICDSC cutoff alone.
Expert commentary and implications
Clinical relevance: Routine nurse documentation of delirium is essential for bedside care and quality monitoring, but as this study demonstrates, raw ICDSC scores—particularly dichotomized by a single cutoff—do not perfectly align with researcher CAM-ICU assessments. For investigators and health systems that rely on EHR-documented delirium to study outcomes, implement performance metrics, or trigger interventions, improved phenotyping methods are needed.
Model utility: The logistic model described by Toth et al. provides a pragmatic bridge: by using granular nursing-documented ICDSC information coupled with two clinical variables already present in the EHR (mechanical ventilation status and admission SOFA), researchers can obtain a delirium label that aligns substantially better with research-standard CAM-ICU assessments. This approach enables higher-fidelity use of routinely collected data for large-scale observational studies and pragmatic trials without requiring new data collection workflows.
Guideline concordance: Current practice guidelines from critical care societies recommend routine delirium monitoring using validated tools such as CAM-ICU or ICDSC. This study does not supplant those recommendations but suggests that analytic postprocessing of ICDSC documentation can enhance its value for research and secondary use.
Limitations and generalizability
– Single health system: Data derive from multiple ICUs within one Southwestern Pennsylvania academic system; external validation in diverse hospitals, community ICUs, and across different nursing documentation practices is required before broad adoption.
– Different instruments and raters: The reference was researcher-administered CAM-ICU, while the index was nurse-documented ICDSC. Differences may reflect both instrument constructs and rater training/availability rather than purely measurement error.
– Potential selection and timing biases: Assessments were daily and matched by ICU day; delirium is a fluctuating syndrome and non-simultaneous assessments can differ. The authors attempted temporal matching but residual misclassification from true fluctuation may persist.
– Model complexity vs implementability: The best-performing model used item-level ICDSC data; some EHRs may only store total scores or incomplete items. However, sensitivity analyses show even reduced inputs improve performance over the cutoff rule.
– Need for prospective testing: Demonstrating that model-informed delirium labels improve research inferences or clinical outcomes when used in decision support requires prospective evaluation.
Clinical and research recommendations
For researchers using EHR-derived delirium outcomes:
– Consider analytic approaches that leverage item-level ICDSC and clinical covariates rather than relying solely on a cutoff-based ICDSC label.
– Report the delirium ascertainment algorithm and validation metrics so readers can assess potential misclassification bias.
– Prioritize external validation of prediction models across different institutions and documentation workflows.
For clinicians and health systems:
– Reinforce training and quality assurance for bedside delirium screening, emphasizing both tool fidelity and documentation completeness.
– Where feasible, encourage storage of item-level ICDSC elements in the EHR to enable downstream analytic refinement.
– Use improved EHR-based phenotyping cautiously in real-time clinical decision support until prospective impact is established.
Conclusion
Toth and colleagues provide an important, pragmatic contribution to delirium measurement in the ICU. Their findings show that a model-based approach integrating nurse-documented ICDSC item data with simple clinical variables markedly improves agreement with researcher-administered CAM-ICU assessments. This strategy can enhance the utility of routine clinical documentation for research, quality improvement, and potentially for population-level surveillance, provided models are externally validated and their implementation is accompanied by continued attention to bedside screening quality.
Funding and clinicaltrials.gov
Funding and trial registration: Funding sources were not detailed in the provided summary. For complete disclosures, grant numbers, and trial registration (if applicable), consult the full article: Toth KM et al., Crit Care Med. 2025;53(12):e2516–e2525.
Selected references
– Toth KM, Aghababa Z, Kennedy JN, et al. Optimizing Agreement Between Bedside Nurse-Documented and Trained Researcher Delirium Assessments in the ICU. Crit Care Med. 2025;53(12):e2516–e2525.
– Devlin JW, Skrobik Y, Gélinas C, et al. Clinical Practice Guidelines for the Prevention and Management of Pain, Agitation/Sedation, Delirium, Immobility, and Sleep Disruption in Adult Patients in the ICU. Crit Care Med. 2018;46(9):e825–e873.
– Bergeron N, Dubois MJ, Dumont M, Dial S, Skrobik Y. The Intensive Care Delirium Screening Checklist: evaluation of a new screening tool. Intensive Care Med. 2001;27(5):859–864.
– Ely EW, Margolin R, Francis J, et al. Evaluation of delirium in critically ill patients: validation of the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU). Crit Care Med. 2001;29(7):1370–1379.
– Pandharipande PP, Girard TD, Jackson JC, et al. Long-term cognitive impairment after critical illness. N Engl J Med. 2013;369(14):1306–1316.

