Title
Why One Sleep Study May Not Tell the Whole Story: Night-to-Night Variability, AHI Definitions, and OSA Reclassification
Highlights
Night-to-night variability (NtNV) is real in obstructive sleep apnea (OSA), but not all polysomnographic metrics vary to the same degree.
In this study, positional measures, sleep latency, and some autonomic metrics varied more than oxygenation-based metrics, while hypoxic burden (HB) remained notably stable.
Using the apnea-hypopnea index (AHI) scored with a 4% desaturation rule led to more diagnostic disagreement than AHI scored with the 3% desaturation/arousal rule, especially at clinically important severity thresholds.
Calibrated threshold models suggested that AHI 4% cut points must be substantially lower than traditional AHI 3%/arousal cut points to approximate similar severity categories.
Study Background and Clinical Context
Obstructive sleep apnea is one of the most common sleep-related breathing disorders and is strongly associated with excessive daytime sleepiness, impaired quality of life, hypertension, cardiovascular risk, and reduced neurocognitive performance. In routine practice, diagnosis and severity grading depend heavily on polysomnography (PSG), with the apnea-hypopnea index (AHI) serving as the most widely used summary metric.
However, AHI is not a single biological entity. It is a score that depends on how hypopneas are defined, which desaturation threshold is used, whether arousals are included, body position, sleep stage distribution, and sensor/scoring differences. These factors can produce clinically meaningful night-to-night variability even when two studies are performed only days apart. That variability matters because a patient may be labeled as having mild, moderate, or severe OSA based on one night and classified differently on another.
This issue has practical consequences. Diagnosis influences treatment eligibility, insurance authorization, and the selection of therapies such as positive airway pressure, oral appliance therapy, weight loss interventions, positional therapy, or surgery. The study summarized here addresses an important gap: it evaluates variability across multiple PSG-derived metrics, not just AHI, and examines whether different AHI scoring rules alter diagnostic stability in patients with known or highly suspected moderate-to-severe OSA.
Study Design
This was a retrospective analysis of a prospective study that included 147 participants with a prior diagnosis of OSA or a high pretest likelihood of moderate-to-severe disease. Each participant underwent two PSG recordings within 10 days, allowing the investigators to compare closely spaced nights rather than distant longitudinal studies.
The analysis assessed night-to-night variability across 20 PSG-derived metrics. These included respiratory event frequency measures, oxygenation measures, sleep architecture variables, positional measures, and autonomic indicators such as heart rate.
The investigators then created a normalized NtNV matrix and applied principal component analysis (PCA) followed by unsupervised k-means clustering. This data-driven approach was used to identify whether participants naturally separated into distinct variability-pattern groups, particularly with respect to respiratory event variability.
Diagnostic stability was compared using two commonly applied AHI scoring definitions: AHI based on 3% desaturation and/or arousal, and AHI based on 4% desaturation. The study also examined hypoxic burden, a metric that reflects the cumulative depth and duration of oxygen desaturations associated with respiratory events. In addition, calibration models were developed to align AHI 4% thresholds with the severity cut points typically used for AHI 3%/arousal.
Key Findings
Not all PSG metrics are equally variable
The most variable measures were maximum heart rate, positional fractions, and sleep latency. These findings are clinically intuitive: body position can shift markedly between nights, sleep onset can be influenced by laboratory environment and first-night effects, and heart rate peaks may fluctuate with arousals, respiratory burden, and sleep fragmentation.
By contrast, average SpO2, average heart rate, minimum SpO2, and hypoxic burden were the most stable metrics. This is an important observation because it suggests that oxygenation-based measures may be less sensitive to single-night sampling noise than event-count-based indices or behavioral/state-dependent measures.
Intermediate variability was seen in respiratory event frequency metrics, including AHI-related measures. These are the core markers used in clinical diagnosis, which means that the most commonly relied-upon diagnostic variables are not the most stable ones.
PCA and clustering identified respiratory variability phenotypes
In the PCA followed by k-means analysis, respiratory event frequency metrics contributed most strongly to the separation of participants into lower- and higher-respiratory-variability groups. This suggests that a subset of patients has intrinsically greater instability in event frequency across nights, while others remain relatively consistent.
Clinically, this is a useful reminder that OSA is not a static disorder measured perfectly by a single night of PSG. Some patients may have relatively reproducible respiratory burden, whereas others fluctuate enough that classification depends heavily on the exact night captured.
AHI 4% was less stable than AHI 3%/arousal
Diagnostic disagreement was consistently higher when OSA severity was defined using AHI 4% rather than AHI 3%/arousal. In short-interval comparisons, overall disagreement was 29.9% for AHI 4% versus 21.2% for AHI 3%/arousal. At the moderate-to-severe threshold, disagreement was 14.3% versus 5.4%, respectively.
In longitudinal comparisons, disagreement widened further: 45.9% for AHI 4% versus 31.1% for AHI 3%/arousal overall, and 20.9% versus 8.2% at the moderate-to-severe threshold.
These differences matter because many clinical and payer systems still rely on AHI 4% thresholds. The study indicates that this definition is more likely to move a patient across diagnostic categories simply because of how the hypopnea rule is written, not because the underlying disease truly changed.
Hypoxic burden was comparatively stable
HB showed low inter-night disagreement of 11.8%, making it one of the most reproducible metrics in the analysis. This supports the growing view that cumulative hypoxemia may be a more biologically robust measure of OSA severity than event counts alone, especially when event scoring rules differ.
Because hypoxic burden integrates both the depth and duration of desaturations, it may better reflect physiologic stress than a simple count of breathing events. Its stability across nights in this study strengthens the case for its use in future phenotyping and risk stratification research.
Threshold calibration showed large discrepancies between AHI rules
Statistical calibration models aligned AHI 4% thresholds of 6.1-6.9 events/h and 18.4-22.3 events/h with AHI 3%/arousal severity cut points of 15 and 30 events/h. In practical terms, this means that to approximate the commonly used AHI 3%/arousal thresholds, the AHI 4% values must be much lower.
This finding underscores a major problem in cross-study comparison and clinical reimbursement policy: two patients with the same physiologic burden may be assigned very different severity labels depending on the scoring definition. It also means that older or payer-driven AHI thresholds cannot be directly substituted for more inclusive clinical thresholds without recalibration.
Interpretation and Clinical Relevance
The central message of this study is that night-to-night variability is not uniform across PSG outputs. Positional, autonomic, and sleep architecture variables are more volatile, respiratory event frequency metrics are intermediate, and oxygenation measures are the most stable. Among diagnostic indices, AHI scored with a 4% desaturation rule appears especially vulnerable to reclassification instability.
This has several important implications.
First, a single PSG may be sufficient for many patients, but not all. If a patient’s AHI falls near a treatment or coverage threshold, the chance of reclassification on repeat testing is nontrivial. That is especially relevant when symptoms are strong but a single-night PSG appears borderline.
Second, clinicians should interpret AHI in the context of the scoring rule used. AHI 4% and AHI 3%/arousal are not interchangeable, even though they are often discussed as if they were simply alternate versions of the same measure. The present study demonstrates that the definition itself drives meaningful differences in stability and classification.
Third, hypoxic burden may deserve greater attention in both research and clinical phenotyping. Its lower variability suggests it may better capture disease burden across nights than event counts that are sensitive to sleep stage, position, and scoring conventions.
Fourth, payer policies that depend on AHI 4% thresholds may exclude patients who would meet clinically relevant definitions under AHI 3%/arousal scoring. The calibration results provide a quantitative argument for revisiting reimbursement criteria if the goal is to align insurance coverage with clinical severity rather than a narrower scoring definition.
Strengths and Limitations
A major strength of the study is its relatively short inter-study interval, which limits confounding by long-term disease progression or treatment changes. The sample also reflects a clinically relevant group: patients with known or strongly suspected moderate-to-severe OSA, where diagnostic classification has immediate treatment implications.
The use of 20 PSG-derived metrics, coupled with PCA and unsupervised clustering, provides a richer picture than studies limited to AHI alone. The inclusion of hypoxic burden is also a meaningful advance, given increasing interest in metrics that better reflect physiologic stress.
Several limitations should be considered. The study is retrospective in analysis, even though the source cohort was prospective. Participants were enriched for prior diagnosis or high pretest probability of moderate-to-severe OSA, so generalizability to low-risk patients, women with milder disease, pediatric populations, or patients with insomnia-predominant presentations is uncertain. The results also depend on the specific scoring conventions and equipment used, and they do not establish whether hypoxic burden should replace AHI in routine practice. Finally, while calibration models are informative, they should be validated in independent cohorts before being adopted for policy or guideline development.
Conclusion
This study shows that night-to-night variability in PSG is real, clinically relevant, and metric-dependent. Positional and autonomic measures vary more, oxygenation measures vary less, and respiratory event frequency sits in between. Among the diagnostic indices tested, AHI defined by 4% desaturation was less stable and more prone to reclassification than AHI defined by 3% desaturation/arousal. Hypoxic burden was comparatively stable and may provide a more reproducible estimate of physiologic disease burden.
For clinicians, the practical takeaway is straightforward: when OSA severity is near a decision threshold, the scoring definition matters as much as the number itself. For policymakers and payers, these findings support a re-examination of AHI-based coverage rules that rely on a narrow hypopnea definition. For researchers, the study reinforces the need to move beyond event counts alone and toward metrics that are both biologically meaningful and reproducible.
Funding and ClinicalTrials.gov
The abstract provided does not specify funding sources or a ClinicalTrials.gov identifier. Readers should consult the full Chest article for complete disclosures and registration details.
References
1. Alavi A, Costa E, Matsumoto MMS, Odenwald N, Kushida C, Bahmani A, Capasso R. Comprehensive Evaluation of Night-to-Night Variability in PSG Metrics and AHI-Based Diagnostic Reclassification. Chest. 2026-06-16. PMID: 42302984.
2. Berry RB, Quan SF, Abreu AR, et al. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine.
3. American Academy of Sleep Medicine. International Classification of Sleep Disorders, 3rd ed. Darien, IL: AASM.
4. Azarbarzin A, Sands SA, Stone KL, et al. The hypoxic burden of sleep apnoea predicts cardiovascular disease-related mortality: the MrOS and SHHS cohorts. Eur Heart J. 2019;40(14):1149-1157.
AI Image Prompt
A split-screen clinical illustration showing a sleep laboratory PSG setup on one side and a fluctuating OSA diagnostic graph on the other, with AHI thresholds, oxygen desaturation curves, and a calm but analytical medical atmosphere, high-resolution editorial style, blue and white tones.
,

