Rule-Based Chatbots Outperform LLMs in Depressive Symptom Management: A Systematic Review and Meta-Analysis

Rule-Based Chatbots Outperform LLMs in Depressive Symptom Management: A Systematic Review and Meta-Analysis

Highlights

  • Rule-based chatbots demonstrate a small but statistically significant effect in alleviating depressive symptoms (g=0.266), whereas LLM-based chatbots currently lack robust evidence for efficacy.
  • The critical intervention window for rule-based chatbots is identified as 4 to 8 weeks, showing the most pronounced effects in the medium term.
  • Neither rule-based nor LLM-based chatbots demonstrated statistically significant efficacy in reducing anxiety symptoms in the pooled analysis.
  • The current clinical utility of LLM-based chatbots is hindered by wide confidence intervals and a lack of large-scale, controlled clinical trials.

Introduction: The Scalability Challenge in Global Mental Health

The global mental health landscape is currently facing a dual crisis: an unprecedented surge in the prevalence of depression and anxiety disorders, and a chronic shortage of qualified mental health professionals. Traditional face-to-face psychotherapy, while effective, remains inaccessible to a significant portion of the population due to high costs, geographic barriers, and the stigma associated with seeking care. In this context, digital mental health interventions—specifically chatbots—have emerged as a promising, scalable, and low-cost solution.

For over a decade, rule-based chatbots, which operate on pre-defined decision trees and structured clinical protocols (such as Cognitive Behavioral Therapy), have been the industry standard. However, the rapid advancement of Large Language Models (LLMs) like GPT-4 has introduced a new paradigm of generative, fluid, and seemingly more empathetic interaction. Despite the technological excitement surrounding LLMs, their clinical efficacy in therapeutic settings has not been systematically compared to traditional rule-based systems. A recent systematic review and meta-analysis by Du et al. (2025) provides a much-needed critical evaluation of these two distinct technological pathways.

Study Design and Methodology

To address the gap in comparative evidence, researchers conducted a systematic search across seven major databases, identifying 15 high-quality studies published between 2020 and 2025. The primary objective was to evaluate the efficacy of rule-based versus LLM-based chatbots in alleviating symptoms of depression and anxiety.

Recognizing the inherent clinical and methodological heterogeneity in digital health research, the study employed a robust variance estimation (RVE) approach to account for non-independent effect sizes. Standardized mean differences (SMDs) were calculated using Hedges g. The researchers utilized a random-effects model, with pooled effect sizes estimated through restricted maximum likelihood estimation (REML). Subgroup analyses were performed to determine the impact of control group types (e.g., waitlist vs. active control), intervention duration, and participant age.

Figure 1. Literature screening flowchart.

Figure 1.

Key Findings: The Efficacy Gap

Depression Outcomes

The meta-analysis revealed a clear distinction in clinical performance between the two chatbot types regarding depression. Rule-based interventions achieved a small but significant effect size (g=0.266; 95% CI 0.020-0.512; P=.04). This suggests that structured, evidence-based dialogue remains a reliable tool for symptom reduction.

In contrast, LLM-based interventions showed a higher point estimate but failed to reach statistical significance (g=0.407; 95% CI -0.734 to 1.550; P=.17). The remarkably wide confidence interval for LLMs reflects a high degree of variability in study outcomes and a lack of standardized implementation, making it impossible at this stage to recommend LLMs as a standalone clinical intervention for depression.

Figure 2. Depression forest plot.

Figure 2.

Figure 3. Anxiety forest plot

Figure 3.

Anxiety Outcomes

The results for anxiety were less encouraging for both technologies. Rule-based chatbots did not yield a significant effect (g=0.147; 95% CI -0.073 to 0.367; P=.15). Similarly, while LLM-based chatbots had a point estimate of g=0.711, the lack of statistical significance (P=.13) and the wide confidence interval (-0.334 to 1.760) underscore the need for more targeted research in the anxiety domain.

Subgroup Insights: The 4-to-8-Week Window

One of the most clinically relevant findings of the study was the identification of an optimal intervention duration. Subgroup analysis indicated that the rule-based chatbot was most effective when the intervention lasted between 4 and 8 weeks. Interventions shorter than 4 weeks may not provide enough therapeutic dose, while those extending beyond 8 weeks may suffer from declining user engagement or ‘digital fatigue.’

Furthermore, rule-based chatbots showed superior performance when compared against blank (waitlist) control groups, confirming their utility in environments where no other psychological resources are available.

Expert Commentary: Why Structure Trumps Fluidity (For Now)

The findings of Du et al. highlight a critical tension in digital psychiatry: the trade-off between the flexibility of LLMs and the safety/predictability of rule-based systems. Rule-based chatbots are essentially digital translations of clinical protocols. By following a decision tree, they ensure that the user receives validated therapeutic techniques, such as cognitive restructuring or behavioral activation, without the risk of ‘hallucination’ or off-script advice.

LLMs, while more ‘human-like’ in conversation, are not inherently therapeutic. Without rigorous fine-tuning on clinical datasets (e.g., RLHF with psychiatric experts), an LLM might provide supportive-sounding dialogue that lacks the structural components necessary to drive clinical improvement. The high heterogeneity in LLM studies suggests that we are currently in a ‘Wild West’ phase of development, where the technology is advancing faster than the clinical evidence required to support it.

From a biological and psychological plausibility standpoint, the 4-8 week window aligns with the typical timeframe required for cognitive behavioral shifts to manifest. The lack of impact of participant age suggests that these digital tools are relatively age-agnostic, though the interface design must still be tailored to the specific demographic.

Clinical Implications and Limitations

For clinicians and health policy experts, these results suggest that rule-based chatbots are currently the more ‘evidence-based’ choice for integration into stepped-care models of mental health. They serve as an effective first-line intervention for mild-to-moderate depression, particularly in resource-limited settings.

However, several limitations must be noted:

  • Small Sample Sizes for LLMs: The lack of significant findings for LLMs may be a function of low power rather than a lack of potential. As more randomized controlled trials (RCTs) are completed, the effect size may stabilize.
  • Heterogeneity: Differences in chatbot ‘personality,’ interaction frequency, and the specific therapeutic framework used across studies remain high.
  • Anxiety Complexity: Anxiety symptoms may require more nuanced, real-time physiological feedback or exposure-based interventions that current chatbots struggle to deliver.

Conclusion

The study by Du et al. provides a sobering but necessary reality check for the field of digital mental health. While the allure of Large Language Models is undeniable, rule-based chatbots remain the only category with statistically significant evidence supporting their use in alleviating depressive symptoms. A 4-to-8-week structured intervention appears to be the most effective clinical pathway. Future research must focus on expanding the sample sizes for LLM-based trials and exploring ‘hybrid’ models that combine the clinical safety of rule-based systems with the engaging conversational capabilities of generative AI.

Reference

Du Q, Ren Y, Meng ZL, He H, Meng S. The Efficacy of Rule-Based Versus Large Language Model-Based Chatbots in Alleviating Symptoms of Depression and Anxiety: Systematic Review and Meta-Analysis. J Med Internet Res. 2025 Dec 4;27:e78186. doi: 10.2196/78186 IF: 6.0 Q1 . PMID: 41343858 IF: 6.0 Q1 ; PMCID: PMC12677872 IF: 6.0 Q1 .

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply