AI Versus Family Physicians: Evaluating ChatGPT-4o’s Responses to Common Primary Care Queries

Study Background and Disease Burden

Primary care forms the cornerstone of comprehensive healthcare, addressing a wide range of medical concerns from acute illnesses to chronic disease management and preventive care. The growing demand on family physicians often limits time available for thorough patient education and individualized counseling. Meanwhile, artificial intelligence (AI) technologies, especially large language models like ChatGPT-4o, have emerged as potential adjunct tools in healthcare delivery. They promise rapid, consistent, and detailed responses to patient queries, possibly supplementing physician advice and improving outcomes. Evaluating AI’s capability against human clinicians is critical to define its role, especially given that primary care inquiries often entail nuanced, patient-centered communication that blends medical accuracy with empathy.

Study Design

The referenced study by İnan et al. (2025) conducted a comparative, observational, cross-sectional analysis involving 200 carefully curated clinical questions reflective of common family medicine scenarios. These questions were developed through a systematic literature review and expert validation to ensure representativeness and clinical relevance.

Three experienced family physicians independently answered this dataset, as did ChatGPT-4o, the latest iteration of OpenAI’s generative language model at the time. To minimize bias, all responses were anonymized and randomly evaluated by three independent family medicine experts. Evaluation metrics were structured across four dimensions using standardized Likert scales:

– Appropriateness (1-6): The suitability of the response to the clinical context.
– Accuracy (1-6): The correctness of the medical information provided.
– Comprehensiveness (1-3): The extent to which the response covered relevant aspects of the question.
– Empathy (1-5): The expression of understanding and patient-centeredness.

Additionally, word counts of responses were recorded to assess length and detail.

Key Findings

The study demonstrated statistically significant superiority of ChatGPT-4o across all evaluation metrics (p < 0.01). Notably, ChatGPT-4o’s mean scores were:

– Appropriateness: 5.8 ± 0.5 versus physicians’ 4.3 ± 1.0
– Accuracy: 5.8 ± 0.5 versus physicians’ 4.5 ± 1.1
– Comprehensiveness: 2.4 ± 0.6 versus physicians’ 1.4 ± 0.7
– Empathy: 4.8 ± 0.4 versus physicians’ 4.0 ± 0.8

These differences underscore AI’s capacity not only to provide medically accurate and relevant answers but also to do so with notable empathy, an often unexpected facet of algorithmic response.

The length of responses was considerably longer for ChatGPT-4o (mean 298.8 ± 82.3 words) compared to physicians' answers (mean 106.1 ± 95.0 words), suggesting more detailed elaboration by the AI, which could relate to higher comprehensiveness scores.

In topic-specific analyses, ChatGPT-4o outperformed physicians consistently except in two domains—General Consultation and Child Infections—where the statistical differences approached but did not reach significance (p = 0.07 and 0.08 respectively). These areas may reflect nuanced clinical judgment scenarios where human experience carries particular weight.

Expert Commentary

These findings are compelling, suggesting that AI tools like ChatGPT-4o can augment primary care by enhancing patient education and supporting clinical decision-making with extensive, accurate, and empathetic information. The higher empathy scores challenge conventional wisdom that AI lacks emotional intelligence, implying that carefully trained models can generate responses that resonate with patients’ psychosocial needs.

However, the markedly longer AI responses raise considerations about efficiency and patient preference, emphasizing the need to tailor answer length for practical use. Also, the near-equivalence in General Consultation and Child Infections hints at contexts requiring intricate clinical judgment or culturally contextualized information where seasoned clinicians excel.

Limitations of the study include the controlled, simulated nature of the assessment—real-world clinical scenarios involve dynamic interactions, physical exams, and nuanced decision-making beyond text responses. Moreover, the expert raters’ subjective evaluations, though standardized, may introduce interpretive variability.

Future directions should also investigate AI’s integration into workflow to avoid overburdening clinicians or patients with excessive information and to ensure cultural and linguistic appropriateness across diverse populations.

Conclusion

The comparative analysis by İnan et al. signals a paradigm shift where AI, exemplified by ChatGPT-4o, can effectively supplement family physicians by providing highly appropriate, accurate, comprehensive, and empathetic answers to patient queries in primary care. The potential applications span enhancing patient education, supporting clinical reasoning, and enriching medical training.

For clinical practice, AI could serve as an initial information source or decision support tool, freeing physicians to focus on complex clinical judgments and interpersonal rapport. However, cautious integration with attention to response refinement for brevity and cultural relevance remains crucial.

Continued research should explore real-world studies assessing patient outcomes, satisfaction, and safety, validating AI’s role beyond experimental frameworks. The collaboration between AI and human clinicians promises a future of more accessible, informed, and compassionate primary care.

References

İnan M, Suvak Ö, Aypak C. AI in primary care: Comparing ChatGPT and family physicians on patient queries. Int J Med Inform. 2025 Nov;203:106047. doi: 10.1016/j.ijmedinf.2025.106047. Epub 2025 Jul 12. PMID: 40664020.