AI outperforms humans in medical tests but falls short in real-world diagnosis, study finds

0
264
AI models outperform humans in medical knowledge tests but struggle with real-world clinical reasoning, highlighting the continued need for human oversight in healthcare decisions.

PATTAYA, Thailand – Recent trends show more people turning to artificial intelligence (AI) for medical advice, using it to interpret symptoms and reports before consulting doctors. While AI can deliver well-structured assessments within seconds, questions remain over whether these systems can truly diagnose and treat patients effectively.

A study conducted by University of Marburg in Germany compared 13 large language models (LLMs) with 123 medical students and physicians using standardized tests on acute kidney injury.

The results showed AI models significantly outperforming humans, achieving an average score of around 90%, with several models scoring perfectly. In contrast, human participants scored an average of just 48.7%. The AI systems also completed the tests considerably faster.



However, despite excelling in factual knowledge, the models still lack the clinical judgment required for real-world medical care.

A separate study published in JAMA Network Open found that large language models struggle with clinical reasoning, particularly in the early stages of cases when information is limited. In simulations across 29 clinical scenarios, the models failed to produce appropriate differential diagnoses in more than 80% of cases.

Researchers emphasized that differential diagnosis remains central to the “art and science” of medicine—an area where AI has yet to match human capability. As a result, AI is increasingly viewed as a tool to support, rather than replace, physicians.


Jens Kleesiek of Essen University Hospital noted that while digital systems are transforming healthcare processes such as documentation and coordination, they must still operate under the guidance of trained professionals.

Experts agree that human oversight remains essential to ensure AI is used safely and appropriately in healthcare settings.

The integration of AI into clinical practice also raises concerns about reliability. Experienced doctors may be able to identify subtle errors in AI-generated recommendations, but less experienced users may lack the judgment needed to detect them.



Another growing risk is the “outsourcing of reasoning,” where users rely too heavily on AI systems that produce confident and well-articulated responses. This could discourage independent thinking and weaken critical decision-making skills over time.

Researchers warn that gradual dependence on AI may ultimately erode essential medical skills that require continuous practice and human judgment.