A major Oxford study finds AI chatbots can give inaccurate, inconsistent, and potentially dangerous medical advice, highlighting limits for real-world use
Published in Nature Medicine, the Oxford-led research found that although AI chatbots perform well on standardised medical tests, they often provide real patients with unsafe or incorrect advice. The findings clearly show that AI is not reliable enough for trusted medical guidance or decision-making in real-world settings.
How the study tested AI chatbots in real-world patient scenarios
Large language models (LLMs) are a type of artificial intelligence algorithm based on deep learning, trained to understand and generate natural language, also known as AI chatbots.
The research team, led by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, conducted the largest user study of large language models (LLMs) and how the public is using them to aid in medical decisions.
Study participants used AI chatbots to identify health conditions and decide on an appropriate course of action, such as seeing a GP or going to the hospital, based on information curated by doctors. Example medical scenarios include a young man developing a severe headache after a night out and a new mother feeling exhausted.
In the study, one group used a chatbot to assist their decision-making, whilst a control group used other traditional sources of information.
The researchers evaluated how accurately participants identified likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. The outcomes were compared with those of standard LLM testing strategies, revealing a striking difference between benchmark tests and interaction with people.
Why AI chatbots aren’t ready to replace doctors
The study found that AI chatbots were no more effective than traditional methods. Participants using LLMs did not make safer or more accurate decisions than those relying on online searches or personal judgment.
Communication between participants and AI chatbots was also inconsistent. Participants were often unsure of what information they needed to provide the LLMs to gain accurate advice, and the responses were varied, making it hard to know what to do next.
The researchers also found that existing evaluation methods for AI chatbots do not adequately reflect the complexity of human-user interaction.
“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” said Dr Rebecca Payne, GP, lead medical practitioner on the study (Nuffield Department of Primary Care Health Sciences, University of Oxford, and Bangor University).
“Despite all the hype, AI just isn’t ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.”
Lead author Andrew Bean, a DPhil student at the Oxford Internet Institute, said: “Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.”
Senior author Associate Professor Adam Mahdi (Oxford Internet Institute) said: “The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators. Our recent work on construct validity in benchmarks shows that many evaluations fail to measure what they claim to measure, and this study demonstrates exactly why that matters. We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.”








