Paging Dr. AI: machines now beat doctors where diagnosis gets messy

New Harvard study finds OpenAI's o1 model outperformed doctors in ER triage and clinical management, but researchers warn results do not prove AI is ready for routine clinical use; for now, they say it should act as second opinion

Oren Reiss|
A patient arrives at the emergency room with a blood clot in his lungs. His condition improves, then suddenly worsens. The medical team suspects the treatment has failed. 
At that point, artificial intelligence enters the picture: After scanning his medical file, it suggests a completely different theory: The patient has a history of lupus, an autoimmune disease that can cause heart inflammation, and that, according to the model, explains the deterioration. The theory turns out to be correct.
3 View gallery 
The gap between AI-assisted triage and doctors’ diagnoses was consistent and significant. Illustration
(Photo: Shutterstock)
This is not a fictional scene from a medical drama. It happened not long ago in the emergency room at Beth Israel Deaconess Medical Center in Boston. A new study published in recent days in the prestigious journal Science argues that such cases may soon become routine, and the gap it found in favor of the AI model is striking.
Six experiments, one result
The study, led by researchers from Harvard and Beth Israel Deaconess in collaboration with Stanford researchers, did not rely on a single test. The researchers conducted six separate experiments pitting OpenAI’s o1 model, a new generation of “chain-of-thought” reasoning model that can reason step by step before concluding, against hundreds of doctors with varying levels of training and experience, including residents, specialists and family physicians. In every experiment, without exception, the model outperformed the humans.
The most significant experiment, the one most closely resembling real clinical practice, used 76 real cases from the Beth Israel ER. The model and two specialist physicians received exactly the same data: electronic medical records, vital signs and a few sentences written by the intake nurse. Two additional doctors, who did not know the source of each diagnosis, rated the results.
The gap was consistent and significant. At the triage stage, when information is minimal and pressure is at its peak, the model identified the correct or very close diagnosis in 67% of cases. The first doctor scored 55%, and the second scored 50%. As more information became available, everyone’s accuracy improved. Still, the gap remained: When the patient was admitted to the internal medicine unit, the model reached 81.6%, compared with 78.9% and 69.7% for the physicians.
“That’s the big conclusion for me,” said Dr. Adam Rodman of Harvard Medical School, who leads the task force integrating AI into the curriculum and directs the AI program at the Shapiro Center at Beth Israel Deaconess. “It works with the messy data of a real emergency room. It works for real-world diagnosis.”
It also sounds like a doctor
One important methodological detail shows how blurred the line has become between AI and a human physician. The doctors who rated the diagnoses were asked to guess whether each answer had been written by a human doctor or by AI. One could not decide in 83.6% of cases, while the other could not decide in 94.4% of cases. In other words, the AI was not only more accurate, but it also sounded like a doctor.
Since the 1950s, complex case discussions published in the New England Journal of Medicine have served as benchmarks for evaluating computerized diagnostic systems. These are real patient cases filled with misleading and distracting details across dozens of medical fields.
“The performance of AI compared with human experts on these cases shocked many people,” said Prof. Arjun Manrai, an associate professor of biomedical informatics at Harvard Medical School and one of the study’s leaders.
3 View gallery 
Artificial intelligence is becoming a natural part of medicine. Illustration
(Photo: Shutterstock)
Across 143 cases from 2021 to 2024, the model included the correct diagnosis in 78.3% of cases. When the criteria were expanded to include diagnoses that were “very close,” accuracy rose to 97.9%. In a direct comparison with GPT-4 on 70 cases, o1 outperformed its predecessor, 88.6% to 72.9%.
Thomas Buckley, a doctoral student at Harvard Medical School who took part in the study, said the results suggest o1 is approaching optimal diagnosis on these challenging cases, which have been used to test computer diagnostic capabilities since 1959.
A striking gap in treatment management
The study’s most surprising finding did not concern diagnosis, but what doctors call “management reasoning,” the clinical decisions that come after the diagnosis: which tests to order, recommendations for antibiotics, and how to discuss end-of-life care. In five complex scenarios developed by 25 experts, the AI model received a median score of 89%. Doctors using conventional resources, including up-to-date Google searches, scored just 34%.
"Management reasoning is likely a more complex task than diagnostic reasoning," said Dr. Peter Brodeur, a subspecialty fellow at Beth Israel who participated in the study. "It requires many considerations of not only the objective features of a case, but also subjective factors: what context and situations you’re in, and therefore, it probably doesn’t come as a surprise that a reasoning model performs significantly better.”
The researchers are careful to stress what the study did not examine, and what the model still cannot do. All the experiments were based on text input only. In practice, clinical medicine is full of nontext data: X-rays, ECGs, physiological measurements and even the way a patient looks, sounds and feels.
“Doctors have to listen to the patient, review chest X-ray, analyze an ECG and an echocardiogram,” Manrai said. “They use many different types of data in everyday clinical decision-making.” He said his team is conducting "parallel studies looking at the performance of these models on images” and is seeing rapid improvement, but the data have not yet been published.
Dr. Wei Xing, a lecturer at the University of Sheffield who was not involved in the study, added an important caveat: The study does not analyze which patients the model diagnoses less accurately, such as older patients or those who do not speak English. He also warned that doctors may begin relying on AI answers instead of thinking independently, "a tendency could grow more significant as AI becomes more routinely used in clinical settings," he told The Guardian.
He concluded that “It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice.” 
Do not take doctors out of the equation
Manrai and Rodman are emphatic that the findings do not support replacing doctors. “At the end of the day,” Manrai said, “people want human beings to guide them through life-and-death decisions, to guide them through complex treatment decisions, to talk with them about their quality of life.”
Rodman said he does not want to see “medical AI” companies try to reduce doctors’ clinical involvement. “Our findings do not support that,” he said. “What they support is an ambitious research agenda.”
Instead, the researchers point to two areas where AI can already help doctors. The first is ER triage, where patients arrive with unclear symptoms and messy medical records. “You can easily imagine how a system that passively ran over the electronic health record could potentially improve quality if it could try to identify diagnostic errors or missed opportunities for diagnosis before they happened,” Rodman said.
The other use is as a second opinion. "We know that doctors getting second opinions from their human colleagues generally improves care. AI can serve as that kind of tool.”
3 View gallery 
AI model could offer a second opinion. Pictured: an Israeli emergency room
(Photo: ChameleonsEye / Shutterstock)
Dr. David Reich, chief clinical officer of the Mount Sinai hospital system in New York, who was not involved in the study, told NPR that the key open question is not whether the technology is accurate, but how to integrate it into clinical workflows “in ways that improve care.” “This study is a perfect call to action,” he said.
Rodman and Manrai are calling for additional controlled studies to determine how the technology affects actual patient outcomes. “We are witnessing a truly profound change in technology that will reshape medicine,” Manrai said, urging that the technology be tested in controlled clinical trials before becoming part of routine care.
The American study comes at an interesting time from an Israeli perspective as well. Just days ago, Israel’s Health Ministry approved for the first time the use of an AI tool for psychiatric triage: LIV, a system developed by the Israeli startup Mentaily, which grew out of Sheba Medical Center’s ARC innovation arm.
The system, in which an avatar conducts a conversation with the patient similar to an initial psychiatric interview, achieved about 90% agreement with psychiatrists’ assessments in two clinical studies involving 385 patients, and about 96% identification of high-risk conditions.
“The goal is not to replace the therapist, but to empower them,” said Dr. Assaf Caspi, deputy director of Sheba’s psychiatric division and one of the venture’s founders.
For now, the Israeli system is being used as an aid for physicians.