
Emergency medicine has long resisted the idea that pattern recognition software could match a trained clinician reading a live patient. The stakes are too high, the variables too human. Yet benchmarks keep moving. Large language models have steadily closed gaps on standardized medical exams, and researchers have begun pushing them into messier, higher-stakes territory.
This week, a team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center published findings in Science that add a harder data point to that trend. Working with real emergency room cases and real physicians as a comparison group, they tested OpenAI's o1 and 4o models on diagnostic accuracy. The results were specific enough to generate both serious interest and serious pushback.
The Experiment Design
Researchers drew on 76 patients who came into the Beth Israel emergency room. Two internal medicine attending physicians provided diagnoses at multiple points during each case. The AI models received the same text-based information available in the electronic medical records at the time of each diagnosis. Critically, the team stated they did not pre-process the data at all before presenting it to the models. A separate pair of attending physicians then assessed all the diagnoses without knowing which came from humans and which came from AI.
Where the Numbers Land
At initial ER triage, OpenAI's o1 model produced the exact or very close diagnosis in 67% of cases. One physician hit that mark 55% of the time; the other reached it 50% of the time. The study noted that differences between o1 and the physicians were especially pronounced at this first diagnostic touchpoint, where the least information is available and the urgency to decide correctly is highest. The 4o model also performed, with o1 rated as performing nominally better than or on par with both physicians and 4o across diagnostic touchpoints.
What the Lead Researchers Claimed
Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study's lead authors, stated in the Harvard press release that the AI model was tested against virtually every benchmark and eclipsed both prior models and physician baselines. The study stopped short of recommending clinical deployment. Its stated conclusion was that the findings show an urgent need for prospective trials to evaluate these technologies in real-world patient care settings.
The Accountability Gap
Adam Rodman, a Beth Israel doctor and co-lead author, warned separately that there is no formal framework right now for accountability around AI diagnoses. He also noted that patients still want humans to guide them through life or death decisions and through challenging treatment decisions. The study itself acknowledged a key constraint: researchers only tested how models performed on text-based information, and existing studies suggest current foundation models are more limited in reasoning over non-text inputs.
The Specialist Comparison Problem
Emergency physician Kristen Panthagani published a direct critique of how the study has been framed in coverage. Her core objection was methodological: the AI was compared to internal medicine attending physicians, not emergency physicians. She argued that if researchers are going to compare AI tools to physicians' clinical ability, the comparison should start with physicians who actually practice that specialty. Panthagani also reframed what ER triage actually demands, writing that an ER doctor's primary goal seeing a patient for the first time is not to guess the ultimate diagnosis but to determine whether the patient has a condition that could kill them. That distinction matters when interpreting the 67% figure.
The study's publication in Science and Harvard's accompanying press release ensured wide circulation. The o1 accuracy figure of 67% against baselines of 55% and 50% is the kind of concrete delta that travels fast in both medical and technology circles. Whether it survives the scrutiny of prospective clinical trials is a different question entirely.
If follow-on trials do validate o1's diagnostic performance under real clinical conditions, the pressure on healthcare systems to define accountability frameworks will accelerate quickly. The gap between what a model can do on a chart and what it should be allowed to do with a patient may close faster than the regulatory and liability structures needed to manage it. For now, the Harvard findings are a benchmark, not a deployment signal.