AI in the emergency department: promising, powerful but still unproven

Artificial intelligence can now outperform doctors at diagnosing patients in the emergency department, according to a new study in Science.
The AI was given written notes from real emergency department records from a hospital in Boston, US, and asked to weigh in at different points during the patient’s care. At the earliest stage – triage, when a patient first arrives – the AI identified the correct diagnosis, or something closely related, in 67% of cases.
The two doctors used for comparison managed 50% and 55%. That’s a meaningful gap, especially at the moment when information is scarcest and uncertainty is highest.
This study matters because the field is moving so fast. Earlier research showed that large language models – the technology behind systems like ChatGPT – could pass medical licensing exams. Interesting, but not all that illuminating. Passing an exam is not the same as being useful on a ward.
This new study goes further. It puts AI alongside doctors across several tasks, using genuine clinical text from a real emergency department. That makes it more directly relevant to medical practice than most of what’s come before. It suggests these systems are developing into something that could genuinely help doctors think through a wide range of possible diagnoses, especially in situations where missing a serious condition is the main concern.
There are good reasons, though, not to get carried away.
The AI was working entirely from written text. It never saw the patient, never noticed how breathless or frightened they looked, never examined them, spoke to their family, weighed up the chaos of a busy department, or took any responsibility for what happened next. It was not practising emergency medicine. It was offering a written opinion based on selected information.
There’s also a gap between producing a list of possible diagnoses and actually improving patient outcomes. A longer list might help a doctor think more broadly, but it could equally generate new problems: unnecessary tests, over-treatment, extra workload, or unwarranted confidence in an answer that sounds plausible but turns out to be wrong.
And some of the benchmark cases used in studies like this may have been publicly available when the AI was trained, which doesn’t undermine the emergency department findings, but is another reason to treat headline numbers with some scepticism.
The hard question
So the question isn’t really whether AI can help doctors think through difficult cases. The harder question is how this should be tested and governed in real clinical settings like the NHS.
That question is already urgent. A Royal College of Physicians snapshot found that 16% of UK doctors were using AI tools in clinical practice every day, with another 15% doing so weekly. Doctors are already using these tools in their daily work – before hospitals and health systems have properly worked out how to assess them, train staff to use them safely, spot when they’re causing harm, or decide who is responsible when something goes wrong.
It’s tempting to say that the solution is to keep a human in the loop. But that phrase does very little work on its own. We need to know which human, in which loop, and with what authority. A doctor’s ability to override an AI suggestion is not, by itself, a safety system. Someone still has to decide which tools get used, who can change how they behave, how harms are spotted, and who is responsible when the tool quietly starts failing.
This study represents genuine progress. But it doesn’t, on its own, change how medicine should be practised. The right response is neither to prohibit these systems nor to let them quietly become part of the routine before anyone has thought it through. They should be trialled in real clinical settings, used as a form of second-opinion support rather than a substitute for clinical judgment, and measured against what actually matters to patients: care that is better, safer and faster.
Ewen Harrison receives funding from a number of grant-giving bodies including UKRI, NIHR, HDRUK, and Wellcome Leap. He is a Deputy Editor with NEJM AI.