Study: ChatGPT can out-diagnose doctors, but doctors shouldn't rely on the tool

ChatGPT
ChatGPT outperformed clinicians at times, but was also completely wrong others.
Cassie McGrath
By Cassie McGrath – Reporter, Boston Business Journal

Listen to this article 3 min

Researchers in a study from Beth Israel Deaconess Medical Center said that ChatGPT can be helpful for clinicians, but cannot replace humans in healthcare.

ChatGPT-4 can be a useful tool to help diagnose patients, but cannot be relied upon to make clinical decisions, according to a new study from Beth Israel Deaconess Medical Center.

At times, the artificial intelligence language program outperformed humans in diagnosing patients, but also used incorrect reasoning in its answers at times, according to Dr. Adam Rodman, an internal medicine physician and investigator in the department of medicine at BIDMC who worked on the research.

To conduct the study, which was published in JAMA Internal Medicine last week, Rodman worked with other researchers to create a tool that can assess doctor’s clinical reasoning, called the revised-IDEA score.

Investigators recruited 21 attending physicians and 18 residents to participate. They worked through 20 clinical cases to diagnose patients. Stages include initial triage, when a patient says what is bothering them, system review, when clinicians get additional information from patients, physical exam and then diagnostic testing and imaging.

Chat GPT was also given a prompt with identical instructions and ran all 20 clinical cases. Researchers then scored for clinical reasoning.

"It’s a surprising finding that these things are capable of showing the equivalent or better reasoning than people throughout the evolution of clinical cases," Rodman said in a statement.

Researchers found that ChatGPT earned the highest revised-IDEA scores, with a median score of 10. Attending physicians scored 9 and residents scored 8. But the bot also used incorrect reasoning in 12% of cases, significantly more than residents.

For example, Rodman said that the bot reported that a patient may have gastroenteritis after traveling to New Mexico. He said this was a sign that the bot knew that travel was related to gastroenteritis, but that it didn't know that it wasn't a risk in New Mexico.

Rodman also noted that ChatGPT had to be fed the information before making any clinical recommendations, and that it didn't triage the patient on its own. He said that receiving information through how clinicians speak with patients is also crucial in eventually making a diagnosis.

Researchers took this to mean that AI would be most useful as a tool to support but not replace human reasoning.  

“Further studies are needed to determine how [large language models] can best be integrated into clinical practice, but even now, they could be useful as a checkpoint, helping us make sure we don't miss something,” lead author Dr. Stephanie Cabral, a third-year internal medicine resident at BIDMC, said in a statement. “My ultimate hope is that AI will improve the patient-physician interaction by reducing some of the inefficiencies we currently have and allow us to focus more on the conversation we’re having with our patients."


Subscribe to the Morning Edition or Afternoon Edition for the business news you need to know, all free.


Largest Hospitals in Massachusetts

Total 2022 net patient service revenue

RankPrior RankHospital/Prior rank (*unranked in 2022)/
1
1
Massachusetts General Hospital
2
2
Brigham and Women's Hospital
3
3
UMass Memorial Medical Center
View this list

Related Content