For healthcare LLMs, ‘unhelmeted’ does not compute

There are “major methodological concerns” with using LLMs to analyze clinical notes: study.

August 22, 2024

• 3 min read

If you’re hoping a well-crafted prompt and a large language model can save you from sifting through 54,000 hospital reports, don’t throw out your staple remover just yet.

Columbia University findings published on August 13 in JAMA Network Open revealed that a large language model (LLM)—a computational model that understands and generates text—couldn’t outperform a simpler text-recognition option known as “text-string” search.

More troubling, according to the report’s lead researcher: The LLM couldn’t replicate its findings.

“It’s no guarantee that if you repeat the experiments, you’re going to get the same results,” Kathryn Burford, postdoctoral researcher at Columbia University Mailman School of Public Health, told IT Brew. “And that’s a major, major problem for scientists.”

The study

Paperwork. The researchers used OpenAI’s computational model, ChatGPT-4, to peruse notes from 54,569 emergency-department visits. The goal: Find patients in the dataset who wore helmets before experiencing their bike, scooter, or hoverboard-related injuries.
Not not helmet. LLMs had trouble identifying negations, according to the report, including terms like “without a helmet,” “w/o helmet,” “and “unhelmeted.” “The LLM often hallucinated and was consistent in replicating its hallucinations,” the team concluded.
String theory. The large language model did not surpass text-string search, a code-based detection of human-researched phrases (like “unhelmeted”) in large datasets. According to the study, the LLM had “moderate to weak agreement with a simple string search method,” unless the prompt contained all human-labeled text strings.

A May 2024 study from market-intelligence firm IDC, which polled 888 global and IT business leaders across 18 industries, found that 35% of respondents are investing significantly in generative AI (14% are not), and 19% have already introduced production-level generative AI services.

“If I’m a doctor, and I’m recording my notes and interactions, right now I may have to actually sit back and transcribe all that into actual data entry into the system. That could go away,” Jeremy Huval, chief innovation officer at certifications provider HITRUST, told IT Brew in March, while warning that the tech could not yet be trusted with higher-risk healthcare decisions.

In January 2024, the World Health Organization published recommendations for ethical use of LLMs in healthcare, like conducting impact assessments and ensuring models are “designed to perform well-defined tasks.”

Columbia performed its statistical analysis from November 2023 to April 2024.

Large language models change quickly; less than a month after the team concluded its research, OpenAI (which did not respond to IT Brew’s interview request by publication) introduced GPT-4o, which demonstrated emotive responses, answers to paper-based arithmetic, and on-screen code translation.

One feat untested during the product’s unveiling, and perhaps on hold for a later model: spotting “unhelmeted.’’

Top insights for IT pros

From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.