Hugging Face releases a benchmark for testing generative AI on health tasks

Generative AI models are increasingly being brought to healthcare settings — in some cases prematurely, perhaps. Early adopters believe that they’ll unlock increased efficiency while revealing insights that’d otherwise be missed. Critics, meanwhile, point out that these models have flaws and biases that could contribute to worse health outcomes.

But is there a quantitative way to know how helpful, or harmful, a model might be when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes a solution in a newly released benchmark test called Open Medical-LLM. Created in partnership with researchers at the nonprofit Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM aims to standardize evaluating the performance of generative AI models on a range of medical-related tasks.

lockquote class=”twitter-tweet” data-width=”550″ data-dnt=”true”>

New: Open Medical LLM Leaderboard! 🩺

In basic chatbots, errors are annoyances.
In medical LLMs, errors can have life-threatening consequences 🩸

It’s therefore vital to benchmark/follow advances in medical LLMs before thinking about deployment.

Blog: https://t.co/pddLtkmhsz

— Clémentine Fourrier 🍊 (@clefourrier) April 18, 2024

lockquote>

Open Medical-LLM isn’t a from-scratch benchmark, per se, but rather a stitching-together of existing test sets — MedQA, PubMedQA, MedMCQA and so on — designed to probe models for general medical knowledge and related fields, such as anatomy, pharmacology, genetics and clinical practice. The benchmark contains multiple choice and open-ended questions that require medical reasoning and understanding, drawing from material including U.S. and Indian medical licensing exams and college biology test question banks.

“[Open Medical-LLM] enables researchers and practitioners to identify the strengths and weaknesses of different approaches, drive further advancements in the field and ultimately contribute to better patient care and outcome,” Hugging Face wrote in a blog post.

Image Credits: Hugging Face

Hugging Face is positioning the benchmark as a “robust assessment” of healthcare-bound generative AI models. But some medical experts on social media cautioned against putting too much stock into Open Medical-LLM, lest it lead to ill-informed deployments.

On X, Liam McCoy, a resident physician in neurology at the University of Alberta, pointed out that the gap between the “contrived environment” of medical question-answering and actual clinical practice can be quite large.

lockquote class=”twitter-tweet” data-width=”550″ data-dnt=”true”>

It is great progress to see these comparisons head-to-head, but important for us to also remember how big the gap is between the contrived environment of medical question answering and actual clinical practice! Not to mention the idiosyncratic risks these metrics can’t capture.

— Liam McCoy, MD MSc (@LiamGMcCoy) April 18, 2024

lockquote>

Hugging Face research scientist Clémentine Fourrier, who co-authored the blog post, agreed.

“These leaderboards should only be used as a first approximation of which [generative AI model] to explore for a given use case, but then a deeper phase of testing is always needed to examine the model’s limits and relevance in real conditions,” Fourrier replied on X. “Medical [models] should absolutely not be used on their own by patients, but instead should be trained to become support tools for MDs.”

It brings to mind Google’s experience when it tried to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

Google created a deep learning system that scanned images of the eye, looking for evidence of retinopathy, a leading cause of vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating both patients and nurses with inconsistent results and a general lack of harmony with on-the-ground practices.

It’s telling that of the 139 AI-related medical devices the U.S. Food and Drug Administration has approved to date, none use generative AI. It’s exceptionally difficult to test how a generative AI tool’s performance in the lab will translate to hospitals and outpatient clinics, and, perhaps more importantly, how the outcomes might trend over time.

That’s not to suggest Open Medical-LLM isn’t useful or informative. The results leaderboard, if nothing else, serves as a reminder of just how poorly models answer basic health questions. But Open Medical-LLM, and no other benchmark for that matter, is a substitute for carefully thought-out real-world testing.

Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

Previous News

Blockchain data-availability protocol Avail announces 600M token airdrop

Next News

New Adobe Express app with Firefly AI now available for iOS

Techcrunch

More like this

Hugging Face releases a benchmark for testing generative AI on health tasks

Disclaimer

Popular

Apple Issues Rare iOS 18 Security Update to Protect Against DarkSword Exploit

Sexualised deepfakes targeting actress spur German ‘#MeToo’ moment

NODWIN Brings Back Ex-CEO Sidharth Kedia Ahead Of IPO

What to Watch on Paramount+ in April 2026

Amazon Leo to Test Custom Satellite Dish With Israeli Firm

More Like this

Which Brands Will Be Hardest Hit by FCC’s Foreign Router Ban? Here’s the List

Nvidia App adds ‘Auto Shader Compilation’ for faster load times in games — beta feature automatically recompiles shaders in the background after every driver...

Cloudflare Announces EmDash As Open-Source ‘Spiritual Successor’ To WordPress

Google Photos finally arrives on Samsung TVs, but not all models

Apple Adds Another iPad to Vintage Products List

32GB of Corsair Vengeance DDR5 RAM is 33% off today only — This superb memory deal for PC gamers might sell out before midnight

Hugging Face releases a benchmark for testing generative AI on health tasks

Disclaimer

More like this

Which Brands Will Be Hardest Hit by FCC’s Foreign...

Nvidia App adds ‘Auto Shader Compilation’ for faster load...

Cloudflare Announces EmDash As Open-Source ‘Spiritual Successor’ To WordPress

Popular

Block title

SMM Panel Behind Today’s Viral Content Boom

Meta boosts Texas AI data centre investment to $10 billion

AI’s arrival complicates Big Tech climate goals, and some worry it’s locking in more...

Fino In Freefall, And A Race Against The Clock

Xbox Game Pass “TRION” shows up with only first‑party games, sparking questions about whether...

Amazon’s Big Spring Sale Drops Prices on Top-Rated Laptops and Desktop PCs

Global Games Show Riyadh Unveils Star-Studded Speaker Lineup of Gaming Legends and Industry Leaders

Startup Events

Trending News

Which Brands Will Be Hardest Hit by FCC’s Foreign Router Ban? Here’s the List

Nvidia App adds ‘Auto Shader Compilation’ for faster load times in games — beta feature automatically recompiles shaders in the background after every driver...

Cloudflare Announces EmDash As Open-Source ‘Spiritual Successor’ To WordPress

Google Photos finally arrives on Samsung TVs, but not all models

Apple Adds Another iPad to Vintage Products List

About

Partnership

Contact us