HyperAIHyperAI

Command Palette

Search for a command to run...

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Abstract

The rapid development of multimodal large language models (MLLMs), such asGPT-4V, has led to significant advancements. However, these models still facechallenges in medical multimodal capabilities due to limitations in thequantity and quality of medical vision-text data, stemming from data privacyconcerns and high annotation costs. While pioneering approaches utilizePubMed's large-scale, de-identified medical image-text pairs to address theselimitations, they still fall short due to inherent data noise. To tackle this,we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) inan 'unblinded' capacity to denoise and reformat the data, resulting in thecreation of the PubMedVision dataset with 1.3 million medical VQA samples. Ourvalidation demonstrates that: (1) PubMedVision can significantly enhance themedical multimodal capabilities of current MLLMs, showing significantimprovement in benchmarks including the MMMU Health & Medicine track; (2)manual checks by medical experts and empirical results validate the superiordata quality of our dataset compared to other data construction methods. UsingPubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which showssuperior performance in medical multimodal scenarios among open-source MLLMs.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp