LoBA: Localizing Before Answering

A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs

IJCAI 2025

Dung Nguyen¹, Minh Khoi Ho¹, Huy Ta², Thanh Tam Nguyen³ Qi Chen², Kumar Rav⁴, Quy Duong Dang², Satwik Ramchandre², Son Lam Phung⁵, Zhibin Liao², Minh-Son To⁴, Johan Verjans², Phi Le Nguyen¹, Vu Minh Hieu Phan^2,†

▶ ¹ Hanoi University of Science and Technology

▶ ² Australian Institute for Machine Learning, The University of Adelaide

▶ ³ Griffith University

▶ ⁴ College of Medicine and Public Health, Flinders University

▶ ⁵ University of Wollongong

^†Project Lead

arXiv Code Demo Dataset Model

HEAL-MedVQA: A comprehensive benchmark with 67K doctor-annotated VQA pairs for evaluating medical multimodal LLMs against localization errors and hallucinations. Our novel LoBA framework significantly improves visual reasoning by guiding models to localize before answering, achieving state-of-the-art accuracy and robustness in medical VQA.

Abstract

HEAL-MedVQA. We introduce HEAL-MedVQA (Hallucination Evaluation via Localization in Medical VQA), a benchmark designed to evaluate hallucination robustness and localization ability of medical LMMs using 67K doctor-annotated VQA pairs and anatomical segmentation masks.

Evaluation Protocols. HEAL-MedVQA includes two protocols—Textual Perturbation Test (TPT) and Visual Perturbation Test (VPT)—to diagnose shortcut reliance on language or irrelevant image areas.

LobA Framework. We propose Localize-before-Answer (LobA), a method that improves visual grounding by training models to localize pathological regions and self-prompt for more accurate answers.

Performance. LobA significantly outperforms state-of-the-art medical LMMs on HEAL-MedVQA, advancing both robustness and reliability in medical VQA.

Hallucination in Medical VQA

Contradicting Source Evidence: LMMs frequently generate responses that are false or contradict the visual information presented in images, often due to poor localization reasoning.

Reliance on Shortcuts Instead of Analysis: Models often bypass analyzing relevant image regions. Instead, they rely on:

Textual Shortcuts: Using learned word associations (like common disease/location pairs) from training data, rather than the specific image's visual content.
Visual Shortcuts: Focusing attention on irrelevant or non-queried image areas, instead of the region pertinent to the question.

Evaluation Protocol

Textual Perturbation Test (TPT): Tests language pattern reliance by swapping anatomical terms or diseases in correctly answered questions. Measures if model's answer changes appropriately, verifying true visual understanding.

Visual Perturbation Test (VPT): Evaluates visual grounding by overlaying disease-free anatomical regions onto test images. Verifies if model's decisions are based on correct image areas.

Dataset Curation

HEAL-MedVQA contains 67K QA pairs with doctor-annotated segmentation masks. Our pipeline:

Anatomy Segmentation: Generate precise masks for core anatomical structures

Disease Extraction: Obtain disease labels and bounding boxes

Anatomy-Disease Mapping: Link anatomies to diseases using IoU scores

QA Generation: Create QA pairs from validated associations

LobA Framework

Region Localization: When a medical image question is received, our model first localizes the region of interest as a mask.

Patch Interpolation: The mask is interpolated into patches, enhancing focus on the relevant region.

Answer Generation: The highlighted input is processed again to generate the final, region-focused answer.

Performance

Performance on the proposed HEAL-MedVQA benchmark: LobA significantly outperforms existing medical VQA models, achieving a 15% increase in accuracy across diverse medical image types and question categories.

Textual and Visual Perturbation Tests: Our framework demonstrates superior resilience against both textual perturbations (Left) and visual perturbations (Right), maintaining consistent performance even under challenging test conditions.

Ablation studies of different components of LobA: LobA achieves precise anatomical localization with an average IoU of 0.85, significantly reducing hallucination instances and improving answer reliability.