LoBA: Localizing Before Answering

A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs

IJCAI 2025
1 Hanoi University of Science and Technology
2 Australian Institute for Machine Learning, The University of Adelaide
3 Griffith University
4 College of Medicine and Public Health, Flinders University
5 University of Wollongong
Project Lead

HEAL-MedVQA: A comprehensive benchmark with 67K doctor-annotated VQA pairs for evaluating medical multimodal LLMs against localization errors and hallucinations. Our novel LoBA framework significantly improves visual reasoning by guiding models to localize before answering, achieving state-of-the-art accuracy and robustness in medical VQA.

Abstract

HEAL-MedVQA. We introduce HEAL-MedVQA (Hallucination Evaluation via Localization in Medical VQA), a benchmark designed to evaluate hallucination robustness and localization ability of medical LMMs using 67K doctor-annotated VQA pairs and anatomical segmentation masks.

Evaluation Protocols. HEAL-MedVQA includes two protocols—Textual Perturbation Test (TPT) and Visual Perturbation Test (VPT)—to diagnose shortcut reliance on language or irrelevant image areas.

LobA Framework. We propose Localize-before-Answer (LobA), a method that improves visual grounding by training models to localize pathological regions and self-prompt for more accurate answers.

Performance. LobA significantly outperforms state-of-the-art medical LMMs on HEAL-MedVQA, advancing both robustness and reliability in medical VQA.

Hallucination in Medical VQA

Contradicting Source Evidence: LMMs frequently generate responses that are false or contradict the visual information presented in images, often due to poor localization reasoning.

Reliance on Shortcuts Instead of Analysis: Models often bypass analyzing relevant image regions. Instead, they rely on:

  • Textual Shortcuts: Using learned word associations (like common disease/location pairs) from training data, rather than the specific image's visual content.
  • Visual Shortcuts: Focusing attention on irrelevant or non-queried image areas, instead of the region pertinent to the question.
Medical VQA Hallucination Example

Evaluation Protocol

Textual Perturbation Test (TPT): Tests language pattern reliance by swapping anatomical terms or diseases in correctly answered questions. Measures if model's answer changes appropriately, verifying true visual understanding.

Visual Perturbation Test (VPT): Evaluates visual grounding by overlaying disease-free anatomical regions onto test images. Verifies if model's decisions are based on correct image areas.

Evaluation Protocol

Dataset Curation

HEAL-MedVQA contains 67K QA pairs with doctor-annotated segmentation masks. Our pipeline:

Anatomy Segmentation: Generate precise masks for core anatomical structures

Disease Extraction: Obtain disease labels and bounding boxes

Anatomy-Disease Mapping: Link anatomies to diseases using IoU scores

QA Generation: Create QA pairs from validated associations

Data Pipeline
QA Examples

LobA Framework

Region Localization: When a medical image question is received, our model first localizes the region of interest as a mask.

Patch Interpolation: The mask is interpolated into patches, enhancing focus on the relevant region.

Answer Generation: The highlighted input is processed again to generate the final, region-focused answer.

LobA Framework Overview

Performance

Performance on the proposed HEAL-MedVQA benchmark: LobA significantly outperforms existing medical VQA models, achieving a 15% increase in accuracy across diverse medical image types and question categories.

Accuracy Comparison

Textual and Visual Perturbation Tests: Our framework demonstrates superior resilience against both textual perturbations (Left) and visual perturbations (Right), maintaining consistent performance even under challenging test conditions.

Textual Perturbation Analysis
Visual Perturbation Analysis

Ablation studies of different components of LobA: LobA achieves precise anatomical localization with an average IoU of 0.85, significantly reducing hallucination instances and improving answer reliability.

Localization Quality