ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang; Xiaoman Zhang; Ankit Pal; Pranav Rajpurkar

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang, Xiaoman Zhang, Ankit Pal, Pranav Rajpurkar

Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:427-447, 2026.

Abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v333-wang26b,
  title = 	 {ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding},
  author =       {Wang, Xucheng and Zhang, Xiaoman and Pal, Ankit and Rajpurkar, Pranav},
  booktitle = 	 {Proceedings of the 7th Conference on Health, Inference, and Learning},
  pages = 	 {427--447},
  year = 	 {2026},
  editor = 	 {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily},
  volume = 	 {333},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29--30 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v333/main/assets/wang26b/wang26b.pdf},
  url = 	 {https://proceedings.mlr.press/v333/wang26b.html},
  abstract = 	 {Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.}
}

Endnote

%0 Conference Paper
%T ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
%A Xucheng Wang
%A Xiaoman Zhang
%A Ankit Pal
%A Pranav Rajpurkar
%B Proceedings of the 7th Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Elizabeth Healey
%E Jason Fries
%E Tom Pollard
%E Shengpu Tang
%E Anna Zink
%E Tom Hartvigsen
%E Monica Agrawal
%E Sam Finlayson
%E Benjamin Glicksberg
%E Brett Beaulieu-Jones
%E Kai Wang
%E Daseyra Fontalvo
%E Tasmie Sarker
%E Irene Chen
%E Emily Alsentzer	
%F pmlr-v333-wang26b
%I PMLR
%P 427--447
%U https://proceedings.mlr.press/v333/wang26b.html
%V 333
%X Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

APA

Wang, X., Zhang, X., Pal, A. & Rajpurkar, P.. (2026). ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:427-447 Available from https://proceedings.mlr.press/v333/wang26b.html.

Related Material

Download PDF