ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang, Xiaoman Zhang, Ankit Pal, Pranav Rajpurkar
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:427-447, 2026.

Abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v333-wang26b, title = {ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding}, author = {Wang, Xucheng and Zhang, Xiaoman and Pal, Ankit and Rajpurkar, Pranav}, booktitle = {Proceedings of the 7th Conference on Health, Inference, and Learning}, pages = {427--447}, year = {2026}, editor = {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily}, volume = {333}, series = {Proceedings of Machine Learning Research}, month = {29--30 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v333/main/assets/wang26b/wang26b.pdf}, url = {https://proceedings.mlr.press/v333/wang26b.html}, abstract = {Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.} }
Endnote
%0 Conference Paper %T ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding %A Xucheng Wang %A Xiaoman Zhang %A Ankit Pal %A Pranav Rajpurkar %B Proceedings of the 7th Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2026 %E Elizabeth Healey %E Jason Fries %E Tom Pollard %E Shengpu Tang %E Anna Zink %E Tom Hartvigsen %E Monica Agrawal %E Sam Finlayson %E Benjamin Glicksberg %E Brett Beaulieu-Jones %E Kai Wang %E Daseyra Fontalvo %E Tasmie Sarker %E Irene Chen %E Emily Alsentzer %F pmlr-v333-wang26b %I PMLR %P 427--447 %U https://proceedings.mlr.press/v333/wang26b.html %V 333 %X Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.
APA
Wang, X., Zhang, X., Pal, A. & Rajpurkar, P.. (2026). ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:427-447 Available from https://proceedings.mlr.press/v333/wang26b.html.

Related Material