[edit]
Primitive Vision: Improving Diagram Understanding in MLLMs
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74732-74755, 2025.
Abstract
Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine-grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state-of-the-art MLLMs highlights that fine-grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically-grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine-grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.