Smart Vision-Language Reasoners

Abstract

We investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning.

Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in Cherian 2022 as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern.

We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention enabled learning multimodal representations adaptively from fused frozen pretrained backbones for better visual grounding. Furthermore, proper hyperparameter and other training choices led to strong improvements (up to 48% gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills.

Smart Vision-Language Reasoners

Smarter-VLM, Smart Vision-Language Reasoners improve performance on Math problems by up to 48%.

Abstract

Visualizations and Ablations

Related Links

Multimodal Algorithmic Reasoning (MAR) Neurips 2024 workshop poster

Math AI ICML 2024 workshop poster

BibTeX