Visualizations and Ablations
We release full ablation experiment results and training curves tracked in CometML.
A Comet Dashboard preview is shown here in the panel.
We investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning.
Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in Cherian 2022 as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern.We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention enabled learning multimodal representations adaptively from fused frozen pretrained backbones for better visual grounding. Furthermore, proper hyperparameter and other training choices led to strong improvements (up to 48% gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills.
We release full ablation experiment results and training curves tracked in CometML.
A Comet Dashboard preview is shown here in the panel.
Many related excellent works have been introduced recently.
Are Deep Neural Networks SMARTer than Second Graders? introduced the Simple Multi-modal Algorithmic Reasoning Task (SMART) and the SMART-101 dataset and trained a first set of baseline vision-language reasoners to solve this task.
Llemma: An Open Language Model For Mathematics presented in the Math AI Workshop at NeurIPS 2023, trains a large language model on Proof-Pile-2 dataset as well as web data and math code (Lean) to train a large language model which can reason mathematically.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, an article in CVPR 2024, reveals that visual capabilities in large multimodal models are systematically lagging behind the powerful reasoning abilities of large language models and motivates our inclusion of several visual representations for better multimodal reasoning.
Mathverse: Does your multi-modal LLM truly see the diagrams in visual math problems? motivates our exploration of deep learning architectures for better visual grounding since large multimodal models cannot truly understand the visual diagrams for mathematical reasoning.
There are probably many more by the time you are reading this.
@misc{roberts2024smartvisionlanguagereasoners,
author = {Denisa Roberts and Lucas Roberts},
title = {Smart Vision-Language Reasoners},
year = {2024},
eprint = {2407.04212},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2407.04212},
}