Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the circuits — the task-specific computational sub-graphs — in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.
Vision-Language Models (VLMs) often show a performance gap when handling analogous tasks presented in visual versus textual formats. This work investigates the source of this gap by analyzing the internal circuits (task-specific computational sub-graphs) that VLMs utilize for these different modalities.
To study the difference between visual and textual processing, we create a dataset of five question-answering tasks.
Each task has pairs of aligned textual and visual prompt variants, ensuring direct comparability.
All prompts adhere to a fixed positional template, and each position in a textual prompt is aligned with the corresponding position in the analogous visual prompt.
To find the differences (and similarities) between textual and visual processing in VLMs, we find circuits that are faithful to the textual and visual variant of each task, and analyze their structural and functional intersections.
We measure structural intersection by measuring the IoU of circuits found for each modality (for a given task). This is measured separately per position group (data, query and generation).
Structural difference is only half of the picture: different circuits can implement similar logic.
We test this by measuring cross-modal faithfulness, swapping sub-circuits between modalities and measuring the resulting faithfulness.
Zooming on the data positions, shown to be processed completely differently, we see that visual data token representations gradually align with their textual analogs as they progress through the VLM. We hypothesize that this happens too late in the model, affecting visual prompt accuracy.
To address this, we employ back-patching: re-injecting the more textually-aligned representations of visual data tokens from later layers into earlier layers.
This makes these aligned representations available earlier in the visual processing pipeline.
This simple intervention leads to increased accuracy on visual prompts across multiple tasks and VLM models (closing a third of the performance gap between visual and textual prompts).
@article{nikankin2025sametask, title={Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs}, author={Nikankin, Yaniv and Arad, Dana and Gandelsman, Yossi and Belinkov, Yonatan}, journal={Preprint}, note={Under review}, year={2025} }