Same Task, Different Circuits

Abstract

Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the circuits — the task-specific computational sub-graphs — in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Overview of our VLM Analysis

Vision-Language Models (VLMs) often show a performance gap when handling analogous tasks presented in visual versus textual formats. This work investigates the source of this gap by analyzing the internal circuits (task-specific computational sub-graphs) that VLMs utilize for these different modalities.

(a) We find circuits for analogous vision and language tasks and show they are structurally disjoint between modalities.
(b) Swapping sub-circuits across modalities reveals that query and generation components preserve performance when swapped between modalities, while swapping data components degrades performance.
(c) To address the performance gap, we apply back-patching: re-injecting textually-aligned visual data activations from later layers into earlier ones.

How do we align visual and textual processing?

To study the difference between visual and textual processing, we create a dataset of five question-answering tasks. Each task has pairs of aligned textual and visual prompt variants, ensuring direct comparability.
All prompts adhere to a fixed positional template, and each position in a textual prompt is aligned with the corresponding position in the analogous visual prompt.

Figure 2: Analogous Vision-Language Tasks — Analogous Vision-Language Tasks. A task prompt is made up of a query (bottom row) asked either about an image (middle row) or an analogous text (top row).

Where do visual and textual processing differ in VLMs?

To find the differences (and similarities) between textual and visual processing in VLMs, we find circuits that are faithful to the textual and visual variant of each task, and analyze their structural and functional intersections.

Structural Intersection

We measure structural intersection by measuring the IoU of circuits found for each modality (for a given task). This is measured separately per position group (data, query and generation).

Visual and textual circuits for the same task are mostly structurally disjoint, with an average of 18% components shared between modalities.
The structural overlap is low in data and query position groups, and moderate in the generation (last) position only.

Normalized IoU scores, showing low circuit overlap in data, query, and generation positions between textual and visual circuits.

Functional Interchangeability

Structural difference is only half of the picture: different circuits can implement similar logic.
We test this by measuring cross-modal faithfulness, swapping sub-circuits between modalities and measuring the resulting faithfulness.

Query and Generation sub-circuits are functionally equivalent and interchangeable between modalities, implying different components perform similar functions.
Data sub-circuits are modality-specific; Swapping them significantly degrades performance, highlighting that differences in data processing are a key factor in the performance gap.

Using interpretability insights to bridge the visual-textual gap

Zooming on the data positions, shown to be processed completely differently, we see that visual data token representations gradually align with their textual analogs as they progress through the VLM. We hypothesize that this happens too late in the model, affecting visual prompt accuracy.

Similarity between visual data activations and analogous textual data activations increases deeper in the model.

To address this, we employ back-patching: re-injecting the more textually-aligned representations of visual data tokens from later layers into earlier layers. This makes these aligned representations available earlier in the visual processing pipeline.

This simple intervention leads to increased accuracy on visual prompts across multiple tasks and VLM models (closing a third of the performance gap between visual and textual prompts).

Back-patching visual token embeddings increases accuracy across tasks.

How to cite

bibtex

@article{nikankin2025sametask,
  title={Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs},
  author={Nikankin, Yaniv and Arad, Dana and Gandelsman, Yossi and Belinkov, Yonatan},
  journal={Preprint},
  note={Under review},
  year={2025}
}