Reverse-Engineering the Retrieval Process in GenIR Models

Anja Reusch, Yonatan Belinkov
Technion - IIT;
ArXiv Models PDF Code

Walkthrough the Retrieval Process

Encoder.
  • Embeds the query
  • Not required to encode information on the documents directly as it can be replaced by an encoder that does not contain document specific information
Priming Stage. (Layers 0–6)
  • "Prepares" the residual stream for subsequent stages.
  • MLP components move document id tokens to lower ranks and non-document id tokens to higher ranks.
  • Does not contain query specific information.
Bridging Stage. (Layers 7–17)
  • Cross-attention moves information from the encoder to the decoder.
  • Cross-attention heads output information in form of word tokens that resemble a form of query expansion.
  • Output of cross-attention is used to activate neurons in the last stage.
Interaction Stage. (Layers 18–23)
  • Neurons in MLPs are activated based on the output of the previous stage, promoting document identifiers.
  • Cross-attention continues to output query information to the residual stream.
  • Last layer: only MLPs are required, they remove all non-document id tokens to high ranks, such that only document id tokens are predicted by the model.
  • In this stage, query and documents interact for the first time.
Setting.
  • DSI setup, atomic document identifiers, T5-large,
  • Datasets: Natural Questions, TriviaQA, sizes: 10k - 220k documents.
A flow through the transformer encoder-decoder, where the three stages that we find in our study are depicted.
A simplified view of the retrieval process in the Generative IR models in our work. After the encoder processes the query, the decoder operates in three stages: (I) the Priming Stage, (II) the Bridging Stage, and (III) the Interaction Stage.

Generative Information Retrieval

A diagram depicting the pass through an encoder-decoder model. The input is a query, the output is a document identifier.
Overview of a transformer encoder-decoder for GenIR.
  • Transformer-Encoder-Decoder Model, e.g., T5
  • Training:
    • Input: First N tokens of document D, or query for which D is relevant
    • Output: Document identifier of document D
  • Inference (Retrieval):
    • Input: Query
    • Output: ranked list of Document identifiers (tokens, ranked by probability)
  • Each document identifier is tokenized as a single token (= atomic document identifiers).

The Role of the Encoder in GenIR

Intuition. Investigating whether information on the document corpus are contained in the encoder or only in the decoder after training a GenIR model.
Results.
  • Replacing the trained encoder by an encoder that was not trained on {Doc1, ..., Doc10}: The missing documents can still be retrieved well.
  • Replacing the trained encoder by the vanilla T5-encoder: The model can still perform retrieval!
  • Conclusion: Documents are not (exclusively) stored in the encoder.
  • Hypothesis: Encoder semantically encodes the query, decoder is reposible for query - document matching.

Which Components are Crucial for Retrieval

How much does each component contribute to the residual stream? We plot the length and the angle of each component (MLP, self-attention, cross-attention) across the pass through the decoder.
Length (normalized to the contribution per layer) and angle (towards the residual stream in this layer) of the output of each component in each decoder layer for NQ10k.
Conclusion. We identify three stages during a pass through the decoder when looking at the contribution of each components:
  • Stage I: High contribution of MLPs, low to no contribution of Cross-Attention and Self-Attention,
  • Stage II: Contribution of Cross-Attention raises, while contribution of MLP declines.
  • Stage III: Contribution of MLP is highest, output of all components is directed to the opposite direction than the residual stream.
How much does it hurt to remove/ replace a component? We perform zero-patching and mean patching on each component in each stage. In each run, we replace the output of a component in one stage with a zero vector or the mean vector of the output of that component aggregated over all queries for which the model previously ranked a relevant document on rank 1.
Some results of our patching experiments. We remove or replace certain components entirely in the indicated stages. The results are displayed as the percentage of documents that the partical model placed correctly on rank 1 (given all query-doc pairs that the full model solved correctly). For the minimal model, we ran the evaluation on the testset.

Conclusion. The results verify our intution gained from the pervious part. In Stage I and III, MLPs cannot be removed. In Stage II, Cross-Attention shows the highest impact when being removed/ replaced. Interestingly, replacing the MLP output in Stage I (and II) by their mean values does not seem to hurt performance drastically. This implies that they do not perform query specific computations. The minimal model can retain most of the performance of the full model, which indicates that these components are crucial to perform most of the retrieval process.

The Role of MLPs and Cross-Attention

Do Cross-Attention and MLPs communicate?
  • Investigation of the information flow within the decoder.
  • Central question: Which component's output causes cross-attention/MLP to activate?
  • "Activate" for MLPs: Input to activation function > 0, for cross-attention: What leads to the highest value in the attention pattern?
Components per stage that trigger cross-attention in Stage II and III (left) and activate MLPs in Stage III (right) of NQ10k. Stage III MLPs gets mostly activated from cross-attention in Stage II and III, while cross-attention in Stage II gets mostly activated by Stage I MLPs.
Conclusion.
  • In Stage I: MLPs write query-agnostic information to the residual stream.
  • In Stage II and III: Cross-attention reads in this information and writes other information back to the residual stream.
  • In Stage III: MLPs read in information from cross-attention.
What does Cross-Attention write?
  • Application of LogitLens to the output of each cross-attention head and the output of the entire component.
  • Most tokens that get promoted by the cross-attentions' output in Stage II are word-tokens (non-document-identifier).
  • Only in less than one percent, document-identifier tokens are within the Top 100 tokens in the output of cross-attention in Stage II.
  • Below are some examples of tokens that get promoted from cross-attention heads:
Query Cr.-Attn. Head Top 5 Words
who wrote the harry potter books Layer 14 - Head 2 about, written, about, tailored, privire
Layer 16 - Head 1 books, ouvrage, books, authors, book
who won the football championship in 2002 Layer 16 - Head 1 year, YEAR, Year, year, jahr
Layer 16 - Head 13 football, Football, fotbal, soccer, NFL
will there be a sequel to baytown outlaws Layer 12 - Head 8 erneut, successor, similarly, repris, continuation
Layer 16 - Head 1 town, towns, city, Town, village