Induction Meets Biology

Research Question

Repeated segments are a common feature of protein sequences and often play important roles in structure and function. While these segments may initially arise as exact copies, they typically diverge over time as mutations accumulate, making many repeats only approximate rather than exact. As a result, repeat identification has long been a challenging problem in computational biology, motivating decades of algorithmic work. More recently, protein language models (PLMs) have been shown to detect such repeats and exploit them in masked-token prediction. But what internal mechanism enables this ability? In this work, we uncover the mechanisms by which PLMs detect such repeats.

Main Results

We formulate repeat identification as a masked language modeling task in which the model receives a protein sequence containing two repeated segments, either exact or approximate, and must predict a masked amino acid using information from the other repeat.
We find that PLMs use two categories of components to solve this task:
- Language model–related components (left in figure): induction-like attention heads that attend to aligned repeat tokens; relative-position heads that attend to fixed positional offsets; and neurons that activate within repetitions. These components have already been linked to repeat identification in language models.
- Biologically specialized components (right in figure): attention heads biased toward specific amino acids, and neurons that respond to individual amino acids or to biochemically similar groups with high substitution likelihood (e.g., BLOSUM62).
We reveal a 3-stage mechanism for how PLMs combine these components to solve the task (see figure below).

Visualization of the repeat identification mechanism — (I) Relative-position heads and biochemical neurons build contextual representations that align tokens across repeats. (II) Induction heads attend from the masked position to its aligned token and copy its information, enabling retrieval of the correct amino acid; repeat neurons play an inhibitory role. (III) MLP neurons refine the final masked token distribution, with amino-acid–biased attention heads also contributing to the prediction.

Key Contributions

Our work is the first to characterize repeat identification in PLMs, revealing how general induction mechanisms integrate protein-specific components encoding biological features.
We uncover a three-stage repeat-detection mechanism in PLMs, highlighting similarities and differences with language models, as well as differences across protein models.
We show that multiple PLM components are directly interpretable in the model basis, without requiring additional interpretability methods such as sparse autoencoders (SAEs).

Additional Results

We find substantial overlap between the mechanisms underlying approximate and identical repeat detection, with the approximate-repeat mechanism functionally generalizing the identical-repeat one.
We compare two prominent PLMs—ESM-3 and ESM-C—finding they share a similar mechanism, with one crucial difference: ESM-3 engages neurons sensitive to protein secondary structure for repeat identification, while ESM-C does not. We relate this to differences in their training regimes and data.

Methodology

Tools & Techniques

Dataset construction. We curate three diverse datasets spanning synthetic and natural proteins with exact and approximate repeats, and formulate each setting as a masked-token prediction task.
Circuit discovery. We apply attribution patching with integrated gradients (AP-IG) to identify circuits of attention heads and MLP neurons involved in repeat detection.
Cross-task comparison. We compare circuits across repeat settings using IoU, recall, and cross-task faithfulness to assess structural overlap and functional generalization.
Attention-head characterization. We analyze attention maps of circuit heads, compute per-head features across the dataset that capture induction behavior, amino-acid bias, positional preference, and other attention properties, and cluster the heads into major functional classes.
Neuron concept analysis. We analyze the activation patterns of important MLP neurons and define a library of binary token-level concepts, including amino-acid identity, biochemical properties, substitution groups, and repeat-region membership. We then use AUROC to evaluate how well each neuron distinguishes in-concept from out-of-concept tokens and assign each neuron to its best-matching concept.
Component interaction analysis. We analyze interactions between groups of components using edge attribution patching with integrated gradients (EAP-IG), and use the logit lens to study how different component groups influence the final prediction.

Related Work

Repeat identification in LLMs

A Mathematical Framework for Transformer Circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah
Transformer Circuits Thread, 2021
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah
Transformer Circuits Thread, 2022

Repeat identification and control in PLMs

In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
Pranav Kantroo, Günter P Wagner, Benjamin B Machta
ArXiv preprint, 2025
Controlling Repetition in Protein Language Models
Jiahao Zhang, Zeqing Zhang, Di Wang, Lijie Hu
ICLR, 2026

Circuit discovery

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna, Sandro Pezzelle, Yonatan Belinkov
COLM, 2024

How to cite

bibliography

	Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov. “Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models”.

bibtex

	@misc{kestenpomeranz2026inductionmeetsbiologymechanisms,
	  title={Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models},
	  author={Gal Kesten-Pomeranz and Yaniv Nikankin and Anja Reusch and Tomer Tsaban and Ora Schueler-Furman and Yonatan Belinkov},
	  year={2026},
	  eprint={2602.23179},
	  archivePrefix={arXiv},
	  primaryClass={cs.LG},
	  url={https://arxiv.org/abs/2602.23179},
	}

Appendix: Neuron Examples

Here we provide visualizations of several task-relevant neurons that capture different concepts, including both biochemical and repeat-related patterns.