Induction Meets Biology:
Mechanisms of Repeat Detection in Protein Language Models

Gal Kesten-Pomeranz1, Yaniv Nikankin1, Anja Reusch1, Tomer Tsaban2, Ora Schueler-Furman2, Yonatan Belinkov1,3
1Technion – Israel Institute of Technology; 2The Hebrew University of Jerusalem; 3Kempner Institute, Harvard University
ArXiv PDF Code Datasets

Research Question

Repeated segments are a common feature of protein sequences and often play important roles in structure and function. While these segments may initially arise as exact copies, they typically diverge over time as mutations accumulate, making many repeats only approximate rather than exact. As a result, repeat identification has long been a challenging problem in computational biology, motivating decades of algorithmic work. More recently, protein language models (PLMs) have been shown to detect such repeats and exploit them in masked-token prediction. But what internal mechanism enables this ability? In this work, we uncover the mechanisms by which PLMs detect such repeats.

Main Results

Visualization of the repeat identification mechanism
(I) Relative-position heads and biochemical neurons build contextual representations that align tokens across repeats. (II) Induction heads attend from the masked position to its aligned token and copy its information, enabling retrieval of the correct amino acid; repeat neurons play an inhibitory role. (III) MLP neurons refine the final masked token distribution, with amino-acid–biased attention heads also contributing to the prediction.

Key Contributions

Additional Results

Methodology

Tools & Techniques

Related Work

Repeat identification in LLMs

Repeat identification and control in PLMs

Circuit discovery

How to cite

bibliography

	Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov. “Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models”.
	

bibtex

	@misc{kestenpomeranz2026inductionmeetsbiologymechanisms,
	  title={Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models},
	  author={Gal Kesten-Pomeranz and Yaniv Nikankin and Anja Reusch and Tomer Tsaban and Ora Schueler-Furman and Yonatan Belinkov},
	  year={2026},
	  eprint={2602.23179},
	  archivePrefix={arXiv},
	  primaryClass={cs.LG},
	  url={https://arxiv.org/abs/2602.23179},
	}
	

Appendix: Neuron Examples

Here we provide visualizations of several task-relevant neurons that capture different concepts, including both biochemical and repeat-related patterns.

Aligned Repeat Token to Mask Position

Aligned repeat token neuron example 1

Aligned repeat token neuron example 2
Example sequence visualizations for a neuron in ESM-3 (layer 38, neuron 2850) selective for the aligned repeat token to the masked position. The neuron activates on the token aligned with the mask (denoted by "?"), corresponding to the correct retrieval target. Repeat tokens are highlighted in green. UniProt accessions: A0A0P7XBW1 (top), B0DTM3 (bottom).

Repeat Tokens

Repeat tokens neuron
Example sequence visualization for a repeat-selective neuron in ESM-C (layer 24, neuron 2224). The neuron exhibits positive activations (red) at positions corresponding to approximate repeat tokens, highlighted in green. UniProt accession: A0A8X6HTE9.

Helix Breakers

Helix breaker neuron visualization
Example sequence visualization for a helix-breaker–selective neuron in ESM-3 (layer 2, neuron 3282). The neuron exhibits strong positive activations (red) on proline (P) and glycine (G), amino acids known to disrupt α-helical secondary structure. UniProt accession: B0DTM3.

Aromatic Ring

Aromatic ring neuron
Example sequence visualization for an aromatic-ring–selective neuron in ESM-3 (layer 35, neuron 3011). The neuron exhibits strong positive activations (red) on aromatic amino acids (F, Y, H, W). UniProt accession: A0A0P7XBW1.

Hydrogen Donor

Hydrogen donor neuron
Example sequence visualization for a neuron in ESM-C (layer 7, neuron 907) that is selective for the IMGT hydrogen-bond donor residue class. The neuron exhibits strong positive activations (red) on R, K, and W, consistent with their classification as side-chain hydrogen-bond donors in IMGT. UniProt accession: A0A2M8A3Y9.

Bidirectional Amino-Acid Neurons

Bidirectional amino-acid neuron ESM-3
ESM-3 (layer 8, neuron 788), exhibiting positive activations (red) on amino acid F and negative activations (blue) on amino acid Y. Notably, F and Y are biochemically similar aromatic residues with a high BLOSUM62 substitution score. UniProt accession: B0DTM3.
Bidirectional amino-acid neuron ESM-C
ESM-C (layer 0, neuron 2105), exhibiting negative activations (blue) on amino acid S and positive activations (red) on amino acid T. Notably, S and T are often considered biochemically similar due to their hydroxyl side chains and have a high BLOSUM62 substitution score. UniProt accession: A0A2K3L8W6.

BLOSUM62

BLOSUM62 neuron ESM-3
ESM-3 (layer 19, neuron 3434), exhibiting positive activations (red) on amino acids D, E, and N. D, E, and N are mutually substitutable according to the BLOSUM62 matrix. UniProt accession: A0A1W9X075.
BLOSUM62 neuron ESM-C
ESM-C (layer 0, neuron 2615), exhibiting negative activations (blue) on amino acids D, E, and N. D, E, and N are mutually substitutable according to the BLOSUM62 matrix. UniProt accession: A0A2K3L8W6.

Special Tokens

Special tokens neuron ESM-3
ESM-3 (layer 0, neuron 125), exhibiting strong negative activations (blue) on the BOS token. Regular amino-acid tokens show positive activations (red), while the EOS token exhibits moderately negative activation. UniProt accession: A0A0P7XBW1.
Special tokens neuron ESM-C
ESM-C (layer 24, neuron 2127), selectively activating on the BOS token with strong negative activation (blue). The EOS token also exhibits moderately negative activation. UniProt accession: A0A2M8A3Y9.