REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

Technion – Israel Institute of Technology

Abstract

Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens which form sensitive information. To adequately evaluate our method on truly sensitive information, we curate two datasets: an email dataset naturally memorized by Llama-3-8B and GPT-J-6B, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.

Motivation

Language models risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in their training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing or model filtering through unlearning and model editing, which can be bypassed through extraction attacks.

Figure 1 from the paper
Overview of the REVS unlearning process: (1) The original model memorizes a sensitive email address and (2) generates it exactly given a related prompt using greedy decoding. (3) After applying REVS, the target email token(s) are demoted to a specified lower rank R in the model's output, preventing the model from generating the unlearned email.

Method

REVS selects tokens from the target sensitive information sequence, as accurately extracting the original sensitive data requires recovering the full token sequence. It then locates the relevant layers for those tokens and selects neurons within those layers that are both relevant to the given prompt and have high association with the target token information. REVS iteratively edits these selected neurons to reduce the rank of the target tokens below a specified threshold, effectively demoting the sensitive information while preserving the model's broader knowledge.

Figure 2 from the paper
Editing one neuron with REVS: (1) The neuron is projected from hidden space to vocabulary logit space. (2) The logit is adjusted to demote the target token rank to a desired lower rank R. (3) The adjusted logits vector is projected back to hidden space, yielding the updated neuron value.

Locating Model Components for Editing

Our methodology REVS selects the rarest tokens from the target sensitive information sequence, identifies layers where those tokens have high rank in the residual hidden state, and selects neurons within those layers that exhibit both high activations for the given prompt and high rank for the target tokens when projected to the vocabulary space. This targeted approach aims to surgically remove the model's tendency to generate the sensitive information while preserving its broader knowledge.

Results

REVS achieves near-perfect efficacy in preventing the generation of unlearned targets and maintains high specificity, indicating minimal disruption to the model's desired behavior. Notably, our experiments revealed inherent differences in unlearning performance between synthetic and organically memorized information, with the organically memorized Email dataset proving more challenging, suggesting that information memorized during pre-training may be more deeply ingrained in the model's parameters.
Unlearning effectiveness and model integrity results. REVS is superior in almost all cases.
Dataset Method Harmonic Mean ↑ Efficacy@100 ↑ General.@100 ↑ Specificity ↑ Perplexity ↓
SSN Unedited 0±0.0 0±0.0 0±0.0 100±0 11.148±0
FT-L 4.78±3.66 55.78±12.37 1.75±1.44 61.67±15.17 11.27±0.054
MEMIT (modified) 78.07±2.2 98.5±2.33 61.15±3.25 84.17±3.07 11.156±0.011
REVS (ours) 81.45±3.56 99.95±0.07 80.17±3.22 70.33±7.84 11.165±0.01
Emails Unedited 0±0.0 0±0.0 - 100±0 8.129±0
FT-L 3.33±1.77 55.57±4.39 - 1.73±0.96 13.63±0.34
MEMIT (modified) 70.05±1.16 88.23±1.64 - 58.1±1.63 8.13±0
REVS (ours) 80.65±2.41 97.22±1.04 - 68.98±3.6 8.148±0.002
Average results for extraction resistance. REVS is more robust to extraction attacks.
Dataset Method Harmonic Mean ↑ Logit Lens@100 ↑ Delta@100 ↑ Perturb@100 ↑
SSN Unedited 0±0.0 0±0.0 95.12±0.82 26.5±5.26
FT-L 65.35±7.57 55.63±12.64 97.03±0.8 58.62±3.49
MEMIT (modified) 93.52±1.76 97.48±2.01 97.88±0.64 90.93±3.53
REVS (ours) 99.12±3.56 99.95±0.07 98.55±0.2 98.97±1.46
Emails Unedited 0±0.0 0±0.0 83.8±0.67 44.2±4.11
FT-L 50.15±3.08 28.47±2.73 85.83±1.11 78.08±5.15
MEMIT (modified) 80.73±1.7 79.62±2.31 86.17±0.39 77.12±3.86
REVS (ours) 83.48±1.14 81.05±1.17 87.08±0.25 82.63±2.63

BibTeX

@article{tomer2024revs,
        title={REVS: Rank Editing in the Vocabulary Space for Unlearning Sensitive Information in Large Language Models},
        author={Ashuach, Tomer and Tutek, Martin and Belinkov, Yonatan},
        journal={arXiv preprint arXiv:2406.09325},
        year={2024}
      }