REVS: Unlearning Sensitive Information in Language Models

Abstract

Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.

Motivation

Language models risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in their training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We demonstrate that models like Llama-3-8B and GPT-J-6B naturally memorize sensitive information from The Pile training dataset, including email addresses and URLs.

Figure 1 from the paper — Overview of the REVS unlearning process: (1) The original model memorizes a sensitive email address and (2) generates it exactly given a related prompt using greedy decoding. (3) After applying REVS, the target email token(s) are demoted to a specified lower rank `R` in the model's output, preventing the model from generating the unlearned email.

Method

REVS operates through two main phases: localization and editing. First, it selects the rarest tokens from the target sensitive information sequence, as accurately extracting the original sensitive data requires recovering the full token sequence. It then locates the relevant layers for those tokens and selects neurons within those layers that are both relevant to the given prompt and have high association with the target token information. REVS iteratively edits these selected neurons to reduce the rank of the target tokens below a specified threshold, effectively demoting the sensitive information while preserving the model's broader knowledge.

The key insight is that transformer MLP layers construct predictions by promoting specific tokens in the output vocabulary space. By identifying and modifying FF₂ columns (neurons) that contribute most to generating target tokens, REVS surgically removes the model's encoded tendency to generate sensitive data.

Figure 2 from the paper — Editing one neuron with REVS: (1) The neuron is projected from hidden space to vocabulary logit space. (2) The logit is adjusted to demote the target token rank to a desired lower rank `R`. (3) The adjusted logits vector is projected back to hidden space, yielding the updated neuron value.

Locating Model Components for Editing

Our methodology REVS uses a sophisticated two-step localization process:

Layer Selection: We identify relevant layers by measuring how strongly each layer contributes to generating the target token. Layers where the target token rank exceeds a predetermined threshold are selected for editing.

Neuron Selection: Within selected layers, we identify neurons using two criteria: (1) Activation strength - how strongly a neuron is activated in response to the prompt, and (2) Token association - how strongly the neuron is associated with the target token when projected to vocabulary space. This hybrid approach targets neurons that are both contextually significant and semantically relevant, outperforming alternatives based solely on activations, gradients, or random selection.

We select the rarest tokens (typically 2 tokens per sequence) from the target sensitive information, as unlearning every token is unnecessary for preventing accurate recovery of the original data.

Evaluation Datasets

To rigorously evaluate unlearning on truly sensitive information, we curated three benchmark datasets:

Organically Memorized Data:

Emails: 205 email addresses naturally memorized by Llama-3-8B and 288 by GPT-J-6B from The Pile training data
URLs: 203 URLs naturally memorized by Llama-3-8B from The Pile training data

Synthetic Data:

SSN: 200 synthetic social security numbers generated using Claude 3 Sonnet, with models fine-tuned to memorize them

Unlike prior work that evaluated editing methods on non-sensitive data, our benchmarks contain actual private information, enabling rigorous assessment of unlearning efficacy and robustness against extraction attacks.

Evaluation Framework

We evaluate unlearning through three critical dimensions:

1. Effectiveness:

Efficacy@k: Maximum score across unlearned tokens, measuring difficulty of extracting complete sequences
Generalization: Performance on unseen prompts that generate the same sensitive information

2. Model Integrity:

Specificity: Impact on tokens that should remain unaffected
General Capabilities: Performance on MMLU and GSM8K benchmarks

3. Extraction Resistance:

Logit-Lens Attack (LLA): Considers top-k and bottom-k tokens across all layers
Delta Attack (DA): Examines tokens with largest changes between consecutive layers
Perturbation Attack (PA): White-box attack using perturbed prompts

Our extraction attacks are stricter and more comprehensive than prior work, considering more candidate tokens across all model layers.

Results

REVS demonstrates superior performance across all evaluation metrics compared to six strong baselines including MEMIT, Constrained Fine-tuning (FT-L), NPO-KL, RMU, Head-Projection, and Max-Entropy methods. Key findings include:

Superior Unlearning Performance: REVS achieves the highest Unlearning Score across all datasets (89.58 on SSN, 62.37 on Emails, 44.25 on URLs)
Robust Extraction Resistance: Achieves highest or second-highest Resistance Score across all datasets and models
Preserved Model Integrity: Maintains strong specificity while achieving high effectiveness, with minimal impact on general capabilities (MMLU/GSM8K scores remain stable)
Consistent Performance: Demonstrates robustness across different hyperparameters, candidate token set sizes, and number of targets

Notably, our experiments revealed inherent differences between synthetic and organically memorized information, with organically memorized data proving more challenging to unlearn, suggesting that information memorized during pre-training may be more deeply ingrained in model parameters.

Unlearning Effectiveness and Model Integrity on Llama-3-8B (k=100)
Dataset	Method	Unlearning Score ↑	Efficacy@100 ↑	General.@100 ↑	Specificity ↑	MMLU ↑	GSM8K ↑
SSN	Unedited	0.00±0.00	0.00±0.00	0.00±0.00	100±0.00	61.05	47.83
	FT-L	36.98±11.97	63.88±9.88	50.35±10.76	24.33±9.78	60.99	46.62
	MEMIT	24.72±7.21	30.70±9.67	23.90±7.61	22.67±6.50	61.02	46.17
	Max-Entropy	5.12±2.13	5.17±3.00	3.92±2.33	1.40±0.60	61.06	47.46
	Head-Projection	2.98±0.79	3.08±1.23	2.95±0.68	4.17±2.41	61.06	46.92
	RMU	16.42±9.10	13.47±8.42	16.67±10.41	38.67±14.92	60.83	48.21
	NPO-KL	11.95±4.87	38.78±18.34	36.13±16.59	6.33±3.68	61.01	47.23
	REVS (ours)	89.58±1.99	98.88±1.28	89.67±3.78	82.17±5.08	60.87	44.20
Emails	Unedited	0.00±0.00	0.00±0.00	−	100±0.00	62.17	47.99
	FT-L	50.30±3.04	52.98±4.23	−	49.25±8.50	62.15	50.94
	MEMIT	35.43±4.30	63.63±3.50	−	24.84±4.20	62.22	50.64
	Max-Entropy	31.08±3.30	69.75±6.30	−	20.22±3.10	62.11	50.64
	Head-Projection	30.80±3.90	64.33±4.90	−	20.43±3.40	62.10	50.19
	RMU	17.47±3.60	15.08±5.90	−	32.58±16.00	62.10	46.39
	NPO-KL	32.75±2.70	24.27±3.00	−	50.97±2.00	62.05	48.67
	REVS (ours)	62.37±2.30	59.65±3.95	−	65.70±3.79	61.77	47.46
URLs	FT-L	28.03±3.95	59.13±7.71	−	18.63±3.52	62.14	50.72
	MEMIT	17.52±4.10	34.37±10.80	−	11.98±3.00	62.14	49.96
	Max-Entropy	12.78±3.90	32.88±7.90	−	8.06±2.80	62.19	49.50
	Head-Projection	11.28±3.90	26.32±8.40	−	7.30±2.70	62.14	49.81
	RMU	13.48±6.00	12.22±12.20	−	41.83±15.30	62.02	49.81
	NPO-KL	17.80±6.60	10.97±4.40	−	50.87±14.10	62.13	49.88
	REVS (ours)	44.25±5.01	78.22±6.04	−	30.94±4.11	62.31	47.76

Robustness to Extraction Attacks

As shown in the radar chart below and detailed results, REVS achieves the highest Resistance Score across all datasets and models, except for Emails on Llama-3-8B, where it ranks a close second. Notably, the results highlight a clear link between higher Effectiveness and improved Resistance Score. While strong Effectiveness often compromises specificity, REVS maintains both, achieving the best or second-best Resistance Score and demonstrating balanced robust unlearning.

Extraction Resistance Results for Llama-3-8B — Results for Extraction Resistance on Llama-3-8B for k=100. REVS is more robust to extraction attacks across multiple evaluation dimensions.

The extraction resistance evaluation demonstrates that REVS not only successfully removes sensitive information from model outputs but also makes it significantly more difficult for adversarial attacks to recover the unlearned data. This robustness is crucial for real-world deployment where models may face sophisticated extraction attempts.

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space