Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.
Language models risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in their training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We demonstrate that models like Llama-3-8B and GPT-J-6B naturally memorize sensitive information from The Pile training dataset, including email addresses and URLs.
REVS operates through two main phases: localization and editing. First, it selects the rarest tokens from the target sensitive information sequence, as accurately extracting the original sensitive data requires recovering the full token sequence. It then locates the relevant layers for those tokens and selects neurons within those layers that are both relevant to the given prompt and have high association with the target token information. REVS iteratively edits these selected neurons to reduce the rank of the target tokens below a specified threshold, effectively demoting the sensitive information while preserving the model's broader knowledge.
The key insight is that transformer MLP layers construct predictions by promoting specific tokens in the output vocabulary space. By identifying and modifying FF₂ columns (neurons) that contribute most to generating target tokens, REVS surgically removes the model's encoded tendency to generate sensitive data.
Our methodology REVS uses a sophisticated two-step localization process:
Layer Selection: We identify relevant layers by measuring how strongly each layer contributes to generating the target token. Layers where the target token rank exceeds a predetermined threshold are selected for editing.
Neuron Selection: Within selected layers, we identify neurons using two criteria: (1) Activation strength - how strongly a neuron is activated in response to the prompt, and (2) Token association - how strongly the neuron is associated with the target token when projected to vocabulary space. This hybrid approach targets neurons that are both contextually significant and semantically relevant, outperforming alternatives based solely on activations, gradients, or random selection.
We select the rarest tokens (typically 2 tokens per sequence) from the target sensitive information, as unlearning every token is unnecessary for preventing accurate recovery of the original data.
To rigorously evaluate unlearning on truly sensitive information, we curated three benchmark datasets:
Unlike prior work that evaluated editing methods on non-sensitive data, our benchmarks contain actual private information, enabling rigorous assessment of unlearning efficacy and robustness against extraction attacks.
We evaluate unlearning through three critical dimensions:
Our extraction attacks are stricter and more comprehensive than prior work, considering more candidate tokens across all model layers.
REVS demonstrates superior performance across all evaluation metrics compared to six strong baselines including MEMIT, Constrained Fine-tuning (FT-L), NPO-KL, RMU, Head-Projection, and Max-Entropy methods. Key findings include:
Notably, our experiments revealed inherent differences between synthetic and organically memorized information, with organically memorized data proving more challenging to unlearn, suggesting that information memorized during pre-training may be more deeply ingrained in model parameters.
Dataset | Method | Unlearning Score ↑ | Efficacy@100 ↑ | General.@100 ↑ | Specificity ↑ | MMLU ↑ | GSM8K ↑ |
---|---|---|---|---|---|---|---|
SSN | Unedited | 0.00±0.00 | 0.00±0.00 | 0.00±0.00 | 100±0.00 | 61.05 | 47.83 |
FT-L | 36.98±11.97 | 63.88±9.88 | 50.35±10.76 | 24.33±9.78 | 60.99 | 46.62 | |
MEMIT | 24.72±7.21 | 30.70±9.67 | 23.90±7.61 | 22.67±6.50 | 61.02 | 46.17 | |
Max-Entropy | 5.12±2.13 | 5.17±3.00 | 3.92±2.33 | 1.40±0.60 | 61.06 | 47.46 | |
Head-Projection | 2.98±0.79 | 3.08±1.23 | 2.95±0.68 | 4.17±2.41 | 61.06 | 46.92 | |
RMU | 16.42±9.10 | 13.47±8.42 | 16.67±10.41 | 38.67±14.92 | 60.83 | 48.21 | |
NPO-KL | 11.95±4.87 | 38.78±18.34 | 36.13±16.59 | 6.33±3.68 | 61.01 | 47.23 | |
REVS (ours) | 89.58±1.99 | 98.88±1.28 | 89.67±3.78 | 82.17±5.08 | 60.87 | 44.20 | |
Emails | Unedited | 0.00±0.00 | 0.00±0.00 | − | 100±0.00 | 62.17 | 47.99 |
FT-L | 50.30±3.04 | 52.98±4.23 | − | 49.25±8.50 | 62.15 | 50.94 | |
MEMIT | 35.43±4.30 | 63.63±3.50 | − | 24.84±4.20 | 62.22 | 50.64 | |
Max-Entropy | 31.08±3.30 | 69.75±6.30 | − | 20.22±3.10 | 62.11 | 50.64 | |
Head-Projection | 30.80±3.90 | 64.33±4.90 | − | 20.43±3.40 | 62.10 | 50.19 | |
RMU | 17.47±3.60 | 15.08±5.90 | − | 32.58±16.00 | 62.10 | 46.39 | |
NPO-KL | 32.75±2.70 | 24.27±3.00 | − | 50.97±2.00 | 62.05 | 48.67 | |
REVS (ours) | 62.37±2.30 | 59.65±3.95 | − | 65.70±3.79 | 61.77 | 47.46 | |
URLs | FT-L | 28.03±3.95 | 59.13±7.71 | − | 18.63±3.52 | 62.14 | 50.72 |
MEMIT | 17.52±4.10 | 34.37±10.80 | − | 11.98±3.00 | 62.14 | 49.96 | |
Max-Entropy | 12.78±3.90 | 32.88±7.90 | − | 8.06±2.80 | 62.19 | 49.50 | |
Head-Projection | 11.28±3.90 | 26.32±8.40 | − | 7.30±2.70 | 62.14 | 49.81 | |
RMU | 13.48±6.00 | 12.22±12.20 | − | 41.83±15.30 | 62.02 | 49.81 | |
NPO-KL | 17.80±6.60 | 10.97±4.40 | − | 50.87±14.10 | 62.13 | 49.88 | |
REVS (ours) | 44.25±5.01 | 78.22±6.04 | − | 30.94±4.11 | 62.31 | 47.76 |
As shown in the radar chart below and detailed results, REVS achieves the highest Resistance Score across all datasets and models, except for Emails on Llama-3-8B, where it ranks a close second. Notably, the results highlight a clear link between higher Effectiveness and improved Resistance Score. While strong Effectiveness often compromises specificity, REVS maintains both, achieving the best or second-best Resistance Score and demonstrating balanced robust unlearning.
The extraction resistance evaluation demonstrates that REVS not only successfully removes sensitive information from model outputs but also makes it significantly more difficult for adversarial attacks to recover the unlearned data. This robustness is crucial for real-world deployment where models may face sophisticated extraction attempts.