Large language models risk inadvertently memorizing and divulging sensitive or personally identifiable information seen in training data, causing privacy concerns. We propose REVS, a novel model editing method for unlearning sensitive information from language models. REVS identifies and modifies a small subset of neurons relevant for each piece of sensitive information. By projecting these neurons to the vocabulary space, we pinpoint the components driving its generation. We then compute a model edit based on the pseudo-inverse of the unembedding matrix, and apply it to de-promote generation of the targeted sensitive data. Compared to other state-of-the-art model editing methods, REVS demonstrates superior performance in both eliminating sensitive information and robustness to extraction attacks, while retaining integrity of the underlying model.
Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in their training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing or model filtering through unlearning and model editing, which can be bypassed through extraction attacks.
REVS selects tokens from the target sensitive information sequence, as accurately extracting the original sensitive data requires recovering the full token sequence. It then locates the relevant layers for those tokens and selects neurons within those layers that are both relevant to the given prompt and have high association with the target token information. REVS iteratively edits these selected neurons to reduce the rank of the target tokens below a specified threshold, effectively demoting the sensitive information while preserving the model's broader knowledge.
Our methodology REVS selects the rarest tokens from the target sensitive information sequence, identifies layers where those tokens have high rank in the residual hidden state, and selects neurons within those layers that exhibit both high activations for the given prompt and high rank for the target tokens when projected to the vocabulary space. This targeted approach aims to surgically remove the model's tendency to generate the sensitive information while preserving its broader knowledge.
Dataset | Method | Harmonic Mean ↑ | Efficacy@100 ↑ | General.@100 ↑ | Specificity ↑ | Perplexity ↓ |
---|---|---|---|---|---|---|
SSN | Unedited | 0±0.0 | 0±0.0 | 0±0.0 | 100±0 | 11.148±0 |
FT-L | 4.78±3.66 | 55.78±12.37 | 1.75±1.44 | 61.67±15.17 | 11.27±0.054 | |
MEMIT (modified) | 78.07±2.2 | 98.5±2.33 | 61.15±3.25 | 84.17±3.07 | 11.156±0.011 | |
REVS (ours) | 81.45±3.56 | 99.95±0.07 | 80.17±3.22 | 70.33±7.84 | 11.165±0.01 | |
Emails | Unedited | 0±0.0 | 0±0.0 | - | 100±0 | 8.129±0 |
FT-L | 3.33±1.77 | 55.57±4.39 | - | 1.73±0.96 | 13.63±0.34 | |
MEMIT (modified) | 70.05±1.16 | 88.23±1.64 | - | 58.1±1.63 | 8.13±0 | |
REVS (ours) | 80.65±2.41 | 97.22±1.04 | - | 68.98±3.6 | 8.148±0.002 |
Dataset | Method | Harmonic Mean ↑ | Logit Lens@100 ↑ | Delta@100 ↑ | Perturb@100 ↑ |
---|---|---|---|---|---|
SSN | Unedited | 0±0.0 | 0±0.0 | 95.12±0.82 | 26.5±5.26 |
FT-L | 65.35±7.57 | 55.63±12.64 | 97.03±0.8 | 58.62±3.49 | |
MEMIT (modified) | 93.52±1.76 | 97.48±2.01 | 97.88±0.64 | 90.93±3.53 | |
REVS (ours) | 99.12±3.56 | 99.95±0.07 | 98.55±0.2 | 98.97±1.46 | |
Emails | Unedited | 0±0.0 | 0±0.0 | 83.8±0.67 | 44.2±4.11 |
FT-L | 50.15±3.08 | 28.47±2.73 | 85.83±1.11 | 78.08±5.15 | |
MEMIT (modified) | 80.73±1.7 | 79.62±2.31 | 86.17±0.39 | 77.12±3.86 | |
REVS (ours) | 83.48±1.14 | 81.05±1.17 | 87.08±0.25 | 82.63±2.63 |