As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. We introduce CRISP, a parameter-efficient method for persistent concept unlearning via sparse autoencoders (SAEs). CRISP automatically identifies salient SAE features activated by harmful or sensitive knowledge and suppresses their activations through fine-tuning, ensuring permanent removal rather than inference-time control. Experiments on two open-weight LLMs show that CRISP achieves state-of-the-art unlearning performance on safety-critical benchmarks while preserving fluency and benign knowledge. Feature-level analyses further demonstrate that CRISP disentangles and suppresses target features with high semantic precision, maintaining coherent text generation and minimal collateral forgetting.
Machine Unlearning. Prior methods aim to remove specific knowledge from language models by modifying parameters or optimizing gradients to shift latent representations. Such global edits often harm related concepts and degrade general utility. CRISP instead performs feature-level unlearning, selectively suppressing relevant directions in the representation space for minimal disruption.
Sparse Autoencoders and Steering. Sparse autoencoders (SAEs) enable interpretable access to model features and have been used for inference-time steering of specific behaviors. However, steering does not alter model parameters, leaving underlying knowledge intact. CRISP leverages SAEs for automatic, context-aware suppression of harmful activations, achieving persistent and precise unlearning while preserving benign knowledge.
CRISP operates in two key phases: feature selection and model optimization.
1. Feature Selection: Using pre-trained SAEs, CRISP computes activation statistics over a target corpus (harmful knowledge) and a retain corpus (benign knowledge). It identifies salient features with high activation frequency and relative activation ratios on the target set, filtering them by significance thresholds.
2. Model Optimization: CRISP fine-tunes the model using LoRA adapters to suppress the activations of selected SAE features on the target corpus while preserving the original hidden representations on the retain set. The total loss combines three objectives: unlearning, retention, and coherence, balancing concept removal with fluency preservation.
We evaluate CRISP on the WMDP benchmark (Li et al., 2024), focusing on two domains:
Experiments were conducted on Llama-3.1-8B and Gemma-2-2B models, using publicly available SAEs from LlamaScope and GemmaScope. Baselines include RMU and ELM. Evaluation covers unlearning accuracy, retention, MMLU performance, fluency, and concept coherence.
CRISP demonstrates superior performance in concept unlearning, achieving a balanced trade-off between unlearning efficacy and retention of general and domain-specific knowledge. It consistently outperforms prior methods while preserving fluency and coherence in generation.
CRISP’s feature analysis highlights its precision in targeting harmful concepts while preserving benign features. By focusing on salient SAE features, it disentangles harmful activations from benign ones, ensuring semantic stability across layers.