CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Technion – Israel Institute of Technology, Boston University, University of Zagreb

Abstract

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. We introduce CRISP, a parameter-efficient method for persistent concept unlearning via sparse autoencoders (SAEs). CRISP automatically identifies salient SAE features activated by harmful or sensitive knowledge and suppresses their activations through fine-tuning, ensuring permanent removal rather than inference-time control. Experiments on two open-weight LLMs show that CRISP achieves state-of-the-art unlearning performance on safety-critical benchmarks while preserving fluency and benign knowledge. Feature-level analyses further demonstrate that CRISP disentangles and suppresses target features with high semantic precision, maintaining coherent text generation and minimal collateral forgetting.

CRISP method diagram
Overview of CRISP: (1) Identify features frequently and strongly activated by the target corpus—but not by the benign corpus—using sparse autoencoders (SAEs). (2) Fine-tune the model to suppress these features on the target corpus while preserving their activations on benign data.

Related Work

Machine Unlearning. Prior methods aim to remove specific knowledge from language models by modifying parameters or optimizing gradients to shift latent representations. Such global edits often harm related concepts and degrade general utility. CRISP instead performs feature-level unlearning, selectively suppressing relevant directions in the representation space for minimal disruption.

Sparse Autoencoders and Steering. Sparse autoencoders (SAEs) enable interpretable access to model features and have been used for inference-time steering of specific behaviors. However, steering does not alter model parameters, leaving underlying knowledge intact. CRISP leverages SAEs for automatic, context-aware suppression of harmful activations, achieving persistent and precise unlearning while preserving benign knowledge.

Method

CRISP operates in two key phases: feature selection and model optimization.

1. Feature Selection: Using pre-trained SAEs, CRISP computes activation statistics over a target corpus (harmful knowledge) and a retain corpus (benign knowledge). It identifies salient features with high activation frequency and relative activation ratios on the target set, filtering them by significance thresholds.

2. Model Optimization: CRISP fine-tunes the model using LoRA adapters to suppress the activations of selected SAE features on the target corpus while preserving the original hidden representations on the retain set. The total loss combines three objectives: unlearning, retention, and coherence, balancing concept removal with fluency preservation.

Experimental Setup

We evaluate CRISP on the WMDP benchmark (Li et al., 2024), focusing on two domains:

  • Biosecurity: Removal of expert-level virology knowledge while retaining general biology.
  • Cybersecurity: Removal of harmful cybersecurity instructions while retaining general computer science.

Experiments were conducted on Llama-3.1-8B and Gemma-2-2B models, using publicly available SAEs from LlamaScope and GemmaScope. Baselines include RMU and ELM. Evaluation covers unlearning accuracy, retention, MMLU performance, fluency, and concept coherence.

Results

CRISP demonstrates superior performance in concept unlearning, achieving a balanced trade-off between unlearning efficacy and retention of general and domain-specific knowledge. It consistently outperforms prior methods while preserving fluency and coherence in generation.

Llama Bio results
Trade-off performance on the WMDP-Bio dataset using Llama-3.1-8B.
Gemma Bio results
Trade-off performance on the WMDP-Bio dataset using Gemma-2-2B.

Feature Analysis

CRISP’s feature analysis highlights its precision in targeting harmful concepts while preserving benign features. By focusing on salient SAE features, it disentangles harmful activations from benign ones, ensuring semantic stability across layers.

Gemma features scatter
Gemma-2-2B Layer 14: disentangled and suppressed harmful features after CRISP unlearning.
Llama features scatter
Llama-3.1-8B Layer 24: disentangled and suppressed harmful features after CRISP unlearning.