Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

ACL 2026

1Technion – Israel Institute of Technology
2IBM Research
3Kempner Institute, Harvard University

Abstract

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness—information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

Method Overview

We train correctness probes on a model's own hidden states (self-probe) and on external models' representations (external-probe), then measure the Premium Gap: the performance advantage of self over external probes. We evaluate across three decoder model families (Gemma-2-9B, Llama-3.1-8B, Qwen-2.5-7B), an embedding model (Qwen3-Embedding-8B), and five datasets spanning factual knowledge and mathematical reasoning. We additionally evaluate Qwen-3-32B for scalability analysis.

Overview of the probing framework
Figure 1: Questions are input to a target model and to external models, yielding representations. Probes trained on these representations predict answer correctness. We evaluate probe performance using mean AUC averaged over layers and define the Premium Gap as the performance advantage of self over external probes.

Full Test Sets Reveal No Premium Gap

On full test sets, self-probes offer no advantage over external probes. In factual tasks, self-probes show only a small advantage over embedding model probes, and are comparable to cross-model probes in 2 out of 3 models. In mathematical reasoning, both embedding model and cross-model probes match self-probe performance, yielding a non-existent premium gap. This initial finding suggests—at first glance—that LLMs have no privileged knowledge about their own correctness.

Premium Gap bar chart
Figure 2: Mean AUC for correctness prediction on TriviaQA (factual) and MATH (reasoning). Bars compare Random, Embedding, and Best Cross-Model baselines to the Self-Probe across three target models. Semi-transparent overlays indicate the performance gain (or lack thereof) of Self relative to each baseline.

The Agreement Confound

Models agree on correctness for approximately 80% of questions in factual tasks and 75% in mathematical reasoning. This high consensus creates a critical problem: when models agree on the majority of examples, the external model's correctness becomes highly correlated with the target's correctness, allowing external probes to use their own model's correctness as a proxy—masking any genuine privileged signal. To break this confound, we evaluate on disagreement subsets—questions where models produce conflicting correctness labels—eliminating the external proxy.

Agreement vs Disagreement Rates
Figure 4: Agreement (blue) vs. disagreement (orange) rates across datasets, averaged over all model pairs. High agreement explains why external probes appear to match self-probes on standard evaluations.

Emergence of Domain-Specific Privileged Knowledge

On disagreement subsets, a statistically significant premium gap (~5%) re-emerges in factual tasks across all models, with both linear and MLP probes. Self-probes consistently outperform external probes on disagreement subsets, indicating that when the external proxy fails, external representations cannot fully account for the target model's correctness. The target model retains unique internal signals that remain inaccessible to external observers. In sharp contrast, mathematical reasoning shows no premium gap: even on the disagreement subset, external model probes closely match or outperform self-probes across all targets.

Heatmap of correctness prediction differences
Figure 3: Target Model Correctness Prediction. Heatmap of correctness prediction differences across target models, datasets, and test subsets. Each cell reports the AUC difference (∆AUC = Self − Best External), with the percentage of the gap closed shown in parentheses. Left: Factual tasks show significant positive gaps on disagreement subsets (asterisks denote statistical significance). Right: Math tasks show no gap in either setting.

Where Does Privileged Knowledge Emerge?

We localize the premium gap across network depth by computing the per-layer premium gap (self-probe AUC − best external-probe AUC) on the disagreement subset at each individually probed layer.

Factual: Progressive Emergence

The premium gap is near zero or slightly negative in early layers—which primarily encode surface-level and syntactic features—and grows progressively toward deeper layers. The gap becomes reliably positive from approximately layer 10–15 onward (normalized depth ~0.25–0.40), consistent with the view that the privileged signal reflects idiosyncratic memory retrieval states that build up through the forward pass.

Per-layer premium gap for factual datasets
Figure 5a: Factual Knowledge. The premium gap is near zero in early layers and grows progressively toward deeper layers across all models, indicating that privileged knowledge emerges in early-to-mid representations.

Math: No Consistent Advantage

For MATH, the premium gap fluctuates near zero across all layers and models with no systematic trend. For GSM1K, the gap is predominantly negative—external probes outperform self-probes at most depths. Mathematical correctness signals appear to be publicly accessible at every depth of the network, confirming that reasoning difficulty is governed by problem structure rather than model-specific knowledge.

Per-layer premium gap for mathematical reasoning
Figure 5b: Mathematical Reasoning. MATH fluctuates near zero at all depths and GSM1K is predominantly negative. No layer exhibits a consistent self-probe advantage across the full network depth.

Conclusion

Our key methodological contribution is identifying inter-model agreement as a critical confound: when models share correctness patterns, external probes exploit peer correctness as a proxy, masking genuine privileged signals. Evaluating on disagreement subsets reveals that privileged knowledge is domain-specific—it emerges consistently in factual tasks, while mathematical reasoning correctness remains externally observable.

These findings reconcile prior conflicting results: privileged knowledge does exist, but is domain-specific and was previously masked by inter-model agreement. Beyond correctness prediction, our disagreement-based methodology can be extended to study privileged knowledge in hybrid domains (coding, commonsense reasoning) and other forms of model introspection. Practically, our results suggest that model-specific activations carry signals that black-box tools miss, with potential applications in hallucination detection and monitoring.

BibTeX

@article{ashuach2026masked,
  title={Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness},
  author={Ashuach, Tomer and Ein-Dor, Liat and Gretz, Shai and Katz, Yoav and Belinkov, Yonatan},
  journal={arXiv preprint arXiv:2604.12373},
  year={2026}
}