Our key methodological contribution is identifying inter-model agreement as a critical confound: when
models share correctness patterns, external probes exploit peer correctness as a proxy, masking genuine
privileged signals. Evaluating on disagreement subsets reveals that privileged knowledge is
domain-specific—it emerges consistently in factual tasks, while mathematical reasoning
correctness remains externally observable.
These findings reconcile prior conflicting results: privileged knowledge does exist, but is
domain-specific and was previously masked by inter-model agreement. Beyond correctness prediction, our
disagreement-based methodology can be extended to study privileged knowledge in hybrid domains (coding,
commonsense reasoning) and other forms of model introspection. Practically, our results suggest that
model-specific activations carry signals that black-box tools miss, with potential applications in
hallucination detection and monitoring.