..

Redaction and Paraphrasing when Studying Collusion

Still thinking through this. Would love pushback.

Julian Stastny proposed finding a “Pareto frontier of redaction and paraphrasing (R&P) techniques” between two metrics:

  1. How hard is it to tell which model produced an output
  2. How well can you classify correctness

The main idea is to identify R&P techniques that make model recognition hard while preserving the ability to verify correctness. I think that this approach has a blind spot.

Julian’s motivation explicitly discusses scheming models, collusion, and monitors lying about dangerous actions. But his proposed methodology only measures two metrics on honest outputs. There’s a disconnect between the adversarial motivation and the non-adversarial evaluation. My concern is: will R&P techniques that are Pareto optimal on model distinguishability and honest correctness also work well in the adversarial setting that actually motivates this work?

During evaluation, when measuring your Pareto frontier, models behave honestly (Julian and Ryan Greenblatt discuss this in more depth here) and correctness classification works fine. Everything looks good. But the whole point of untrusted monitoring, as Julian explains in his motivation, is dealing with scheming models that actively try to subvert safety measures.

In production, scheming models could produce subtly incorrect solutions such as backdoors, security vulnerabilities or worse: logic that is difficult to follow. The more aggressive the R&P, the harder it becomes to verify these subtle attacks. This degradation in verification capability might not show up in your correctness classification metric because that metric is evaluated on honest outputs, not adversarial backdoors. For example, redacting function names might preserve apparent correctness while hiding that a function is called in a suspicious context.

You can measure correctness classification on evaluation data, but that’s measuring “can you spot obvious bugs” and not “can you spot clever backdoors after aggressive R&P has destroyed context?”

The question is: is the Pareto frontier you map on honest outputs predictive of performance in adversarial settings? Or could we end up selecting R&P techniques that excel on distinguishability and honest correctness, but fail when models are actually trying to collude?

I think that finding a Pareto frontier on model distinguishability and correctness classification is a valuable first step, but it doesn’t tell us what we need to know about whether R&P prevents collusion in adversarial settings. What if we extended the methodology to include a third metric: performance on adversarially constructed backdoors?

For this third metric, we would have models produce backdoored code, apply the same R&P techniques, and measure how well monitors catch these adversarial backdoors. This tests whether R&P techniques that excel on the first two metrics maintain their value when models are actually trying to collude.