Skip to content

Conversation

@patriciapampanelli
Copy link
Collaborator

Summary

Adds 95% bootstrap CIs to attack success rates, accounting for sampling variance and detector imperfection via Rogan-Gladen correction.

Changes

  • New: bootstrap_ci.py, detector_metrics.py - CI calculation with Se/Sp correction
  • Modified: evaluators/base.py - CI integration into eval pipeline and output
  • Modified: report_digest.py - CI propagation through reports

Methodology

  1. Resampling: Draws 10,000 bootstrap samples from the binary pass/fail results (with replacement)
  2. Correction: Adjusts each sample's observed rate using the Rogan-Gladen formula to account for detector error
  3. Interval extraction: Takes the 2.5th and 97.5th percentiles as CI bounds

The correction formula:

P_true = (P_obs + Sp - 1) / (Se + Sp - 1)
  • P_obs = observed failure rate in the resampled data
  • Se = detector sensitivity (probability of detecting a true attack)
  • Sp = detector specificity (probability of correctly passing a benign response)

Requires ≥30 evaluated outputs per probe-detector pair; falls back to perfect detector (Se=Sp=1.0) when detector metrics unavailable.

Statistical Limitations

  • Se/Sp treated as fixed (no detectors uncertainty propagation)
  • Uses detector-level metrics only (not probe-specific): Detector performance (Se/Sp) can vary depending on the probe.

Out of Scope

  • Probe-specific Se/Sp lookup

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to find a better way to print this. I'm mostly confident that this methodology can work, though I had trouble writing a formal proof that this gives us a true 95% CI.

Comment on lines 42 to 43
During console output, attack success rates may include confidence intervals displayed as: ``(attack success rate: 45.23%) ± 2.15``.
The ± margin represents the 95% confidence interval half-width in percentage points.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, our + and - won't be evenly distributed. We almost universally have asymmetric CIs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, yes, they are already calculated asymmetrically. I'll correct how the CI's are displayed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated to bracketed format [lower%, upper%].

p_obs = resampled_results.mean()

# Apply Se/Sp correction to get true ASR
# TODO: propagate detector metric uncertainty (requires Se/Sp CIs in detector_metrics_summary.json)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Comment on lines 254 to 258
ci_text = (
f" ± {(ci_upper - ci_lower) / 2:.2f}"
if ci_lower is not None and ci_upper is not None
else ""
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this assume even distribution? I understand there's some lossiness in printing it this way, but I'd think that if failrate is, for example, 100%, we'd want something more like:
ci_lower <= failrate? Hard to manage it, but I'm not completely sure how to avoid saying something like "100% ± 10%"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to do this based on model of distribution of probe:detector scores acquired during calibration, thus ditching the frequently-untrue even assumption

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leondz I have a separate research branch where I try a totally different calculation. Working on checking how different my bounds (which are derived from a nonparametric test on an empirical CDF) are compared to these.

patriciapampanelli and others added 3 commits January 27, 2026 13:07
…ic ± format

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shaping up well. Few minor requests around non-duplication and configuration. Larger questions about where this code belongs and how to support CI calculation beyond Evaluator.

Status can be 0 (not sent to target), 1 (with target response but not evaluated), or 2 (with response and evaluation).
Eval-type entries are added after each probe/detector pair completes, and list the results used to compute the score.

Confidence Intervals (Optional)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the (Optional) refer to here?

Comment on lines 16 to 17
num_iterations: int = 10000,
confidence_level: float = 0.95,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be configurable, propose in core config under reporting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now reads from _config.reporting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The intent with _config for objects to never read from it, but instead from a config parameter passed at instantiation. I think adherence to this pattern might block directly accessing _config in these methods, and then the question is where does the data come from. One solution might be to have the instantiated Evaluator - which is configured with access to those parameter - pass these values to this function; or even to pass this function its own config object. Could that make sense?

also paging @jmartin-tech for opinion

Comment on lines 89 to 90
num_iterations: int = 10000,
confidence_level: float = 0.95,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be configurable. Also, prefer defining defaults in just one place wherever possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now reads from _config.reporting.

se,
sp,
)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid catching Exception parent class, be specific

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now catching specific ValueError.


# Add CI fields if calculation succeeded
if ci_lower is not None and ci_upper is not None:
eval_record["confidence"] = "0.95"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

draw this 0.95 value from one central place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Reading from _config.reporting.bootstrap_confidence_level.

Comment on lines 254 to 258
ci_text = (
f" ± {(ci_upper - ci_lower) / 2:.2f}"
if ci_lower is not None and ci_upper is not None
else ""
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to do this based on model of distribution of probe:detector scores acquired during calibration, thus ditching the frequently-untrue even assumption

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, CIs are calculated only during active garak runs, before results eval objects are logged

Disadvantages:

  1. There's no route to calculating CIs post-hoc (e.g. for older runs)
  2. There's no route to recalculating CIs with different config
  3. Failures during the non-trivial CI calc procedure abort the run

Would prefer to factor this out and have it run at report digest compilation time. On the other hand that fails the requirement to print CIs on the command line. That's tricky. Can we get both? Calling report_digest already recalculates a great deal - I wouldn't be averse to having a "rebuild_cis" flag for that when called as a CLI tool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like a super-simple CI for the general case that ignores detector performance, clamped to 0.0-1.0. We can estimate a CI for cases where we don't have extensive detector perf information, and we can do it quickly.

Could be configured in core via e.g. reporting.confidence_interval_method with values:

  • None - no confidence interval calc/display
  • bootstrap - bootstrap only
  • simple - simple only
  • backoff - bootstrap where we can, simple in the gaps

backoff might be a bit much for this week, but some pattern like this is where I'd like this to go

@leondz leondz added the reporting Reporting, analysis, and other per-run result functions label Jan 28, 2026
patriciapampanelli and others added 7 commits January 28, 2026 12:45
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
… config with None/bootstrap options

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reporting Reporting, analysis, and other per-run result functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants