-
Notifications
You must be signed in to change notification settings - Fork 769
Add Bootstrap Confidence Intervals for Attack Success Rates #1577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Bootstrap Confidence Intervals for Attack Success Rates #1577
Conversation
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
erickgalinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to find a better way to print this. I'm mostly confident that this methodology can work, though I had trouble writing a formal proof that this gives us a true 95% CI.
docs/source/reporting.rst
Outdated
| During console output, attack success rates may include confidence intervals displayed as: ``(attack success rate: 45.23%) ± 2.15``. | ||
| The ± margin represents the 95% confidence interval half-width in percentage points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, our + and - won't be evenly distributed. We almost universally have asymmetric CIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, yes, they are already calculated asymmetrically. I'll correct how the CI's are displayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated to bracketed format [lower%, upper%].
garak/analyze/bootstrap_ci.py
Outdated
| p_obs = resampled_results.mean() | ||
|
|
||
| # Apply Se/Sp correction to get true ASR | ||
| # TODO: propagate detector metric uncertainty (requires Se/Sp CIs in detector_metrics_summary.json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
| ci_text = ( | ||
| f" ± {(ci_upper - ci_lower) / 2:.2f}" | ||
| if ci_lower is not None and ci_upper is not None | ||
| else "" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this assume even distribution? I understand there's some lossiness in printing it this way, but I'd think that if failrate is, for example, 100%, we'd want something more like:
ci_lower <= failrate? Hard to manage it, but I'm not completely sure how to avoid saying something like "100% ± 10%"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love to do this based on model of distribution of probe:detector scores acquired during calibration, thus ditching the frequently-untrue even assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leondz I have a separate research branch where I try a totally different calculation. Working on checking how different my bounds (which are derived from a nonparametric test on an empirical CDF) are compared to these.
…ic ± format Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
leondz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shaping up well. Few minor requests around non-duplication and configuration. Larger questions about where this code belongs and how to support CI calculation beyond Evaluator.
| Status can be 0 (not sent to target), 1 (with target response but not evaluated), or 2 (with response and evaluation). | ||
| Eval-type entries are added after each probe/detector pair completes, and list the results used to compute the score. | ||
|
|
||
| Confidence Intervals (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the (Optional) refer to here?
garak/analyze/bootstrap_ci.py
Outdated
| num_iterations: int = 10000, | ||
| confidence_level: float = 0.95, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be configurable, propose in core config under reporting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now reads from _config.reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The intent with _config for objects to never read from it, but instead from a config parameter passed at instantiation. I think adherence to this pattern might block directly accessing _config in these methods, and then the question is where does the data come from. One solution might be to have the instantiated Evaluator - which is configured with access to those parameter - pass these values to this function; or even to pass this function its own config object. Could that make sense?
also paging @jmartin-tech for opinion
garak/analyze/bootstrap_ci.py
Outdated
| num_iterations: int = 10000, | ||
| confidence_level: float = 0.95, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be configurable. Also, prefer defining defaults in just one place wherever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now reads from _config.reporting.
garak/evaluators/base.py
Outdated
| se, | ||
| sp, | ||
| ) | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid catching Exception parent class, be specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now catching specific ValueError.
garak/evaluators/base.py
Outdated
|
|
||
| # Add CI fields if calculation succeeded | ||
| if ci_lower is not None and ci_upper is not None: | ||
| eval_record["confidence"] = "0.95" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draw this 0.95 value from one central place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Reading from _config.reporting.bootstrap_confidence_level.
| ci_text = ( | ||
| f" ± {(ci_upper - ci_lower) / 2:.2f}" | ||
| if ci_lower is not None and ci_upper is not None | ||
| else "" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love to do this based on model of distribution of probe:detector scores acquired during calibration, thus ditching the frequently-untrue even assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation, CIs are calculated only during active garak runs, before results eval objects are logged
Disadvantages:
- There's no route to calculating CIs post-hoc (e.g. for older runs)
- There's no route to recalculating CIs with different config
- Failures during the non-trivial CI calc procedure abort the run
Would prefer to factor this out and have it run at report digest compilation time. On the other hand that fails the requirement to print CIs on the command line. That's tricky. Can we get both? Calling report_digest already recalculates a great deal - I wouldn't be averse to having a "rebuild_cis" flag for that when called as a CLI tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like a super-simple CI for the general case that ignores detector performance, clamped to 0.0-1.0. We can estimate a CI for cases where we don't have extensive detector perf information, and we can do it quickly.
Could be configured in core via e.g. reporting.confidence_interval_method with values:
None- no confidence interval calc/displaybootstrap- bootstrap onlysimple- simple onlybackoff- bootstrap where we can, simple in the gaps
backoff might be a bit much for this week, but some pattern like this is where I'd like this to go
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com> Signed-off-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
… config with None/bootstrap options Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Summary
Adds 95% bootstrap CIs to attack success rates, accounting for sampling variance and detector imperfection via Rogan-Gladen correction.
Changes
bootstrap_ci.py,detector_metrics.py- CI calculation with Se/Sp correctionevaluators/base.py- CI integration into eval pipeline and outputreport_digest.py- CI propagation through reportsMethodology
The correction formula:
P_obs= observed failure rate in the resampled dataSe= detector sensitivity (probability of detecting a true attack)Sp= detector specificity (probability of correctly passing a benign response)Requires ≥30 evaluated outputs per probe-detector pair; falls back to perfect detector (Se=Sp=1.0) when detector metrics unavailable.
Statistical Limitations
Out of Scope