Skip to content

chadmarkey/aedt-fairness-audit

Repository files navigation

AEDT Fairness Audit Toolkit

Fairness measurement library for automated employment decision tools.

In plain language (the upshot)

This is software for measuring fairness in the AI tools that screen job applications.

Some companies sell AI that reads documents (résumés, personal statements, recommendation letters) and ranks candidates for human reviewers. A 2025 U.S. patent (No. 12,265,502 B1) describes one such system in detail, including how it scores the personal statement and how it tries to remove bias.

This toolkit does two things:

  • Implements the patent's bias-removal step and personal-statement scoring step in open code that anyone can read and run.
  • Provides command-line audits that test whether those implementations treat demographic groups equally on synthetic test data.

The legal benchmark used throughout is the U.S. EEOC's four-fifths rule: a selection process is presumed to discriminate when one group is picked at less than 80% the rate of another (a selection-rate ratio below 0.80, or above 1.25 in the reverse direction).

What happens when you run it

These are example numbers from a 384-PS synthetic corpus covering 24 demographic combinations, scored with two LLM families (gpt-4o-mini and claude-haiku-4-5). Specific values vary across runs.

The substantive claim, in one sentence: under LLM-based personal- statement extraction, this synthetic audit repeatedly finds a borderline school-tier disparity on the "academic career" question; that signal survives several prompt and model-family sensitivity checks, but it does not survive multiple-comparisons correction and does not hold under SBERT extraction. It is not a finding about any deployed AEDT.

The bullets below unpack that:

  • No per-cell finding survives multiple-comparisons correction. Audit 2 reports 30 hypothesis tests across the two scorers (15 per scorer). Under Bonferroni or Benjamini–Hochberg correction at family-wise α = 0.05, the corrected per-cell threshold is ≈ 0.0017. The smallest observed p-value is 0.052. No cell clears.
  • A borderline school_tier signal is shared across both LLM families and both LLM corpus generators. Under both gpt-4o-mini and claude-haiku-4-5 as scorers, on a corpus generated by either family, academic_career × school_tier produces a selection-rate ratio of about 0.65–0.75 at uncorrected p ≈ 0.06–0.13. Top-20 school applicants are selected at a higher rate than lower-resource school applicants on this question. This is the only LLM-extractor cell that holds direction across the 2 × 2 (generator × scorer) cross-family check. It does not survive multiple-comparisons correction.
  • The signal does not appear under SBERT. A 2026-05-06 fresh-corpus reproduction confirmed that the canonical SBERT result (DI = 1.242, p = 0.18 in the opposite direction) was corpus-draw noise rather than a stable opposite-direction signal: on a fresh corpus, SBERT's academic_career × school_tier comes in at DI = 1.013, p = 1.000. SBERT registers no replicating signal on either corpus draw; the LLM extractor's signal is the only one that reproduces.
  • The patent's specified Claim-1 mitigation does not move the per-question school_tier signal. Under a paired-permutation test (10,000 reps) of the marker-stripping intervention, academic_career × school_tier shifts from 0.650 to 0.673, paired-perm p = 0.65. The aggregate _total × school_tier shifts from 0.673 to 0.807, paired-perm p = 0.015 — the aggregate moves toward parity at conventional significance, the per-question driver does not.
  • The school_tier signal survives a content-neutral prompt rewrite. A token-frequency diagnostic showed that under the original prompt, top_20 PSs contained 1.4× to 3.6× more academic- register tokens than lower_tier PSs in non-academic seeds. An opt-in --prompt-variant content_neutral mode rewrites the corpus prompt to remove the "research lab / faculty mentor" content cue for top_20 and to forbid academic-register variation across school tiers. Under that rewrite, token-density does flatten across tiers (basically equal in non-academic seeds). But the audit signal persists: academic_career × school_tier DI = 0.673, p = 0.065 — essentially identical to the original 0.650, p = 0.059. The LLM extractor is reading something other than literal academic-register token density. The most plausible remaining mechanism is school- name associations: top_20 PSs still name "Harvard," "Stanford," "Hopkins," "UCSF," and the LLM extractor's training data likely associates those names with academic content. A controlled school- name substitution test is left as a follow-on.

A discrete-statistic / tie-breaking pitfall was discovered and fixed during a 2026-05-06 validation pass. Earlier versions of the audit used numpy's stable sort for top-K selection, which preserves original row order at ties. With near-discrete LLM scores, this introduced a corpus-row-order bias that produced spurious per-cell DI findings on the race axis under gpt-4o-mini scoring. Those findings were artifacts. The current top_k_selection uses seeded random tie-breaking to eliminate the bias. See CHANGELOG.md for the discovery and fix; the methodology section of RESULTS.md documents the issue for any future AEDT auditor working with near-discrete LLM outputs.

The screening-simulation tool, run on illustrative sentiment anchorings, produced selection-rate ratios outside the four-fifths range under a fitted logistic-regression screening model. Three additional scoring rules in the same simulation (linear, patent §530 power-of-2 aggregation, power-of-3 aggregation) apply the data-generating betas at inference rather than fitting a model; their ratios sit in the same band as the fitted model and are useful for showing that monotone aggregation transforms do not by themselves close a sentiment-driven gap. When narrative sentiment alone was held constant across groups in a sentiment-only counterfactual — all other features held constant, classifier held constant — the ratios returned to within sampling noise of parity in every cell.

The technical results, with confidence intervals and methodology, are in RESULTS.md. The audit code is in tools/. The plotting code is in plots/. Everything is reproducible from the command line.

Scope and claims

This repository is a measurement and stress-test framework, not a reverse-engineered production system or an authoritative finding about any specific deployed AEDT.

  • The audits run on synthetic data, not real applicant text. Results are illustrative and sensitivity-based, not population-level claims about real-world hiring or screening outcomes.
  • The toolkit does not include an AEDT pipeline implementation. Users supply their own scoring function via the pipeline_fn interface. Different reasonable implementations of the patent's architecture will produce different absolute selection-rate ratios.
  • Findings demonstrate that adverse-impact signals can emerge under plausible configurations of the patent's specified architecture on synthetic test data. They do not, and cannot, prove that any specific deployed AEDT (including any product made by the patent's assignee) implements the architecture in the same way the toolkit does, uses the same parameters, or produces the same outputs.
  • The U.S. EEOC's four-fifths rule (29 C.F.R. § 1607) is a regulatory screening heuristic for adverse impact, used by agencies and compliance teams to flag selection processes that warrant further review. It is not, on its own, a determination of unlawful discrimination; that determination requires additional evidence, regulatory process, and adjudication outside the scope of this toolkit.

For a fuller account of methodological boundaries and the prospective follow-on tests that would strengthen or further narrow the surviving claim, see Limitations and prospective follow-ons at the end of this file.


Implements:

  • AIF360-style fairness metrics: disparate impact, statistical parity, threshold sweep, calibration, counterfactual flips, bootstrap CIs
  • The bias mitigation operation specified by Claim 1 of U.S. Patent No. 12,265,502 B1: input-side detection-and-replacement of biasing identifiers
  • The four Personal Statement questions enumerated at column 10 of U.S. Patent No. 12,265,502 B1, with both an SBERT cosine-similarity extractor and an LLM question-answering extractor

The library does not include an AEDT pipeline implementation. Users supply their own scoring function via the pipeline_fn interface. See PIPELINE_BUILD_GUIDE.md for how to build a pipeline replicating the architecture disclosed in U.S. Patent No. 12,265,502 B1 from off-the-shelf libraries.

Contributions and critiques are welcome via Issues and pull requests. A running record of substantive methodology revisions made after the initial public release lives in CHANGELOG.md, including a 2026-05-05 cross-family robustness check that narrowed the audit's substantive claim and is described in full in RESULTS.md.

Why this is public

This repo is public because there was no other way to do it. The patent's current assignee was asked, via Data Subject Access Request under New Hampshire's Privacy Act (RSA 507-H), for the data and processing details that would let an outside auditor verify how the pipeline runs in practice. The request was declined at the 45-day statutory deadline. The grounds were jurisdictional: RSA 507-H sets volume thresholds for which companies the law reaches, and the assignee argued that too few of New Hampshire's 1.4 million residents use their products to clear those thresholds. So the privacy law that should have governed the request did not apply. I appealed under §507-H:4(IV) on May 4, 2026. That appeal is its own record now.

What was left was to rebuild the patent's specified pipeline from public components and audit it. Patent text (U.S. Patent No. 12,265,502 B1), off-the-shelf libraries, synthetic test data. That is the toolkit.

This is an audit, not an indictment. The toolkit measures whether the patent's specified architecture produces equal outcomes on synthetic data. It does not claim that any specific deployed AEDT — including any product made by the patent's current assignee — implements that architecture in the same way, uses the same parameters, or produces the same outcomes. Inference from toolkit results to deployed products requires independent evidence about those products.

The repo is fully public and the commit history is intact. Nothing has been squashed; nothing has been force-pushed; every methodology revision is documented in CHANGELOG.md, including the corrections that have narrowed initial findings:

  • A stable-sort tie-break bug discovered on 2026-05-06 dissolved a "vendor-dependent race-axis disparity" headline that turned out to be an artifact. The bug, the fix, and the dissolved findings are all documented and remain reproducible from prior commits.
  • A cross-family scoring check disclosed on 2026-05-05 in response to peer review showed per-cell race-axis findings did not transfer across LLM scoring families. Framing was narrowed.
  • A cross-generator validation check confirmed the surviving school_tier signal direction-replicates across two LLM generator families.
  • A close audit of the corpus generator prompt found the prompt itself encodes school-tier-correlated content; an opt-in content-neutral prompt variant ships as a sensitivity test.
  • An SBERT-vs-LLM cross-extractor check showed the two extractor architectures disagree on direction on the surviving cell. The surviving claim is bounded to LLM-extractor architectures specifically.

Each of these refinements happened because someone pushed back — public critique on Reddit, private code review and methodology advice from independent reviewers reaching out off-channel, or a closer look at the code itself. That is what this kind of audit is for. Real-time methodology audit, in public, with corrections visible — that is the lens this work has needed all along, and it has been rare in AEDT validation generally. The fact that initial findings were refined under that scrutiny is the point of releasing the toolkit, not a weakness.

Critiques and replication failures are welcome. File them as GitHub Issues. The substantive claims of this audit have narrowed several times and will likely narrow again. The work gets better when the methodology iterates in public.

Provenance and prior inquiry

The toolkit's architecture-class framing is the result of trying the cooperative path first.

A Data Subject Access Request was submitted to the patent's current assignee under New Hampshire's Privacy Act (RSA 507-H), seeking disclosure of how the requester's residency-application data had been processed through the assignee's platform. Across three rounds of correspondence, a representative of the patent assignee made two substantive representations worth quoting verbatim:

"The programs you applied to did not use Medicratic/Halsted. We have reviewed our records, and none of the residency programs to which you submitted applications in the 2025–2026 cycle had active or participating Medicratic/Halsted accounts. ... Your application was not uploaded into or processed by Halsted at any point, nor will it be in the future."

"While programs you applied to may have used [the assignee]'s Cortex product, the patent features you cited are not incorporated into Cortex. The specific methods described in U.S. Patent No. 12,265,502 B1—e.g. the multi-stage scoring pipeline, attribute indicator generation, sentiment scoring of MSPEs, and comparative letter of recommendation ranking—are not part of Cortex. Cortex currently incorporates only two AI features ... (a) A transcript normalization tool ... that extracts and normalizes grades from uploaded transcripts; and (b) An Academic Interest Badge ..., which analyzes an applicant's personal statement to assess interest in an academic medicine career, with the output being binary..."

These representations stand alongside two pieces of public-record documentation that bear on the same question.

The patent assignee's own published integration roadmap. From the assignee's published FAQ page on the Medicratic acquisition and integration roadmap (publicly accessible from the assignee's public website; accessed 2026-05-06; local archive and Wayback Machine snapshot preserved), under the heading "Product Vision and Roadmap":

Q: What are the integration plans and timelines?

A: "We are currently assessing technical pathways to integrate key Medicratic features into [the assignee's] Platform. Our goal is to offer some key features for the 2026 ERAS residency recruitment cycle, beginning September 2025. We are targeting a fully unified experience by the 2027 ERAS season (starting July 2026) for all ERAS-participating residency and fellowship programs."

A third-party peer-reviewed account. A March 2026 JAMA Viewpoint (Bachina et al., 2026, doi:10.1001/jama.2026.1993) reports that the patent's current assignee acquired the parent company of the patent (Medicratic) in July 2025, that the Academic Interest Badge deployed in Cortex is Halsted-derived technology, and that additional Halsted software assessing applicant qualities and program fit exists with limited public methodological disclosure.

Read together, the three documents describe a coherent picture. The DSAR representation that the patent features are not incorporated into Cortex as of the 2025–2026 application cycle may be technically accurate. The public roadmap states the assignee's own goal of a fully unified experience by the 2027 ERAS season. The JAMA Viewpoint reports that Halsted-derived technology is already deployed within Cortex via the Academic Interest Badge. These statements are mutually consistent, and together they describe a product trajectory in which the architecture disclosed in U.S. Patent No. 12,265,502 B1 is being merged into the assignee's primary residency-screening product on a multi-cycle timeline.

The toolkit's architecture-class framing exists for this reason. The substantive question is not which specific named product processed which specific applicant during which specific application cycle. The substantive question is whether the architecture class disclosed in the patent — the architecture the assignee has publicly stated will be a fully unified component of its primary residency-screening product by the 2027 cycle — produces equal outcomes when applied to applicant data. The toolkit makes that question tractable through public methodology. It does not, and cannot, claim that any specific deployed AEDT implements the architecture in the same way the toolkit does, uses the same parameters, or produces the same outputs.

A separate jurisdictional response, addressed in the eventual appeal, declined to engage with the substantive DSAR on the grounds that the New Hampshire Privacy Act's volume thresholds do not reach the assignee. That track is documented in the appeal record under §507-H:4(IV) and is not the subject of this section.

Components

Module Function
audit/ Fairness metrics, bootstrap CIs, per-axis screening
mitigator/ Bias Mitigator (Claim 1, input-side anonymization + semantic substitution)
ps_extraction/ Four PS-question extractor (SBERT and LLM variants)
synthetic/ Demographically stratified synthetic PS generator
examples/ Reference pipeline_fn (VADER baseline)
tools/ CLI runners for end-to-end audits

Patent-element to module mapping is documented in DEVIATIONS_FROM_PATENT.md.

Installation

cd aedt-fairness-audit
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# For SBERT extractor:
pip install sentence-transformers

# For mitigator (NER-based anonymization):
pip install spacy
python -m spacy download en_core_web_sm

# For LLM extractor:
pip install openai          # OpenAI or any OpenAI-compatible endpoint
pip install anthropic       # Anthropic

Quickstart — reproduce the reference outputs

A committed reference output set lives in examples/reference_outputs/. It contains one representative run of every CLI tool, in JSON plus rendered PNG/PDF figures. Use it as a known-good baseline to diff your own runs against.

The full synthetic-corpus audit set reproduces in five commands. With gpt-4o-mini for synthetic-PS generation and LLM-based PS extraction (the canonical configuration RESULTS.md reports), the cost is roughly $1–$3 in API credits depending on extractor cache state.

# 1. Generate the 384-PS stratified synthetic corpus
OPENAI_API_KEY=sk-... python -m tools.generate_ps_corpus \
    --provider openai --model gpt-4o-mini --instances-per-cell 4 \
    --out synthetic/data/ps_corpus.jsonl

# 2. Audit 1: Bias Mitigator effect on a VADER baseline pipeline
python -m tools.run_audit_1 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --pipeline examples.example_pipeline:score_texts \
    --out-dir out/audit_1 --bootstrap-reps 1000

# 3. Audit 2 — PS four-question extraction (SBERT + LLM variants)
python -m tools.run_audit_2 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out-dir out/audit_2 --extractor sbert --bootstrap-reps 1000

python -m tools.run_audit_2 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out-dir out/audit_2 --extractor llm \
    --llm-provider openai --llm-model gpt-4o-mini --bootstrap-reps 1000

# 3b. Permutation tests for the inferential p-values
#     (the audit-2 runs above produce point estimates and bootstrap CIs;
#     this step adds the permutation p-values reported in RESULTS.md)
python -m tools.rebootstrap \
    --scores out/audit_2/audit_2_per_applicant_scores_sbert.csv \
    --score-cols poverty refugee major_illness academic_career _total \
    --top-frac 0.3 --bootstrap-reps 1000 --n-permutations 10000 \
    --out out/audit_2/audit_2_results_sbert_perm.json
python -m tools.rebootstrap \
    --scores out/audit_2/audit_2_per_applicant_scores_llm.csv \
    --score-cols poverty refugee major_illness academic_career _total \
    --top-frac 0.3 --bootstrap-reps 1000 --n-permutations 10000 \
    --out out/audit_2/audit_2_results_llm_perm.json

# 4. Content-equivalence validation
python -m tools.content_equivalence \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out out/content_equivalence/results.json

# 5. Counterfactual decomposition (marker-stripping diagnostic).
#    --n-permutations adds the paired-permutation p-value per
#    (question × axis) cell that RESULTS.md reports.
python -m tools.counterfactual_decomposition \
    --corpus synthetic/data/ps_corpus.jsonl \
    --original-scores out/audit_2/audit_2_per_applicant_scores_llm.csv \
    --out-dir out/counterfactual \
    --llm-provider openai --llm-model gpt-4o-mini \
    --n-permutations 10000

# 6. Render figures
python -m plots.plot_audit_1 \
    --input out/audit_1/audit_1_results.json --out-dir out/audit_1
python -m plots.plot_audit_2 \
    --input out/audit_2/audit_2_results_sbert.json --out-dir out/audit_2 \
    --name audit_2_sbert_di_heatmap --title-suffix " (SBERT)"
python -m plots.plot_audit_2 \
    --input out/audit_2/audit_2_results_llm.json --out-dir out/audit_2 \
    --name audit_2_llm_di_heatmap --title-suffix " (LLM, gpt-4o-mini)"
python -m plots.plot_content_equivalence \
    --input out/content_equivalence/results.json --out-dir out/content_equivalence
python -m plots.plot_counterfactual_decomposition \
    --input out/counterfactual/counterfactual_decomposition.json \
    --out-dir out/counterfactual

Compare your out/ figures against examples/reference_outputs/ for sanity. Specific numerical values will not match exactly. Synthetic-PS generation is stochastic, model outputs vary across API calls, and bootstrap seeds drift. But the qualitative patterns (which axes raise an adverse-impact flag, which direction the mitigator moves the ratios, the ordering of the content-equivalence nesting levels) should reproduce. If they do not, that is itself an audit-worthy finding.

Cross-family and cross-generator robustness checks

The headline school_tier × academic_career signal in RESULTS.md rests on a 2 × 2 (generator × scorer) robustness check, not on a single run. To reproduce the four cells:

# Cross-family scoring (vary scorer, hold generator at gpt-4o-mini)
python -m tools.run_audit_2 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out-dir out/audit_2_crossfam --extractor llm \
    --llm-provider anthropic --llm-model claude-haiku-4-5 \
    --bootstrap-reps 1000

python -m tools.rebootstrap \
    --scores out/audit_2_crossfam/audit_2_per_applicant_scores_llm.csv \
    --score-cols poverty refugee major_illness academic_career _total \
    --top-frac 0.3 --bootstrap-reps 1000 --n-permutations 10000 \
    --out out/audit_2_crossfam/audit_2_results_llm_perm.json

# Cross-generator (regenerate with claude-haiku-4-5, score with both)
ANTHROPIC_API_KEY=... python -m tools.generate_ps_corpus \
    --provider anthropic --model claude-haiku-4-5 \
    --instances-per-cell 4 \
    --out synthetic/data/ps_corpus_haiku_gen.jsonl

# (Then run Audit 2 LLM with each scorer against the haiku-generated
#  corpus, and rebootstrap, exactly as in the standard reproduction
#  block above. Reference outputs at examples/reference_outputs/
#  audit_2_crossgen/.)

A content-neutral prompt variant (--prompt-variant content_neutral on tools.generate_ps_corpus) generates a corpus under a prompt that strips school-tier-correlated voice and content cues from the generator's instruction. Comparing audits run against the standard corpus vs. a content-neutral corpus tests whether the school_tier signal is a corpus-prompt design effect or robust to prompt design.

Screener-model simulation

The screener-model tools do not require the synthetic-PS corpus. They take a JSON of sentiment-instrument anchorings (one entry per sentiment instrument, naming the score it produces on a low-tone vs. a high-tone variant of the same content) and use those anchorings to generate stratified synthetic applicants for a top-K invite simulation. The example anchorings live in examples/screening_anchorings_template.json; replace with your own sentiment scores when running on your own document.

# 1. Multi-instrument × multi-scoring-method screen + sentiment-only
#    counterfactual. Reports baseline DI and counterfactual DI under
#    linear, logistic-regression, patent §530 power-of-2 (quadratic),
#    and power-of-3 (cubic) scoring rules across all anchorings.
python -m tools.run_screening_with_counterfactual \
    --anchorings examples/screening_anchorings_template.json \
    --n 6000 --invite-rate 0.12 --narrative-sd 0.10 \
    --bootstrap-reps 50 \
    --models logistic_regression linear_score quadratic_aggregation cubic_aggregation \
    --out out/screening_counterfactual/results_multimodel.json

# 2. Disclosure-rate sweep — DI as a function of the rate at which
#    disadvantaged-group applicants disclose protected-class language
#    a vendor's control would trigger on
python -m tools.run_disclosure_sweep \
    --anchoring "lexicon:0.18:0.78" \
    --rates 0 0.05 0.10 0.25 0.50 0.75 0.90 1.00 \
    --n 6000 --invite-rate 0.12 --bootstrap-reps 100 \
    --out out/disclosure_sweep/results.json

# 3. Dilution test — excerpt vs full-document gap per instrument
python -m tools.run_dilution_test \
    --config examples/dilution_test_template.json \
    --out-dir out/dilution_test

# 4. Paragraph audit — section-aware multi-instrument scoring of a
#    single user-supplied document
python -m tools.run_paragraph_audit \
    --document /path/to/document.txt \
    --instruments vader transformer \
    --out out/paragraph_audit/scores.json

# 5. Render figures
python -m plots.plot_screening_counterfactual \
    --input out/screening_counterfactual/results_multimodel.json \
    --out-dir out/screening_counterfactual --name screening_counterfactual_multimodel
python -m plots.plot_disclosure_sweep \
    --input out/disclosure_sweep/results.json --out-dir out/disclosure_sweep
python -m plots.plot_dilution_test \
    --input out/dilution_test/dilution_test_results.json --out-dir out/dilution_test
python -m plots.plot_paragraph_audit \
    --input out/paragraph_audit/scores.json --out-dir out/paragraph_audit

Reference output set lives at examples/reference_outputs/{screening_counterfactual, disclosure_sweep, dilution_test, paragraph_audit}/.

Tools

Each tool below has a one-paragraph plain-language description of what it does, followed by the command line for running it. All tools write JSON to disk; companion plot scripts in plots/ render figures from the JSON output.

Generate a synthetic corpus — tools/generate_ps_corpus.py

What it does: Generates a set of fake personal statements that span different demographic combinations (race, gender, school tier) while keeping the underlying narrative theme consistent within each group of applicants. Used as test input for the audits below. Actual applicant text is never needed.

OPENAI_API_KEY=sk-... python -m tools.generate_ps_corpus \
    --provider openai --model gpt-4o-mini \
    --out synthetic/data/ps_corpus.jsonl

Audit 1 — Bias Mitigator effect on a user-supplied pipeline_fn — tools/run_audit_1.py

What it does: Runs a user-supplied pipeline_fn twice. Once on raw applicant text. Once after applying the patent's input-side detect-and-replace bias mitigator. Compares whether the demographic gaps shrink. Requires a user-supplied scoring function via the pipeline_fn interface; see PIPELINE_BUILD_GUIDE.md.

python -m tools.run_audit_1 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --pipeline my_pipeline:score_texts \
    --out-dir out/audit_1 --bootstrap-reps 1000

Audit 2 — PS four-question extraction — tools/run_audit_2.py

What it does: Tests the patent's personal-statement scoring component. The patent specifies four yes/no questions the system asks of each applicant's personal statement (poverty, refugee status, major illness, academic career interest). This audit measures whether the answers come out systematically different for different demographic groups. Runs in either an SBERT (embedding-similarity) or LLM (question-answering) variant.

# SBERT extractor
python -m tools.run_audit_2 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out-dir out/audit_2 --extractor sbert --bootstrap-reps 1000

# LLM extractor
python -m tools.run_audit_2 \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out-dir out/audit_2 --extractor llm \
    --llm-provider openai --llm-model gpt-4o-mini --bootstrap-reps 1000

Paragraph audit — tools/run_paragraph_audit.py

What it does: Scores a document one section at a time under multiple sentiment tools (VADER, RoBERTa, LLM judge), and flags any section that is the lowest-scoring section under every tool. This is the signal a "section-aware" AI screener would pick up if a single section of an otherwise-positive document carries a lower tone, for example, a paragraph describing a leave of absence in an otherwise laudatory recommendation letter.

python -m tools.run_paragraph_audit \
    --document /path/to/document.txt \
    --instruments vader transformer llm \
    --llm-provider openai --llm-model gpt-4o-mini \
    --out out/paragraph_audit/scores.json

Dilution test — tools/run_dilution_test.py

What it does: Compares two narrative variants (a low-tone version and a high-tone version of the same content) when scored as standalone excerpts vs when embedded in a longer surrounding document. Tests whether the score difference disappears when the variant is buried in context. The dilution percentage depends on which sentiment tool is used; lexicon and transformer tools tend to fully dilute, while LLM judges retain more of the gap.

python -m tools.run_dilution_test \
    --config examples/dilution_test_template.json \
    --out-dir out/dilution_test

The included examples/mspe_skeleton_template.txt is a generic medical-school document skeleton with placeholders for student name, school, and dates.

Screening simulation — tools/run_screening_simulation.py

What it does: Simulates a residency-style screening process with thousands of synthetic applicants. The user supplies sentiment scores (the score the audited document gets under different tools) as "anchorings"; the simulation generates applicants where the disadvantaged group's narratives carry the low score and the favored group's carry the high score, trains a screening model on the combined applicant pool, and reports how often each group gets selected.

python -m tools.run_screening_simulation \
    --anchorings examples/screening_anchorings_template.json \
    --out-dir out/screening_simulation \
    --n 6000 --invite-rate 0.12 --bootstrap-reps 100

Screening with counterfactual — tools/run_screening_with_counterfactual.py

What it does: Runs the simulation above, then runs an "intervention" version: what if the disadvantaged group's narratives suddenly carried the high-tone scores instead? The intervention re-scores the same applicants under the same trained model with only the narrative-tone feature changed. Comparing the two reports how much of the disparity is driven by narrative tone alone. Supports multiple screening models including the patent's specified power-of-2 aggregation (quadratic_aggregation) from §530.

python -m tools.run_screening_with_counterfactual \
    --anchorings examples/screening_anchorings_template.json \
    --n 6000 --invite-rate 0.12 --narrative-sd 0.10 \
    --bootstrap-reps 50 \
    --models logistic_regression linear_score quadratic_aggregation cubic_aggregation \
    --out out/screening_counterfactual/results.json

Disclosure-rate sweep — tools/run_disclosure_sweep.py

What it does: Tests how much an AI vendor's "protected-class control" (a feature that suppresses sentiment scoring when ADA/504 disclosure language is detected in the document) actually helps. The control only triggers when the applicant's text contains the disclosure language. The sweep varies what percentage of disadvantaged- group applicants disclose, and reports the resulting disparity at each rate.

python -m tools.run_disclosure_sweep \
    --anchoring "vader_excerpt:0.18:0.78" \
    --rates 0 0.05 0.10 0.25 0.50 0.75 0.90 1.00 \
    --n 6000 --invite-rate 0.12 --bootstrap-reps 100 \
    --out out/disclosure_sweep/results.json

Content-equivalence validation — tools/content_equivalence.py

What it does: Validates the synthetic corpus. Measures whether two synthetic applications from different demographic groups but the same narrative theme are more similar to each other than two applications from different themes. If the within-theme distance is meaningfully smaller than the across-theme distance, the corpus held content constant across demographic groups; if not, the audits' premise is suspect.

python -m tools.content_equivalence \
    --corpus synthetic/data/ps_corpus.jsonl \
    --out out/content_equivalence/results.json

Counterfactual decomposition — tools/counterfactual_decomposition.py

What it does: When the audits show demographic disparity, this diagnostic asks: is the disparity driven by demographic markers (names, schools, identity phrases) or by deeper content patterns? Removes the markers using the patent's bias-removal step, re-scores the marker-stripped applications with the same scoring tool, and compares the results.

python -m tools.counterfactual_decomposition \
    --corpus synthetic/data/ps_corpus.jsonl \
    --original-scores out/audit_2/audit_2_per_applicant_scores_llm.csv \
    --out-dir out/counterfactual \
    --llm-provider openai --llm-model gpt-4o-mini \
    --n-permutations 10000

Re-bootstrap — tools/rebootstrap.py

What it does: Re-runs the statistical confidence-interval calculation from a prior audit's per-applicant scores at a higher bootstrap-replicate count, without re-running the expensive scoring step. Used to tighten error bars to publication-grade levels.

python -m tools.rebootstrap \
    --scores out/audit_2/audit_2_per_applicant_scores_llm.csv \
    --score-cols poverty refugee major_illness academic_career _total \
    --top-frac 0.3 --bootstrap-reps 1000 \
    --out out/audit_2/audit_2_results_llm_reps1000.json

Library use

import pandas as pd, numpy as np
from audit.metrics import group_outcome_summary, disparity_summary
from mitigator import BiasMitigator
from ps_extraction import PSExtractor

# Fairness metrics on any binary-group prediction
metrics = group_outcome_summary(df, yhat, proba)
disparity = disparity_summary(metrics)

# Bias mitigator
mitigator = BiasMitigator()
mitigated_text = mitigator(text)

# PS extractor
extractor = PSExtractor()
scores = extractor.score_text(text)
# {"poverty": 0.42, "refugee": 0.08, "major_illness": 0.61, "academic_career": 0.34, "_total": ...}

Inputs and outputs

The pipeline expects a pandas DataFrame with these columns:

  • applicant_id (str)
  • mspe, lor, ps (str) — narrative document fields; any may be empty
  • structured features (numeric) — combined with narrative score by user-supplied pipeline_fn
  • prot_* (str) — protected attribute columns; the stage audit runs fairness metrics across any column whose name begins with prot_

When out_dir is supplied, the audit harnesses write:

  • audit_1_results.json / audit_2_results_{sbert,llm}.json — metrics
  • audit_*_per_applicant_scores.csv — applicant-level scores

Privacy defaults

The library produces no disk artifacts unless the user explicitly writes them. Audit metrics return Python dicts; the caller decides whether and where to persist results. The PSExtractor's store_anchor_text defaults to False; cluster exemplars are not serialized to manifests.

Smoke test

python -m tools.smoke_test            # full path
python -m tools.smoke_test --offline  # CI / sandbox-friendly

Runs a 16-document hand-curated corpus through the BiasMitigator, PSExtractor, and per-axis fairness metrics. ~1 minute end to end.

--offline skips network-dependent steps (SBERT model download, spaCy NER model download, VADER lexicon download). It runs the mitigator with regex-only redaction (no spaCy NER), substitutes a deterministic synthetic score for the SBERT extractor, and exercises the full axis_audit code path without requiring internet access. Suitable for CI environments and offline reproducibility checks.

Methodology references

AAMC alignment

The AAMC's published principles for the responsible use of AI in medical education explicitly recommend the kind of audit this library supports and name AI Fairness 360 (AIF360) by name as a recommended tool:

"Audit AI systems regularly. Schedule and conduct an annual audit of the AI system and its output to identify AI-related biases and other problems in the selection process. Collaborate with a dedicated team of experts to analyze the findings and develop strategies for continuous improvement to be implemented for the next cycle. Consult recent and relevant journal articles and technical reports that have used AI in selection processes, explore tools used to examine the potential for bias like Admissible ML or AI Fairness 360, and consult legal counsel when appropriate."

Source: AAMC, Principles for the Responsible Use of Artificial Intelligence in and for Medical Education — Protect Against Algorithmic Bias, aamc.org/about-us/mission-areas/medical-education/principles-ai/protect-against-algorithmic-bias.

The AAMC also recommends not changing the process mid-cycle and tracking all changes when they occur. The toolkit's manifest output (SHA-256 hashes of stage artifacts, full config dump, run timestamps) is designed to make process-version tracking auditable.

AI Fairness 360 (AIF360)

The fairness-metric definitions implemented in audit/ follow the AI Fairness 360 (AIF360) conventions. AIF360 itself is not a runtime dependency. The metrics are reimplemented here in numpy/pandas to keep installation light. The mathematical definitions and naming follow AIF360's ClassificationMetric class.

Specifically, this library replicates:

  • disparate_impact — selection-rate ratio between protected groups (the EEOC four-fifths rule metric)
  • statistical_parity_difference — selection-rate difference
  • equal_opportunity_difference — true-positive-rate difference
  • false_positive_rate_difference — false-positive-rate difference
  • accuracy — overall classification accuracy

Source for the AIF360 definitions: github.com/Trusted-AI/AIF360/blob/master/aif360/metrics/classification_metric.py

Citation: Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., et al. (2018). AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv:1810.01943. github.com/Trusted-AI/AIF360

If you prefer to use AIF360 directly rather than the reimplementations here, the metric definitions are bytewise compatible. Substitute aif360.metrics.ClassificationMetric calls in your own code where this library uses its functions of the same names.

Other methodology references:

  • VADER sentiment: Hutto, C. J. & Gilbert, E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. ICWSM.
  • Sentence-BERT: Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
  • HDBSCAN: Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. PAKDD.

Limitations and prospective follow-ons

This toolkit is a measurement instrument, not an authoritative finding about any specific deployed AEDT. Several caveats apply to how its outputs should be read.

Synthetic corpus, not real applicant data. The audits run on LLM-generated personal statements stratified across demographic combinations. The corpus is designed so within-seed content is held approximately constant across demographic strata; this is validated by pairwise SBERT distance (within-seed-across-stratum mean cosine 0.215 vs. across-seed mean 0.337, ratio 0.637; n = 384). The separation is meaningful but not overwhelming. The generator may leak content along with demographic markers in ways the cosine check does not catch. Findings should be read as "this is what happens when the patent's architecture is run on a corpus designed to isolate demographic markers from content," not as a population-level claim about real applicant text.

Bootstrap CIs behave poorly on discrete top-K selection; permutation tests are reported as the inferential complement. Even at moderate group sizes (~192 per group at n = 384), binary top-K selection produces a discrete-valued disparate-impact statistic, and percentile bootstrap intervals can sit above the point estimate or fail to bracket it cleanly. The audit runners support a two-sided permutation test under the null of group-selection independence (--n-permutations flag on tools/rebootstrap.py); RESULTS.md reports those p-values alongside the bootstrap CIs at 10,000 permutations. After the tie-break fix on 2026-05-06 (see CHANGELOG), at n = 384 no race-axis or gender-axis cell reaches conventional significance under any single LLM scorer; one school_tier-axis cell (academic_career × school_tier) sits at borderline uncorrected significance (p ≈ 0.06–0.07) and is direction-consistent across both LLM scorer families and both LLM generator families in a 2×2 cross-family check. It does not survive multiple-comparisons correction. Users running on smaller corpora should expect the percentile bootstrap to be a stability indicator rather than a significance test.

No ground truth for the four PS questions. Audit 2 measures whether the patent's four-question extractor produces systematically different yes-rates across demographic groups. It does not measure whether those answers are correct; the synthetic corpus does not carry verified ground-truth labels for poverty / refugee / illness / academic-career status independent of what the generator was prompted to encode. Group-rate disparities are the measurand; calibration against truth is not.

The pipeline implementation is the user's, not the toolkit's. The library does not ship an AEDT pipeline. Audit 1 requires the user to supply their own pipeline_fn. Different reasonable implementations of the patent's architecture will produce different absolute DI values. The toolkit's role is to provide a consistent measurement framework, not to certify that any one implementation is the patent's "true" implementation.

Default extractor parameters are implementer's choices. The SBERT extractor's threshold (0.35), softmax temperature (8.0), aggregate power (2.0), and exemplar inventory are documented in DEVIATIONS_FROM_PATENT.md as implementation choices the patent does not specify. Changing these can shift per-question DI values. Robust findings should reproduce across reasonable parameter ranges; users running audits should report the parameters used.

Two extractor variants are not exhaustive. The SBERT and LLM extractors represent two reasonable instantiations of "users can apply NLP to read through personal statement of each applicant" (col. 10). Other architectures consistent with the patent (hybrid retrieval-augmented systems, fine-tuned classifiers, ensemble approaches) are not tested. Findings that hold across SBERT and LLM extractors are more robust than findings that appear in only one.

The bias mitigator's effect on the surviving school_tier signal is mixed. Audit 1 (VADER + Claim 1 mitigator) shows no systematic mitigator effect on a sentiment-only pipeline (paired-permutation p > 0.5 on all three demographic axes), which is the expected result when the pipeline is largely insensitive to the markers the mitigator strips. The counterfactual decomposition (LLM extractor scored on marker-stripped PSs) shows that the academic_career × school_tier signal is statistically unmoved by marker-stripping (paired-permutation p = 0.65), while the _total × school_tier aggregate signal does move toward parity (p = 0.015). The single-question signal is content-driven in a sense the patent's specified anonymization step cannot reach; the aggregate dilutes when school markers are removed but the underlying academic-narrative variation remains. Output-side recalibration (col. 24, lines 19–46) is in the patent spec but discretionary under Claim 1, with no algorithm specified, and is not implemented here. Users designing their own mitigations may find approaches that perform differently.

Findings are about an architecture class, not a specific product. This toolkit tests the architecture disclosed in U.S. Patent No. 12,265,502 B1 as implemented from public components. It does not, and cannot, claim that any specific deployed AEDT (including any product made by the patent's assignee) implements the architecture in the same way the toolkit does, uses the same parameters, or produces the same outputs. Inference from toolkit results to specific deployed products requires independent evidence about those products.

Demographic axes are limited to those the synthetic generator stratifies. The shipped corpus stratifies on race (4 categories), gender (binary), and school tier (3 levels). Disability status, age, sexual orientation, geographic origin, language background, and intersectional combinations beyond the three axes are not separately tested. Users investigating those axes should extend the generator and re-run.

Prospective follow-ons

The current audit's surviving claim is borderline, architecture- dependent, and synthetic-data-bound. To strengthen or dissolve it further, several follow-ons would be informative. None of them is implemented in the current toolkit; all are flagged here so a hostile reviewer's first questions are also the audit's own first questions.

  1. Controlled school-name substitution test. The mitigator's counterfactual decomposition strips multiple identifying tokens simultaneously (names, schools, ethnicity terms, locations). A tighter test would substitute only the school name (e.g., "Harvard" → "[TIER1_SCHOOL_A]") while preserving the surrounding narrative literally, then re-score with the LLM extractor. If the academic_career × school_tier signal collapses under that substitution, the school-name-association mechanism is confirmed.

  2. Non-LLM-generated or hand-authored matched corpus. Every robustness check in this audit varies LLM-related parameters (scorer family, generator family, prompt design) but holds "the corpus was generated by an LLM" constant. A corpus generated without LLM involvement — hand-authored matched pairs, or programmatically templated PSs with controlled lexical content — would be the cleanest test of whether the surviving signal is real LLM-extractor behavior on authentic text or a property of LLM-generated text specifically.

  3. Pre-registered primary cells. The current audit reports 30+ cells across two extractors and several sensitivity tests. The corrected substantive claim narrows post-hoc to one cell. A confirmatory follow-up should pre-register the academic_career × school_tier cell as the primary endpoint, set the multiple- comparisons threshold up front, and report only that cell as the confirmatory test. A pre-registration committing to such a confirmatory test, including a school-name substitution intervention as the primary mechanism test, is committed at CONFIRMATORY_STUDY_PREREG.md alongside the locked substitution table at confirmatory/school_substitution_table.json. The pre-registered analysis has not been executed at this commit.

  4. More explicit multiple-comparisons handling in headline framing. The current README/RESULTS prominently note that no cell clears Bonferroni or Benjamini–Hochberg correction, but the surviving substantive claim still leans on the cell's robustness pattern rather than its corrected significance. A confirmatory replication under (3) would address this directly.

  5. Independent replication. The audit toolkit is reproducible from CLI on a fresh checkout. Running the same commands on a different machine, with a different operator, and reporting whether the qualitative findings hold is the cheapest external credibility step. The toolkit is set up so the only API spend required is corpus generation + LLM scoring (~$3 total) and the commands are documented in the Quickstart above. The repository includes a 2026-05-06 fresh-corpus reproduction (in examples/reference_outputs/audit_2_repro/) as one such replication; the substantive claim direction-replicated and the SBERT "opposite direction" framing did not.

The 2026-05-06 CHANGELOG entries describe the corrections made in response to public and private review (the tie-break artifact, the cross-family + cross-generator + content-neutral sensitivity tests, the permutation rep-count reconciliation, the fresh-corpus reproduction). Those refinements have narrowed the audit's substantive claim. The follow-ons above are the next set of moves that would either further narrow it or strengthen it.

Citation

Markey, C. (2026). AEDT Fairness Audit Toolkit.
https://github.com/[username]/aedt-fairness-audit

A CITATION.cff file is provided for automatic citation generation.

License

MIT. See LICENSE.

About

Fairness measurement library for automated employment decision tools, with audits of components from U.S. Patent No. 12,265,502 B1.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages