Skip to content

[Multi-Agent Privacy] Detection tools implementation#335

Open
JCHAVEROT wants to merge 21 commits into
v2from
feat/detection-toolkit-clean
Open

[Multi-Agent Privacy] Detection tools implementation#335
JCHAVEROT wants to merge 21 commits into
v2from
feat/detection-toolkit-clean

Conversation

@JCHAVEROT

Copy link
Copy Markdown
Collaborator

Summary

Related issue: #292
Depending on: #285
Review completed in a fork: link
Target branch: swiss-ai:mmore/v2

This PR adds a new Personally Identifiable Information (PII) detection toolkit to be later used as tools by the agentic privacy system.

What this adds

  • A PII detection toolkit under mmore.privacy.detection with four interchangeable engines:
  • Each engine takes a shared DetectionConfig, and registers itself in a tool registry so agents can call it
  • Model loading is lazy and shared across engines via one model registry global to the pipeline, so each model loads once and can be reused across agents:
    • LRU eviction with a memory budget, defaults to a fraction of device memory (CUDA/MPS/CPU), and can be override with MMORE_PRIVACY_MODEL_BUDGET_MB
    • Disable entirely with MMORE_PRIVACY_MODEL_CACHE=0

Dependencies / CI

  • the extra privacy now has new dependencies (gliner, presidio, spacy, dspy, and psutils for memory measurements)
  • new separate extra privacy-openai-filter (transformers>=5, peft) as currently there is a conflict with marker-pdf from the extra process (will be solved once #191 is closed)

Tests

  • Unit tests for all four engines that use mocks so they run without downloading models
  • These tests are intentionally temporary and will be replaced by end-to-end tests once the full privacy multi-agent system is in place

Disclaimer: the big numbers in the line differences most come from new dependencies, hence changes in the uv.lock file


Demo

Input note (AI generated)

# Progress Note - Internal Medicine

Pt is a 58 yo M, goes by "Bobby," seen this AM on rounds (bed 4B, Tower 3).
Known to our service from the 3/2 admission - see prior note by Dr. Garcia.
Wife (Linda, reachable on her cell 617-555-0148, or at the house on Linwood
Ave) was at bedside overnight and is the HCP. Hx obtained partly from pt,
partly from the daughter who flew in from Austin.

Pt c/o "the same chest thing as before," denies fevers. Says he stopped the
metoprolol ~2 wks ago bc he "ran out and couldn't get through to the office."
Smokes ~1 ppd, quit date keeps changing.

Of note, records faxed over from Dr. R. Lee's office (St. Mary's, the one off
123 Main) list a different DOB than what we have - pt says 4/23/65 but the
face sheet says 04/23/1955, needs reconciling. MRN on the wristband (12345678)
matches the chart; the outside packet had 1245-6788 which is probably a
transcription error. Insurance still showing the old BCBS plan, member id
BCXY 99-88-77, though pt thinks he switched in Jan. Front desk left a vm at
555 867 5309 re: policy AB1234567.

Pt mentioned he emailed photos of the rash to "the dermatology guy" at
jsmith@hosp-derm.org last week, unclear which provider. Asked us to call his
brother (no name given, "he's a nurse over at the VA"). SSN partially visible
on a scanned form in the chart (xxx-xx-4321) - flagged to HIM.

A/P: 58M w/ CP, likely demand ischemia i/s/o med noncompliance. Will discuss
code status with pt and Linda. F/u cardiology (Dr. Maria Garcia, pager 12345)
after d/c. Tentative d/c 4/1, ride being arranged. Pt verbalized understanding,
quote: "just don't call me at work, my boss doesn't know."

GLiNER (nvidia/gliner-PII)

15 spans at confidence_threshold = 0.4

start end label score text
63 68 PERSON 0.859 Bobby
143 146 DATE 0.990 3/2
195 200 PERSON 0.990 Linda
224 236 PHONE 0.782 617-555-0148
381 387 LOCATION 0.942 Austin
640 650 LOCATION 0.987 St. Mary's
723 730 DATE 0.959 4/23/65
755 765 DATE 0.974 04/23/1955
808 816 MRN 0.990 12345678
964 977 INSURANCE_ID 1.000 BCXY 99-88-77
1147 1167 EMAIL 1.000 jsmith@hosp-derm.org
1334 1345 SSN 0.963 xxx-xx-4321
1467 1472 PERSON 0.992 Linda
1494 1506 PERSON 0.637 Maria Garcia
1546 1549 DATE 0.789 4/1

openai/privacy-filter

78 spans at confidence_threshold = 0.4

start end label score text
63 64 B-private_person 0.999 B
64 68 E-private_person 0.999 obby
176 179 B-private_person 1.000 Dr
179 180 I-private_person 1.000 .
180 187 E-private_person 1.000 Garcia
195 200 S-private_person 0.999 Linda
224 227 B-private_phone 1.000 617
227 228 I-private_phone 1.000 -
228 231 I-private_phone 1.000 555
231 232 I-private_phone 1.000 -
232 235 I-private_phone 1.000 014
235 236 E-private_phone 1.000 8
256 260 B-private_address 0.999 Lin
260 264 I-private_address 0.997 wood
264 265 I-private_address 0.999
265 268 E-private_address 0.995 Ave
618 621 B-private_person 1.000 Dr
621 622 I-private_person 1.000 .
622 624 I-private_person 1.000 R
624 625 I-private_person 1.000 .
625 629 E-private_person 1.000 Lee
723 724 B-private_date 0.928 4
724 725 I-private_date 0.852 /
725 727 I-private_date 0.859 23
727 728 I-private_date 0.817 /
728 730 E-private_date 0.787 65
755 757 B-private_date 0.784 04
757 758 I-private_date 0.775 /
758 760 I-private_date 0.728 23
760 761 I-private_date 0.662 /
761 764 I-private_date 0.633 195
764 765 E-private_date 0.620 5
808 811 B-account_number 0.998 123
811 814 I-account_number 0.995 456
814 816 E-account_number 0.995 78
964 966 B-account_number 1.000 BC
966 968 I-account_number 0.999 XY
968 969 I-account_number 1.000
969 971 I-account_number 0.999 99
971 972 I-account_number 0.999 -
972 974 I-account_number 0.999 88
974 975 I-account_number 0.997 -
975 977 E-account_number 0.999 77
1010 1014 S-private_person 0.494 Jan
1040 1043 B-private_phone 0.954 555
1043 1044 I-private_phone 0.894
1044 1047 I-private_phone 0.880 867
1047 1048 I-private_phone 0.929
1048 1051 I-private_phone 0.971 530
1051 1052 E-private_phone 0.947 9
1063 1066 B-account_number 0.992 AB
1066 1069 I-account_number 0.998 123
1069 1072 I-account_number 0.996 456
1072 1073 E-account_number 0.998 7
1147 1149 B-private_email 1.000 js
1149 1153 I-private_email 1.000 mith
1153 1154 I-private_email 1.000 @
1154 1155 I-private_email 1.000 h
1155 1158 I-private_email 1.000 osp
1158 1159 I-private_email 1.000 -
1159 1162 I-private_email 1.000 der
1162 1163 I-private_email 1.000 m
1163 1167 E-private_email 1.000 .org
1334 1337 B-account_number 0.977 xxx
1337 1338 I-account_number 0.991 -
1338 1340 I-account_number 0.950 xx
1340 1341 I-account_number 0.892 -
1341 1344 I-account_number 0.750 432
1344 1345 E-account_number 0.851 1
1466 1472 S-private_person 0.561 Linda
1490 1492 B-private_person 1.000 Dr
1492 1493 I-private_person 1.000 .
1493 1499 I-private_person 1.000 Maria
1499 1506 E-private_person 1.000 Garcia
1507 1513 B-account_number 0.993 pager
1513 1514 I-account_number 0.978
1514 1517 I-account_number 0.945 123
1517 1519 E-account_number 0.978 45

Presidio + custom clinical recognizers

26 spans at confidence_threshold = 0.4

start end label score text
63 68 PERSON 0.850 Bobby
103 110 LOCATION 0.850 Tower 3
181 187 PERSON 0.850 Garcia
195 200 PERSON 0.850 Linda
224 236 PHONE_NUMBER 0.750 617-555-0148
381 387 LOCATION 0.850 Austin
480 487 DATE_TIME 0.850 wks ago
623 631 PERSON 0.850 R. Lee's
640 650 PERSON 0.850 St. Mary's
723 730 DATE_TIME 0.600 4/23/65
755 765 DATE_TIME 0.600 04/23/1955
755 765 HOSPITAL_DATE 0.600 04/23/1955
808 816 MRN 0.750 12345678
860 869 DATE_TIME 0.850 1245-6788
1011 1015 DATE_TIME 0.850 Jan.
1040 1052 PHONE_NUMBER 0.400 555 867 5309
1064 1073 INSURANCE_ID 1.000 AB1234567
1064 1073 US_DRIVER_LICENSE 0.650 AB1234567
1147 1167 EMAIL_ADDRESS 1.000 jsmith@hosp-derm.org
1154 1167 URL 0.500 hosp-derm.org
1168 1177 DATE_TIME 0.850 last week
1274 1276 LOCATION 0.850 VA
1467 1472 PERSON 0.850 Linda
1494 1506 PERSON 0.850 Maria Garcia
1514 1519 DATE_TIME 0.850 12345
1529 1544 PERSON 0.850 c. Tentative d/

LLM Qwen/Qwen2.5-7B-Instruct via DSPy

21 spans at confidence_threshold = 0.4

start end label score text
63 68 PERSON 0.800 Bobby
143 146 DATE 0.800 3/2
177 187 PERSON 0.800 Dr. Garcia
195 200 PERSON 0.800 Linda
224 236 PHONE 0.950 617-555-0148
466 476 MEDICATION 0.800 metoprolol
478 483 DURATION 0.800 2 wks
619 629 PERSON 0.800 Dr. R. Lee
640 650 LOCATION 0.800 St. Mary's
664 672 LOCATION 0.800 123 Main
808 816 MRN 0.950 12345678
860 869 MRN 0.950 1245-6788
964 977 INSURANCE_ID 0.950 BCXY 99-88-77
1040 1052 PHONE 0.950 555 867 5309
1147 1167 EMAIL 0.950 jsmith@hosp-derm.org
1274 1276 LOCATION 0.800 VA
1334 1345 SSN 0.950 xxx-xx-4321
1490 1506 PERSON 0.800 Dr. Maria Garcia
1508 1519 PHONE 0.800 pager 12345
1546 1549 DATE 0.800 4/1
1631 1635 LOCATION 0.800 work

JCHAVEROT added 20 commits June 26, 2026 16:31
(cherry picked from commit d8503b1)
(cherry picked from commit ec9126f)
(cherry picked from commit 8be4dc8)
(cherry picked from commit 53e384b)
(cherry picked from commit fd49131)
@JCHAVEROT JCHAVEROT self-assigned this Jun 26, 2026
@JCHAVEROT JCHAVEROT added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jun 26, 2026
@JCHAVEROT JCHAVEROT linked an issue Jun 26, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detection engines to flag sensible data

1 participant