feat(spam): capture moderated comments as training samples (axis-2 L2)#4846
Merged
Conversation
L2 of the spam-data-retention roadmap: emit de-identified labeled samples to SQS at the moderation boundary so the spam-model training signal survives later content deletion that L1's passive DB extraction can't recover — clearCommunityWatchOriginalContent nulls the snapshot, and account purge erases content. - common/notifications/spamSample.ts: enqueueSpamSample, mirrors enqueueReportAlert (best-effort SQS, never throws, no-op when unconfigured). Ids are HMAC-SHA256(salt) at emit so no raw user/content ids enter the queue; only the text the model trains on is carried verbatim. - wired: communityWatchRemoveComment (confirmed spam at removal), clearCommunityWatchOriginalContent (capture before the snapshot is nulled; reversed action -> hard-negative ham). - env: MATTERS_AWS_SPAM_SAMPLE_QUEUE_URL, MATTERS_SPAM_SAMPLE_HASH_SALT. A separate Lambda worker consumes the queue and appends de-identified rows to the S3 training bucket (see spam-detection-scaffold). Off until ops provisions the queue + salt. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
added 2 commits
June 14, 2026 14:52
CI lint failed: #-subpath/external/node: imports are one alphabetized group with no blank lines. Reorder spamSample.ts imports accordingly.
…shot) The clear mutation now snapshots the action before nulling (axis-2 L2), so its test context must provide findCommunityWatchActionByUUID. enqueueSpamSample no-ops without queue/salt env, so nothing is sent in tests.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4846 +/- ##
===========================================
- Coverage 73.04% 72.59% -0.46%
===========================================
Files 1066 1067 +1
Lines 21659 21178 -481
Branches 4816 4623 -193
===========================================
- Hits 15821 15374 -447
+ Misses 5357 5326 -31
+ Partials 481 478 -3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
added 5 commits
June 14, 2026 22:03
Mirror reportAlert.test.ts: payload shape, HMAC de-identification (ids hashed, never raw + deterministic), null score for ham, and no-op guards (queue unset / salt unset / blank text) + AWS-error swallowing. Brings spamSample.ts diff coverage to green.
CI test scripts only run build/{connectors,common/utils,routes,types}; the
common/notifications dir has no script, so the standalone spamSample.test.ts
never ran and spamSample.ts stayed at 38%. Remove that dead test and instead
exercise enqueueSpamSample's full body from communityWatchRemoveComment.test
(common/utils, which IS run): set the queue URL + hash salt, stub
aws.sqsSendMessage, and assert a de-identified sample (hashed ids) is enqueued
on removal.
…-sample-capture # Conflicts: # .env.example
… project) spamSample.ts was at 76.9% (lines 66, 84-85 uncovered). Add two removal cases: aws throws -> removal still succeeds (covers the swallow/catch); removed comment has blank content -> sample skipped (covers the blank-text guard). Brings the file to ~full coverage so codecov/project no longer dips.
Total repo coverage fluctuates run-to-run (sharded integration suites) and codecov compares against the nearest ancestor with a coverage upload (develop merge commits publish none), so PRs show spurious project drops even when their own diff is 100% covered (e.g. #4846 at -0.46%). Add a 1% project threshold to absorb that noise; patch stays strict so new code must still be tested.
This was referenced Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
L2 of the spam-data-retention roadmap. Emits de-identified labeled samples to SQS at the moderation boundary, so the spam-model training signal survives content deletion that L1's passive DB extraction can't recover:
clearCommunityWatchOriginalContentnulls the snapshot,A separate Lambda worker (in
spam-detection-scaffold/workers/spam_sample_worker.py) consumes the queue and appends to the S3 training bucket.How
common/notifications/spamSample.ts—enqueueSpamSample, mirrorsenqueueReportAlert: best-effort SQS, never throws, no-op when the queue/salt is unconfigured. Comment/author ids are HMAC-SHA256(salt) at emit, so no raw ids ever enter the queue; only the text the model trains on is carried verbatim.communityWatchRemoveComment→ confirmed spam at removal time.clearCommunityWatchOriginalContent→ capture before the snapshot is nulled; areversedaction ⇒ hard-negative ham.MATTERS_AWS_SPAM_SAMPLE_QUEUE_URL,MATTERS_SPAM_SAMPLE_HASH_SALT.Privacy / governance
De-identified at source (HMAC ids, no contactable PII); only model-relevant text retained. Legal cleared the retention direction (2026-06-13); concrete retention period TBD. The training corpus lives in a separate S3 bucket decoupled from the operational DB so account deletion still fully removes user-facing content.
Off by default
No-op until ops provisions the SQS queue +
MATTERS_SPAM_SAMPLE_HASH_SALTand deploys the worker. Zero behavior change otherwise.Test plan
🤖 Generated with Claude Code