מדד האכיפה - Enforcement Index
denbust is evolving from a single-purpose news scanner into a small multi-dataset platform for TFHT
data jobs. Phase A introduced the shared platform spine. Phase B turns the first real dataset,
news_items, into an end-to-end operational flow with:
- normalized metadata records
- Supabase operational persistence
- privacy/review/suppression gating
- weekly public release bundle generation
- publication hooks for Kaggle and Hugging Face
- latest-backup upload hooks for Google Drive and S3-compatible object storage
Today, the implemented dataset/jobs are:
news_items / discover(source-native + Brave + Exa + Google CSE candidate persistence, plus taxonomy-targeted and Facebook-targeted social discovery queries)news_items / backfill_discovernews_items / backfill_scrapenews_items / ingestnews_items / monthly_reportnews_items / releasenews_items / backup
Planned future datasets:
docs_metadataopen_docs_fulltextevents
- Scans Israeli news sources for enforcement activity: raids, arrests, closures, trafficking cases
- Uses RSS and browser-backed scrapers
- Classifies relevance with an LLM
- Deduplicates the same story across multiple sources
- Emits unified items via CLI or SMTP email for the ingest workflow
- Persists normalized
news_itemsoperational rows - Builds metadata-only weekly release bundles
- Builds monthly Markdown/JSON report bundles for public-facing TFHT reporting
- Publishes release bundles to Kaggle and Hugging Face when configured
- Uploads the latest release bundle to Google Drive and S3-compatible object storage when configured
- Persists dataset/job-scoped seen state and per-run JSON snapshots
- Scaffolds a persistent discovery/candidacy layer with dedicated Supabase tables and state-repo
paths under
news_items/discover/ - Runs Brave, Exa, and Google CSE as external discovery engines feeding the durable candidate layer
- Adds taxonomy-targeted discovery queries from the packaged TFHT taxonomy for broader search-engine recall
- Adds source-targeted taxonomy queries for each configured news domain so search-backed discovery and capped historical backfill can compensate when source-native pages produce zero recent matches
- Expands source-native recall terms for Walla archive filtering and ICE search so diagnostics and discovery exercise targeted Hebrew phrases beyond the coarse operator keyword sample
- Protects the Ynet source-targeted taxonomy search path with fixture-backed recall coverage for a known February 12, 2026 article, including candidate provenance and pre-classification handoff
- Keeps Ynet RSS as the primary משפט ופלילים source while adding a non-browser category-page backstop for source-native recall
- Emits Facebook-targeted
social_targetedsearch queries and retains those results as non-scrapeable reference candidates - Plans historical backfill windows and persists durable
backfill_batchesmetadata for slow-drain discovery/scrape work - Drains one historical backfill batch at a time with oldest-window-first scrape prioritization
- Refreshes backfill batch aggregate counts through the discovery persistence layer, including Supabase exact-count requests, without materializing full candidate rows in the pipeline
- Provides an opt-in
agents/news/local_search.yamlconfig for local source-native + Brave + Exa- Google CSE discovery experiments
- Provides
agents/news/local_search_brave_exa.yamlfor local Brave+Exa wet tests when Google CSE returns403 PERMISSION_DENIED/ no API access; Google CSE remains supported in code and in the full local search config - Retains obvious non-article search-result noise as provenance while marking social profiles,
app-store detail pages, dictionary, translation, and reference-utility candidates
unsupported_sourcebefore they can consume scrape-drain budget; post-like social URLs and non-app store paths remain eligible, and diagnostics expose durable filter reason counts - Writes discovery overlap/queue/conversion diagnostics artifacts and exposes
denbust diagnose-discovery - Reports queue-drain selection order, attempted and remaining eligible source mix, configured candidate cap, persisted scrape-attempt count, and inferred stop reason for bounded candidate scrape passes
- Breaks down partial-page diagnostics so operators can distinguish retained candidate-fallback operational rows, metadata-only partials, search-result-only generic fetch failures, source vs generic partial attempts, generic partials after source-adapter attempts, dominant partial domains/sources, and visible current-candidate classifier/taxonomy risk signals, including low-confidence fallback counts by source, domain, taxonomy label, and confidence field
- Persists run-level classifier parser warning counts in run debug summaries under
classifier_summary.warning_counts, including JSON parse failures and invalid taxonomy pairs, and exposes fallback-only scrape/backfill context underfallback_classifier_summary, without changing classifier prompts, taxonomy policy, or scrape selection behavior - Documents the first post-persistence warning-count interpretation pass: the bounded January 1-7
Brave+Exa/no-Google scrape/backfill run saw 4 parse failures and 1 invalid taxonomy pair across
100 fallback classifier inputs, and compact summaries now retain
fallback_classifier_summaryfor future fallback-only drains - Keeps classifier output robustness evidence-driven: the retained Phase C parse-failure artifacts do not expose raw malformed response shapes, so representative non-JSON outputs are covered only as current rejection-policy examples and parser recovery is deferred until sanitized shape evidence is persisted safely
- Persists sanitized classifier parse-failure shape diagnostics in run debug summaries under
classifier_summary.parse_failure_diagnosticsand, for fallback-only scrape/backfill drains,fallback_classifier_summary.parse_failure_diagnostics; compact summaries retain the same bounded category counts, sanitized JSON error kinds, and structural samples without storing raw classifier response text, article text, secrets, or generated data artifacts - Interprets the first post-capture bounded Phase C parse-failure evidence pass: five fallback
parse failures across 100 fallback classifier inputs were all
object_like_non_json, with one-line, no-code-fence samples andmissing_property_nameat line 1 column 2, so parser recovery stayed deferred until a follow-up evidence pass could prove recovery safety - Interprets the follow-up parse-failure structure evidence pass: the same five-failure pattern was consistently balanced double-wrapped valid-inner-JSON classifier output, so the next parser slice can add fixture-backed recovery for that exact shape while keeping prompts, taxonomy policy, queue behavior, scraper behavior, source support, and generated-data boundaries unchanged
- Recovers only the proven classifier double-wrapper shape where the normalized response starts
with exactly two opening object braces, ends with exactly two closing object braces, has balanced
braces outside strings, and trimming one outer wrapper exposes a valid JSON object; recovered
objects still use the normal taxonomy validation path and increment
double_wrapper_recovery_count, while pseudo-JSON, unbalanced wrappers, non-object JSON, code-fenced malformed JSON, and invalid taxonomy pairs remain rejected - Interprets the first post-recovery bounded Phase C evidence pass: four double-wrapper recoveries replaced the prior five-parse-failure pattern with zero parse failures across 100 fallback classifier inputs, invalid taxonomy warnings stayed at one, and no retained fallback row carried an invalid taxonomy pair. One retained low-confidence partial fallback row had only a legacy category and no TFHT taxonomy leaf, so parser-output hardening pauses while the next Phase C classifier item investigates fallback retention without usable taxonomy.
- Keeps source-suggestion scrape diagnostics evidence-driven by reporting generic partial recoveries separately from definite scrape failures, without otherwise changing source-suggestion ranking
- Reports self-heal-eligible candidate backlog and structured scrape-failure groups for future repair workflows, without running automatic AI repair or selector rewriting
- Writes source-suggestion diagnostics artifacts for repeated unseen non-social domains
- Flags candidate-only
sport1.maariv.co.ilpressure in source-suggestion diagnostics without changing scrape eligibility or Maariv source-family support - Lints the tracked classifier validation CSV with
denbust validation-lint - Shares validation taxonomy/category/index-relevance row-integrity rules between
denbust validation-lintand reviewed-row finalize/import - Runs tracked classifier/live-source scenarios with
denbust live-check - Runs scheduled GitHub Actions discovery separately from candidate-driven ingest
- Exposes manual GitHub Actions workflows for historical
backfill_discoverandbackfill_scrape - Reviews the latest daily ingest artifacts and can open GitHub issues for suspicious runs
pip install -e ".[dev]"
python -m playwright install chromium
denbust scan --config agents/news/local.yaml
DENBUST_BACKFILL_DATE_FROM=2026-01-23T00:00:00+00:00 \
DENBUST_BACKFILL_DATE_TO=2026-04-22T23:59:59+00:00 \
denbust run --dataset news_items --job backfill_discover --config agents/news/local.yaml
DENBUST_BACKFILL_BATCH_ID=batch-optional \
denbust run --dataset news_items --job backfill_scrape --config agents/news/local.yaml
denbust report monthly --month 2026-03 --config agents/news/local.yaml
denbust diagnose-discovery --config agents/news/local.yaml
denbust validation-lint --validation-set validation/news_items/classifier_validation.csv
denbust release --config agents/release/news_items.yaml
denbust backup --config agents/backup/news_items.yamlTo send reports by email, set output.format: email in your config and provide SMTP env vars
from .env.example.
Mako scraping uses a headless Chromium browser. After installing dependencies on a new machine, run
python -m playwright install chromium once before your first live scan.
Phase A introduces explicit dataset and job identity in config and run snapshots:
dataset_namejob_name
Current defaults remain:
dataset_name: news_itemsjob_name: ingest
denbust scan is preserved as a compatibility alias for news_items / ingest.
Future-facing commands now exist as real news_items jobs:
denbust run --dataset news_items --job discover --config agents/news/local.yaml
denbust run --dataset news_items --job backfill_discover --config agents/news/local.yaml
denbust run --dataset news_items --job backfill_scrape --config agents/news/local.yaml
denbust run --dataset news_items --job scrape_candidates --config agents/news/local.yaml
denbust run --dataset news_items --job ingest --config agents/news/local.yaml
denbust run --dataset news_items --job monthly_report --config agents/news/local.yaml
denbust release --dataset news_items --config agents/release/news_items.yaml
denbust backup --dataset news_items --config agents/backup/news_items.yamlLocal runs now use dataset/job-namespaced defaults under the repo-local state root:
- seen store:
data/news_items/ingest/seen.json - run snapshots:
data/news_items/ingest/runs/ - publication scaffold dir:
data/news_items/ingest/publication/
Example:
denbust scan --config agents/news/local.yamlYou can override the persistence layout without changing YAML by setting:
DENBUST_STATE_ROOTDENBUST_STORE_PATHDENBUST_RUNS_DIR
Precedence rules:
DENBUST_STORE_PATH/DENBUST_RUNS_DIRDENBUST_STATE_ROOT- explicit YAML store paths /
store.state_root - local default root
data/
Scheduled GitHub Actions runs use this repo as the code runner and a separate repo,
tfht_enforce_idx_state, as the canonical mutable state store.
The workflow family:
- checks out this repo
- checks out the state repo into
state_repo/ - sets dataset/job env such as
DATASET_NAME=news_itemsandJOB_NAME=discoveroringest - runs
denbust run --dataset news_items --job <job> --config agents/news/github.yaml - points persistence at the checked-out state repo via
DENBUST_STATE_ROOT=state_repo - commits and pushes the updated namespaced state files only if files changed
Operational workflow ownership after DL-PR-11:
news-items-discoverowns scheduled discovery writes undernews_items/discover/daily-state-runandweekly-state-runown candidate-driven ingest undernews_items/ingest/news-items-backfill-discoverandnews-items-backfill-scrapeare manual operator workflows for historical recoverynews-items-daily-reviewstill followsdaily-state-run
Required secrets for GitHub-run mode:
ANTHROPIC_API_KEYSTATE_REPO_PATDENBUST_SUPABASE_URLDENBUST_SUPABASE_SERVICE_ROLE_KEYDENBUST_EMAIL_SMTP_HOSTDENBUST_EMAIL_SMTP_PORTDENBUST_EMAIL_SMTP_USERNAMEDENBUST_EMAIL_SMTP_PASSWORDDENBUST_EMAIL_FROMDENBUST_EMAIL_TODENBUST_EMAIL_USE_TLSDENBUST_EMAIL_SUBJECT
Optional discovery-engine secrets:
DENBUST_BRAVE_SEARCH_API_KEYDENBUST_EXA_API_KEYDENBUST_GOOGLE_CSE_API_KEYDENBUST_GOOGLE_CSE_ID
Expected tfht_enforce_idx_state structure:
tfht_enforce_idx_state/
└── news_items/
├── ingest/
│ ├── seen.json
│ ├── runs/
│ ├── logs/
│ └── publication/
├── release/
│ ├── runs/
│ └── publication/
├── monthly_report/
│ ├── runs/
│ └── publication/
└── backup/
├── runs/
└── publication/
The new discovery/candidacy foundation adds a separate candidate-layer namespace alongside the existing ingest/release/backup state:
tfht_enforce_idx_state/
└── news_items/
└── discover/
├── runs/
├── candidates/
│ ├── backfill_queue.jsonl
│ ├── latest_candidates.jsonl
│ ├── retry_queue.jsonl
│ └── scrape_attempts.jsonl
├── backfill_batches/
│ ├── latest_backfill_batches.jsonl
│ └── <batch-id>.json
└── metrics/
├── engine_overlap_latest.json
├── discovery_diagnostics_latest.json
└── source_suggestions_latest.json
Bootstrap notes:
seen.jsonmay be absent initially; it is created once a run marks at least one URL as seenruns/andpublication/directories are created automatically by the workflows when neededlogs/is created automatically once ingest debug artifacts are written- a small
README.mdin the state repo is fine but optional - see docs/discovery_operations.md for the detailed discover / ingest / backfill runbook and migration checklist
- the one-time
C-8catch-up path is a manual 90-daybackfill_discoverplusbackfill_scraperun using the existing 7-day backfill window slicing
Phase A introduces shared platform primitives so future dataset jobs can reuse them:
src/denbust/models/- dataset/job identity
- run snapshot model
- policy enums for rights, privacy, review, and publication status
src/denbust/store/state_paths.py- centralized dataset/job state path resolution
src/denbust/datasets/- explicit dataset/job registry
src/denbust/ops/storage.py- operational-store abstraction with a null implementation
src/denbust/publish/- release/export abstractions
- backup abstractions
What is implemented now:
news_items / ingest- live source ingestion
- canonical URL normalization
- LLM relevance classification
- one-sentence summary generation
- privacy/review/publication/takedown status assignment
- operational persistence through the configured store
news_items / release- reads operational rows
- filters to publicly releasable metadata-only rows
- writes
news_items.parquet,news_items.csv,MANIFEST.json,SCHEMA.json,SCHEMA.md,README.md, andchecksums.txt - publishes to Kaggle and Hugging Face when configured
news_items / backup- finds the latest release bundle
- uploads it to Google Drive and S3-compatible object storage when configured
Still intentionally deferred:
- additional dataset implementations beyond
news_items - richer human review tooling / admin UI
- more advanced privacy policies beyond the current pragmatic gate
DL-PR-06 now builds on the earlier discovery milestones:
src/denbust/discovery/with durable candidate, provenance, scrape-attempt, and discovery-run models- config sections for
discovery,source_discovery,candidates, andbackfill - durable
backfill_batchesmetadata mirrored to the state repo and Supabase - explicit state-repo path helpers for candidate-layer snapshots and queue files
- source-native candidate normalization and merge/upsert persistence
- a real
news_items / discoverjob that persists source-native candidates and Brave-discovered plus Exa-discovered and Google CSE-discovered candidates into the same durable substrate - Brave query building for broad and source-targeted discovery searches
- taxonomy-targeted query building from packaged TFHT discovery terms
- capped source-targeted taxonomy query building for historical backfill via
backfill.max_source_targeted_taxonomy_queries_per_window - fixture-backed Ynet recall coverage for source-targeted taxonomy search candidates, durable candidate normalization, source-adapter materialization, and pre-classification ingest handling
- Exa query execution for the same broad and source-targeted discovery searches
- Google CSE query execution for the same broad and source-targeted discovery searches
- dedicated
DENBUST_BRAVE_SEARCH_API_KEY,DENBUST_EXA_API_KEY, andDENBUST_GOOGLE_CSE_API_KEY/DENBUST_GOOGLE_CSE_IDconfiguration paths for external discovery engines - candidate selection / queueing helpers for retryable scrape work
- scrape-attempt persistence and candidate status transitions underneath the ingest path
- a real
news_items / scrape_candidatesjob that drains queued candidates into article ingest - a real
news_items / backfill_discoverjob that requiresDENBUST_BACKFILL_DATE_FROM/DENBUST_BACKFILL_DATE_TO, creates a durable batch record, and runs historical search-engine discovery plus capability-based source-native discovery - a real
news_items / backfill_scrapejob that drains one historical batch at a time through the existing scrape-to-ingest path - persistence-layer aggregate count API used by backfill batch status refreshes, with Supabase exact-count requests and state-repo streaming counts preserving existing aggregate semantics
news_items / ingestnow uses the candidate scrape layer for source-native candidates while preserving the existing direct-fetch convenience flow as a compatibility fallback- Supabase migrations for:
discovery_runspersistent_candidatesbackfill_batchescandidate_provenancescrape_attempts
The current daily monitoring flow remains operational, and the durable candidate substrate now accepts source-native discovery plus Brave, Exa, and Google CSE search results.
Local operators can use agents/news/local_search_brave_exa.yaml for wet tests when the Google
Programmable Search API is unavailable locally. That config disables only the Google CSE engine and
keeps Brave and Exa enabled; browser-backed scraping still uses DENBUST_BROWSER_MODE=chrome_cdp
and DENBUST_CHROME_CDP_URL=http://127.0.0.1:9222 when the run needs Chrome CDP.
DL-PR-08 extends that substrate with fallback retention for imperfect scraping:
- source-adapter successes now persist
content_basis = full_article_page - generic fetch fallback can retain
content_basis = partial_pagewhen page metadata is recoverable - failed full scrapes can retain
content_basis = search_result_onlyusing discovery metadata - retained fallback candidates can materialize provisional internal-only
news_itemsrows for monitoring and review, while staying excluded from public release by default - discovery diagnostics now report candidate-basis counts so operators can distinguish full-page, partial-page, and search-result-only retention
- Ynet source-health diagnostics now report separate RSS and category-page checks, distinguishing RSS low coverage, category HTTP failure, category parse-zero, and category keyword-zero cases
- source-health diagnostics now include a report-level
source_zero_summaryfor the Phase C 4+ hard affected-source guardrail, keep keyword-zero outcomes visible through separate report-level counts and per-source warnings, and include Makofailure_modedetails for missing Chromium, navigation timeout, context teardown, redirect/anti-bot, selector drift, parse-zero, and stale/keyword-zero outcomes - the 2026-05-03 Phase C source-health triage pass, run with an isolated
DENBUST_STATE_ROOTand Chromium installed before Mako probing, showed Mako and Haaretz healthy while the then-current 4-source guardrail still fired for Ynet, Walla, Maariv, and ICE; the #72 follow-up narrows that guardrail to hard source failures and makes #71/#74 duplicate or stale Mako runtime hygiene unless Mako regresses again - the 2026-05-03T13:13:10Z source-specific Mako follow-up reran the Chromium-backed probe under
data/may_26_followup/20260503T131309Z/state; Mako again returnedok, so #71/#74 are closed as stale/duplicate runtime hygiene without changing scraper behavior - PR #108 was squash-merged as
dea6406and left no open GitHub issue backlog; the post-#108 artifact-only reset underdata/may_26_followup/20260503T134102Z/statecorrectly showed an empty isolated diagnostics baseline rather than a new code defect - PR #109 was squash-merged as
201c247; the 2026-05-03 candidate-drain evidence pass underdata/may_26_followup/20260503T153123Z/statepersisted 63 candidates, spent all 30 scrape attempts on ICE, left 33 candidates from Haaretz, ICE, Maariv, Mako, and Walla never scraped, and produced no scrape failures, retry backlog, or self-heal backlog. Artifact-only source-health was inconclusive because that diagnostic path skippedscrape_candidatesdebug summaries - PR #110 was squash-merged as
8c89d91;diagnose-discoverynow emitsqueue_draindiagnostics for bounded candidate scrape passes, including persisted attempted order, actual-attempt source mix, remaining eligible order/source mix, configured candidate cap, persisted scrape-attempt count, and inferred stop reason, without changing queue prioritization or fairness behavior SRC-PR-GLOBES-THEMARKERadds bounded generic-fetch source-family support forglobes.co.ilandthemarker.com: diagnostics and fallback provenance now group search-discovered article URLs asglobes/themarker, article metadata and JSON-LD are preferred over page titles for partial metadata extraction, and source-targeted discovery/backfill queries now cover both domains while source-suggestion diagnostics can still surface weak conversion or candidate-only backlog. This is intentionally not a browser scraper or source-native discovery adapterSRC-PR-ISRAELHAYOMadds bounded generic-fetch source-family recognition for main-domainisraelhayom.co.ilarticle URLs: search-discovered Israel Hayom article URLs can be grouped asisraelhayom, while source-targeted discovery/backfill fanout, Israel Hayom subdomains, browser scrapers, source-native adapters, and queue-prioritization changes remain out of scope until stronger extraction evidence existsSRC-PR-KANadds low-confidence generic-fetch diagnostic labeling for official Kan news article paths underkan.org.il/content/kan-news/; this is not source-targeted fanout, a source-native adapter, a browser scraper, or broad support for non-articlekan.org.ilpages and unrelated Kan-named domainsSRC-PR-NEWS1adds low-confidence generic-fetch diagnostic labeling for main-domain News1 archive article paths undernews1.co.il/Archive/; this is not source-targeted fanout, a source-native adapter, a browser scraper, or broad support for non-archive News1 pagesCLASSIFIER-PR-WARNING-EVIDENCE-INTERPRETATIONrecords the first post-warning-counter Phase C interpretation pass under generated local rootdata/may_26_followup/20260514T182934Z/: 100 fallback classifier inputs produced 4 parse failures, 1 invalid taxonomy pair, no invalid legacy pairs, and no relevant-without-usable-taxonomy warnings. It also keepsfallback_classifier_summaryin compact run summaries going forward and sets the next PR to inspect parse-failure shape evidence before any bounded parser robustness change. It does not recommend prompt, taxonomy-policy, queue, scraper, or source-family changesDL-PR-12adds explicit future self-heal hooks while preserving current behavior:self_heal_eligibleis visible in queue diagnostics, failed scrape attempts are grouped by attempt kind/status/error/source/domain, source-adapter and generic-fetch failures carry stable failure-stage diagnostics, and code can select eligible failed candidates for a laterself_heal_retryorchestration pass without implementing AI repair.
A four-stage, fully-local, non-LLM-API-based filter cascade is planned for insertion between the discovery/triage layer and the Claude Sonnet relevance classifier, to drop high-confidence true negatives before they consume paid LLM budget. The cascade composes:
- Stage A: scored lexicon + Beta-Binomial domain reputation + URL heuristics (microseconds per candidate);
- Stage B: trained text classifier (Naive Bayes on character n-grams, with a SetFit option) (~5 ms per candidate);
- Stage C: multilingual sentence-embedding similarity using
intfloat/multilingual-e5-largewith centroid-cosine and FAISS-HNSW kNN-max-cosine signals (~30 ms per candidate); - Stage D: local SLM (
dicta-il/dictalm2.0-instructvia MLX) scored by token logprobs ofכן/לאat the answer slot (~800 ms per candidate).
Each stage is calibrated to ≥ 99% per-stage recall on a held-out validation set; the cascade
ships in mode: off by default, is promoted to mode: shadow in LPF-PR-09, and only flips to
mode: enforce after a documented 7-day shadow window meets the configured recall floor.
See docs/local_prefilter_cascade_design.md for the
design and docs/local_prefilter_cascade_implementation_plan.md
for the per-PR rollout plan (LPF-PR-01 through LPF-PR-10 for the core cascade plus optional
LPF-PR-11 Claude distillation and LPF-PR-12 active-learning / cluster-filter follow-ups).
Preferred checked-in config layout:
agents/
news/
local.yaml
github.yaml
release/
news_items.yaml
backup/
news_items.yaml
Backward-compatible shims are still present:
agents/news.yamlagents/news-github.yaml
Current intent:
agents/news/...drives ingest jobsagents/release/...drives release jobsagents/backup/...drives backup jobs
The release and backup commands still accept any compatible config path you pass explicitly, but
their default paths now point at dedicated config files instead of reusing the ingest config.
The current GitHub Actions layer is still news-items-first, but it is now parameterized around shared dataset/job env variables:
DATASET_NAMEJOB_NAMEJOB_CONFIG_PATHSTATE_JOB_DIR
This keeps the current scheduled news ingest behavior unchanged while making the workflow files easier to extend for future dataset/job combinations.
The public news_items dataset is metadata-only. Each public row contains:
- deterministic row id
- source name and source domain
- original and canonical URL
- publication and retrieval timestamps
- title
- category and sub-category
- one-sentence factual summary
- geographic fields when available
- organizations and topic tags
- rights / privacy / review / publication / takedown status
- release version
The public dataset intentionally excludes:
- article full text
- cached HTML
- page screenshots or snapshots
- private ingestion diagnostics
Rows are excluded from public release when they are:
- suppressed by a takedown/suppression rule
- marked
internal_only - still pending privacy review
- otherwise non-public under the shared policy enums
Each news_items release currently writes:
news_items.parquetas the canonical exportnews_items.csvMANIFEST.jsonSCHEMA.jsonSCHEMA.mdREADME.mdchecksums.txt
Release versions use a UTC date string such as 2026-03-22.
| Mode | Reads from | Writes to | External integrations |
|---|---|---|---|
| Local ingest | live sources + local seen store | local namespaced state + local JSON operational store (from agents/news/local.yaml) |
Anthropic, optional SMTP |
| GitHub ingest | live sources + shared state repo seen store | shared state repo + Supabase | Anthropic, Supabase, optional SMTP |
| Weekly release | Supabase news_items rows |
release bundle under news_items/release/publication + release run snapshot |
optional Kaggle, optional Hugging Face |
| Weekly backup | latest built release bundle under news_items/release/publication |
backup run snapshot under news_items/backup/runs |
optional Google Drive, optional S3-compatible object storage |
In other words:
- local ingest uses local state plus the local JSON operational store by default
- GitHub ingest uses the shared state repo plus Supabase
- release reads releasable rows from Supabase, builds the bundle, and only publishes to public targets when they are configured
- backup does not rebuild the release; it uploads the latest already-built bundle when backup targets are configured
ANTHROPIC_API_KEYDENBUST_SUPABASE_URLfor GitHub/Supabase-backed ingestDENBUST_SUPABASE_SERVICE_ROLE_KEYfor GitHub/Supabase-backed ingest- SMTP variables when email output is enabled
DENBUST_SUPABASE_URLDENBUST_SUPABASE_SERVICE_ROLE_KEYDENBUST_KAGGLE_DATASETto enable Kaggle publishingKAGGLE_USERNAMEKAGGLE_KEYDENBUST_HUGGINGFACE_REPO_IDto enable Hugging Face publishingHF_TOKEN
DENBUST_DRIVE_SERVICE_ACCOUNT_JSONDENBUST_DRIVE_FOLDER_IDDENBUST_OBJECT_STORE_BUCKETDENBUST_OBJECT_STORE_PREFIX(optional; defaults tonews_items/latest)DENBUST_OBJECT_STORE_ENDPOINT_URLDENBUST_OBJECT_STORE_ACCESS_KEY_IDDENBUST_OBJECT_STORE_SECRET_ACCESS_KEY
The Phase B integrations are intentionally split into:
- required backends for a given job
- optional targets that are only activated when explicitly configured
DENBUST_SUPABASE_URLandDENBUST_SUPABASE_SERVICE_ROLE_KEYare required for the checked-in GitHub ingest config and for the checked-in release config- the service-role key is used because ingest, release, and suppression-aware export assembly all need privileged operational access
- if the selected config uses
operational.provider: supabaseand these variables are missing, the job fails
- Kaggle publishing is activated only when
DENBUST_KAGGLE_DATASETis set - if
DENBUST_KAGGLE_DATASETis not set, the release job still builds the bundle and skips Kaggle publication - if
DENBUST_KAGGLE_DATASETis set butKAGGLE_USERNAMEorKAGGLE_KEYis missing, the release job fails
- Hugging Face publication is activated only when
DENBUST_HUGGINGFACE_REPO_IDis set - if
DENBUST_HUGGINGFACE_REPO_IDis not set, the release job still builds the bundle and skips Hugging Face publication - if
DENBUST_HUGGINGFACE_REPO_IDis set butHF_TOKENis missing, the release job fails
- Google Drive backup is activated when the backup config enables the target or when
DENBUST_DRIVE_FOLDER_IDis present - the checked-in backup config keeps the target disabled for local safety; in GitHub Actions the folder-id secret can activate it implicitly
- if the target is inactive, backup skips Google Drive cleanly
- if the target is active but
DENBUST_DRIVE_SERVICE_ACCOUNT_JSONis missing, the backup job fails
- object-storage backup is activated when the backup config enables the target or when
DENBUST_OBJECT_STORE_BUCKETis present DENBUST_OBJECT_STORE_PREFIXis optional and defaults tonews_items/latest- if the target is inactive, backup skips object storage cleanly
- if the target is active but
DENBUST_OBJECT_STORE_ACCESS_KEY_IDorDENBUST_OBJECT_STORE_SECRET_ACCESS_KEYis missing, the backup job fails
- release is considered successful if the bundle is built; skipped public targets are surfaced as warnings in logs/run snapshots
- backup is considered successful if the command completes; zero configured targets is treated as a warning, not a failure
- if a configured publication or backup target is missing required credentials, that target currently fails the job rather than silently skipping
The repository also includes a workflow that reviews the latest daily-state-run artifacts and can
open GitHub issues when the latest ingest looks suspicious:
news-items-daily-review.yml
It runs automatically after daily-state-run completes successfully, and it also supports
workflow_dispatch for manual review. It reads the latest matching files from:
news_items/ingest/runs/news_items/ingest/logs/
The workflow uses Anthropic to turn those artifacts into candidate engineering issues, then creates only new issues by embedding a hidden fingerprint marker in each issue body.
| Workflow | Required secrets | Optional secrets |
|---|---|---|
daily-state-run.yml / weekly-state-run.yml |
STATE_REPO_PAT, ANTHROPIC_API_KEY, DENBUST_SUPABASE_URL, DENBUST_SUPABASE_SERVICE_ROLE_KEY |
SMTP/email secrets if email output is enabled |
news-items-daily-review.yml |
STATE_REPO_PAT, ANTHROPIC_API_KEY |
DENBUST_REVIEW_MODEL, DENBUST_REVIEW_ISSUE_LABELS |
news-items-release.yml |
STATE_REPO_PAT, DENBUST_SUPABASE_URL, DENBUST_SUPABASE_SERVICE_ROLE_KEY |
DENBUST_KAGGLE_DATASET, KAGGLE_USERNAME, KAGGLE_KEY, DENBUST_HUGGINGFACE_REPO_ID, HF_TOKEN |
news-items-backup.yml |
STATE_REPO_PAT |
DENBUST_DRIVE_FOLDER_ID, DENBUST_DRIVE_SERVICE_ACCOUNT_JSON, DENBUST_OBJECT_STORE_BUCKET, DENBUST_OBJECT_STORE_PREFIX, DENBUST_OBJECT_STORE_ENDPOINT_URL, DENBUST_OBJECT_STORE_ACCESS_KEY_ID, DENBUST_OBJECT_STORE_SECRET_ACCESS_KEY |
The release and backup workflows both support workflow_dispatch for manual runs and weekly schedules for automated runs.
Recommended GitHub Environment mapping:
news-items-ingestfordaily-state-run.ymlandweekly-state-run.ymlnews-items-ingestfornews-items-daily-review.ymlnews-items-releasefornews-items-release.ymlnews-items-backupfornews-items-backup.yml
The code reads generic env vars at runtime, so the same variable names can safely have different values per GitHub Environment.
Phase B adds SQL migrations under:
supabase/migrations/
Apply the news_items migration before running Supabase-backed jobs. The schema includes:
news_itemsingestion_runsrelease_runsbackup_runssuppression_rules
suppression_rules is the minimal takedown/suppression path. Add rows there by canonical URL or row
id to block future public releases.
The checked-in local ingest config uses the local JSON operational store:
denbust scan --config agents/news/local.yaml
denbust release --config agents/release/news_items.yaml
denbust backup --config agents/backup/news_items.yamlTo run release locally without Supabase, either:
- switch
operational.providertolocal_jsonin the release config, or - provide a custom config path with that override
The checked-in GitHub ingest config uses the Supabase operational store:
agents/news/github.yaml
Release and backup jobs rely on dedicated configs:
agents/release/news_items.yamlagents/backup/news_items.yaml
For backup specifically:
- the checked-in YAML keeps both targets disabled for local safety
DENBUST_DRIVE_FOLDER_IDauto-enables Google Drive backup at config-load timeDENBUST_OBJECT_STORE_BUCKETauto-enables object-storage backup at config-load time- because the backup config no longer hardcodes
store.publication_dir, it reads the latest release bundle from the current state root undernews_items/release/publication
- privacy/risk gating is intentionally lightweight and conservative, not a substitute for legal review
- publication and backup integrations require external credentials and cannot be fully exercised in CI
- only
news_itemsis implemented end to end in this phase
📍 פשיטה על בית בושת ברמת גן
תאריך: 2026-02-15
קטגוריה: בית בושת
תקציר: המשטרה פשטה על דירה ברמת גן...
מקורות:
• Ynet: https://ynet.co.il/...
• Mako: https://mako.co.il/...
- Agent Plan - Current operational priority pointer
- Repo Plan Summary - Human-friendly map of the main plan and active sub-plans
- Discovery Operations - Local and GitHub Actions runbook for discover/ingest/backfill
- Product Definition - Full project background (Hebrew)
- MVP Spec - Phase 1 technical scope
- Implementation Plan - Task breakdown
- Discovery Layer Rollout Plan -
DL-PR-*sequence
- When a PR is opened against a tracked plan item, the PR should update
.agent-plan.md,README.md, and the relevant human-facing plan document(s) in the same branch. - Those documentation updates should describe the state the repo is expected to be in after the PR is merged, not the pre-merge state from before the work landed.
.agent-plan.mdis written as mainline truth usingMainline StatusandTask Ledger; on a branch it acts as the merge contract, and onmainthe same text is read as present-tense fact..agent-plan.mdtask ledger entries use only[done],[next],[later], and[blocked], with exactly one[next]item at any time.- Plan-tracked PRs should not leave planning and docs surfaces one merge behind the code.
- Phase A (current): multi-dataset platform spine + working
news_items / ingest - Phase B:
news_itemsdataset evolution and release/backup implementation - Later: docs metadata, open-docs fulltext, events, and downstream analytics