Features • Installation • Corpus • Use • NICE 541 • Recon mode • Scope
syllabus indexes 981 PDFs into a single SQLite database: NICE 541 career-pathway PDFs (the DoD cyber workforce framework), 5 University of Illinois AI-secure courses (cs307, cs442, cs562, cs598, cs598-fall2020), and 260 USENIX and NDSS papers. The CLI runs BM25 search, lists papers by topic across a 21-topic hand-curated taxonomy, and maps 36 NICE 541 KSAs onto the AI-security paper corpus that operationalizes each KSA in an ML context.
The same corpus drives recon. signals.py, exfil.py, expand.py, scope.py, and probe.py chain together so the literature directly produces a paper-derived target list. The first end-to-end exposed-LLM finding from that chain (case-study-syllabus-vllm-sweep.md) shipped four verified UNAUTH OpenAI-compatible serving endpoints, one already under attacker mass-scanning.
- BM25 (k1=1.5, b=0.75) full-text search over the indexed corpus
- 21 hand-curated topics. A paper inherits a topic when its keyword count crosses 2
- 36 NICE 541 KSA-to-paper maps, including four AI-extension KSAs (
K_AI_ADV,K_AI_POI,K_AI_PRIV,K_AI_FED) - First 15 pages indexed per PDF (abstract, intro, problem setup, enough body for topic tagging without index bloat)
- Idempotent ingest: cache hits skip re-extraction;
--reindexforces a clean re-pass - Recon-mode scripts read the corpus to extract IP literals, GitHub and HuggingFace pivots, ports, endpoints, and defaults
- Single SQLite DB at
~/syllabus/syllabus.db. No service to run. No network calls insidesearch,topics, orksa
Requires pdftotext (poppler-utils) and Python 3.10+.
git clone https://github.com/nuclide-research/syllabus.git ~/syllabus
ln -sf ~/syllabus/syllabus.py ~/.local/bin/syllabusHard-coded in CORPORA at the top of syllabus.py. Defaults:
~/Documents/dod-cyber-pathways (NICE 541 work-role PDFs)
~/Documents/cs307-aisecure
~/Documents/cs442-aisecure
~/Documents/cs562-aisecure
~/Documents/cs598-aisecure
~/Documents/cs598-fall2020-aisecure
Point CORPORA at whatever folders you have. Each value is a filesystem path; the key is the corpus label shown in search output.
syllabus ingest # extract + index every PDF (idempotent)
syllabus ingest --reindex # re-extract from scratch
syllabus search "certified robustness randomized smoothing" -n 10
syllabus topics # all topics + counts
syllabus topics backdoor -n 8 # papers tagged backdoor
syllabus ksa K0342 # NICE 541 KSA -> corpus papers
syllabus ksa # every KSA, all at once
syllabus brief certified-robustness -n 5
syllabus stats| Path | Contents |
|---|---|
~/syllabus/syllabus.db |
SQLite index |
~/syllabus/extracted/*.txt |
per-PDF text cache, sha1-named |
Reindex is safe. Deletes happen by doc_id, so the prior cache is reused.
The KSA bridge is the load-bearing part. Each NICE 541 KSA is mapped to a keyword set drawn from the official career-pathway PDF. syllabus ksa <id> returns BM25-ranked corpus matches for that KSA, scoring the NICE pathway docs and the AI-security paper corpus together. The result is a side-by-side reading list: the work-role doc that defines the KSA next to the AI-security papers that operationalize it.
$ syllabus ksa K0177
=== K0177 - cyber attack stages (recon/scanning/etc.) ===
27.83 [nice-541 ] 531 Cyber Defense Incident Responder
27.13 [nice-541 ] 541 Vulnerability Assessment Analyst Career Pathway
24.52 [nice-541 ] 511 Cyber Defense Analyst Career Pathway
20.79 [nice-541 ] 212 Cyber Defense Forensics Analyst Career Pathway
10.01 [cs562 ] Poison Frogs! Targeted Clean-Label Poisoning
The work-role doc names the kill-chain framing the workforce uses. The Poison Frogs paper is what that kill chain looks like inside an ML pipeline. Same KSA, two operating substrates.
syllabus is more than a study index. The corpus describes the threat model the field is actively researching, and that intelligence drives recon in the wild. The scripts at the repo root chain together:
signals.py -> rank AI/ML platforms by paper-mention count;
extract corpus-described ports / endpoints / defaults
exfil.py -> pull every cited IP literal + non-standard host:port
exfil2.py -> pull every cited GitHub / HuggingFace / Replicate / etc.
second-hop pivot surface
expand.py -> pull a recent AI-security paper corpus from the arxiv API
to keep the brain current
scope.py -> turn extracted citations into an operator-authorizable
scope sheet (checkbox per target)
probe.py -> read scope.md, run passive recon (rDNS, whois, crt.sh,
HTTP HEAD) on every [x] row; --active adds TCP banners
shodan/sweep.py is the same pattern at scale. It reads a corpus-derived Shodan dork (vllm, sglang, "Triton Inference Server"), pulls all hits, runs a single fingerprint GET per host (/v1/models for OpenAI-compatible engines, /v2 for Triton). Each verified unauth is a real exposed inference endpoint.
Requires SHODAN_API_KEY to run.
case-study-syllabus-vllm-sweep.md. First end-to-end exposed-AI finding produced by literature-derived asset discovery. Four UNAUTH OpenAI-compatible LLM serving endpoints verified, one already under active mass scanning with attacker-injected model entries.
syllabus does single fingerprint GETs per target. No inference requests. No model uploads. No federation joins. The operator-policy gate is welcome to block deeper probes. Enumerate metadata, do not exfiltrate. The names are the finding. Only run the recon scripts against hosts you own or have explicit written authorization to assess.
- wardrobe — NICE Cybersecurity Workforce Framework as a wardrobe of atoms
- tome — Technical OSINT Mining Engine, canonical platform corpus
- aimap — AI/ML infrastructure fingerprint scanner
- scanner — active-banner stage between passive discovery and deep enumeration
- BARE — semantic exploit-module ranking over scanner findings
MIT. Part of the NuClide toolchain. Contact: nuclide-research.com