Skip to content

nuclide-research/syllabus

Repository files navigation

syllabus

Local CLI study index over the AI-security PDF corpus.

license python NuClide

FeaturesInstallationCorpusUseNICE 541Recon modeScope


syllabus indexes 981 PDFs into a single SQLite database: NICE 541 career-pathway PDFs (the DoD cyber workforce framework), 5 University of Illinois AI-secure courses (cs307, cs442, cs562, cs598, cs598-fall2020), and 260 USENIX and NDSS papers. The CLI runs BM25 search, lists papers by topic across a 21-topic hand-curated taxonomy, and maps 36 NICE 541 KSAs onto the AI-security paper corpus that operationalizes each KSA in an ML context.

The same corpus drives recon. signals.py, exfil.py, expand.py, scope.py, and probe.py chain together so the literature directly produces a paper-derived target list. The first end-to-end exposed-LLM finding from that chain (case-study-syllabus-vllm-sweep.md) shipped four verified UNAUTH OpenAI-compatible serving endpoints, one already under attacker mass-scanning.

Features

  • BM25 (k1=1.5, b=0.75) full-text search over the indexed corpus
  • 21 hand-curated topics. A paper inherits a topic when its keyword count crosses 2
  • 36 NICE 541 KSA-to-paper maps, including four AI-extension KSAs (K_AI_ADV, K_AI_POI, K_AI_PRIV, K_AI_FED)
  • First 15 pages indexed per PDF (abstract, intro, problem setup, enough body for topic tagging without index bloat)
  • Idempotent ingest: cache hits skip re-extraction; --reindex forces a clean re-pass
  • Recon-mode scripts read the corpus to extract IP literals, GitHub and HuggingFace pivots, ports, endpoints, and defaults
  • Single SQLite DB at ~/syllabus/syllabus.db. No service to run. No network calls inside search, topics, or ksa

Installation

Requires pdftotext (poppler-utils) and Python 3.10+.

git clone https://github.com/nuclide-research/syllabus.git ~/syllabus
ln -sf ~/syllabus/syllabus.py ~/.local/bin/syllabus

Corpus layout

Hard-coded in CORPORA at the top of syllabus.py. Defaults:

~/Documents/dod-cyber-pathways         (NICE 541 work-role PDFs)
~/Documents/cs307-aisecure
~/Documents/cs442-aisecure
~/Documents/cs562-aisecure
~/Documents/cs598-aisecure
~/Documents/cs598-fall2020-aisecure

Point CORPORA at whatever folders you have. Each value is a filesystem path; the key is the corpus label shown in search output.

Use

syllabus ingest                                          # extract + index every PDF (idempotent)
syllabus ingest --reindex                                # re-extract from scratch

syllabus search "certified robustness randomized smoothing" -n 10
syllabus topics                                          # all topics + counts
syllabus topics backdoor -n 8                            # papers tagged backdoor
syllabus ksa K0342                                       # NICE 541 KSA -> corpus papers
syllabus ksa                                             # every KSA, all at once
syllabus brief certified-robustness -n 5
syllabus stats

Storage

Path Contents
~/syllabus/syllabus.db SQLite index
~/syllabus/extracted/*.txt per-PDF text cache, sha1-named

Reindex is safe. Deletes happen by doc_id, so the prior cache is reused.

NICE 541 mapping

The KSA bridge is the load-bearing part. Each NICE 541 KSA is mapped to a keyword set drawn from the official career-pathway PDF. syllabus ksa <id> returns BM25-ranked corpus matches for that KSA, scoring the NICE pathway docs and the AI-security paper corpus together. The result is a side-by-side reading list: the work-role doc that defines the KSA next to the AI-security papers that operationalize it.

$ syllabus ksa K0177
=== K0177 - cyber attack stages (recon/scanning/etc.) ===
   27.83  [nice-541  ] 531 Cyber Defense Incident Responder
   27.13  [nice-541  ] 541 Vulnerability Assessment Analyst Career Pathway
   24.52  [nice-541  ] 511 Cyber Defense Analyst Career Pathway
   20.79  [nice-541  ] 212 Cyber Defense Forensics Analyst Career Pathway
   10.01  [cs562     ] Poison Frogs! Targeted Clean-Label Poisoning

The work-role doc names the kill-chain framing the workforce uses. The Poison Frogs paper is what that kill chain looks like inside an ML pipeline. Same KSA, two operating substrates.

Recon mode (corpus as the brain)

syllabus is more than a study index. The corpus describes the threat model the field is actively researching, and that intelligence drives recon in the wild. The scripts at the repo root chain together:

signals.py   -> rank AI/ML platforms by paper-mention count;
                extract corpus-described ports / endpoints / defaults

exfil.py     -> pull every cited IP literal + non-standard host:port
exfil2.py    -> pull every cited GitHub / HuggingFace / Replicate / etc.
                second-hop pivot surface

expand.py    -> pull a recent AI-security paper corpus from the arxiv API
                to keep the brain current

scope.py     -> turn extracted citations into an operator-authorizable
                scope sheet (checkbox per target)

probe.py     -> read scope.md, run passive recon (rDNS, whois, crt.sh,
                HTTP HEAD) on every [x] row; --active adds TCP banners

shodan/sweep.py is the same pattern at scale. It reads a corpus-derived Shodan dork (vllm, sglang, "Triton Inference Server"), pulls all hits, runs a single fingerprint GET per host (/v1/models for OpenAI-compatible engines, /v2 for Triton). Each verified unauth is a real exposed inference endpoint.

Requires SHODAN_API_KEY to run.

Case studies

  • case-study-syllabus-vllm-sweep.md. First end-to-end exposed-AI finding produced by literature-derived asset discovery. Four UNAUTH OpenAI-compatible LLM serving endpoints verified, one already under active mass scanning with attacker-injected model entries.

Scope

syllabus does single fingerprint GETs per target. No inference requests. No model uploads. No federation joins. The operator-policy gate is welcome to block deeper probes. Enumerate metadata, do not exfiltrate. The names are the finding. Only run the recon scripts against hosts you own or have explicit written authorization to assess.

Our other projects

  • wardrobe — NICE Cybersecurity Workforce Framework as a wardrobe of atoms
  • tome — Technical OSINT Mining Engine, canonical platform corpus
  • aimap — AI/ML infrastructure fingerprint scanner
  • scanner — active-banner stage between passive discovery and deep enumeration
  • BARE — semantic exploit-module ranking over scanner findings

License

MIT. Part of the NuClide toolchain. Contact: nuclide-research.com

About

Local CLI study index over an AI-security PDF corpus. BM25 + 21-topic taxonomy. 981 PDFs (NICE 541 + 5 aisec courses + USENIX/NDSS) and 36 KSA-to-paper maps.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages