phenixblue · phenixblue · Apr 5, 2026 · Apr 3, 2026 · Apr 5, 2026
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,5 @@ bin/
 .DS_Store
 dist/
 planning/
+collector-data.json
+collectors-test/
diff --git a/README.md b/README.md
@@ -32,6 +32,20 @@ make build
 ./bin/kvirtbp scan --show-runbook --output table
 ./bin/kvirtbp runbook
 ./bin/kvirtbp runbook --id RUNBOOK-SEC-RBAC-001
+
+# Collector workflow — gather node/cluster data then scan with it
+./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json
+./bin/kvirtbp collect --collector-config ./my-collectors.json --output collector-data.json
+./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline --collector-data collector-data.json
+
+# Remote bundle (HTTPS tarball)
+./bin/kvirtbp collect --bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz --output collector-data.json
+./bin/kvirtbp scan --engine rego --policy-bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz --collector-data collector-data.json
+
+# Remote monorepo (bundle lives under a subdirectory)
+./bin/kvirtbp scan --engine rego \
+  --policy-bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
+  --bundle-subdir policy/kubevirt --collector-data collector-data.json
 ```
 
 ## Homebrew
@@ -92,8 +106,10 @@ Scan command supports:
 - `--category` and `--severity` to filter findings
 - `--engine` to select evaluator backend (`go` and `rego`)
 - `--policy-file` to provide a custom Rego policy file with `data.kvirtbp.findings` output
-- `--policy-bundle` to provide a directory of `.rego` files with optional `metadata.json`
+- `--policy-bundle` to provide a local directory or HTTPS `.tar.gz` URL of `.rego` files with optional `metadata.json`
+- `--bundle-subdir` to point at a subdirectory within a remote archive (for monorepo layouts)
 - `--show-runbook` to append compact runbook hints for failing findings
+- `--collector-data` to inject pre-collected node/cluster data into `input.cluster.collectors` for Rego policies
 
 Namespace scoping precedence for namespace-based coverage controls:
 
@@ -132,6 +148,7 @@ Additional documentation:
 
 - [docs/check-catalog.md](docs/check-catalog.md)
 - [docs/policy-authoring.md](docs/policy-authoring.md)
+- [docs/collectors.md](docs/collectors.md)
 - [docs/operations.md](docs/operations.md)
 - [docs/workflows.md](docs/workflows.md)
 
@@ -140,6 +157,7 @@ Policy bundle metadata (optional `metadata.json`):
 - `schemaVersion`: currently `v1alpha1`
 - `policyVersion`: informational version for your bundle
 - `minBinaryVersion`: optional minimum CLI version (for example `1.2.0`)
+- `collectors`: optional array of `CollectorConfig` objects that `kvirtbp collect` will run automatically when `--bundle` is provided (see [docs/collectors.md](docs/collectors.md))
 
 Checked-in baseline Rego bundle:
 
@@ -196,4 +214,5 @@ Manual CI execution:
 ## Roadmap notes
 
 - v1 includes hybrid policy execution (Go + Rego/OPA)
+- The `collect` subcommand and collector framework are included in v1
 - Snapshot bundle export and visualization UI are post-v1 roadmap items
diff --git a/docs/collectors.md b/docs/collectors.md
@@ -0,0 +1,227 @@
+# Collectors Guide
+
+Collectors are short-lived Kubernetes Jobs deployed by `kvirtbp collect` to gather node or cluster-scope data that Rego policies can reference at scan time via `input.cluster.collectors`.
+
+## How it works
+
+```
+kvirtbp collect                  kvirtbp scan --collector-data ...
+      │                                       │
+      ▼                                       ▼
+Deploy Jobs on cluster          Load collector-data.json
+      │                                       │
+      ▼                                       ▼
+Wait for completion             Inject into ClusterSnapshot.Collectors
+      │                                       │
+      ▼                                       ▼
+Read pod logs (JSON)            Available as input.cluster.collectors
+      │
+      ▼
+Write collector-data.json
+```
+
+`scan` is read-only. All cluster writes happen exclusively in `collect`.
+
+## Quick start
+
+```bash
+# Run collectors declared in a local policy bundle
+./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json
+
+# Run collectors from a remote bundle (HTTPS .tar.gz)
+./bin/kvirtbp collect \
+    --bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
+    --output collector-data.json
+
+# Monorepo: bundle lives under a subdirectory of the archive
+./bin/kvirtbp collect \
+    --bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
+    --bundle-subdir policy/kubevirt \
+    --output collector-data.json
+
+# Or from a standalone collector config file
+./bin/kvirtbp collect --collector-config ./my-collectors.json --output collector-data.json
+
+# Then scan with the collected data
+./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline \
+    --collector-data collector-data.json
+```
+
+## CollectorConfig schema
+
+Collectors are declared as JSON objects. The schema maps to `collector.CollectorConfig` in the Go package.
+
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `name` | string | yes | Unique identifier; used as the key in `input.cluster.collectors` |
+| `image` | string | yes | Container image to run |
+| `commands` | []string | yes | Shell commands to execute inside the container |
+| `scope` | string | no | `"once"` (default) or `"per-node"` |
+| `outputPath` | string | no | In-pod path where commands write JSON output (default: `/kvirtbp/output.json`) |
+| `timeoutSeconds` | int | no | Per-collector deadline in seconds; `0` means use the global `--collector-timeout` cap |
+| `privileged` | bool | no | Run container with `SecurityContext.Privileged = true` |
+| `hostPID` | bool | no | Mount host PID namespace |
+| `hostNetwork` | bool | no | Attach to host network namespace |
+| `env` | object | no | Environment variables injected into the container |
+| `tolerations` | []object | no | Pod tolerations applied to the Job pod. Each object supports `key`, `operator` (`Exists`/`Equal`), `value`, and `effect` fields. Use `{"operator": "Exists"}` to tolerate all taints (e.g. to run `per-node` on control-plane nodes). |
+
+### Scope values
+
+| Value | Behaviour |
+|---|---|
+| `once` | A single Job is deployed. Output is stored under the sentinel key `_cluster`. |
+| `per-node` | One Job per node (via `nodeName` selector). Output is stored keyed by node name. |
+
+## Output format
+
+`kvirtbp collect` writes a JSON file with the structure:
+
+```json
+{
+  "<collector-name>": {
+    "<node-name-or-_cluster>": {
+      "<key>": "<value>"
+    }
+  }
+}
+```
+
+Example with a `sysctl` collector using `scope: once`:
+
+```json
+{
+  "sysctl": {
+    "_cluster": {
+      "net.ipv4.ip_forward": "0",
+      "net.bridge.bridge-nf-call-iptables": "1"
+    }
+  }
+}
+```
+
+Example with a `sysctl` collector using `scope: per-node`:
+
+```json
+{
+  "sysctl": {
+    "worker-1": { "net.ipv4.ip_forward": "0" },
+    "worker-2": { "net.ipv4.ip_forward": "1" }
+  }
+}
+```
+
+If a collector (or individual node) fails, an `_error` key is stored in place of the normal output and collection continues:
+
+```json
+{
+  "sysctl": {
+    "worker-1": { "_error": "job kvirtbp-sysctl-worker-1 failed: BackoffLimitExceeded" }
+  }
+}
+```
+
+## Writing collector output
+
+Commands must write valid JSON to `outputPath` (default `/kvirtbp/output.json`). The CLI appends `cat <outputPath>` as the last command in the Job spec — this is what appears as pod logs and is parsed as JSON.
+
+Intermediate commands may freely write to stdout/stderr without corrupting the output payload since they run before the final `cat`.
+
+Example pattern for a sysctl collector:
+
+```bash
+# write JSON to the output path, then let the CLI's appended "cat" emit it
+commands:
+  - "sysctl -a --pattern '^(net\\.ipv4\\.ip_forward|net\\.bridge\\.bridge-nf-call-iptables)' | awk 'BEGIN{printf \"{\"} {printf \"%s\\\"%s\\\": \\\"%s\\\"%s\", NR>1?\",\":\" \", $1, $3} END{print \"}\"}' > /kvirtbp/output.json"
+```
+
+Or with a helper image that already produces JSON:
+
+```bash
+commands:
+  - "my-tool dump-json > /kvirtbp/output.json"
+```
+
+## Declaring collectors in a bundle
+
+Add a `collectors` array to the bundle's `metadata.json`:
+
+```json
+{
+  "schemaVersion": "v1alpha1",
+  "policyVersion": "1.0.0",
+  "resources": ["v1/nodes"],
+  "collectors": [
+    {
+      "name": "sysctl",
+      "image": "alpine:3.21",
+      "commands": [
+        "apk add -q procps",
+        "sysctl -a --pattern '^net\\.ipv4\\.ip_forward' | awk 'BEGIN{printf \"{\"}{printf \"\\\"%s\\\":\\\"%s\\\"\", $1,$3}END{print \"}\"}' > /kvirtbp/output.json"
+      ],
+      "scope": "per-node",
+      "privileged": true,
+      "hostNetwork": true
+    }
+  ]
+}
+```
+
+Running `kvirtbp collect --bundle ./policy/baseline` will execute these collectors automatically.
+
+## Merging collector configs
+
+When both `--bundle` and `--collector-config` are provided, the two sets are merged. `--collector-config` wins on name collision:
+
+```bash
+./bin/kvirtbp collect \
+    --bundle ./policy/baseline \
+    --collector-config ./overrides.json \
+    --output collector-data.json
+```
+
+`overrides.json` can be a partial list — only the names that need to differ from the bundle's defaults need to be included.
+
+## RBAC requirements
+
+The identity running `kvirtbp collect` needs:
+
+```yaml
+rules:
+  - apiGroups: ["batch"]
+    resources: ["jobs"]
+    verbs: ["create", "get", "list", "delete"]
+  - apiGroups: [""]
+    resources: ["pods", "pods/log"]
+    verbs: ["get", "list"]
+  - apiGroups: [""]
+    resources: ["namespaces"]
+    verbs: ["get", "create"]
+  - apiGroups: [""]
+    resources: ["nodes"]
+    verbs: ["list"]          # needed for per-node scope
+```
+
+If `--collector-namespace` already exists, the `namespaces` create verb is not required.
+
+## Debugging
+
+Use `--no-collector-cleanup` to prevent Job deletion after completion:
+
+```bash
+./bin/kvirtbp collect --bundle ./policy/baseline \
+    --no-collector-cleanup --output collector-data.json
+```
+
+Then inspect failed Jobs directly:
+
+```bash
+kubectl get jobs -n kvirtbp-collectors
+kubectl logs -n kvirtbp-collectors -l collector-name=sysctl
+```
+
+## Security considerations
+
+- `privileged`, `hostPID`, and `hostNetwork` must be explicitly opted into per collector; they are never defaulted.
+- The collector namespace (`kvirtbp-collectors` by default) should have a restrictive PSA policy. Consider `enforce: privileged` only if your collectors require host access, and scope the namespace RBAC tightly.
+- Collector `env` values are stored in plain text in the Job spec. Do not use them for secrets; use a Kubernetes Secret volume mount instead.
+- The output file (`collector-data.json`) may contain sensitive node data. Treat it with appropriate access controls.
diff --git a/docs/operations.md b/docs/operations.md
@@ -16,6 +16,38 @@ The scanner expects read/list access to:
 
 Insufficient access is reported as explicit permission findings (`perm-list-*`) and may degrade security/availability baseline outcomes.
 
+The `collect` subcommand creates Kubernetes Jobs and therefore requires additional permissions in the collector namespace:
+
+- `batch/jobs` — create, get, list, delete
+- `pods/log` — get (to read Job output)
+- `namespaces` — get, create (only if `--collector-namespace` does not already exist)
+
+## Collector workflow
+
+The `collect` subcommand deploys short-lived Kubernetes Jobs to gather data that Rego policies can reference via `input.cluster.collectors`. This is a separate step from `scan`; the scan command itself makes no cluster writes.
+
+Typical two-step workflow:
+
+```bash
+# Step 1: collect — writes collector-data.json
+./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json
+
+# Step 2: scan — injects collected data, makes no cluster writes
+./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline \
+    --collector-data collector-data.json
+```
+
+Key `collect` flags:
+
+- `--bundle` — loads collector definitions from a bundle's `metadata.json` automatically
+- `--collector-config` — path to a standalone JSON file of `[]CollectorConfig`; merged with bundle collectors (file wins on name collision)
+- `--collector-namespace` — Kubernetes namespace for Jobs (default: `kvirtbp-collectors`; created if absent)
+- `--collector-timeout` — maximum time to wait for all collectors (default: `5m`)
+- `--no-collector-cleanup` — keep completed Jobs after collection (useful for debugging)
+- `--output` — path for the collector data JSON file (default: `collector-data.json`)
+
+Collector Jobs are deleted automatically after completion unless `--no-collector-cleanup` is set. A `TTLSecondsAfterFinished` of 300 seconds is also set on each Job as a safety net.
+
 ## Runtime behavior
 
 Key runtime flags:
@@ -29,6 +61,7 @@ Key runtime flags:
 - `--policy-bundle`
 - `--waiver-file`
 - `--show-runbook`
+- `--collector-data` (inject pre-collected node/cluster data into `input.cluster.collectors`)
 
 Output modes:
 
@@ -91,5 +124,9 @@ Waiver rules:
   - Apply PSA enforce labels and baseline NetworkPolicies.
 - Namespace guardrail failures:
   - Add ResourceQuota and LimitRange to targeted namespaces.
+- Collector failures:
+  - Check Job status in the collector namespace: `kubectl get jobs -n kvirtbp-collectors`
+  - Use `--no-collector-cleanup` to retain failed Jobs for log inspection.
+  - Ensure the scanning identity has `batch/jobs` create/get/delete and `pods/log` get in the collector namespace.
 
 For remediation-specific playbooks, see `docs/runbooks.md`.