Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ bin/
.DS_Store
dist/
planning/
collector-data.json
collectors-test/
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,20 @@ make build
./bin/kvirtbp scan --show-runbook --output table
./bin/kvirtbp runbook
./bin/kvirtbp runbook --id RUNBOOK-SEC-RBAC-001

# Collector workflow — gather node/cluster data then scan with it
./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json
./bin/kvirtbp collect --collector-config ./my-collectors.json --output collector-data.json
./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline --collector-data collector-data.json

# Remote bundle (HTTPS tarball)
./bin/kvirtbp collect --bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz --output collector-data.json
./bin/kvirtbp scan --engine rego --policy-bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz --collector-data collector-data.json

# Remote monorepo (bundle lives under a subdirectory)
./bin/kvirtbp scan --engine rego \
--policy-bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
--bundle-subdir policy/kubevirt --collector-data collector-data.json
```

## Homebrew
Expand Down Expand Up @@ -92,8 +106,10 @@ Scan command supports:
- `--category` and `--severity` to filter findings
- `--engine` to select evaluator backend (`go` and `rego`)
- `--policy-file` to provide a custom Rego policy file with `data.kvirtbp.findings` output
- `--policy-bundle` to provide a directory of `.rego` files with optional `metadata.json`
- `--policy-bundle` to provide a local directory or HTTPS `.tar.gz` URL of `.rego` files with optional `metadata.json`
- `--bundle-subdir` to point at a subdirectory within a remote archive (for monorepo layouts)
- `--show-runbook` to append compact runbook hints for failing findings
- `--collector-data` to inject pre-collected node/cluster data into `input.cluster.collectors` for Rego policies

Namespace scoping precedence for namespace-based coverage controls:

Expand Down Expand Up @@ -132,6 +148,7 @@ Additional documentation:

- [docs/check-catalog.md](docs/check-catalog.md)
- [docs/policy-authoring.md](docs/policy-authoring.md)
- [docs/collectors.md](docs/collectors.md)
- [docs/operations.md](docs/operations.md)
- [docs/workflows.md](docs/workflows.md)

Expand All @@ -140,6 +157,7 @@ Policy bundle metadata (optional `metadata.json`):
- `schemaVersion`: currently `v1alpha1`
- `policyVersion`: informational version for your bundle
- `minBinaryVersion`: optional minimum CLI version (for example `1.2.0`)
- `collectors`: optional array of `CollectorConfig` objects that `kvirtbp collect` will run automatically when `--bundle` is provided (see [docs/collectors.md](docs/collectors.md))

Checked-in baseline Rego bundle:

Expand Down Expand Up @@ -196,4 +214,5 @@ Manual CI execution:
## Roadmap notes

- v1 includes hybrid policy execution (Go + Rego/OPA)
- The `collect` subcommand and collector framework are included in v1
- Snapshot bundle export and visualization UI are post-v1 roadmap items
227 changes: 227 additions & 0 deletions docs/collectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Collectors Guide

Collectors are short-lived Kubernetes Jobs deployed by `kvirtbp collect` to gather node or cluster-scope data that Rego policies can reference at scan time via `input.cluster.collectors`.

## How it works

```
kvirtbp collect kvirtbp scan --collector-data ...
│ │
▼ ▼
Deploy Jobs on cluster Load collector-data.json
│ │
▼ ▼
Wait for completion Inject into ClusterSnapshot.Collectors
│ │
▼ ▼
Read pod logs (JSON) Available as input.cluster.collectors
Write collector-data.json
```

`scan` is read-only. All cluster writes happen exclusively in `collect`.

## Quick start

```bash
# Run collectors declared in a local policy bundle
./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json

# Run collectors from a remote bundle (HTTPS .tar.gz)
./bin/kvirtbp collect \
--bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
--output collector-data.json

# Monorepo: bundle lives under a subdirectory of the archive
./bin/kvirtbp collect \
--bundle https://github.com/myorg/policies/archive/refs/tags/v1.2.0.tar.gz \
--bundle-subdir policy/kubevirt \
--output collector-data.json

# Or from a standalone collector config file
./bin/kvirtbp collect --collector-config ./my-collectors.json --output collector-data.json

# Then scan with the collected data
./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline \
--collector-data collector-data.json
```

## CollectorConfig schema

Collectors are declared as JSON objects. The schema maps to `collector.CollectorConfig` in the Go package.

| Field | Type | Required | Description |
|---|---|---|---|
| `name` | string | yes | Unique identifier; used as the key in `input.cluster.collectors` |
| `image` | string | yes | Container image to run |
| `commands` | []string | yes | Shell commands to execute inside the container |
| `scope` | string | no | `"once"` (default) or `"per-node"` |
| `outputPath` | string | no | In-pod path where commands write JSON output (default: `/kvirtbp/output.json`) |
| `timeoutSeconds` | int | no | Per-collector deadline in seconds; `0` means use the global `--collector-timeout` cap |
| `privileged` | bool | no | Run container with `SecurityContext.Privileged = true` |
| `hostPID` | bool | no | Mount host PID namespace |
| `hostNetwork` | bool | no | Attach to host network namespace |
| `env` | object | no | Environment variables injected into the container |
| `tolerations` | []object | no | Pod tolerations applied to the Job pod. Each object supports `key`, `operator` (`Exists`/`Equal`), `value`, and `effect` fields. Use `{"operator": "Exists"}` to tolerate all taints (e.g. to run `per-node` on control-plane nodes). |

### Scope values

| Value | Behaviour |
|---|---|
| `once` | A single Job is deployed. Output is stored under the sentinel key `_cluster`. |
| `per-node` | One Job per node (via `nodeName` selector). Output is stored keyed by node name. |

## Output format

`kvirtbp collect` writes a JSON file with the structure:

```json
{
"<collector-name>": {
"<node-name-or-_cluster>": {
"<key>": "<value>"
}
}
}
```

Example with a `sysctl` collector using `scope: once`:

```json
{
"sysctl": {
"_cluster": {
"net.ipv4.ip_forward": "0",
"net.bridge.bridge-nf-call-iptables": "1"
}
}
}
```

Example with a `sysctl` collector using `scope: per-node`:

```json
{
"sysctl": {
"worker-1": { "net.ipv4.ip_forward": "0" },
"worker-2": { "net.ipv4.ip_forward": "1" }
}
}
```

If a collector (or individual node) fails, an `_error` key is stored in place of the normal output and collection continues:

```json
{
"sysctl": {
"worker-1": { "_error": "job kvirtbp-sysctl-worker-1 failed: BackoffLimitExceeded" }
}
}
```

## Writing collector output

Commands must write valid JSON to `outputPath` (default `/kvirtbp/output.json`). The CLI appends `cat <outputPath>` as the last command in the Job spec — this is what appears as pod logs and is parsed as JSON.

Intermediate commands may freely write to stdout/stderr without corrupting the output payload since they run before the final `cat`.

Example pattern for a sysctl collector:

```bash
# write JSON to the output path, then let the CLI's appended "cat" emit it
commands:
- "sysctl -a --pattern '^(net\\.ipv4\\.ip_forward|net\\.bridge\\.bridge-nf-call-iptables)' | awk 'BEGIN{printf \"{\"} {printf \"%s\\\"%s\\\": \\\"%s\\\"%s\", NR>1?\",\":\" \", $1, $3} END{print \"}\"}' > /kvirtbp/output.json"
```

Or with a helper image that already produces JSON:

```bash
commands:
- "my-tool dump-json > /kvirtbp/output.json"
```

## Declaring collectors in a bundle

Add a `collectors` array to the bundle's `metadata.json`:

```json
{
"schemaVersion": "v1alpha1",
"policyVersion": "1.0.0",
"resources": ["v1/nodes"],
"collectors": [
{
"name": "sysctl",
"image": "alpine:3.21",
"commands": [
"apk add -q procps",
"sysctl -a --pattern '^net\\.ipv4\\.ip_forward' | awk 'BEGIN{printf \"{\"}{printf \"\\\"%s\\\":\\\"%s\\\"\", $1,$3}END{print \"}\"}' > /kvirtbp/output.json"
],
"scope": "per-node",
"privileged": true,
"hostNetwork": true
}
]
}
```

Running `kvirtbp collect --bundle ./policy/baseline` will execute these collectors automatically.

## Merging collector configs

When both `--bundle` and `--collector-config` are provided, the two sets are merged. `--collector-config` wins on name collision:

```bash
./bin/kvirtbp collect \
--bundle ./policy/baseline \
--collector-config ./overrides.json \
--output collector-data.json
```

`overrides.json` can be a partial list — only the names that need to differ from the bundle's defaults need to be included.

## RBAC requirements

The identity running `kvirtbp collect` needs:

```yaml
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "delete"]
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "create"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list"] # needed for per-node scope
```

If `--collector-namespace` already exists, the `namespaces` create verb is not required.

## Debugging

Use `--no-collector-cleanup` to prevent Job deletion after completion:

```bash
./bin/kvirtbp collect --bundle ./policy/baseline \
--no-collector-cleanup --output collector-data.json
```

Then inspect failed Jobs directly:

```bash
kubectl get jobs -n kvirtbp-collectors
kubectl logs -n kvirtbp-collectors -l collector-name=sysctl
```

## Security considerations

- `privileged`, `hostPID`, and `hostNetwork` must be explicitly opted into per collector; they are never defaulted.
- The collector namespace (`kvirtbp-collectors` by default) should have a restrictive PSA policy. Consider `enforce: privileged` only if your collectors require host access, and scope the namespace RBAC tightly.
- Collector `env` values are stored in plain text in the Job spec. Do not use them for secrets; use a Kubernetes Secret volume mount instead.
- The output file (`collector-data.json`) may contain sensitive node data. Treat it with appropriate access controls.
37 changes: 37 additions & 0 deletions docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,38 @@ The scanner expects read/list access to:

Insufficient access is reported as explicit permission findings (`perm-list-*`) and may degrade security/availability baseline outcomes.

The `collect` subcommand creates Kubernetes Jobs and therefore requires additional permissions in the collector namespace:

- `batch/jobs` — create, get, list, delete
- `pods/log` — get (to read Job output)
- `namespaces` — get, create (only if `--collector-namespace` does not already exist)

## Collector workflow

The `collect` subcommand deploys short-lived Kubernetes Jobs to gather data that Rego policies can reference via `input.cluster.collectors`. This is a separate step from `scan`; the scan command itself makes no cluster writes.

Typical two-step workflow:

```bash
# Step 1: collect — writes collector-data.json
./bin/kvirtbp collect --bundle ./policy/baseline --output collector-data.json

# Step 2: scan — injects collected data, makes no cluster writes
./bin/kvirtbp scan --engine rego --policy-bundle ./policy/baseline \
--collector-data collector-data.json
```

Key `collect` flags:

- `--bundle` — loads collector definitions from a bundle's `metadata.json` automatically
- `--collector-config` — path to a standalone JSON file of `[]CollectorConfig`; merged with bundle collectors (file wins on name collision)
- `--collector-namespace` — Kubernetes namespace for Jobs (default: `kvirtbp-collectors`; created if absent)
- `--collector-timeout` — maximum time to wait for all collectors (default: `5m`)
- `--no-collector-cleanup` — keep completed Jobs after collection (useful for debugging)
- `--output` — path for the collector data JSON file (default: `collector-data.json`)

Collector Jobs are deleted automatically after completion unless `--no-collector-cleanup` is set. A `TTLSecondsAfterFinished` of 300 seconds is also set on each Job as a safety net.

## Runtime behavior

Key runtime flags:
Expand All @@ -29,6 +61,7 @@ Key runtime flags:
- `--policy-bundle`
- `--waiver-file`
- `--show-runbook`
- `--collector-data` (inject pre-collected node/cluster data into `input.cluster.collectors`)

Output modes:

Expand Down Expand Up @@ -91,5 +124,9 @@ Waiver rules:
- Apply PSA enforce labels and baseline NetworkPolicies.
- Namespace guardrail failures:
- Add ResourceQuota and LimitRange to targeted namespaces.
- Collector failures:
- Check Job status in the collector namespace: `kubectl get jobs -n kvirtbp-collectors`
- Use `--no-collector-cleanup` to retain failed Jobs for log inspection.
- Ensure the scanning identity has `batch/jobs` create/get/delete and `pods/log` get in the collector namespace.

For remediation-specific playbooks, see `docs/runbooks.md`.
Loading
Loading