Skip to content

Add parse-size and structural guardrails for untrusted flow files #416

@dgenio

Description

@dgenio

Summary

Add configurable, conservative limits to the flow-file loaders (max file size, max step count, max nesting depth, max string length) so a malformed or hostile .flow.yaml/.flow.json cannot exhaust memory or CPU before validation, raising a typed FlowSerializationError when a limit is exceeded.

Why this matters

Flow files are the primary untrusted input surface: they arrive from repositories, contributor PRs validated by the GitHub Action, and generated drafts. yaml.safe_load prevents code execution, but YAML alias expansion, deeply nested mappings, and giant strings can still cause pathological resource use before Pydantic validation runs. Bounding the input defends the CI Action and any host that loads third-party flows. Pairs with the adversarial corpus (issue 15).

Current evidence

  • chainweaver/serialization.py uses yaml.safe_load and raises FlowSerializationError, but reads the whole file and parses it before any size/shape bound (verified by reading the loader).
  • .github/actions/chainweaver validates arbitrary contributor flow files in CI — a direct untrusted-input path.
  • Open Add an opt-in allowlist hook for schema-ref module resolution #345 (schema-ref allowlist) hardens a different injection vector (module resolution); this is the resource-exhaustion complement.

External context

Bounding input size/depth is standard defensive practice for parsers consuming untrusted documents; YAML billion-laughs-style alias expansion is a known class.

Proposed implementation

  1. Add limits as loader parameters with conservative defaults (e.g. max bytes, max steps, max depth, max string length), overridable via chainweaver validate/check flags and the library API.
  2. Enforce size before reading fully (or stream-bounded read); enforce step/depth/string limits during/after parse, raising FlowSerializationError with the offending limit named.
  3. For YAML, disable or bound alias expansion if safe_load permits unbounded expansion.
  4. Wire the GitHub Action to use the same defaults and surface a clear annotation.

AI-agent execution notes

Inspect chainweaver/serialization.py, chainweaver/cli.py (validate/check), .github/actions/chainweaver/annotate.py, chainweaver/exceptions.py. Coordinate with the adversarial-corpus issue (15) — the resource-shaped cases there are the regression net. Keep defaults conservative but not so tight that legitimate large flows break (measure against the biggest in-repo example). Pure, no network. Frame defensively; no exploit detail in docs.

Acceptance criteria

  • Files exceeding any limit fail fast with a typed FlowSerializationError naming the limit, in < ~2s for the corpus resource cases.
  • Limits are configurable via API and CLI with documented conservative defaults.
  • Legitimate existing example flows load unchanged.

Test plan

Negative tests for each limit (oversized file, 10k steps, deep nesting, huge string), timing assertion, regression that real examples still load, GitHub Action annotation test.

Documentation plan

docs/security.md, docs/cli.md (new flags), CHANGELOG (security note framed defensively).

Migration and compatibility notes

Not expected to require migration if defaults are above realistic flow sizes; document the defaults and how to raise them.

Risks and tradeoffs

Too-tight defaults reject legitimate large flows (mitigate by measuring and documenting overrides); limits add a few parameters to the loader surface.

Suggested labels

security, reliability, testing

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions