Infer robust Pydantic v2 models from messy, evolving JSON streams.
Pydanticforge helps you go from raw JSON (APIs, logs, LLM output) to typed Python models without hand-writing schemas. It infers types from real samples, merges them in an order-independent way, and can detect when new data no longer fits the inferred schema (drift).
- Why pydanticforge?
- Requirements
- Dependency breakdown
- Installation
- Quick start
- How it works (for new Python developers)
- Implementation details by command
- Commands in detail
- Use cases and examples
- Programmatic API
- Public API method reference
- Input and output formats
- Open source standards
- Development
- License
- APIs and scrapers often return JSON where some fields appear only sometimes, or types change (e.g.
"id"as number in one response and string in another). - LLM output and user-submitted JSON are inconsistent; manually keeping Pydantic models in sync is tedious and error-prone.
- Logs and event streams evolve over time; new fields get added and old code breaks if the schema is not updated.
Pydanticforge:
- Infers a single schema from many JSON samples (streaming or from files).
- Generates Pydantic v2
BaseModelclasses you can use in your app. - Monitors directories of JSON files and reports when data no longer matches the inferred schema (drift).
- Diffs two generated model files and classifies changes as breaking vs non-breaking.
You get type-safe models that stay in sync with real data instead of hand-maintained schemas that drift out of date.
- Python 3.10+
- Package manager:
pip(latest recommended) - Dependencies: see the full dependency matrix below
This table is the source-of-truth for package dependencies and their purpose.
| Dependency | Type | Why it is used | Where used |
|---|---|---|---|
pydantic |
Runtime | Generated models target Pydantic v2 APIs (BaseModel, RootModel, ConfigDict). |
Generated model code (modelgen.emit) |
orjson |
Runtime (optional fast path) | Faster JSON decode for file and stream ingestion, with stdlib json fallback. |
src/pydanticforge/io/files.py, src/pydanticforge/io/stream.py |
typer |
Runtime compatibility | Kept for CLI roadmap compatibility and potential UX migration. | Packaging metadata (pyproject.toml) |
rich |
Runtime compatibility | Kept for richer terminal output roadmap. | Packaging metadata (pyproject.toml) |
watchfiles |
Runtime compatibility | Kept for continuous watch-mode roadmap (monitor Phase 3). |
Packaging metadata (pyproject.toml) |
Dev-only tools (build, twine, ruff, mypy, pytest, pytest-cov, hypothesis, pip-audit, pre-commit) are defined in pyproject.toml under [project.optional-dependencies].dev.
Dependency policy:
- New dependency additions must include README and
pyproject.tomlupdates in the same PR. - Unused dependencies should be removed or moved to optional extras.
- Security-impacting dependency changes must run
pip-auditbefore release.
From PyPI:
pip install pydanticforgeFrom source (e.g. for development):
git clone https://github.com/adwantg/pydanticforge.git
cd pydanticforge
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"The CLI is available as pydanticforge:
pydanticforge --help1. Generate Pydantic models from sample JSON (stdin)
echo '{"id": 1}
{"id": 2, "name": "alice"}' | pydanticforge generate --output models.pyThis produces models.py with a root model whose fields are inferred from both objects: id (required) and name (optional, because it appeared in only one sample).
2. Save inferred schema state for later
cat samples.ndjson | pydanticforge generate --output models.py --save-state .pydanticforge/state.json3. Watch a directory for new JSON and check for drift
pydanticforge monitor ./logs --state .pydanticforge/state.json4. Compare two model versions (e.g. before/after a change)
pydanticforge diff models_v1.py models_v2.pyPydanticforge does schema inference: it looks at many JSON values and builds a single type that fits all of them.
- Each JSON value is turned into an internal type node (e.g. object with fields, array of items, string, int, float, etc.).
- When you feed multiple samples, it joins their types with rules that are order-independent (associative and commutative), so you get the same result no matter the order of samples.
Examples of join rules:
- Same type → same type (e.g.
int+int→int). - Different scalars → union (e.g.
intandstr→int | str). intandfloat→ by default merged tofloat(orint | floatwith--strict-numbers).- Two objects → one object with merged fields; a field present in only some samples becomes optional in the generated model.
- Two arrays → array of the join of their item types.
nulland another type →type | None.
So: required = field appeared in every sample; optional = field missing in at least one sample. Generated Pydantic models use Field(...)-style metadata and Optional/| None accordingly.
The state file (e.g. .pydanticforge/state.json) stores the inferred type graph (the internal representation), not the generated Python. You can:
- Generate models from that state later without re-reading all JSON.
- Monitor a directory: compare new JSON against this state and optionally autopatch (merge new types into state and regenerate models).
Drift means “this JSON does not match the schema we inferred.” The monitor compares each file’s inferred type against the expected type (from state). It reports:
- Type mismatches (e.g. expected
int, gotstr). - Missing required fields.
- New fields (informational).
With --autopatch, the state is updated by joining the observed type into the expected type, and you can write updated models to a file.
The diff command parses two Pydantic model files (AST), extracts class and field names, types, and required/optional, and compares them:
- Breaking: removed class/field, required field added, type narrowed (e.g.
str→int), optional → required. - Non-breaking: new optional field, new class, type widened (e.g.
int→int | str), required → optional.
So you can use it in CI or reviews to see if a schema change might break consumers.
This section explains how each command is implemented internally and what it is best used for.
- Implementation: reads stdin via
iter_json_from_stream, callsTypeInferer.observeper sample, and periodically emits model code withgenerate_models. - Persistence: always writes state with
save_schema_state; can also export JSON Schema. - Best use case: long-running NDJSON stream ingestion where model shape evolves over time.
- Implementation: builds root type from one of four sources (stdin, files/directories, state, JSON Schema), then renders deterministic Pydantic code.
- Core guarantee: deterministic output ordering for classes and fields for stable diffs in CI.
- Best use case: one-shot model generation from payload samples or existing schema assets.
- Implementation: scans JSON files, infers observed types, compares against expected state with
detect_drift, and classifies event severity. - Operational semantics:
--fail-oncontrols CI exit behavior (none,breaking,any) with explicit exit codes. - Best use case: schema-drift gates in CI or periodic log-directory audits.
- Implementation: parses both model files using AST, normalizes model/field signatures, and computes semantic changes.
- Classification: breaking vs non-breaking is based on field/class changes and type widening/narrowing.
- Best use case: release and PR checks for backward-compatibility impact.
- Implementation: reads state and emits deterministic state hash plus schema node/field counts; optional drift snapshot is generated with
monitor_directory_once. - CI value: stable machine-readable snapshots for audit trails and schema-regression baselines.
- Implementation: round-trips internal state <-> JSON Schema through dedicated conversion utilities.
- Interop value: enables coexistence with JSON Schema-first tooling while preserving pydanticforge workflow.
Reads newline-delimited JSON (NDJSON) from stdin and incrementally updates the inferred schema. Optionally writes models and state periodically or at the end.
pydanticforge watch --input stdin [OPTIONS]| Option | Description |
|---|---|
--input |
Input source; only stdin is supported. |
--output |
Write generated Pydantic models to this file. If omitted, prints to stdout. |
--state |
Path to state file (default: .pydanticforge/state.json). |
--root-name |
Name of the root model class (default: PydanticforgeModel). |
--every N |
Emit models every N samples (0 = only at end). |
--strict-numbers |
Keep int and float distinct (union) instead of merging to float. |
--export-json-schema |
Also write inferred schema as JSON Schema. |
--json-schema-title |
JSON Schema title (default: PydanticforgeSchema). |
Example: stream API responses and update models every 100 records
tail -f api_responses.ndjson | pydanticforge watch --input stdin --output models.py --state .pydanticforge/state.json --every 100Infers schema from stdin, from file(s)/directory(ies), from a saved state file, or from JSON Schema; then writes Pydantic v2 model code.
pydanticforge generate [OPTIONS]| Option | Description |
|---|---|
--input |
JSON file or directory (repeat for multiple). Directory scans include .json, .ndjson, and .jsonl files (recursive). |
--from-state |
Use this state file instead of inferring from input. |
--from-json-schema |
Use this JSON Schema file as input. |
--output |
Output path for generated models.py. If omitted, prints to stdout. |
--save-state |
After inferring, save state to this path. |
--export-json-schema |
Also export inferred schema as JSON Schema. |
--json-schema-title |
JSON Schema title for export (default: PydanticforgeSchema). |
--root-name |
Root model class name (default: PydanticforgeModel). |
--strict-numbers |
Keep int and float as union. |
Examples
From stdin (e.g. NDJSON):
cat events.ndjson | pydanticforge generate --output models.py --save-state .pydanticforge/state.jsonFrom a single file or directory:
pydanticforge generate --input ./samples/data.json --output models.py
pydanticforge generate --input ./samples/events.ndjson --output models.py
pydanticforge generate --input ./samples/ --output models.pyFrom previously saved state (no JSON needed):
pydanticforge generate --from-state .pydanticforge/state.json --output models.pyFrom JSON Schema:
pydanticforge generate --from-json-schema schema.json --output models.pyScans a directory for .json, .ndjson, and .jsonl files, infers type from each file’s content, and compares to the expected schema from the state file. Reports drift (type mismatches, missing required fields, new fields). Optionally updates state and regenerates models (autopatch).
pydanticforge monitor <directory> [OPTIONS]| Option | Description |
|---|---|
directory |
Directory to scan (required). |
--state |
State file path (default: .pydanticforge/state.json). If it doesn’t exist, first file(s) establish the baseline. |
--model-output |
If set, and --autopatch is used, write updated models here. |
--export-json-schema |
If set with --autopatch, write merged schema as JSON Schema. |
--json-schema-title |
JSON Schema title for export (default: PydanticforgeSchema). |
--root-name |
Root model name for generated code. |
--recursive / --no-recursive |
Scan subdirectories (default: recursive). |
--autopatch |
Merge drifted types into state and save; if --model-output is set, regenerate models. |
--strict-numbers |
Use strict number handling when joining types. |
--format |
Output format: text or json. |
--fail-on |
Exit threshold: none, breaking, or any. |
Examples
Report drift only:
pydanticforge monitor ./logs/api_responses --state .pydanticforge/state.jsonAuto-update state and models when drift is found:
pydanticforge monitor ./logs --state .pydanticforge/state.json --autopatch --model-output models.pyCI-style failure on drift severity:
pydanticforge monitor ./logs --state .pydanticforge/state.json --format json --fail-on breakingMonitor exit codes:
0: no threshold reached20: warnings found and--fail-on any21: breaking drift found and--fail-on breaking|any
Parses two Python files containing Pydantic BaseModel or RootModel classes and prints a semantic diff (added/removed classes and fields, required/optional and type changes), classified as breaking or non-breaking.
pydanticforge diff <old_model.py> <new_model.py> [OPTIONS]| Option | Description |
|---|---|
--format |
Output format: text or json. |
--fail-on-breaking |
Exit with code 1 if any change is breaking (useful in CI). |
Example
pydanticforge diff models_v1.py models_v2.py
pydanticforge diff models_v1.py models_v2.py --format json
pydanticforge diff models_v1.py models_v2.py --fail-on-breakingExample output:
[breaking] User.email: Field removed
[non-breaking] User.avatar_url: New optional field
[breaking] User.id: Type narrowed: int -> str
Emits a deterministic summary of schema state (including a stable hash), and optionally a drift snapshot for a directory.
pydanticforge status [directory] [OPTIONS]| Option | Description |
|---|---|
directory |
Optional directory to include a drift snapshot in the status payload. |
--state |
State file path (default: .pydanticforge/state.json). |
--recursive / --no-recursive |
When directory is provided, scan recursively (default: recursive). |
--strict-numbers |
Use strict number handling for optional drift snapshot. |
--format |
Output format: text or json. |
--output |
Write the report to a file instead of stdout. |
Example
pydanticforge status ./logs --state .pydanticforge/state.json --format json --output schema_status.jsonConverts pydanticforge internal state files and JSON Schema documents in either direction.
pydanticforge schema [OPTIONS]| Option | Description |
|---|---|
--from-state |
Input state file. |
--from-json-schema |
Input JSON Schema file. |
--to-state |
Output state file. |
--to-json-schema |
Output JSON Schema file. |
--json-schema-title |
Title to use when writing JSON Schema (default: PydanticforgeSchema). |
Examples
pydanticforge schema --from-state .pydanticforge/state.json --to-json-schema schema.json
pydanticforge schema --from-json-schema schema.json --to-state .pydanticforge/state.jsonYou have an API that returns JSON with inconsistent fields:
{"user_id": 1, "name": "Alice"}
{"user_id": "2", "name": "Bob", "email": "bob@example.com"}Save samples to api_samples.ndjson, then:
cat api_samples.ndjson | pydanticforge generate --output api_models.py --save-state .pydanticforge/state.jsonUse the generated module in your code:
from api_models import PydanticforgeModel
data = {"user_id": 1, "name": "Alice"}
obj = PydanticforgeModel.model_validate(data)New JSON log files are written to ./logs. You want to detect when the “shape” of logs changes:
# One-time: build initial state from existing logs
pydanticforge generate --input ./logs --output models.py --save-state .pydanticforge/state.json
# Periodically or in CI: check for drift
pydanticforge monitor ./logs --state .pydanticforge/state.json
# Or auto-merge new shapes and refresh models
pydanticforge monitor ./logs --state .pydanticforge/state.json --autopatch --model-output models.pyPipe NDJSON from any source (script, API, LLM output) into generate or watch:
python my_script.py | pydanticforge generate --output models.pyGenerate models_v1.py from the main branch and models_v2.py from a feature branch (or from two commits). Then:
pydanticforge diff models_v1.py models_v2.py --fail-on-breakingIf the diff contains any breaking change, the command exits with 1 so your CI can fail the build.
Pydanticforge infers nested objects and optional fields from multiple samples. For example:
echo '{"id": 1, "meta": {"tag": "alpha"}}
{"id": 2.0, "meta": {"tag": "beta", "score": 10}}' | pydanticforge generate --output models.pyThe root model will have id: float, and a nested meta object with tag (required) and score (optional). Generated nested objects become separate Pydantic model classes when needed.
You can use pydanticforge inside your own Python code instead of the CLI.
Each public method below should always have at least one runnable example in README when behavior changes.
| Method | Purpose | Key internals | Typical use case |
|---|---|---|---|
TypeInferer.observe / observe_many |
Incrementally infer type graph from JSON values. | Lattice join (join_types) over inferred nodes. |
Build schema from stream or batched payloads. |
generate_models |
Emit deterministic Pydantic v2 model code from inferred root node. | Stable naming/order via model generation modules. | Materialize models.py for application validation. |
save_schema_state / load_schema_state |
Persist and restore inference state. | Canonical serialized IR payload with version tag. | Reuse schema without reprocessing all samples. |
schema_state_hash |
Compute deterministic hash of state payload. | Canonical JSON serialization + SHA-256. | CI snapshots and schema-baseline tracking. |
save_json_schema / load_json_schema |
Convert between internal state and JSON Schema files. | Explicit node-level conversion rules. | Interop with external schema tooling. |
monitor_directory_once |
Scan directory and detect drift against expected state. | Sample inference + detect_drift event generation. |
CI drift checks or scheduled production audits. |
diff_models / format_diff |
Semantic diff between Pydantic model files. | AST parsing + field/type requiredness comparison. | Breaking-change review before release. |
from pathlib import Path
from pydanticforge.inference.infer import TypeInferer
from pydanticforge.modelgen.emit import generate_models
from pydanticforge.state import save_schema_state, load_schema_state
# From in-memory samples
inferer = TypeInferer()
inferer.observe({"id": 1, "name": "Alice"})
inferer.observe({"id": 2, "name": "Bob", "email": "bob@example.com"})
root = inferer.root # TypeNode | None
if root is not None:
code = generate_models(root, root_name="User")
Path("models.py").write_text(code)
save_schema_state(Path(".pydanticforge/state.json"), root)from pathlib import Path
from pydanticforge.state import load_schema_state
from pydanticforge.modelgen.emit import generate_models
root = load_schema_state(Path(".pydanticforge/state.json"))
code = generate_models(root, root_name="User")
Path("models.py").write_text(code)from pathlib import Path
from pydanticforge.json_schema import load_json_schema, save_json_schema
from pydanticforge.state import load_schema_state
root = load_schema_state(Path(".pydanticforge/state.json"))
save_json_schema(Path("schema.json"), root, title="MySchema")
round_tripped = load_json_schema(Path("schema.json"))from pathlib import Path
from pydanticforge.state import load_schema_state
from pydanticforge.monitor.watcher import monitor_directory_once
state_path = Path(".pydanticforge/state.json")
expected = load_schema_state(state_path) if state_path.exists() else None
new_root, report = monitor_directory_once(
Path("./logs"),
expected_root=expected,
recursive=True,
autopatch=False,
)
print(f"Scanned {report.files_scanned} files, drift in {report.files_with_drift}")
for fd in report.drifts:
print(f" {fd.path}")
for ev in fd.events:
print(f" [{ev.kind}] {ev.path}: expected {ev.expected}, observed {ev.observed}")from pathlib import Path
from pydanticforge.diff.semantic import diff_models, format_diff
entries = diff_models(Path("models_v1.py"), Path("models_v2.py"))
print(format_diff(entries))
breaking = [e for e in entries if e.severity == "breaking"]To keep int and float as a union instead of merging to float:
inferer = TypeInferer(strict_numbers=True)Use the same strict_numbers when calling join_types (e.g. in monitor autopatch) for consistency.
- Stdin (watch / generate): Newline-delimited JSON (NDJSON). Each line is one JSON value (object or array). If a line is an array, each element is treated as a separate sample.
- Files:
.json,.ndjson, and.jsonlfiles. Content can be a single JSON value (object or array) or NDJSON; arrays are expanded into one sample per element. - State file: JSON file with
schema_versionandroot(internal type graph). Do not edit by hand; use CLI orsave_schema_state/load_schema_state. - JSON Schema: Draft 2020-12 document import/export for state interop (
schema,generate,watch,monitor). - Monitor JSON report:
monitor --format jsonemits machine-readable summary + per-file/per-event severities. - Status report:
status --format jsonemits deterministic schema metadata (state_hash, counts) and optional drift snapshot. - Generated models: Valid Python with Pydantic v2
BaseModel(and optionallyRootModelfor non-object roots),ConfigDict(extra="allow"), and deterministic field order.
Project standards expected for open-source maintainability:
- Versioning: semantic versioning (
MAJOR.MINOR.PATCH), with__version__andpyproject.tomlkept in sync. - Documentation drift policy: any behavior/dependency/API change must update README examples and method reference in the same PR.
- License: MIT (
LICENSE). - Citation metadata:
CITATION.cffmaintained for academic/professional attribution.
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pre-commit installQuality checks:
ruff check .
ruff format --check .
mypy src
python -m pytest -q
pip-auditMIT