This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
scrapling-schema is a schema-driven HTML extractor that converts HTML into structured JSON using declarative specs. Users can define extraction rules in Python (with full IDE type hints) or YAML format.
pip install -e ".[dev]"# Run all tests
pytest
# Run specific test file
pytest tests/test_extractor.py
# Run specific test
pytest tests/test_extractor.py::test_scalar_text_default_and_strip_transform# Extract data from HTML
scrapling-schema --spec spec.yml --html-file page.html
# Generate JSON Schema from spec
scrapling-schema --spec spec.yml --schemasrc/scrapling_schema/types.py - Python type definitions for type-safe spec building
Schema,Field,Options- Main spec structures with full dataclass support- Transform classes:
RegexSub,Split,ToInt,ToFloat,Default - All types have
.to_dict()methods to convert to dict format for core processing
src/scrapling_schema/core.py - Extraction engine and spec processing
extract()- Main extraction function, takes HTML + dict specextract_from_yaml()/extract_from_file()- YAML spec loadersschema()/schema_from_yaml()- JSON Schema generation- Uses
scrapling.parser.Selectorfor HTML parsing - Transform pipeline: processes
transformlists sequentially - Field evaluation:
_eval_field()handles css selection, text/attr/html extraction, nested fields, and transforms
src/scrapling_schema/cli.py - Command-line interface
- Handles
--spec,--html,--html-file,--schemaarguments - Exit codes: 0=success, 2=spec error, 3=validation error
- Spec (Python types or YAML) → dict format via
.to_dict()or YAML parsing - HTML preprocessing: remove unwanted tags based on
options.clear.remove_tags - Parse HTML with
scrapling.parser.Selector - Evaluate fields recursively:
- CSS selection (supports "SELF" for context node)
- Extract text/attr/html
- Apply transform pipeline
- Handle nested fields and lists
- Validate
requiredfields
- Return structured dict/list result
Transforms are applied sequentially. Built-in transforms:
"strip"- whitespace trimming"to_int"/"to_float"- type conversionRegexSub(pattern, repl)- regex substitutionSplit(delimiter)- string splitting (requireslist=True)Default(value)- fallback for empty values- Custom callables (Python-only, not serializable to YAML)
Fields support one extraction mode at a time:
text=True- extract text contentattr="name"- extract attribute valuehtml=True- extract outer HTMLfields={...}- nested object/list extraction
Use list=True to extract multiple matches instead of just the first.
Tests use pytest and cover:
- Basic extraction scenarios (test_extractor.py)
- JSON Schema generation (test_json_schema.py)
- Spec validation and error handling (test_spec_validation.py)
- Complex nested structures (test_more_scenarios.py)
Test HTML fixtures are defined inline as string constants.