Automatically parse HTML, generate extraction schemas, and produce production-ready code using AI agents.
Features β’ Quick Start β’ Architecture β’ Documentation β’ Examples
crawlAgent is an intelligent HTML extraction agent that uses specialized AI agents to automatically parse, understand, and extract structured data from HTML documents. Instead of manually writing XPath selectors or CSS queries, the system intelligently analyzes HTML structures, identifies content patterns, and generates production-ready extraction code.
- π§ Intelligent Understanding: AI agents understand HTML semantics, not just syntax
- π Multi-Agent Collaboration: Four specialized agents work together seamlessly
- π Pattern Recognition: Automatically identifies common patterns across multiple pages
- π οΈ Production-Ready: Generates robust, maintainable extraction code
- β‘ Smart Checkpointing: Resume from any step, never lose progress
| Agent | Purpose | Model |
|---|---|---|
| π Analyzer Agent | Deep text-based HTML structure analysis | Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) |
| ποΈ Visual Analyzer | Visual layout analysis using vision models | Qwen-VL-Max (qwen-vl-max) |
| π― Orchestrator | Coordinates agents and synthesizes results | GPT-5 (gpt5) |
| π» Code Generator | Generates production-ready extraction code | Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) |
| β Code Validator | Validates and improves code quality | Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) |
| π Markdown Converter | Converts JSON results to Markdown format | Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) |
- Multi-Modal Analysis: Combines text (LLM) and visual (Vision) analysis
- Automatic XPath Generation: Intelligently generates XPath expressions
- Schema Inference: Creates JSON schemas from HTML structure
- Pattern Recognition: Identifies common patterns across multiple files
- Batch Processing: Analyze multiple HTML files simultaneously
- URL Download: Automatically download HTML from URL lists
- Checkpoint System: Save progress and resume from interruptions
- Step-by-Step Results: Review intermediate results at each step
- Beautiful Logging: Colored console output with file logging
- Custom API Endpoints: Support for OpenAI-compatible APIs
- Error Recovery: Automatic retry and fallback mechanisms
- Code Validation: Automatic syntax and robustness checking
# Clone the repository
git clone https://github.com/SHUzhangshuo/crawlAgent
cd crawlAgent
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (for visual analysis)
playwright install chromium-
Copy the example environment file:
cp env.example .env
-
Edit
.envand add your API keys:# OpenAI API (for orchestrator and code generator) OPENAI_API_KEY=sk-your_api_key_here OPENAI_API_BASE=http://your-endpoint:port/v1 OPENAI_MODEL=gpt-4o-mini # Anthropic API (for analyzer) ANTHROPIC_API_KEY=your_anthropic_api_key_here ANTHROPIC_BASE_URL=https://api.anthropic.com ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 # Vision Model API (for visual analysis) VISION_API_KEY=sk-your_api_key_here VISION_MODEL=gpt-4o VISION_API_BASE=http://your-endpoint:port/v1
# Use default configuration (auto-read from spread directory, auto-create new flow directory)
python main.py
# Use typical directory (learning content)
python main.py --input-type typical
# Process specified URL list file
python main.py urls.txt
# Process specified HTML directory
python main.py ./html_files
# Disable visual analysis (faster)
python main.py --no-visual
# Specify flow ID (disable auto-increment)
python main.py --flow-id 1
# Specify custom output directory
python main.py --output-dir ./resultsAfter processing, use the generated extraction code:
from output.extraction_code import HTMLExtractor
import json
# Create extractor
extractor = HTMLExtractor()
# Extract from file
result = extractor.extract(file_path="example.html")
print(json.dumps(result, indent=2, ensure_ascii=False))
# Extract from HTML string
html_string = "<html><body><h1>Title</h1></body></html>"
result = extractor.extract(html_content=html_string)
# Batch processing
from pathlib import Path
files = list(Path("html_files").glob("*.html"))
results = extractor.extract_batch(files, is_file_paths=True)graph TB
A[HTML Input] --> B{Input Type?}
B -->|URLs| C[URL Downloader]
B -->|Files| D[HTML Parser]
C --> D
D --> E[HTML Agent System]
E --> F[Step 1: Analyzer Agent]
F --> G[Text Analysis Results]
G --> H[flow1/checkpoint.json]
E --> I[Step 2: Visual Analyzer]
I --> J[Visual Analysis Results]
J --> K[flow2/checkpoint.json]
G --> L[Step 3: Orchestrator]
J --> L
L --> M[Synthesized Results]
M --> N[flow3/checkpoint.json]
M --> O[Step 4: Schema Generator]
O --> P[JSON Schema]
P --> Q[flow4/checkpoint.json]
P --> R[Step 5: Code Generator]
R --> S[Extraction Code]
S --> T[flow5/extraction_code.py]
S --> T[flow5/extraction_code.py]
T --> U[Step 6: Code Validator]
U --> V{Valid?}
V -->|No| W[Auto-Fix]
W --> U
V -->|Yes| X[Validated Code]
X --> Y[flow6/extraction_code.py]
Y --> Z[Step 6.5: Execute Code]
Z --> AA[Extraction Results]
AA --> AB[flow6/extraction_results/]
AB --> AC[Step 7: Markdown Converter]
AC --> AD[Markdown Output]
AD --> AE[flow7/markdown_output/]
style F fill:#e1f5ff
style I fill:#fff4e1
style L fill:#e8f5e9
style O fill:#f3e5f5
style R fill:#fce4ec
style U fill:#fff9c4
style AC fill:#e8eaf6
HTML Input (Files/URLs)
β
[Agent 1] Analyzer Agent
ββ Text Structure Analysis
ββ XPath Generation
ββ Pattern Identification
β
[Agent 2] Visual Analyzer (Optional)
ββ HTML Rendering (Playwright)
ββ Visual Layout Analysis
ββ Content Region Detection
β
[Agent 3] Orchestrator
ββ Synthesize Results
ββ Identify Common Patterns
ββ Generate JSON Schema
β
[Agent 4] Code Generator
ββ Generate Python Code
ββ Implement Error Handling
β
[Agent 5] Code Validator
ββ Syntax Validation
ββ Robustness Checking
ββ Auto-Fix Issues
β
[Step 6.5] Code Execution
ββ Execute on spread directory
ββ Generate JSON results
β
[Agent 6] Markdown Converter
ββ Analyze JSON content fields
ββ Generate converter code
ββ Convert to Markdown format
β
Production-Ready Code + Schema + Markdown
sequenceDiagram
participant User
participant Main as Main System
participant Settings as Settings
participant Step1 as Step 1 (flow1)
participant Step2 as Step 2 (flow2)
participant Step3 as Step 3 (flow3)
participant Step4 as Step 4 (flow4)
participant Step5 as Step 5 (flow5)
participant Step6 as Step 6 (flow6)
participant Step7 as Step 7 (flow7)
participant Checkpoint as Checkpoint Manager
User->>Main: python main.py
Main->>Settings: Initialize directories
Main->>Checkpoint: Scan flow directories
Checkpoint-->>Main: Found checkpoints
alt Checkpoint exists
Main->>Checkpoint: Load checkpoint data
Checkpoint-->>Main: Restore state
Main->>Step1: Skip (already done)
else No checkpoint
Main->>Step1: Create flow1/
Step1->>Step1: Analyze HTML
Step1->>Checkpoint: Save checkpoint.json
end
Main->>Step2: Create flow2/
Step2->>Step2: Visual analysis
Step2->>Checkpoint: Save checkpoint.json
Main->>Step3: Create flow3/
Step3->>Step3: Synthesize results
Step3->>Checkpoint: Save checkpoint.json
Main->>Step4: Create flow4/
Step4->>Step4: Generate schema
Step4->>Checkpoint: Save checkpoint.json
Main->>Step5: Create flow5/
Step5->>Step5: Generate code
Step5->>Checkpoint: Save checkpoint.json (code_generated)
Main->>Step6: Create flow6/
Step6->>Step6: Validate code
Step6->>Step6: Fix code
Step6->>Step6: Execute code
Step6->>Checkpoint: Save checkpoint.json (code_validated)
Main->>Step7: Create flow7/
Step7->>Step7: Analyze JSON results
Step7->>Step7: Generate converter code
Step7->>Step7: Convert to Markdown
Step7->>Checkpoint: Save checkpoint.json (markdown_converted)
Step7-->>User: Return results
- Text Analysis β Analyzer Agent analyzes HTML structure β
flow1/ - Visual Analysis β Visual Analyzer analyzes rendered layout (optional) β
flow2/ - Coordination β Orchestrator synthesizes all results β
flow3/ - Schema Generation β Orchestrator generates JSON schema β
flow4/ - Code Generation β Code Generator creates extraction code β
flow5/ - Code Validation β Code Validator validates and improves code β
flow6/ - Code Execution β Execute validated code on spread directory β
flow6/extraction_results/ - Markdown Conversion β Markdown Converter analyzes JSON and generates converter code β
flow7/
crawlAgent/
βββ agents/ # AI Agent implementations
β βββ orchestrator.py # Orchestrator agent
β βββ analyzer.py # Analyzer agent
β βββ code_generator.py # Code generator agent
β βββ code_validator.py # Code validator agent
βββ utils/ # Utility modules
β βββ html_parser.py # HTML parsing utilities
β βββ visual_analyzer.py # Visual analysis
β βββ url_downloader.py # URL downloading
β βββ logger.py # Logging system
β βββ checkpoint.py # Checkpoint management
βββ config/ # Configuration
β βββ settings.py # Settings management (includes path configuration)
βββ prompts/ # Prompt templates
β βββ prompt_templates.py
βββ data/ # Data directory
β βββ input/ # Input directory
β β βββ typcial/ # Learning content directory
β β β βββ urls.txt # URL list (optional)
β β β βββ html/ # Pre-crawled HTML files (optional)
β β βββ spread/ # Content to process directory
β β βββ urls.txt # URL list (optional)
β β βββ html/ # HTML files (optional)
β βββ output/ # Output directory
β βββ flow1/ # Flow 1 output
β βββ flow2/ # Flow 2 output
β βββ ... # More flow outputs
βββ logs/ # Log files (gitignored)
βββ main.py # Main entry point
βββ requirements.txt # Dependencies
βββ env.example # Environment template
βββ README.md # This file
All input/output paths are centrally configured in config/settings.py for easy extension.
data/
βββ input/ # Input directory
β βββ typcial/ # Learning content (what the agent needs to learn)
β β βββ urls.txt # URL list file (will be crawled)
β β βββ html/ # Pre-crawled HTML files directory
β βββ spread/ # Content to process (what the generated code needs to process)
β βββ urls.txt # URL list file
β βββ html/ # HTML files directory
βββ output/ # Output directory (stores results from each API call)
βββ flow1/ # Flow 1 output directory
βββ flow2/ # Flow 2 output directory
βββ ... # More flow output directories
typcial directory (Learning content):
- Method 1: Place
urls.txtfile, system will automatically crawl URLs from the list - Method 2: Place pre-crawled HTML files directly in the
html/directory
spread directory (Content to process):
- Method 1: Place
urls.txtfile, system will process URLs from the list - Method 2: Place HTML files to process directly in the
html/directory
- Results from each API call are stored in
data/output/directory - Auto-create flow directories: Each step automatically creates a new flow folder (
flow1/,flow2/,flow3/, ...)- Step 1 (Text Analysis) β
flow1/ - Step 2 (Visual Analysis) β
flow2/ - Step 3 (Synthesis) β
flow3/ - Step 4 (Schema Generation) β
flow4/ - Step 5 (Code Generation) β
flow5/ - Step 6 (Code Validation & Fix) β
flow6/(includesextraction_results/folder with individual JSON files) - Step 7 (Markdown Conversion) β
flow7/(includesmarkdown_output/folder with Markdown files)
- Step 1 (Text Analysis) β
- Each flow directory contains:
checkpoint.json: Checkpoint data for resumingstep{N}_*_result.json: Step-specific result filesintermediate_results.json: Intermediate results for this stepextraction_code.py: Generated/validated extraction code (in flow5/flow6)extraction_results/: Individual JSON result files for each HTML file (in flow6)extraction_results_summary.json: Summary of all extraction results (in flow6)markdown_converter.py: Generated Markdown converter code (in flow7)markdown_output/: Individual Markdown files for each JSON result (in flow7)markdown_conversion_summary.json: Summary of Markdown conversion results (in flow7)
- Manual flow ID specification: Use
--flow-id Nto specify a particular flow number - First API call input comes from
data/input/directory
graph LR
A[Settings.get_next_flow_id] --> B{Scan OUTPUT_DIR}
B --> C["Find All flowN Directories"]
C --> D[Extract Flow Numbers]
D --> E[Find Max Flow ID]
E --> F[Return Max + 1]
G[Settings.get_flow_output_dir] --> H["OUTPUT_DIR / flowN"]
H --> I[Create Directory]
I --> J[Return Path]
style A fill:#e1f5ff
style G fill:#fff4e1
Flow ID Auto-Increment Algorithm:
def get_next_flow_id():
if not OUTPUT_DIR.exists():
return 1
existing_flows = []
for item in OUTPUT_DIR.iterdir():
if item.is_dir() and item.name.startswith('flow'):
flow_num = int(item.name[4:]) # Extract number from 'flow{N}'
existing_flows.append(flow_num)
if not existing_flows:
return 1
return max(existing_flows) + 1Flow Directory Structure:
data/output/
βββ flow1/ # Step 1: Text Analysis
β βββ checkpoint.json # Contains: step="text_analysis", analysis_results
β βββ step1_text_analysis_result.json
β
βββ flow2/ # Step 2: Visual Analysis
β βββ checkpoint.json # Contains: step="visual_analysis", visual_results, analysis_results
β βββ step2_visual_analysis_result.json
β
βββ flow3/ # Step 3: Synthesis
β βββ checkpoint.json # Contains: step="synthesized", synthesized, analysis_results, visual_results
β βββ step3_synthesized_result.json
β
βββ flow4/ # Step 4: Schema Generation
β βββ checkpoint.json # Contains: step="schema", schema, synthesized, ...
β βββ extraction_schema.json
β βββ step4_schema_result.json
β
βββ flow5/ # Step 5: Code Generation
β βββ checkpoint.json # Contains: step="code_generated", code, schema, ...
β βββ extraction_code.py # Initial generated code
β βββ intermediate_results.json
β βββ step5_code_result.json
β
βββ flow6/ # Step 6: Code Validation & Execution
β βββ checkpoint.json # Contains: step="code_validated", code, validation, ...
β βββ extraction_code.py # Validated and fixed code (final)
β βββ code_validation_result.json
β βββ extraction_results/ # Individual JSON files for each HTML
β β βββ page1.json
β β βββ page2.json
β β βββ ...
β βββ extraction_results_summary.json
β βββ intermediate_results.json
β
βββ flow7/ # Step 7: Markdown Conversion
βββ checkpoint.json # Contains: step="markdown_converted", markdown_converter_code, ...
βββ markdown_converter.py # Generated Markdown converter code
βββ markdown_output/ # Individual Markdown files for each JSON
β βββ page1.md
β βββ page2.md
β βββ ...
βββ markdown_conversion_summary.json
βββ intermediate_results.json
All path configurations are in config/settings.py:
from config import Settings
# Initialize and create all required directories
Settings.initialize_directories()
# Access paths
typical_urls = Settings.TYPICAL_URLS_FILE # data/input/typcial/urls.txt
typical_html = Settings.TYPICAL_HTML_DIR # data/input/typcial/html/
spread_urls = Settings.SPREAD_URLS_FILE # data/input/spread/urls.txt
spread_html = Settings.SPREAD_HTML_DIR # data/input/spread/html/
output_dir = Settings.OUTPUT_DIR # data/output/
# Get flow output directory
flow1_output = Settings.get_flow_output_dir(1) # data/output/flow1/
flow2_output = Settings.get_flow_output_dir(2) # data/output/flow2/
# Auto-get next available flow ID and directory
next_flow_id = Settings.get_next_flow_id() # Auto-increment, returns next available ID
next_flow_dir = Settings.get_next_flow_output_dir() # Auto-creates new flow directoryYou can customize paths via environment variables:
# Custom data directory (optional, default is data/ under project root)
DATA_DIR=D:/data/custom_data
# Custom output directory (optional, default is DATA_DIR/output)
OUTPUT_DIR=D:/data/custom_outputDefault behavior: Each time you run an agent, the system automatically creates a new flow folder without manual specification.
# First run - auto-creates flow1
python main.py
# Second run - auto-creates flow2
python main.py
# Third run - auto-creates flow3
python main.pyIf you need to use a specific flow number:
# Use flow1
python main.py --flow-id 1
# Use flow5
python main.py --flow-id 5
# Disable auto-increment, force use flow1
python main.py --no-auto-flowfrom config import Settings
# Method 1: Manually specify flow ID
flow3_output = Settings.get_flow_output_dir(3) # data/output/flow3/
# Method 2: Auto-get next available flow ID (recommended)
next_flow_id = Settings.get_next_flow_id() # Auto-increment
next_flow_dir = Settings.get_next_flow_output_dir() # Auto-creates new directoryWhen adding new flows:
-
Auto-increment: The system automatically creates new flow directories by default, no manual management needed
-
Custom input sources: Add new input directory configurations in
config/settings.py# Add to Settings class CUSTOM_INPUT_DIR = INPUT_DIR / 'custom' CUSTOM_HTML_DIR = CUSTOM_INPUT_DIR / 'html'
-
Auto-create directories: Call
Settings.initialize_directories()to auto-create all configured directories
View all path configuration information:
from config import Settings
path_info = Settings.get_path_info()
print(path_info)
# Output:
# {
# 'project_root': 'D:/data/cursorworkspace/crawlAgent',
# 'data_dir': 'D:/data/cursorworkspace/crawlAgent/data',
# 'input_dir': 'D:/data/cursorworkspace/crawlAgent/data/input',
# 'typical_dir': 'D:/data/cursorworkspace/crawlAgent/data/input/typcial',
# ...
# }extraction_schema.json: JSON schema with XPath expressionsextraction_code.py: Production-ready Python extraction codecode_validation_result.json: Code validation report
| Step | Flow Directory | Key Output Files |
|---|---|---|
| Step 1 | flow1/ |
step1_text_analysis_result.json, checkpoint.json, intermediate_results.json |
| Step 2 | flow2/ |
step2_visual_analysis_result.json, checkpoint.json, intermediate_results.json |
| Step 3 | flow3/ |
step3_synthesized_result.json, checkpoint.json, intermediate_results.json |
| Step 4 | flow4/ |
extraction_schema.json, step4_schema_result.json, checkpoint.json, intermediate_results.json |
| Step 5 | flow5/ |
extraction_code.py (initial), checkpoint.json, intermediate_results.json |
| Step 6 | flow6/ |
extraction_code.py (validated), code_validation_result.json, checkpoint.json, intermediate_results.json |
| Step 6.5 | flow6/extraction_results/ |
page1.json, page2.json, ... (individual results), extraction_results_summary.json (in flow6/) |
| Step 7 | flow7/ |
markdown_converter.py, markdown_output/, markdown_conversion_summary.json, checkpoint.json, intermediate_results.json |
extraction_schema.json(flow4/): Complete JSON schema with XPath expressions for all extractable sectionsextraction_code.py(flow5/): Initial generated Python extraction codeextraction_code.py(flow6/): Validated and improved Python extraction code (production-ready)code_validation_result.json(flow6/): Detailed validation report with syntax errors, robustness issues, and fixes appliedextraction_results/(flow6/): Directory containing individual JSON files for each processed HTML file- Each file named after the source HTML (e.g.,
page1.json,article.html.json) - Contains extracted structured data according to the schema
- Each file named after the source HTML (e.g.,
extraction_results_summary.json(flow6/): Summary file listing all processed files and their result file pathsmarkdown_converter.py(flow7/): Generated Python code for converting JSON results to Markdown formatmarkdown_output/(flow7/): Directory containing individual Markdown files for each JSON result- Each file named after the source JSON (e.g.,
page1.md,article.json.md) - Contains Markdown-formatted content extracted from JSON
- Each file named after the source JSON (e.g.,
markdown_conversion_summary.json(flow7/): Summary file listing all converted Markdown files and their pathscheckpoint.json(each flow/): Complete processing state for that step, enables automatic resumeintermediate_results.json(each flow/): Intermediate processing results for debugging and review
The checkpoint system ensures you never lose progress:
checkpoint.json: Stored in each flow directory, contains complete processing state and all data for that step- Automatic Recovery: System automatically scans all flow directories on startup, loads checkpoints, and resumes from the last completed step
- Per-Step Checkpoints: Each step (flow1-flow6) maintains its own checkpoint independently
- Smart Resume: When resuming, the system loads all previous step data from the latest checkpoint
- No Manual Intervention: Checkpoint recovery is automatic - no need to check logs or manually specify resume points
# Auto-create new flow directory (default behavior)
python main.py
# Resume from last checkpoint (default: enabled automatically)
python main.py
# Force restart, ignore checkpoints
python main.py --no-resume
# Specify flow ID (disable auto-increment)
python main.py --flow-id 2
# Disable auto-increment, use default flow1
python main.py --no-auto-flow
# Custom output directory
python main.py --output-dir ./custom_output
# Disable visual analysis (faster)
python main.py --no-visual
# Use typical directory (learning content)
python main.py --input-type typical1. Directory of HTML files:
python main.py ./html_files2. URL list file:
python main.py urls.txtExample urls.txt:
# Comments start with #
https://example.com/page1.html
https://example.com/page2.html
https://example.com/page3.html
The system supports custom OpenAI-compatible API endpoints:
OPENAI_API_BASE=http://your-custom-endpoint:port/v1
ANTHROPIC_BASE_URL=http://your-custom-endpoint:port/v1Note: URLs are used exactly as configured, without modification.
The CheckpointManager class handles all checkpoint operations:
class CheckpointManager:
"""Manage checkpoints for resuming interrupted processing"""
CHECKPOINT_FILE = "checkpoint.json"
def __init__(self, output_dir: Path):
self.output_dir = Path(output_dir)
self.checkpoint_path = self.output_dir / self.CHECKPOINT_FILE
def save_checkpoint(self, step: str, data: Dict[str, Any]):
"""Save checkpoint with step name and data"""
checkpoint = {
"step": step,
"timestamp": datetime.now().isoformat(),
"data": data
}
# Save to checkpoint.json
def load_checkpoint(self) -> Optional[Dict[str, Any]]:
"""Load checkpoint from file"""
# Returns checkpoint dict or NoneCentralized path management in config/settings.py:
class Settings:
# Base directories
PROJECT_ROOT = Path(__file__).parent.parent
DATA_DIR = PROJECT_ROOT / 'data'
INPUT_DIR = DATA_DIR / 'input'
OUTPUT_DIR = DATA_DIR / 'output'
# Input directories
TYPICAL_DIR = INPUT_DIR / 'typcial'
TYPICAL_HTML_DIR = TYPICAL_DIR / 'html'
TYPICAL_URLS_FILE = TYPICAL_DIR / 'urls.txt'
SPREAD_DIR = INPUT_DIR / 'spread'
SPREAD_HTML_DIR = SPREAD_DIR / 'html'
SPREAD_URLS_FILE = SPREAD_DIR / 'urls.txt'
@classmethod
def get_flow_output_dir(cls, flow_id: int) -> Path:
"""Get flow-specific output directory"""
return cls.OUTPUT_DIR / f'flow{flow_id}'
@classmethod
def get_next_flow_id(cls) -> int:
"""Auto-increment flow ID"""
# Scans existing flow directories
# Returns max(flow_ids) + 1The system dynamically imports and executes generated code:
def _execute_extraction_code(code_path, output_dir, json_schema):
# 1. Load code module dynamically
spec = importlib.util.spec_from_file_location("extraction_code", code_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# 2. Find extractor class/function
extractor = module.HTMLExtractor()
# 3. Process each HTML file
for html_file in html_files:
result = extractor.extract(html_content=html_content)
# 4. Save individual JSON file
json_filename = f"{html_filename}.json"
save_to_extraction_results(json_filename, result)
# 5. Generate summary
save_summary(extraction_results_summary.json)Each agent uses specialized prompts:
- Analyzer Agent: Focuses on HTML structure, XPath generation
- Visual Analyzer: Analyzes rendered layout, visual patterns
- Orchestrator: Synthesizes results, identifies common patterns
- Code Generator: Generates production-ready Python code
- Code Validator: Validates syntax, checks robustness, suggests fixes
- Markdown Converter: Analyzes JSON content fields, generates Markdown converter code
Prompts are stored in prompts/prompt_templates.py and can be customized.
from output.extraction_code import HTMLExtractor
import json
extractor = HTMLExtractor()
result = extractor.extract(file_path="article.html")
print(f"Title: {result.get('article_title')}")
print(f"Date: {result.get('article_date')}")
print(f"Body: {result.get('article_body')[:100]}...")from pathlib import Path
from output.extraction_code import HTMLExtractor
extractor = HTMLExtractor()
html_files = list(Path("html_files").glob("*.html"))
results = extractor.extract_batch(html_files, is_file_paths=True)
for file_path, result in zip(html_files, results):
print(f"{file_path.name}: {result.get('article_title', 'N/A')}")from output.extraction_code import HTMLExtractor
import json
extractor = HTMLExtractor()
result = extractor.extract(file_path="article.html")
with open("extracted_data.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)The generated schema follows this structure:
{
"schema_version": "1.0",
"description": "Schema for extracting content from HTML pages",
"sections": [
{
"name": "article_title",
"description": "Main article title",
"xpath": "//h1[@class='title']",
"is_list": false,
"attributes": {},
"notes": "Extracts the main title"
},
{
"name": "comments",
"description": "List of comments",
"xpath": "//div[@class='comment']",
"xpath_list": ["//div[@class='comment']"],
"is_list": true,
"attributes": {"class": "comment"}
}
]
}The system automatically saves checkpoints after each step and resumes from checkpoints by default, ensuring you never lose progress.
- β Automatic Resume (Default): System automatically checks for checkpoints on startup and resumes from the last completed step
- β No Log Reading Required: Checkpoint recovery is automatic - you don't need to check logs to know where to resume
- β
Per-Step Checkpoints: Each step (flow1, flow2, flow3, etc.) has its own
checkpoint.jsonfile - β Smart Recovery: System scans all flow directories, loads checkpoints, and automatically skips completed steps
- β
Manual Control: Use
--no-resumeflag to force restart and ignore checkpoints - β Step Results: Each step saves its result separately in its flow directory
- β Progress Tracking: Never lose progress, resume from any step automatically
- Checkpoint Creation: After each step completes, a
checkpoint.jsonfile is saved in that step's flow directory - Startup Scan: When the system starts, it scans all
flow{N}/directories in the output folder - Checkpoint Loading: For each flow directory, it loads the
checkpoint.jsonfile and extracts:- The step name (e.g.,
"text_analysis","code_validated") - All processing data (analysis results, schema, code, etc.)
- The step name (e.g.,
- State Restoration: The system restores the complete state from the latest checkpoint
- Smart Skip: Steps that are already completed (have valid checkpoints) are automatically skipped
- Resume Execution: Processing continues from the first incomplete step
- Python: 3.8 or higher
- API Keys:
- OpenAI API key (or compatible endpoint)
- Anthropic API key (or compatible endpoint)
- Dependencies: See
requirements.txt - Optional: Playwright for visual analysis
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with OpenAI and Anthropic APIs
- Uses lxml for HTML parsing
- Uses Playwright for visual analysis
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with β€οΈ using AI Agents
β Star this repo if you find it useful!