Skip to content

ccprocessor/crawlAgent

Repository files navigation

πŸ€– crawlAgent

Intelligent HTML Extraction Agent System

Python License OpenAI Anthropic

Automatically parse HTML, generate extraction schemas, and produce production-ready code using AI agents.

Features β€’ Quick Start β€’ Architecture β€’ Documentation β€’ Examples


πŸ“– Overview

crawlAgent is an intelligent HTML extraction agent that uses specialized AI agents to automatically parse, understand, and extract structured data from HTML documents. Instead of manually writing XPath selectors or CSS queries, the system intelligently analyzes HTML structures, identifies content patterns, and generates production-ready extraction code.

🎯 What Makes It Special?

  • 🧠 Intelligent Understanding: AI agents understand HTML semantics, not just syntax
  • πŸ”„ Multi-Agent Collaboration: Four specialized agents work together seamlessly
  • πŸ“Š Pattern Recognition: Automatically identifies common patterns across multiple pages
  • πŸ› οΈ Production-Ready: Generates robust, maintainable extraction code
  • ⚑ Smart Checkpointing: Resume from any step, never lose progress

✨ Features

πŸ€– Multi-Agent Architecture

Agent Purpose Model
πŸ” Analyzer Agent Deep text-based HTML structure analysis Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
πŸ‘οΈ Visual Analyzer Visual layout analysis using vision models Qwen-VL-Max (qwen-vl-max)
🎯 Orchestrator Coordinates agents and synthesizes results GPT-5 (gpt5)
πŸ’» Code Generator Generates production-ready extraction code Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
βœ… Code Validator Validates and improves code quality Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
πŸ“ Markdown Converter Converts JSON results to Markdown format Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)

🧠 Intelligent Parsing

  • Multi-Modal Analysis: Combines text (LLM) and visual (Vision) analysis
  • Automatic XPath Generation: Intelligently generates XPath expressions
  • Schema Inference: Creates JSON schemas from HTML structure
  • Pattern Recognition: Identifies common patterns across multiple files

⚑ Automation & Efficiency

  • Batch Processing: Analyze multiple HTML files simultaneously
  • URL Download: Automatically download HTML from URL lists
  • Checkpoint System: Save progress and resume from interruptions
  • Step-by-Step Results: Review intermediate results at each step

πŸ› οΈ Developer Experience

  • Beautiful Logging: Colored console output with file logging
  • Custom API Endpoints: Support for OpenAI-compatible APIs
  • Error Recovery: Automatic retry and fallback mechanisms
  • Code Validation: Automatic syntax and robustness checking

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/SHUzhangshuo/crawlAgent
cd crawlAgent

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers (for visual analysis)
playwright install chromium

Configuration

  1. Copy the example environment file:

    cp env.example .env
  2. Edit .env and add your API keys:

    # OpenAI API (for orchestrator and code generator)
    OPENAI_API_KEY=sk-your_api_key_here
    OPENAI_API_BASE=http://your-endpoint:port/v1
    OPENAI_MODEL=gpt-4o-mini
    
    # Anthropic API (for analyzer)
    ANTHROPIC_API_KEY=your_anthropic_api_key_here
    ANTHROPIC_BASE_URL=https://api.anthropic.com
    ANTHROPIC_MODEL=claude-3-5-sonnet-20241022
    
    # Vision Model API (for visual analysis)
    VISION_API_KEY=sk-your_api_key_here
    VISION_MODEL=gpt-4o
    VISION_API_BASE=http://your-endpoint:port/v1

Basic Usage

# Use default configuration (auto-read from spread directory, auto-create new flow directory)
python main.py

# Use typical directory (learning content)
python main.py --input-type typical

# Process specified URL list file
python main.py urls.txt

# Process specified HTML directory
python main.py ./html_files

# Disable visual analysis (faster)
python main.py --no-visual

# Specify flow ID (disable auto-increment)
python main.py --flow-id 1

# Specify custom output directory
python main.py --output-dir ./results

Using the Generated Code

After processing, use the generated extraction code:

from output.extraction_code import HTMLExtractor
import json

# Create extractor
extractor = HTMLExtractor()

# Extract from file
result = extractor.extract(file_path="example.html")
print(json.dumps(result, indent=2, ensure_ascii=False))

# Extract from HTML string
html_string = "<html><body><h1>Title</h1></body></html>"
result = extractor.extract(html_content=html_string)

# Batch processing
from pathlib import Path
files = list(Path("html_files").glob("*.html"))
results = extractor.extract_batch(files, is_file_paths=True)

πŸ—οΈ Architecture

System Architecture Diagram

graph TB
    A[HTML Input] --> B{Input Type?}
    B -->|URLs| C[URL Downloader]
    B -->|Files| D[HTML Parser]
    C --> D
    D --> E[HTML Agent System]
    
    E --> F[Step 1: Analyzer Agent]
    F --> G[Text Analysis Results]
    G --> H[flow1/checkpoint.json]
    
    E --> I[Step 2: Visual Analyzer]
    I --> J[Visual Analysis Results]
    J --> K[flow2/checkpoint.json]
    
    G --> L[Step 3: Orchestrator]
    J --> L
    L --> M[Synthesized Results]
    M --> N[flow3/checkpoint.json]
    
    M --> O[Step 4: Schema Generator]
    O --> P[JSON Schema]
    P --> Q[flow4/checkpoint.json]
    
    P --> R[Step 5: Code Generator]
    R --> S[Extraction Code]
    S --> T[flow5/extraction_code.py]
    
    S --> T[flow5/extraction_code.py]
    T --> U[Step 6: Code Validator]
    U --> V{Valid?}
    V -->|No| W[Auto-Fix]
    W --> U
    V -->|Yes| X[Validated Code]
    X --> Y[flow6/extraction_code.py]
    
    Y --> Z[Step 6.5: Execute Code]
    Z --> AA[Extraction Results]
    AA --> AB[flow6/extraction_results/]
    
    AB --> AC[Step 7: Markdown Converter]
    AC --> AD[Markdown Output]
    AD --> AE[flow7/markdown_output/]
    
    style F fill:#e1f5ff
    style I fill:#fff4e1
    style L fill:#e8f5e9
    style O fill:#f3e5f5
    style R fill:#fce4ec
    style U fill:#fff9c4
    style AC fill:#e8eaf6
Loading

Multi-Agent Workflow

HTML Input (Files/URLs)
    ↓
[Agent 1] Analyzer Agent
    β”œβ”€ Text Structure Analysis
    β”œβ”€ XPath Generation
    └─ Pattern Identification
    ↓
[Agent 2] Visual Analyzer (Optional)
    β”œβ”€ HTML Rendering (Playwright)
    β”œβ”€ Visual Layout Analysis
    └─ Content Region Detection
    ↓
[Agent 3] Orchestrator
    β”œβ”€ Synthesize Results
    β”œβ”€ Identify Common Patterns
    └─ Generate JSON Schema
    ↓
[Agent 4] Code Generator
    β”œβ”€ Generate Python Code
    └─ Implement Error Handling
    ↓
[Agent 5] Code Validator
    β”œβ”€ Syntax Validation
    β”œβ”€ Robustness Checking
    └─ Auto-Fix Issues
    ↓
[Step 6.5] Code Execution
    β”œβ”€ Execute on spread directory
    └─ Generate JSON results
    ↓
[Agent 6] Markdown Converter
    β”œβ”€ Analyze JSON content fields
    β”œβ”€ Generate converter code
    └─ Convert to Markdown format
    ↓
Production-Ready Code + Schema + Markdown

Processing Steps with Flow Directories

sequenceDiagram
    participant User
    participant Main as Main System
    participant Settings as Settings
    participant Step1 as Step 1 (flow1)
    participant Step2 as Step 2 (flow2)
    participant Step3 as Step 3 (flow3)
    participant Step4 as Step 4 (flow4)
    participant Step5 as Step 5 (flow5)
    participant Step6 as Step 6 (flow6)
    participant Step7 as Step 7 (flow7)
    participant Checkpoint as Checkpoint Manager

    User->>Main: python main.py
    Main->>Settings: Initialize directories
    Main->>Checkpoint: Scan flow directories
    Checkpoint-->>Main: Found checkpoints
    
    alt Checkpoint exists
        Main->>Checkpoint: Load checkpoint data
        Checkpoint-->>Main: Restore state
        Main->>Step1: Skip (already done)
    else No checkpoint
        Main->>Step1: Create flow1/
        Step1->>Step1: Analyze HTML
        Step1->>Checkpoint: Save checkpoint.json
    end
    
    Main->>Step2: Create flow2/
    Step2->>Step2: Visual analysis
    Step2->>Checkpoint: Save checkpoint.json
    
    Main->>Step3: Create flow3/
    Step3->>Step3: Synthesize results
    Step3->>Checkpoint: Save checkpoint.json
    
    Main->>Step4: Create flow4/
    Step4->>Step4: Generate schema
    Step4->>Checkpoint: Save checkpoint.json
    
    Main->>Step5: Create flow5/
    Step5->>Step5: Generate code
    Step5->>Checkpoint: Save checkpoint.json (code_generated)
    
    Main->>Step6: Create flow6/
    Step6->>Step6: Validate code
    Step6->>Step6: Fix code
    Step6->>Step6: Execute code
    Step6->>Checkpoint: Save checkpoint.json (code_validated)
    
    Main->>Step7: Create flow7/
    Step7->>Step7: Analyze JSON results
    Step7->>Step7: Generate converter code
    Step7->>Step7: Convert to Markdown
    Step7->>Checkpoint: Save checkpoint.json (markdown_converted)
    Step7-->>User: Return results
Loading

Processing Steps

  1. Text Analysis β†’ Analyzer Agent analyzes HTML structure β†’ flow1/
  2. Visual Analysis β†’ Visual Analyzer analyzes rendered layout (optional) β†’ flow2/
  3. Coordination β†’ Orchestrator synthesizes all results β†’ flow3/
  4. Schema Generation β†’ Orchestrator generates JSON schema β†’ flow4/
  5. Code Generation β†’ Code Generator creates extraction code β†’ flow5/
  6. Code Validation β†’ Code Validator validates and improves code β†’ flow6/
  7. Code Execution β†’ Execute validated code on spread directory β†’ flow6/extraction_results/
  8. Markdown Conversion β†’ Markdown Converter analyzes JSON and generates converter code β†’ flow7/

πŸ“ Project Structure

crawlAgent/
β”œβ”€β”€ agents/                  # AI Agent implementations
β”‚   β”œβ”€β”€ orchestrator.py      # Orchestrator agent
β”‚   β”œβ”€β”€ analyzer.py          # Analyzer agent
β”‚   β”œβ”€β”€ code_generator.py    # Code generator agent
β”‚   └── code_validator.py    # Code validator agent
β”œβ”€β”€ utils/                   # Utility modules
β”‚   β”œβ”€β”€ html_parser.py       # HTML parsing utilities
β”‚   β”œβ”€β”€ visual_analyzer.py   # Visual analysis
β”‚   β”œβ”€β”€ url_downloader.py    # URL downloading
β”‚   β”œβ”€β”€ logger.py            # Logging system
β”‚   └── checkpoint.py        # Checkpoint management
β”œβ”€β”€ config/                  # Configuration
β”‚   └── settings.py          # Settings management (includes path configuration)
β”œβ”€β”€ prompts/                 # Prompt templates
β”‚   └── prompt_templates.py
β”œβ”€β”€ data/                    # Data directory
β”‚   β”œβ”€β”€ input/               # Input directory
β”‚   β”‚   β”œβ”€β”€ typcial/         # Learning content directory
β”‚   β”‚   β”‚   β”œβ”€β”€ urls.txt     # URL list (optional)
β”‚   β”‚   β”‚   └── html/        # Pre-crawled HTML files (optional)
β”‚   β”‚   └── spread/          # Content to process directory
β”‚   β”‚       β”œβ”€β”€ urls.txt     # URL list (optional)
β”‚   β”‚       └── html/        # HTML files (optional)
β”‚   └── output/              # Output directory
β”‚       β”œβ”€β”€ flow1/           # Flow 1 output
β”‚       β”œβ”€β”€ flow2/           # Flow 2 output
β”‚       └── ...              # More flow outputs
β”œβ”€β”€ logs/                    # Log files (gitignored)
β”œβ”€β”€ main.py                  # Main entry point
β”œβ”€β”€ requirements.txt         # Dependencies
β”œβ”€β”€ env.example              # Environment template
└── README.md                # This file

βš™οΈ Configuration

Path Configuration

All input/output paths are centrally configured in config/settings.py for easy extension.

Directory Structure

data/
β”œβ”€β”€ input/                    # Input directory
β”‚   β”œβ”€β”€ typcial/              # Learning content (what the agent needs to learn)
β”‚   β”‚   β”œβ”€β”€ urls.txt          # URL list file (will be crawled)
β”‚   β”‚   └── html/             # Pre-crawled HTML files directory
β”‚   └── spread/               # Content to process (what the generated code needs to process)
β”‚       β”œβ”€β”€ urls.txt          # URL list file
β”‚       └── html/             # HTML files directory
└── output/                   # Output directory (stores results from each API call)
    β”œβ”€β”€ flow1/                # Flow 1 output directory
    β”œβ”€β”€ flow2/                # Flow 2 output directory
    └── ...                   # More flow output directories

Input Methods

typcial directory (Learning content):

  • Method 1: Place urls.txt file, system will automatically crawl URLs from the list
  • Method 2: Place pre-crawled HTML files directly in the html/ directory

spread directory (Content to process):

  • Method 1: Place urls.txt file, system will process URLs from the list
  • Method 2: Place HTML files to process directly in the html/ directory

Output Methods

  • Results from each API call are stored in data/output/ directory
  • Auto-create flow directories: Each step automatically creates a new flow folder (flow1/, flow2/, flow3/, ...)
    • Step 1 (Text Analysis) β†’ flow1/
    • Step 2 (Visual Analysis) β†’ flow2/
    • Step 3 (Synthesis) β†’ flow3/
    • Step 4 (Schema Generation) β†’ flow4/
    • Step 5 (Code Generation) β†’ flow5/
    • Step 6 (Code Validation & Fix) β†’ flow6/ (includes extraction_results/ folder with individual JSON files)
    • Step 7 (Markdown Conversion) β†’ flow7/ (includes markdown_output/ folder with Markdown files)
  • Each flow directory contains:
    • checkpoint.json: Checkpoint data for resuming
    • step{N}_*_result.json: Step-specific result files
    • intermediate_results.json: Intermediate results for this step
    • extraction_code.py: Generated/validated extraction code (in flow5/flow6)
    • extraction_results/: Individual JSON result files for each HTML file (in flow6)
    • extraction_results_summary.json: Summary of all extraction results (in flow6)
    • markdown_converter.py: Generated Markdown converter code (in flow7)
    • markdown_output/: Individual Markdown files for each JSON result (in flow7)
    • markdown_conversion_summary.json: Summary of Markdown conversion results (in flow7)
  • Manual flow ID specification: Use --flow-id N to specify a particular flow number
  • First API call input comes from data/input/ directory

Flow Directory Management

graph LR
    A[Settings.get_next_flow_id] --> B{Scan OUTPUT_DIR}
    B --> C["Find All flowN Directories"]
    C --> D[Extract Flow Numbers]
    D --> E[Find Max Flow ID]
    E --> F[Return Max + 1]
    
    G[Settings.get_flow_output_dir] --> H["OUTPUT_DIR / flowN"]
    H --> I[Create Directory]
    I --> J[Return Path]
    
    style A fill:#e1f5ff
    style G fill:#fff4e1
Loading

Flow ID Auto-Increment Algorithm:

def get_next_flow_id():
    if not OUTPUT_DIR.exists():
        return 1
    
    existing_flows = []
    for item in OUTPUT_DIR.iterdir():
        if item.is_dir() and item.name.startswith('flow'):
            flow_num = int(item.name[4:])  # Extract number from 'flow{N}'
            existing_flows.append(flow_num)
    
    if not existing_flows:
        return 1
    
    return max(existing_flows) + 1

Flow Directory Structure:

data/output/
β”œβ”€β”€ flow1/                    # Step 1: Text Analysis
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="text_analysis", analysis_results
β”‚   └── step1_text_analysis_result.json
β”‚
β”œβ”€β”€ flow2/                    # Step 2: Visual Analysis
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="visual_analysis", visual_results, analysis_results
β”‚   └── step2_visual_analysis_result.json
β”‚
β”œβ”€β”€ flow3/                    # Step 3: Synthesis
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="synthesized", synthesized, analysis_results, visual_results
β”‚   └── step3_synthesized_result.json
β”‚
β”œβ”€β”€ flow4/                    # Step 4: Schema Generation
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="schema", schema, synthesized, ...
β”‚   β”œβ”€β”€ extraction_schema.json
β”‚   └── step4_schema_result.json
β”‚
β”œβ”€β”€ flow5/                    # Step 5: Code Generation
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="code_generated", code, schema, ...
β”‚   β”œβ”€β”€ extraction_code.py    # Initial generated code
β”‚   β”œβ”€β”€ intermediate_results.json
β”‚   └── step5_code_result.json
β”‚
β”œβ”€β”€ flow6/                    # Step 6: Code Validation & Execution
β”‚   β”œβ”€β”€ checkpoint.json       # Contains: step="code_validated", code, validation, ...
β”‚   β”œβ”€β”€ extraction_code.py    # Validated and fixed code (final)
β”‚   β”œβ”€β”€ code_validation_result.json
β”‚   β”œβ”€β”€ extraction_results/   # Individual JSON files for each HTML
β”‚   β”‚   β”œβ”€β”€ page1.json
β”‚   β”‚   β”œβ”€β”€ page2.json
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ extraction_results_summary.json
β”‚   └── intermediate_results.json
β”‚
└── flow7/                    # Step 7: Markdown Conversion
    β”œβ”€β”€ checkpoint.json       # Contains: step="markdown_converted", markdown_converter_code, ...
    β”œβ”€β”€ markdown_converter.py # Generated Markdown converter code
    β”œβ”€β”€ markdown_output/      # Individual Markdown files for each JSON
    β”‚   β”œβ”€β”€ page1.md
    β”‚   β”œβ”€β”€ page2.md
    β”‚   └── ...
    β”œβ”€β”€ markdown_conversion_summary.json
    └── intermediate_results.json

Using Configuration

All path configurations are in config/settings.py:

from config import Settings

# Initialize and create all required directories
Settings.initialize_directories()

# Access paths
typical_urls = Settings.TYPICAL_URLS_FILE      # data/input/typcial/urls.txt
typical_html = Settings.TYPICAL_HTML_DIR       # data/input/typcial/html/
spread_urls = Settings.SPREAD_URLS_FILE        # data/input/spread/urls.txt
spread_html = Settings.SPREAD_HTML_DIR        # data/input/spread/html/
output_dir = Settings.OUTPUT_DIR               # data/output/

# Get flow output directory
flow1_output = Settings.get_flow_output_dir(1)  # data/output/flow1/
flow2_output = Settings.get_flow_output_dir(2)  # data/output/flow2/

# Auto-get next available flow ID and directory
next_flow_id = Settings.get_next_flow_id()  # Auto-increment, returns next available ID
next_flow_dir = Settings.get_next_flow_output_dir()  # Auto-creates new flow directory

Environment Variables

You can customize paths via environment variables:

# Custom data directory (optional, default is data/ under project root)
DATA_DIR=D:/data/custom_data

# Custom output directory (optional, default is DATA_DIR/output)
OUTPUT_DIR=D:/data/custom_output

Flow Management

Auto-Increment Flow ID (Recommended)

Default behavior: Each time you run an agent, the system automatically creates a new flow folder without manual specification.

# First run - auto-creates flow1
python main.py

# Second run - auto-creates flow2
python main.py

# Third run - auto-creates flow3
python main.py

Manual Flow ID Specification

If you need to use a specific flow number:

# Use flow1
python main.py --flow-id 1

# Use flow5
python main.py --flow-id 5

# Disable auto-increment, force use flow1
python main.py --no-auto-flow

Programmatic Flow Management

from config import Settings

# Method 1: Manually specify flow ID
flow3_output = Settings.get_flow_output_dir(3)  # data/output/flow3/

# Method 2: Auto-get next available flow ID (recommended)
next_flow_id = Settings.get_next_flow_id()  # Auto-increment
next_flow_dir = Settings.get_next_flow_output_dir()  # Auto-creates new directory

Extending New Flows

When adding new flows:

  1. Auto-increment: The system automatically creates new flow directories by default, no manual management needed

  2. Custom input sources: Add new input directory configurations in config/settings.py

    # Add to Settings class
    CUSTOM_INPUT_DIR = INPUT_DIR / 'custom'
    CUSTOM_HTML_DIR = CUSTOM_INPUT_DIR / 'html'
  3. Auto-create directories: Call Settings.initialize_directories() to auto-create all configured directories

View Path Information

View all path configuration information:

from config import Settings

path_info = Settings.get_path_info()
print(path_info)
# Output:
# {
#     'project_root': 'D:/data/cursorworkspace/crawlAgent',
#     'data_dir': 'D:/data/cursorworkspace/crawlAgent/data',
#     'input_dir': 'D:/data/cursorworkspace/crawlAgent/data/input',
#     'typical_dir': 'D:/data/cursorworkspace/crawlAgent/data/input/typcial',
#     ...
# }

πŸ“Š Output Files

Main Outputs

  • extraction_schema.json: JSON schema with XPath expressions
  • extraction_code.py: Production-ready Python extraction code
  • code_validation_result.json: Code validation report

Step-by-Step Output Files

Step Flow Directory Key Output Files
Step 1 flow1/ step1_text_analysis_result.json, checkpoint.json, intermediate_results.json
Step 2 flow2/ step2_visual_analysis_result.json, checkpoint.json, intermediate_results.json
Step 3 flow3/ step3_synthesized_result.json, checkpoint.json, intermediate_results.json
Step 4 flow4/ extraction_schema.json, step4_schema_result.json, checkpoint.json, intermediate_results.json
Step 5 flow5/ extraction_code.py (initial), checkpoint.json, intermediate_results.json
Step 6 flow6/ extraction_code.py (validated), code_validation_result.json, checkpoint.json, intermediate_results.json
Step 6.5 flow6/extraction_results/ page1.json, page2.json, ... (individual results), extraction_results_summary.json (in flow6/)
Step 7 flow7/ markdown_converter.py, markdown_output/, markdown_conversion_summary.json, checkpoint.json, intermediate_results.json

Detailed File Descriptions

  • extraction_schema.json (flow4/): Complete JSON schema with XPath expressions for all extractable sections
  • extraction_code.py (flow5/): Initial generated Python extraction code
  • extraction_code.py (flow6/): Validated and improved Python extraction code (production-ready)
  • code_validation_result.json (flow6/): Detailed validation report with syntax errors, robustness issues, and fixes applied
  • extraction_results/ (flow6/): Directory containing individual JSON files for each processed HTML file
    • Each file named after the source HTML (e.g., page1.json, article.html.json)
    • Contains extracted structured data according to the schema
  • extraction_results_summary.json (flow6/): Summary file listing all processed files and their result file paths
  • markdown_converter.py (flow7/): Generated Python code for converting JSON results to Markdown format
  • markdown_output/ (flow7/): Directory containing individual Markdown files for each JSON result
    • Each file named after the source JSON (e.g., page1.md, article.json.md)
    • Contains Markdown-formatted content extracted from JSON
  • markdown_conversion_summary.json (flow7/): Summary file listing all converted Markdown files and their paths
  • checkpoint.json (each flow/): Complete processing state for that step, enables automatic resume
  • intermediate_results.json (each flow/): Intermediate processing results for debugging and review

Checkpoint System

The checkpoint system ensures you never lose progress:

  • checkpoint.json: Stored in each flow directory, contains complete processing state and all data for that step
  • Automatic Recovery: System automatically scans all flow directories on startup, loads checkpoints, and resumes from the last completed step
  • Per-Step Checkpoints: Each step (flow1-flow6) maintains its own checkpoint independently
  • Smart Resume: When resuming, the system loads all previous step data from the latest checkpoint
  • No Manual Intervention: Checkpoint recovery is automatic - no need to check logs or manually specify resume points

πŸ”§ Advanced Usage

Command-Line Options

# Auto-create new flow directory (default behavior)
python main.py

# Resume from last checkpoint (default: enabled automatically)
python main.py

# Force restart, ignore checkpoints
python main.py --no-resume

# Specify flow ID (disable auto-increment)
python main.py --flow-id 2

# Disable auto-increment, use default flow1
python main.py --no-auto-flow

# Custom output directory
python main.py --output-dir ./custom_output

# Disable visual analysis (faster)
python main.py --no-visual

# Use typical directory (learning content)
python main.py --input-type typical

Input Formats

1. Directory of HTML files:

python main.py ./html_files

2. URL list file:

python main.py urls.txt

Example urls.txt:

# Comments start with #
https://example.com/page1.html
https://example.com/page2.html
https://example.com/page3.html

Custom API Endpoints

The system supports custom OpenAI-compatible API endpoints:

OPENAI_API_BASE=http://your-custom-endpoint:port/v1
ANTHROPIC_BASE_URL=http://your-custom-endpoint:port/v1

Note: URLs are used exactly as configured, without modification.


πŸ”¬ Implementation Details

CheckpointManager Class

The CheckpointManager class handles all checkpoint operations:

class CheckpointManager:
    """Manage checkpoints for resuming interrupted processing"""
    
    CHECKPOINT_FILE = "checkpoint.json"
    
    def __init__(self, output_dir: Path):
        self.output_dir = Path(output_dir)
        self.checkpoint_path = self.output_dir / self.CHECKPOINT_FILE
    
    def save_checkpoint(self, step: str, data: Dict[str, Any]):
        """Save checkpoint with step name and data"""
        checkpoint = {
            "step": step,
            "timestamp": datetime.now().isoformat(),
            "data": data
        }
        # Save to checkpoint.json
    
    def load_checkpoint(self) -> Optional[Dict[str, Any]]:
        """Load checkpoint from file"""
        # Returns checkpoint dict or None

Settings Configuration System

Centralized path management in config/settings.py:

class Settings:
    # Base directories
    PROJECT_ROOT = Path(__file__).parent.parent
    DATA_DIR = PROJECT_ROOT / 'data'
    INPUT_DIR = DATA_DIR / 'input'
    OUTPUT_DIR = DATA_DIR / 'output'
    
    # Input directories
    TYPICAL_DIR = INPUT_DIR / 'typcial'
    TYPICAL_HTML_DIR = TYPICAL_DIR / 'html'
    TYPICAL_URLS_FILE = TYPICAL_DIR / 'urls.txt'
    
    SPREAD_DIR = INPUT_DIR / 'spread'
    SPREAD_HTML_DIR = SPREAD_DIR / 'html'
    SPREAD_URLS_FILE = SPREAD_DIR / 'urls.txt'
    
    @classmethod
    def get_flow_output_dir(cls, flow_id: int) -> Path:
        """Get flow-specific output directory"""
        return cls.OUTPUT_DIR / f'flow{flow_id}'
    
    @classmethod
    def get_next_flow_id(cls) -> int:
        """Auto-increment flow ID"""
        # Scans existing flow directories
        # Returns max(flow_ids) + 1

Code Execution Mechanism

The system dynamically imports and executes generated code:

def _execute_extraction_code(code_path, output_dir, json_schema):
    # 1. Load code module dynamically
    spec = importlib.util.spec_from_file_location("extraction_code", code_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    
    # 2. Find extractor class/function
    extractor = module.HTMLExtractor()
    
    # 3. Process each HTML file
    for html_file in html_files:
        result = extractor.extract(html_content=html_content)
        
        # 4. Save individual JSON file
        json_filename = f"{html_filename}.json"
        save_to_extraction_results(json_filename, result)
    
    # 5. Generate summary
    save_summary(extraction_results_summary.json)

Agent Prompt Engineering

Each agent uses specialized prompts:

  1. Analyzer Agent: Focuses on HTML structure, XPath generation
  2. Visual Analyzer: Analyzes rendered layout, visual patterns
  3. Orchestrator: Synthesizes results, identifies common patterns
  4. Code Generator: Generates production-ready Python code
  5. Code Validator: Validates syntax, checks robustness, suggests fixes
  6. Markdown Converter: Analyzes JSON content fields, generates Markdown converter code

Prompts are stored in prompts/prompt_templates.py and can be customized.


πŸ“ Examples

Example 1: Extract from Single HTML File

from output.extraction_code import HTMLExtractor
import json

extractor = HTMLExtractor()
result = extractor.extract(file_path="article.html")

print(f"Title: {result.get('article_title')}")
print(f"Date: {result.get('article_date')}")
print(f"Body: {result.get('article_body')[:100]}...")

Example 2: Batch Processing

from pathlib import Path
from output.extraction_code import HTMLExtractor

extractor = HTMLExtractor()
html_files = list(Path("html_files").glob("*.html"))

results = extractor.extract_batch(html_files, is_file_paths=True)

for file_path, result in zip(html_files, results):
    print(f"{file_path.name}: {result.get('article_title', 'N/A')}")

Example 3: Save Results to JSON

from output.extraction_code import HTMLExtractor
import json

extractor = HTMLExtractor()
result = extractor.extract(file_path="article.html")

with open("extracted_data.json", "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

πŸ” JSON Schema Format

The generated schema follows this structure:

{
  "schema_version": "1.0",
  "description": "Schema for extracting content from HTML pages",
  "sections": [
    {
      "name": "article_title",
      "description": "Main article title",
      "xpath": "//h1[@class='title']",
      "is_list": false,
      "attributes": {},
      "notes": "Extracts the main title"
    },
    {
      "name": "comments",
      "description": "List of comments",
      "xpath": "//div[@class='comment']",
      "xpath_list": ["//div[@class='comment']"],
      "is_list": true,
      "attributes": {"class": "comment"}
    }
  ]
}

πŸ›‘οΈ Checkpoint & Resume System

The system automatically saves checkpoints after each step and resumes from checkpoints by default, ensuring you never lose progress.

Key Features

  • βœ… Automatic Resume (Default): System automatically checks for checkpoints on startup and resumes from the last completed step
  • βœ… No Log Reading Required: Checkpoint recovery is automatic - you don't need to check logs to know where to resume
  • βœ… Per-Step Checkpoints: Each step (flow1, flow2, flow3, etc.) has its own checkpoint.json file
  • βœ… Smart Recovery: System scans all flow directories, loads checkpoints, and automatically skips completed steps
  • βœ… Manual Control: Use --no-resume flag to force restart and ignore checkpoints
  • βœ… Step Results: Each step saves its result separately in its flow directory
  • βœ… Progress Tracking: Never lose progress, resume from any step automatically

How It Works

  1. Checkpoint Creation: After each step completes, a checkpoint.json file is saved in that step's flow directory
  2. Startup Scan: When the system starts, it scans all flow{N}/ directories in the output folder
  3. Checkpoint Loading: For each flow directory, it loads the checkpoint.json file and extracts:
    • The step name (e.g., "text_analysis", "code_validated")
    • All processing data (analysis results, schema, code, etc.)
  4. State Restoration: The system restores the complete state from the latest checkpoint
  5. Smart Skip: Steps that are already completed (have valid checkpoints) are automatically skipped
  6. Resume Execution: Processing continues from the first incomplete step

πŸ“‹ Requirements

  • Python: 3.8 or higher
  • API Keys:
    • OpenAI API key (or compatible endpoint)
    • Anthropic API key (or compatible endpoint)
  • Dependencies: See requirements.txt
  • Optional: Playwright for visual analysis

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


πŸ“ž Support


Made with ❀️ using AI Agents

⭐ Star this repo if you find it useful!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages