Skip to content

freds-dev/Tag2Pixel

Repository files navigation

🌍 Tag2Pixel: Geospatial ML Pipeline

Tag2Pixel is a pipeline designed to automate the creation of machine learning datasets and models by using OpenStreetMap (OSM) vector data with Sentinel-2 multispectral satellite imagery.

The pipeline is "class-agnostic": you can repurpose it to detect any object that has a distinct spectral signature and is mapped in OSM (e.g., solar panels, specific crop types, or water bodies). It extracts geometries, samples the underlying satellite pixels, and trains a Random Forest model with spatial validation.

Note: This is a pixel-based pipeline. It classifies the spectral signature of a single point rather than the shape or texture of an object.


🚀 Key Features

  • Automated Data Bridge: Connects OSM tags directly to Sentinel-2 spectral bands.
  • Ultra-Fast OSM Extraction: Uses osmium-tool pre-filtering to process large-scale PBF files in seconds.
  • Cloud-Native Imagery: Fetches multispectral data on-the-fly from the Microsoft Planetary Computer STAC API.
  • Spatial Validation (kNNDM): Uses spatial cross-validation (Nearest Neighbor Distance Matching).
  • Dockerized: One command to set up the entire Python (data) and R (ML) environment.

🛠 How it Works

  1. OSM Extraction: Scans a .osm.pbf file for specific tags and extracts polygon centroids.
  2. Spectral Sampling: Queries the STAC API for the 10+ Sentinel-2 bands at those coordinates.
  3. Median Compositing: Uses multiple time-steps to reduce noise.
  4. Spatial Training: Trains a Random Forest and validates it.

⚙️ Configuration (config/config.yaml)

The pipeline is controlled entirely via config.yaml. Here is how to define your task:

1. Task Settings

Defines the "Target" (what you want to find) and the "Ratio" (how many negative samples to pick).

task:
  target_class: "pv"         # Must match a name in the 'classes' list below
  target_count: 2000         # How many samples to extract for the target
  negative_ratio: 1.0        # 1.0 means 2000 target vs 2000 background samples

2. Class Definitions (The "Tags")

You define classes using OSM tag filters. You can use AND logic (inside one filter set) and OR logic (by adding multiple filter sets).

classes:
  - name: "pv"
    label: 1                 # Integer label for ML
    min_area: 200            # Minimum size in m² to consider a polygon valid
    filters:
      # Option A: match if source=solar AND location=roof
      - generator:source: "solar"
        location: "roof"
      # Option B: OR match if method=photovoltaic AND location=roof
      - generator:method: "photovoltaic"
        location: "roof"

  - name: "background"
    label: 0
    min_area: 500
    filters:
      # Match any building tagged as retail or supermarket
      - building: ["retail", "supermarket", "commercial"]

3. Satellite Imagery (STAC)

Choose your bands and time range. Default uses all 10m and 20m Sentinel-2 bands.

stac:
  date_range: "2024-01-01/2024-12-31"
  cloud_cover: 20
  bands: ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B11", "B12"]

🚦 Quick Start

1. Requirements

2. Run the Pipeline

# Build and run everything
docker-compose up --build

3. Test Mode

To verify everything is working without waiting for a full 4,000-point extraction, set test_mode: true in config.yaml. This will use smaller sample sizes and limits.


📊 Outputs

  • data/training/training.csv: The extracted spectral dataset.
  • data/artifacts/: The trained .rds R model.
  • output/metrics.json: Model performance (AUC, Accuracy) using spatial CV.

📄 License

MIT License.

About

From tags to pixels. Easily build Sentinel-2 classifiers for any object mapped in OpenStreetMap.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors