This repository is a MindSpore companion implementation for the SparseTable-Bench workflow described in Chapter 40 of Data Engineering for Foundation Models.
It focuses on the data engineering loop behind sparse table recognition:
- STB sample schema loading and validation
- synchronized HTML, cell text, and bbox supervision objects
- column-aware STB-Mask-Stress generation
- TEDS and TEDS-S style structural evaluation
- a minimal MindSpore training entry point for coarse structure targets
pip install -e ".[test]"MindSpore is optional for the data tools and tests. Install it only when running the training entry point:
pip install -e ".[mindspore]"Input JSONL samples use one table per line:
{
"image_id": "sample_001",
"image_path": "images/sample_001.png",
"width": 320,
"height": 120,
"html": "<table><tr><th>Metric</th><th>Prior</th><th>Current</th></tr><tr><td>Revenue</td><td></td><td>$12.4M</td></tr></table>",
"cells": [
{"text": "Metric", "bbox": [10, 10, 110, 40], "row": 0, "col": 0, "is_header": true},
{"text": "Prior", "bbox": [110, 10, 210, 40], "row": 0, "col": 1, "is_header": true},
{"text": "Current", "bbox": [210, 10, 310, 40], "row": 0, "col": 2, "is_header": true},
{"text": "Revenue", "bbox": [10, 40, 110, 80], "row": 1, "col": 0},
{"text": "[EMPTY_CELL]", "bbox": [110, 40, 210, 80], "row": 1, "col": 1},
{"text": "$12.4M", "bbox": [210, 40, 310, 80], "row": 1, "col": 2}
]
}stb-ms validate --input data/stb_train.jsonlValidation checks image dimensions, bbox bounds, positive spans, and grid occupancy conflicts.
stb-ms mask-stress \
--input data/stb_train.jsonl \
--image-root data \
--output-root outputs/mask_stress \
--mask-columns 1 \
--seed 13The generator masks selected cell regions in the image and synchronously sets the corresponding cell text to [EMPTY_CELL] while preserving row, column, bbox, and span topology.
Prediction JSONL format:
{"image_id": "sample_001", "reference_html": "<table>...</table>", "prediction_html": "<table>...</table>"}Run:
stb-ms eval --predictions outputs/predictions.jsonlThe tool reports approximate TEDS and TEDS-S. TEDS includes text content; TEDS-S focuses on structural topology.
The training entry point is intentionally small. It predicts coarse structure targets: row count, column count, empty-cell ratio, and mean normalized cell area. This is a reproducible scaffold for wiring STB data into MindSpore; benchmark-grade HTML decoding and TEDS evaluation remain in the data tools.
stb-ms-train \
--train-jsonl outputs/mask_stress/mask_stress.jsonl \
--image-root outputs/mask_stress \
--epochs 3 \
--device-target CPUsparsetable_bench_ms/
schema.py # STB sample dataclasses and validation
mask_stress.py # column-aware occlusion generation
metrics.py # TEDS / TEDS-S style metrics
ms_dataset.py # MindSpore GeneratorDataset adapter
ms_model.py # minimal MindSpore structure regressor
train.py # training CLI
cli.py # validation, masking, evaluation CLI
configs/
tests/