Skip to content

champiom666/SparseTable-Bench-MindSpore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparseTable-Bench MindSpore

This repository is a MindSpore companion implementation for the SparseTable-Bench workflow described in Chapter 40 of Data Engineering for Foundation Models.

It focuses on the data engineering loop behind sparse table recognition:

  • STB sample schema loading and validation
  • synchronized HTML, cell text, and bbox supervision objects
  • column-aware STB-Mask-Stress generation
  • TEDS and TEDS-S style structural evaluation
  • a minimal MindSpore training entry point for coarse structure targets

Install

pip install -e ".[test]"

MindSpore is optional for the data tools and tests. Install it only when running the training entry point:

pip install -e ".[mindspore]"

Sample Schema

Input JSONL samples use one table per line:

{
  "image_id": "sample_001",
  "image_path": "images/sample_001.png",
  "width": 320,
  "height": 120,
  "html": "<table><tr><th>Metric</th><th>Prior</th><th>Current</th></tr><tr><td>Revenue</td><td></td><td>$12.4M</td></tr></table>",
  "cells": [
    {"text": "Metric", "bbox": [10, 10, 110, 40], "row": 0, "col": 0, "is_header": true},
    {"text": "Prior", "bbox": [110, 10, 210, 40], "row": 0, "col": 1, "is_header": true},
    {"text": "Current", "bbox": [210, 10, 310, 40], "row": 0, "col": 2, "is_header": true},
    {"text": "Revenue", "bbox": [10, 40, 110, 80], "row": 1, "col": 0},
    {"text": "[EMPTY_CELL]", "bbox": [110, 40, 210, 80], "row": 1, "col": 1},
    {"text": "$12.4M", "bbox": [210, 40, 310, 80], "row": 1, "col": 2}
  ]
}

Validate Samples

stb-ms validate --input data/stb_train.jsonl

Validation checks image dimensions, bbox bounds, positive spans, and grid occupancy conflicts.

Generate STB-Mask-Stress

stb-ms mask-stress \
  --input data/stb_train.jsonl \
  --image-root data \
  --output-root outputs/mask_stress \
  --mask-columns 1 \
  --seed 13

The generator masks selected cell regions in the image and synchronously sets the corresponding cell text to [EMPTY_CELL] while preserving row, column, bbox, and span topology.

Evaluate Predictions

Prediction JSONL format:

{"image_id": "sample_001", "reference_html": "<table>...</table>", "prediction_html": "<table>...</table>"}

Run:

stb-ms eval --predictions outputs/predictions.jsonl

The tool reports approximate TEDS and TEDS-S. TEDS includes text content; TEDS-S focuses on structural topology.

MindSpore Training

The training entry point is intentionally small. It predicts coarse structure targets: row count, column count, empty-cell ratio, and mean normalized cell area. This is a reproducible scaffold for wiring STB data into MindSpore; benchmark-grade HTML decoding and TEDS evaluation remain in the data tools.

stb-ms-train \
  --train-jsonl outputs/mask_stress/mask_stress.jsonl \
  --image-root outputs/mask_stress \
  --epochs 3 \
  --device-target CPU

Repository Layout

sparsetable_bench_ms/
  schema.py       # STB sample dataclasses and validation
  mask_stress.py  # column-aware occlusion generation
  metrics.py      # TEDS / TEDS-S style metrics
  ms_dataset.py   # MindSpore GeneratorDataset adapter
  ms_model.py     # minimal MindSpore structure regressor
  train.py        # training CLI
  cli.py          # validation, masking, evaluation CLI
configs/
tests/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages