Skip to content

Add synthetic curriculum generators for model training #9

@swernerx

Description

@swernerx

Goal

Build a synthetic curriculum generator for Morphēa training and evaluation, covering diagrams, charts, logo-like marks, multi-color illustrations, and typography-like regions before relying only on new real-image cases.

Why this matters

Real curated images are essential, but they are slow to collect, label, and review. Morphēa already benefits from generated primitive data and the Lucide rendered corpus. The next step is a broader procedural curriculum where every generated bitmap has known SVG structure, target labels, and editability expectations.

This fits current AI practice: synthetic data and domain randomization are widely used to train vision systems when labeled real data is scarce. The important caveat is the simulation-to-reality gap. Synthetic data should create controlled coverage and failure reproduction, while real curated cases still decide whether the system is genuinely improving.

Proposed curriculum families

Start with deterministic generators for:

  • Diagram mini-scenes: boxes, rounded boxes, diamonds, connector strokes, arrowheads, text-like regions, repeated node styles.
  • Chart mini-scenes: bars, axes, grid lines, line markers, legends, annotation boxes, repeated colors.
  • Logo-like marks: symmetric icons, counters/negative space, smooth curves, sharp geometric cuts, mild antialiasing.
  • Multi-color vector illustrations: overlapping colored regions, reused palette colors, simple foreground/background grouping.
  • Typography-like regions: words rendered from multiple fonts, grouped as glyph/word regions without requiring OCR.

Data strategy

  • Generate source SVG plus rendered PNG for each sample.
  • Store machine-readable targets: expected primitive classes, grouping, source object ids, allowed fallbacks, node/parameter budgets, and visual thresholds.
  • Randomize controlled factors: scale, antialiasing, color palette, stroke width, slight rotation, spacing, compression/noise, font family, and layer order.
  • Keep seeds stable and configs checked in so failures can be reproduced.
  • Split generated data into train/val/test by seed and family, not by output file order.

Model strategy

  • Use the generated corpus to train/evaluate MLX-backed raster-target and primitive/ranking models.
  • Treat synthetic-only gains as provisional until validated on curated real images.
  • Add reports that separate synthetic family scores from real-image family scores so the system cannot look good only by overfitting generated data.
  • Use synthetic cases to reproduce failures discovered in real images before changing detectors.

Acceptance criteria

  • Add a spec or design doc for the synthetic curriculum schema and first generator family.
  • Implement one first family end-to-end, preferably diagrams or charts, with source SVG, rendered PNG, target labels, and evaluation report.
  • Add a checked-in smoke config that writes generated artifacts to /tmp and trains/evaluates at least one MLX model path.
  • Reports expose per-family accuracy/editability gates and make synthetic-vs-real performance explicit.
  • Documentation states that synthetic data accelerates coverage but does not replace curated real-image promotion.

References

  • Synthetic data augmentation is a common answer to limited labeled data in computer vision, but surveys also stress limits and domain gaps.
  • SynthText and related text-image generators show that text/glyph appearance can be generated at large scale with character/word boxes, which is useful here even if Morphēa does not attempt OCR.
  • Domain randomization suggests varying style/noise/geometry so models learn stable structure rather than one narrow rendering style.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions