Skip to content

[Training v3.0 consolidation] Evaluate Parquet reader performance in FLUX to validate it can meet target throughput #357

@wolfgang-desalvador

Description

@wolfgang-desalvador

Summary

Evaluate the Parquet reader path for the FLUX workload, after the optimizations tracked in #356 are in place, to determine whether Parquet can remain the data format for FLUX while still meeting the per-accelerator target throughput.

Dependency: This evaluation can only be carried out after #356 is implemented, since the persistent file handle and preserved row group cache are prerequisites for a meaningful Parquet performance assessment in FLUX.

Motivation

FLUX currently uses Parquet for sample storage. Before considering alternative on-disk formats, we need a clear, quantitative answer to a single question: can the Parquet reader path — once optimized as proposed in #356 (persistent file handle and preserved row group cache) — sustain the FLUX throughput target on representative storage backends?

A clean evaluation here will either confirm Parquet as the long-term format for FLUX or provide the evidence needed to motivate exploring alternatives.

Proposed methodology

  1. Define a baseline FLUX dataset and a fixed accelerator/host configuration.
  2. Run FLUX on the current Parquet reader as a baseline.
  3. Run FLUX with the optimization from [Training v3.0 consolidation] Keep a single Parquet file handle open in the FLUX reader to avoid row group cache eviction #356 applied (persistent file handle, preserved row group cache).
  4. For each run, record:
    • Achieved throughput (GB/s, samples/s) per accelerator
    • CPU utilization (overall and per worker)
    • Storage-side metrics (read IOPS, average request size)
    • Reader latency distribution
  5. Compare results against the FLUX target throughput.

Success criteria

  • A clear go/no-go conclusion on whether Parquet can remain the FLUX data format while meeting the target throughput.
  • A short report supporting the conclusion with reproducible measurements.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions