Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading by russfellows · Pull Request #359 · mlcommons/storage

russfellows · 2026-04-27T21:58:58Z

Summary

Five independent bug fixes and enhancements across object storage credential loading, results directory configuration, IPC context handling in streaming checkpointing, MPI transport configuration, and Parquet data loader test correctness. Also adds 28 new unit tests, corrects 15 upstream test failures (wrong package name mlpstorage → mlpstorage_py), a comprehensive object-storage usage guide, and dated benchmark results.

Tests: 949 passing, 4 skipped, 0 failures.

Changes

1. Object Storage: `.env` Credential Loading (`mlpstorage_py/benchmarks/dlio.py`)

_apply_object_storage_params() now logs the full path of the .env file it loads and raises a FileNotFoundError with an actionable message (listing required env vars and pointing to .env.example) when --object mode is active but no .env is found. Previously failed silently with cryptic downstream auth errors.

2. Results Directory: Environment Variable Override (`mlpstorage_py/config.py`, `mlpstorage_py/main.py`)

DEFAULT_RESULTS_DIR now reads MLPERF_RESULTS_DIR from the environment, falling back to the system temp directory. A WARNING-level message is emitted at startup when results are going to the temp dir and MLPERF_RESULTS_DIR is not set. Also adds the missing import os to main.py.

3. Checkpointing: IPC Context Fix (`mlpstorage_py/checkpointing/streaming_checkpoint.py`)

Queue and Event IPC objects are now created from the same multiprocessing context (ctx) as the child process. Previously used the default context, causing RuntimeError: SemLock can only be shared between processes through inheritance on non-fork platforms (macOS, Windows, PYTHONSTARTMETHOD=spawn).

4. MPI: Disable VADER BTL (`mlpstorage_py/utils.py`)

Appends --mca btl ^vader to single-host MPI flags to prevent segfaults from Open MPI's VADER shared-memory BTL on certain kernel/Open MPI version combinations. Standard documented workaround; no performance impact for storage benchmarking.

5. Dependencies (`pyproject.toml`, `uv.lock`)

Added python-dotenv >= 1.0.0 (required for .env loading in dlio.py)
Updated s3dlio >= 0.9.95 (byte-range GET + stat() for Parquet footer reads)
Pinned dlio-benchmark to russfellows/dlio_benchmark@feat/parquet-dgen-streaming which adds ParquetReaderS3Iterable with footer-caching fix (pending merge into upstream dlio_benchmark)

6. Security: `.gitignore` credential exclusion

Added .env.* / !.env.example to prevent accidental credential file commits.

7. Test fixes: upstream `mlpstorage` → `mlpstorage_py` references

Fixed 15 test failures in tests/unit/test_benchmarks_vectordb.py and tests/unit/test_cli.py introduced by upstream commits that imported from mlpstorage (non-existent package; correct name is mlpstorage_py). Also mocked _validate_vdb_dependencies so VectorDB tests don't require optional pymilvus/tabulate packages in the base environment.

8. Updated Parquet reader tests (`tests/unit/test_parquet_reader.py`)

7 tests updated to match the refactored ParquetReaderS3Iterable API: cache stores compressed byte count (int) not a pyarrow Table; close() is a no-op; no per-sample LRU eviction (cache clears at finalize()).

9. Documentation

docs/OBJECT_STORAGE_GUIDE.md — new 292-line guide for S3/MinIO/object storage setup, .env configuration, and troubleshooting
docs/MULTI_ENDPOINT_GUIDE.md — expanded with s3dlio multi-endpoint load-balancing examples
README cross-references updated

- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files) - Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files) - Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML) - Add .env.example documenting all runtime parameters - Update 22 shell scripts: pip/venv setup -> uv sync pattern - Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports - Update tests/object-store/README.md: - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...' - Update Library Selection section: YAML key -> runtime --param approach - Remove s3torchconnector from library selection table (keep historical results) - Update prerequisites: source .venv + source .env -> uv sync Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)

- Extract --file/--object from add_universal_arguments into new add_storage_type_arguments() function; VectorDB/KVCache parsers no longer require it; training/checkpointing parsers call it - Update training/checkpointing tests to pass --file in parse_args - Wrap _collect_cluster_start/_collect_cluster_end with progress_context to show spinner during SSH/MPI collection - Pass validate_structure=False to ReportGenerator in test fixtures that use empty temporary directories - Change logger.error -> logger.warning for nonexistent results dir in get_runs_files; skip dirs with multiple metadata files - Add _uri_for_filename alias to ParquetReaderS3Iterable

- Make --file/--object optional (required=False) so ALL benchmark parsers can carry the flag; VectorDB and KV-cache parsers now include it so the argument is available everywhere - Fix progress.py: replace logger.status() (non-existent Logger method) with logger.info() in both progress_context and create_stage_progress non-interactive fallback paths - Update tests to assert logger.info() instead of logger.status() dlio_benchmark changes (local fork + installed venv): - Replace broken \r-in-logger progress() with a Rich-based implementation using SpinnerColumn + BarColumn; falls back to plain stdout writes if Rich is unavailable

…rams Reduce tests/object-store/ from 30+ files to 4 clean tests: - run_training.sh — datagen + training via mlpstorage CLI - run_checkpointing.sh — checkpoint write + read via dlio_benchmark - test_s3lib_get_bench.py — GET throughput benchmark (updated) - test_direct_write_comparison.py — native write/read benchmark (updated) All runtime parameters (bucket, endpoint, storage library, credentials) now come exclusively from environment variables or .env — no hardcoded site-specific values remain in any test script or config file. Changes: - Archive 26 per-library scripts and result docs to old-archive/ - Archive 3 per-library checkpoint YAMLs to old-archive/ - Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only YAML with all storage runtime params supplied via Hydra CLI overrides - run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable - run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable - test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio); fail fast with clear error if bucket not set - test_direct_write_comparison.py: use BUCKET env var as shared default; add validation error if required buckets not set - Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests Unit tests: 905 passed, 4 skipped (no regressions)

…ore tests - pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev, which contains minio connection-pool fix and s3torchconnector bool fix - uv.lock: regenerated after pyproject.toml change (resolved b1696e1) - configs/dlio/workload: remove 17 library-specific YAML files (minio, s3dlio, s3torch variants) — all storage params are now supplied via --params CLI overrides from .env; generic YAMLs remain - configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field - tests/object-store/README.md: complete rewrite with accurate instructions - tests/object-store/run_training.sh: add s3torchconnector support, spawn multiprocessing, disable checkpoint in training tests - tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector - tests/object-store/run_datagen.sh: new helper script - tests/object-store/run_cleanup.sh: new helper script - tests/object-store/old-archive/: archive stale test utility files

…d parquet loading Object storage (dlio.py): - _apply_object_storage_params() now logs the .env file path it loads - Raises FileNotFoundError with actionable message if --object mode finds no .env Config (config.py): - DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir Main (main.py): - Add import os (was missing after tempdir warning addition) - Warn at startup when results will be written to system temp dir Checkpointing (streaming_checkpoint.py): - IPC Queue/Event created from same multiprocessing context as child process - Fixes SemLock fork/spawn mismatch on non-fork start methods MPI (utils.py): - Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults Dependencies (pyproject.toml, uv.lock): - s3dlio >= 0.9.95 - python-dotenv >= 1.0.0 - dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming Security (.gitignore): - Block .env.* credential files; keep .env.example Unit tests (933 passing, 4 skipped): - tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior - tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark() - tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params() - tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API (cache stores int byte-count not Table; no LRU eviction; close() is no-op) Docs: - docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/ - README.md, docs/README.md, tests/README.md: cross-reference links updated Benchmark results and analysis (new in tests/): - tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy) - tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test - tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md - tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md - tests/DLIO-optimization-analysis-2026-04-25.md

…iles tests/unit/test_benchmarks_vectordb.py: - Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*) - Add _validate_vdb_dependencies mock to all 14 tests that instantiate VectorDBBenchmark; that method runs in __init__ before verify_benchmark and raises DependencyError when optional packages (pymilvus, tabulate) are not installed in the base uv env tests/unit/test_cli.py: - Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser, mlpstorage.config → mlpstorage_py.*) - Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved All 15 previously-failing upstream tests now pass. Full suite: 949 passed, 4 skipped.

github-actions · 2026-04-27T21:59:07Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy · 2026-04-28T17:07:53Z

I thought I had converted all the "mlpstorage" to "mlpstorage_py" everywhere, so I'm surprised that you had to do those changes (again?). Did I just miss these cases earlier?

idevasena · 2026-04-28T22:32:32Z

Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly.

=========================== short test summary info ============================
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_path_when_env_file_found_in_cwd
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_raises_file_not_found_when_no_env_file_anywhere
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_error_message_includes_required_vars
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_when_dotenv_upward_search_succeeds
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsBucketValidation::test_raises_value_error_when_bucket_missing
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_type_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_root_as_bucket
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_library_s3dlio
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_custom_library
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_uri_scheme_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_sets_force_path_style_when_endpoint_url_present
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_without_endpoint_url
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_direct_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_file_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_storage_root_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_force_path_style_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_logs_injected_params_summary
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_get_sample_caches_row_group
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_cache_grows_as_rgs_are_accessed
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_all_rg_entries_persist_within_epoch
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_does_not_evict_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_preserves_all_files_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_absolute_passthrough
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_relative_filename
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_strips_leading_slash
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_column_filtering_records_byte_count
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_no_column_filter_reads_all
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
smrc@dskbd029:~/Storage_Repo_Tests/mlc-storage-pr359$ uv run pytest tests/unit -q --co | wc -l  
1160
smrc@dskbd029:~/Storage_Repo_Tests/mlc-storage-pr359$ grep -E "passed|failed|skipped" /tmp/pr359_unit.log | tail -5
        tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_datagen_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_configview_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_skipped_logs_debug (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_validation_helpers.py::TestValidateBenchmarkEnvironment::test_success_logs_passed_message.
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================

russfellows · 2026-04-28T22:59:42Z

Yes, I am pretty certain this can not, and SHOULD not work until the dlio_benchmark PR has been merged. These require changes in how the readers work. Here is that PR: mlcommons/DLIO_local_changes#16 Regards, —Russ

…

On Apr 28, 2026, at 4:32 PM, Devasena I ***@***.***> wrote: idevasena left a comment (mlcommons/storage#359) <#359 (comment)> Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly. =========================== short test summary info ============================ FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_path_when_env_file_found_in_cwd FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_raises_file_not_found_when_no_env_file_anywhere FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_error_message_includes_required_vars FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_when_dotenv_upward_search_succeeds FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsBucketValidation::test_raises_value_error_when_bucket_missing FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_type_s3 FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_root_as_bucket FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_library_s3dlio FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_custom_library FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_uri_scheme_s3 FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_sets_force_path_style_when_endpoint_url_present FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_without_endpoint_url FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_direct_scheme FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_file_scheme FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_storage_root_not_overridden FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_force_path_style_not_overridden FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_logs_injected_params_summary FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_get_sample_caches_row_group FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_cache_grows_as_rgs_are_accessed FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_all_rg_entries_persist_within_epoch FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_does_not_evict_rg_cache FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_preserves_all_files_rg_cache FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_absolute_passthrough FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_relative_filename FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_strips_leading_slash FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_column_filtering_records_byte_count FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_no_column_filter_reads_all ================== 27 failed, 922 passed, 4 skipped in 9.75s =================== ***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ uv run pytest tests/unit -q --co | wc -l 1160 ***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ grep -E "passed|failed|skipped" /tmp/pr359_unit.log | tail -5 tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_datagen_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory). tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_configview_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory). tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_skipped_logs_debug (fixtures used: mock_logger, request, tmp_path, tmp_path_factory). tests/unit/test_validation_helpers.py::TestValidateBenchmarkEnvironment::test_success_logs_passed_message. ================== 27 failed, 922 passed, 4 skipped in 9.75s =================== — Reply to this email directly, view it on GitHub <#359 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJYX6VJBQE7AI55AS4L4YEWRNAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGMZZGU2TGNRYGE>. You are receiving this because you authored the thread.

idevasena · 2026-04-29T01:01:20Z

Got it. Reviewing your DLIO PR mlcommons/DLIO_local_changes#16 and use local version of it for this PR testing.

russfellows · 2026-04-29T13:41:11Z

Devasena, I am certain you are correct. So perhaps my branch tag is wrong, or the tests are now wrong? I am not certain. Obviously, more testing would be good. Unfortunately, I don’t have any time left this week, as I spent my weekend working on these changes. If someone else can examine the tests given the updated dlio_benchmark parquet readers, that would be great. Regards, —Russ

…

On Apr 28, 2026, at 7:04 PM, Devasena I ***@***.***> wrote: @idevasena commented on this pull request. In pyproject.toml <#359 (comment)>: > @@ -82,7 +85,7 @@ url = "https://download.pytorch.org/whl/cpu" explicit = true [tool.uv.sources] -dlio-benchmark = { git = "https://github.com/mlcommons/DLIO_local_changes.git" } +dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", branch = "feat/parquet-dgen-streaming" } I thought, this instance of DLIO benchmark linking to your fork in pyproject.toml would suffice. — Reply to this email directly, view it on GitHub <#359 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJZYB75PBH43OYCXA6L4YFILBAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJTGQ4TMOBUHE>. You are receiving this because you authored the thread.

idevasena · 2026-05-01T22:14:07Z


 [tool.uv.sources]
-dlio-benchmark = { git = "https://github.com/mlcommons/DLIO_local_changes.git" }
+dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", branch = "feat/parquet-dgen-streaming" }


Now that mlcommons/DLIO_local_changes#16 is merged. The pyproject.toml needs to be reverted back to https://github.com/mlcommons/DLIO_local_changes.git.

idevasena

Unit-test coverage step-up (949 passing) is genuinely impressive and made it easy to verify cross-PR compatibility with DLIO PR #16.
The _apply_object_storage_params() flow that consumes .env cleanly bridges DLIO PR #16's env-var contract with the user-facing CLI; well-designed.
The tests/Checkpoint_test_results.md / Flux_test_results.md etc. baseline-results docs are excellent provenance — every baseline number is dated and accompanied by the exact command that produced it.
The package rename to mlpstorage_py was thorough — none of the 7 updated parquet reader tests, 8 issue-9 tests, etc. failed.

idevasena

Test pass summary

Configuration tested:

Host: 1 TiB RAM, 56 TB NVMe
DLIO: mlcommons/DLIO_local_changes:main (post-PR-16-merge)
mlpstorage: branch-3-0-1/bug-fixes-perf-enhancements (PR #359 head)
Python 3.12, uv-managed venv

Suites that passed:

DLIO tests/run_fast_ci.sh: 112 passed, 1 skipped
DLIO tests/test_issue_regressions.py: all passed
mlpstorage tests/unit/: 949 passed, 4 skipped

End-to-end runs validated:

UNet3D datagen + run, local-FS, 4 ranks, 8 files (smoke)
Flux datagen, local-FS, 8 ranks, 16 files (parquet generator + dgen-py path)
Multi-endpoint S3 rank-routing logic (unit-level reproduction of obj_store_lib.py:202-231)

End-to-end run for Flux:

POSIX test: Flux full b200 run at 8 accelerators × 8610 files (would take ~30+ minutes; integration validated at smaller scale instead)
I/O Throughput: 611 MiB/s
AU Accelerator Utilization: 99.5%

idevasena · 2026-05-07T14:26:10Z

Just one change needed in pyproject.toml before merge. #359 (comment)

Since the new commit will invalidate the previous approvals, will approve it after the commit.

idevasena · 2026-05-07T14:38:52Z

We can take the diagnosis and fix for this issue below in a separate PR. However, I wanted to highlight the reported issue: #362 where the training hang on the new v3 training models — flux, dlrm, and retinanet — all of which freeze at Starting epoch 1: 1200 steps expected and never progress.

I faced this during datagen as well intermittently on the PR branch 359 as well where the parquet file generation is hung when using num procs > 1.

idevasena

pyproject.toml update to DLIO_local mlcommons repo and issue #362 as part of separate PRs.

russfellows added 9 commits April 27, 2026 15:37

fix: switch dlio-benchmark ref from deleted dev branch to main

aa8de4b

chore: update uv.lock to dlio_benchmark f58903c (PRs #9 and #10)

217ac6e

russfellows requested a review from a team April 27, 2026 21:58

russfellows mentioned this pull request Apr 27, 2026

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading russfellows/mlc-storage#28

Merged

FileSystemGuy approved these changes Apr 27, 2026

View reviewed changes

idevasena reviewed Apr 29, 2026

View reviewed changes

Comment thread pyproject.toml

idevasena reviewed May 7, 2026

View reviewed changes

idevasena approved these changes May 8, 2026

View reviewed changes

idevasena merged commit 39e657d into mlcommons:main May 8, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators May 8, 2026

Conversation

russfellows commented Apr 27, 2026

Summary

Changes

1. Object Storage: .env Credential Loading (mlpstorage_py/benchmarks/dlio.py)

2. Results Directory: Environment Variable Override (mlpstorage_py/config.py, mlpstorage_py/main.py)

3. Checkpointing: IPC Context Fix (mlpstorage_py/checkpointing/streaming_checkpoint.py)

4. MPI: Disable VADER BTL (mlpstorage_py/utils.py)

5. Dependencies (pyproject.toml, uv.lock)

6. Security: .gitignore credential exclusion

7. Test fixes: upstream mlpstorage → mlpstorage_py references

8. Updated Parquet reader tests (tests/unit/test_parquet_reader.py)

9. Documentation

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

FileSystemGuy commented Apr 28, 2026

Uh oh!

idevasena commented Apr 28, 2026

Uh oh!

russfellows commented Apr 28, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idevasena commented Apr 29, 2026

Uh oh!

Uh oh!

russfellows commented Apr 29, 2026 via email

Uh oh!

idevasena May 1, 2026

Choose a reason for hiding this comment

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Test pass summary

Configuration tested:

Suites that passed:

End-to-end runs validated:

End-to-end run for Flux:

Uh oh!

idevasena commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idevasena commented May 7, 2026

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Object Storage: `.env` Credential Loading (`mlpstorage_py/benchmarks/dlio.py`)

2. Results Directory: Environment Variable Override (`mlpstorage_py/config.py`, `mlpstorage_py/main.py`)

3. Checkpointing: IPC Context Fix (`mlpstorage_py/checkpointing/streaming_checkpoint.py`)

4. MPI: Disable VADER BTL (`mlpstorage_py/utils.py`)

5. Dependencies (`pyproject.toml`, `uv.lock`)

6. Security: `.gitignore` credential exclusion

7. Test fixes: upstream `mlpstorage` → `mlpstorage_py` references

8. Updated Parquet reader tests (`tests/unit/test_parquet_reader.py`)

russfellows commented Apr 28, 2026 via email •

edited

Loading

idevasena commented May 7, 2026 •

edited

Loading