Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#359
Conversation
- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files) - Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files) - Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML) - Add .env.example documenting all runtime parameters - Update 22 shell scripts: pip/venv setup -> uv sync pattern - Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports - Update tests/object-store/README.md: - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...' - Update Library Selection section: YAML key -> runtime --param approach - Remove s3torchconnector from library selection table (keep historical results) - Update prerequisites: source .venv + source .env -> uv sync Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)
- Extract --file/--object from add_universal_arguments into new add_storage_type_arguments() function; VectorDB/KVCache parsers no longer require it; training/checkpointing parsers call it - Update training/checkpointing tests to pass --file in parse_args - Wrap _collect_cluster_start/_collect_cluster_end with progress_context to show spinner during SSH/MPI collection - Pass validate_structure=False to ReportGenerator in test fixtures that use empty temporary directories - Change logger.error -> logger.warning for nonexistent results dir in get_runs_files; skip dirs with multiple metadata files - Add _uri_for_filename alias to ParquetReaderS3Iterable
- Make --file/--object optional (required=False) so ALL benchmark parsers can carry the flag; VectorDB and KV-cache parsers now include it so the argument is available everywhere - Fix progress.py: replace logger.status() (non-existent Logger method) with logger.info() in both progress_context and create_stage_progress non-interactive fallback paths - Update tests to assert logger.info() instead of logger.status() dlio_benchmark changes (local fork + installed venv): - Replace broken \r-in-logger progress() with a Rich-based implementation using SpinnerColumn + BarColumn; falls back to plain stdout writes if Rich is unavailable
…rams Reduce tests/object-store/ from 30+ files to 4 clean tests: - run_training.sh — datagen + training via mlpstorage CLI - run_checkpointing.sh — checkpoint write + read via dlio_benchmark - test_s3lib_get_bench.py — GET throughput benchmark (updated) - test_direct_write_comparison.py — native write/read benchmark (updated) All runtime parameters (bucket, endpoint, storage library, credentials) now come exclusively from environment variables or .env — no hardcoded site-specific values remain in any test script or config file. Changes: - Archive 26 per-library scripts and result docs to old-archive/ - Archive 3 per-library checkpoint YAMLs to old-archive/ - Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only YAML with all storage runtime params supplied via Hydra CLI overrides - run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable - run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable - test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio); fail fast with clear error if bucket not set - test_direct_write_comparison.py: use BUCKET env var as shared default; add validation error if required buckets not set - Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests Unit tests: 905 passed, 4 skipped (no regressions)
…ore tests - pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev, which contains minio connection-pool fix and s3torchconnector bool fix - uv.lock: regenerated after pyproject.toml change (resolved b1696e1) - configs/dlio/workload: remove 17 library-specific YAML files (minio, s3dlio, s3torch variants) — all storage params are now supplied via --params CLI overrides from .env; generic YAMLs remain - configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field - tests/object-store/README.md: complete rewrite with accurate instructions - tests/object-store/run_training.sh: add s3torchconnector support, spawn multiprocessing, disable checkpoint in training tests - tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector - tests/object-store/run_datagen.sh: new helper script - tests/object-store/run_cleanup.sh: new helper script - tests/object-store/old-archive/: archive stale test utility files
…d parquet loading Object storage (dlio.py): - _apply_object_storage_params() now logs the .env file path it loads - Raises FileNotFoundError with actionable message if --object mode finds no .env Config (config.py): - DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir Main (main.py): - Add import os (was missing after tempdir warning addition) - Warn at startup when results will be written to system temp dir Checkpointing (streaming_checkpoint.py): - IPC Queue/Event created from same multiprocessing context as child process - Fixes SemLock fork/spawn mismatch on non-fork start methods MPI (utils.py): - Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults Dependencies (pyproject.toml, uv.lock): - s3dlio >= 0.9.95 - python-dotenv >= 1.0.0 - dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming Security (.gitignore): - Block .env.* credential files; keep .env.example Unit tests (933 passing, 4 skipped): - tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior - tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark() - tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params() - tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API (cache stores int byte-count not Table; no LRU eviction; close() is no-op) Docs: - docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/ - README.md, docs/README.md, tests/README.md: cross-reference links updated Benchmark results and analysis (new in tests/): - tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy) - tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test - tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md - tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md - tests/DLIO-optimization-analysis-2026-04-25.md
…iles tests/unit/test_benchmarks_vectordb.py: - Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*) - Add _validate_vdb_dependencies mock to all 14 tests that instantiate VectorDBBenchmark; that method runs in __init__ before verify_benchmark and raises DependencyError when optional packages (pymilvus, tabulate) are not installed in the base uv env tests/unit/test_cli.py: - Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser, mlpstorage.config → mlpstorage_py.*) - Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved All 15 previously-failing upstream tests now pass. Full suite: 949 passed, 4 skipped.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
I thought I had converted all the "mlpstorage" to "mlpstorage_py" everywhere, so I'm surprised that you had to do those changes (again?). Did I just miss these cases earlier? |
|
Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly. |
|
Yes, I am pretty certain this can not, and SHOULD not work until the dlio_benchmark PR has been merged.
These require changes in how the readers work.
Here is that PR: mlcommons/DLIO_local_changes#16
Regards,
—Russ
… On Apr 28, 2026, at 4:32 PM, Devasena I ***@***.***> wrote:
idevasena
left a comment
(mlcommons/storage#359)
<#359 (comment)>
Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly.
=========================== short test summary info ============================
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_path_when_env_file_found_in_cwd
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_raises_file_not_found_when_no_env_file_anywhere
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_error_message_includes_required_vars
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_when_dotenv_upward_search_succeeds
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsBucketValidation::test_raises_value_error_when_bucket_missing
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_type_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_root_as_bucket
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_library_s3dlio
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_custom_library
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_uri_scheme_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_sets_force_path_style_when_endpoint_url_present
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_without_endpoint_url
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_direct_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_file_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_storage_root_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_force_path_style_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_logs_injected_params_summary
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_get_sample_caches_row_group
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_cache_grows_as_rgs_are_accessed
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_all_rg_entries_persist_within_epoch
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_does_not_evict_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_preserves_all_files_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_absolute_passthrough
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_relative_filename
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_strips_leading_slash
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_column_filtering_records_byte_count
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_no_column_filter_reads_all
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ uv run pytest tests/unit -q --co | wc -l
1160
***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ grep -E "passed|failed|skipped" /tmp/pr359_unit.log | tail -5
tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_datagen_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_configview_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_skipped_logs_debug (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_validation_helpers.py::TestValidateBenchmarkEnvironment::test_success_logs_passed_message.
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
—
Reply to this email directly, view it on GitHub <#359 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJYX6VJBQE7AI55AS4L4YEWRNAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGMZZGU2TGNRYGE>.
You are receiving this because you authored the thread.
|
|
Got it. Reviewing your DLIO PR mlcommons/DLIO_local_changes#16 and use local version of it for this PR testing. |
|
Devasena,
I am certain you are correct. So perhaps my branch tag is wrong, or the tests are now wrong? I am not certain.
Obviously, more testing would be good. Unfortunately, I don’t have any time left this week, as I spent my weekend working on these changes.
If someone else can examine the tests given the updated dlio_benchmark parquet readers, that would be great.
Regards,
—Russ
… On Apr 28, 2026, at 7:04 PM, Devasena I ***@***.***> wrote:
@idevasena commented on this pull request.
In pyproject.toml <#359 (comment)>:
> @@ -82,7 +85,7 @@ url = "https://download.pytorch.org/whl/cpu"
explicit = true
[tool.uv.sources]
-dlio-benchmark = { git = "https://github.com/mlcommons/DLIO_local_changes.git" }
+dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", branch = "feat/parquet-dgen-streaming" }
I thought, this instance of DLIO benchmark linking to your fork in pyproject.toml would suffice.
—
Reply to this email directly, view it on GitHub <#359 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJZYB75PBH43OYCXA6L4YFILBAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJTGQ4TMOBUHE>.
You are receiving this because you authored the thread.
|
|
|
||
| [tool.uv.sources] | ||
| dlio-benchmark = { git = "https://github.com/mlcommons/DLIO_local_changes.git" } | ||
| dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", branch = "feat/parquet-dgen-streaming" } |
There was a problem hiding this comment.
Now that mlcommons/DLIO_local_changes#16 is merged. The pyproject.toml needs to be reverted back to https://github.com/mlcommons/DLIO_local_changes.git.
idevasena
left a comment
There was a problem hiding this comment.
-
Unit-test coverage step-up (949 passing) is genuinely impressive and made it easy to verify cross-PR compatibility with DLIO PR #16.
-
The _apply_object_storage_params() flow that consumes .env cleanly bridges DLIO PR #16's env-var contract with the user-facing CLI; well-designed.
-
The tests/Checkpoint_test_results.md / Flux_test_results.md etc. baseline-results docs are excellent provenance — every baseline number is dated and accompanied by the exact command that produced it.
-
The package rename to mlpstorage_py was thorough — none of the 7 updated parquet reader tests, 8 issue-9 tests, etc. failed.
idevasena
left a comment
There was a problem hiding this comment.
Test pass summary
Configuration tested:
- Host: 1 TiB RAM, 56 TB NVMe
- DLIO: mlcommons/DLIO_local_changes:main (post-PR-16-merge)
- mlpstorage: branch-3-0-1/bug-fixes-perf-enhancements (PR #359 head)
- Python 3.12, uv-managed venv
Suites that passed:
- DLIO tests/run_fast_ci.sh: 112 passed, 1 skipped
- DLIO tests/test_issue_regressions.py: all passed
- mlpstorage tests/unit/: 949 passed, 4 skipped
End-to-end runs validated:
- UNet3D datagen + run, local-FS, 4 ranks, 8 files (smoke)
- Flux datagen, local-FS, 8 ranks, 16 files (parquet generator + dgen-py path)
- Multi-endpoint S3 rank-routing logic (unit-level reproduction of obj_store_lib.py:202-231)
End-to-end run for Flux:
POSIX test: Flux full b200 run at 8 accelerators × 8610 files (would take ~30+ minutes; integration validated at smaller scale instead)
I/O Throughput: 611 MiB/s
AU Accelerator Utilization: 99.5%
|
Just one change needed in Since the new commit will invalidate the previous approvals, will approve it after the commit. |
|
We can take the diagnosis and fix for this issue below in a separate PR. However, I wanted to highlight the reported issue: #362 where the training hang on the new v3 training models — flux, dlrm, and retinanet — all of which freeze at Starting epoch 1: 1200 steps expected and never progress. I faced this during datagen as well intermittently on the PR branch 359 as well where the parquet file generation is hung when using num procs > 1. |
Summary
Five independent bug fixes and enhancements across object storage credential loading, results directory configuration, IPC context handling in streaming checkpointing, MPI transport configuration, and Parquet data loader test correctness. Also adds 28 new unit tests, corrects 15 upstream test failures (wrong package name
mlpstorage→mlpstorage_py), a comprehensive object-storage usage guide, and dated benchmark results.Tests: 949 passing, 4 skipped, 0 failures.
Changes
1. Object Storage:
.envCredential Loading (mlpstorage_py/benchmarks/dlio.py)_apply_object_storage_params()now logs the full path of the.envfile it loads and raises aFileNotFoundErrorwith an actionable message (listing required env vars and pointing to.env.example) when--objectmode is active but no.envis found. Previously failed silently with cryptic downstream auth errors.2. Results Directory: Environment Variable Override (
mlpstorage_py/config.py,mlpstorage_py/main.py)DEFAULT_RESULTS_DIRnow readsMLPERF_RESULTS_DIRfrom the environment, falling back to the system temp directory. AWARNING-level message is emitted at startup when results are going to the temp dir andMLPERF_RESULTS_DIRis not set. Also adds the missingimport ostomain.py.3. Checkpointing: IPC Context Fix (
mlpstorage_py/checkpointing/streaming_checkpoint.py)QueueandEventIPC objects are now created from the same multiprocessing context (ctx) as the child process. Previously used the default context, causingRuntimeError: SemLock can only be shared between processes through inheritanceon non-forkplatforms (macOS, Windows,PYTHONSTARTMETHOD=spawn).4. MPI: Disable VADER BTL (
mlpstorage_py/utils.py)Appends
--mca btl ^vaderto single-host MPI flags to prevent segfaults from Open MPI's VADER shared-memory BTL on certain kernel/Open MPI version combinations. Standard documented workaround; no performance impact for storage benchmarking.5. Dependencies (
pyproject.toml,uv.lock)python-dotenv >= 1.0.0(required for.envloading indlio.py)s3dlio >= 0.9.95(byte-range GET +stat()for Parquet footer reads)dlio-benchmarktorussfellows/dlio_benchmark@feat/parquet-dgen-streamingwhich addsParquetReaderS3Iterablewith footer-caching fix (pending merge into upstream dlio_benchmark)6. Security:
.gitignorecredential exclusionAdded
.env.*/!.env.exampleto prevent accidental credential file commits.7. Test fixes: upstream
mlpstorage→mlpstorage_pyreferencesFixed 15 test failures in
tests/unit/test_benchmarks_vectordb.pyandtests/unit/test_cli.pyintroduced by upstream commits that imported frommlpstorage(non-existent package; correct name ismlpstorage_py). Also mocked_validate_vdb_dependenciesso VectorDB tests don't require optionalpymilvus/tabulatepackages in the base environment.8. Updated Parquet reader tests (
tests/unit/test_parquet_reader.py)7 tests updated to match the refactored
ParquetReaderS3IterableAPI: cache stores compressed byte count (int) not a pyarrowTable;close()is a no-op; no per-sample LRU eviction (cache clears atfinalize()).9. Documentation
docs/OBJECT_STORAGE_GUIDE.md— new 292-line guide for S3/MinIO/object storage setup,.envconfiguration, and troubleshootingdocs/MULTI_ENDPOINT_GUIDE.md— expanded withs3dliomulti-endpoint load-balancing examples