Skip to content

Parallel access to b-tree and data via cat_ranges and threading#218

Draft
bnlawrence wants to merge 20 commits intomainfrom
pbtree2
Draft

Parallel access to b-tree and data via cat_ranges and threading#218
bnlawrence wants to merge 20 commits intomainfrom
pbtree2

Conversation

@bnlawrence
Copy link
Copy Markdown
Collaborator

@bnlawrence bnlawrence commented Apr 6, 2026

Description

It is clear that pyfive itself could benefit from internal parallelism. This idea was outlined in #154. Some detailed thinking and architecture design resulted in #216. This is the outcome of that work, and provides both parallel chunk reading and parallel reading of b-tree information. These are both turned on by default. The API to turn them off is somewhat obscure, and might be something to address in the discussion around this pull request.

This would close #209 and #216 (#154 has been already closed in anticipation).

Considerations:

  • The use of a mixin class for reading chunks. While concerns have been expressed, i think in the end, this is the right pattern, for now at least.

  • This retains a nearly complete separation of concerns between pyfive and the environment (POSIX, FSSPEC etc), but it is not perfect. Future work will need to address that, but the benefits of doing this now are so remarkable that it is worth doing it now, and foreshadowing the necessary work (an issue will be forthcoming in the next few days, and will link back here).

  • This replaces the previous pull request (First cut at adding some parallelism in pyfive #209).

  • Parallel decompression of chunks is postponed for future work.

Checklist

  • This pull request has a descriptive title and labels
  • This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
  • Unit tests have been added (if codecov test fails)
  • Any changed dependencies have been added or removed correctly (if need be)
  • If you are working on the documentation, please ensure the current build passes
  • All tests pass

@bnlawrence bnlawrence changed the title Pbtree2 Parallel access to b-tree and data via cat_ranges and threading Apr 6, 2026
@bnlawrence
Copy link
Copy Markdown
Collaborator Author

bnlawrence commented Apr 6, 2026

260406_remote_testing_results_summary These results show the benefit of the parallelism for data reading, though they suggest one would not make the parallel b-tree read the default. Further investigation is necessary. Note that the POSIX results are not believable as they represent memory caching by the OS, as discussed here. Note that the ssh results are using `p5rem`, not `fsspec`. To what extent server side caching (for http and s3) is involved is not clear.

@valeriupredoi
Copy link
Copy Markdown
Collaborator

@bnlawrence I fixed your ruff issues so you have a clean CI and focus on the functional fails, if any. You can always fix ruff issues to the first degree/pass by running pre-commit run -a 🍺

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 93.12169% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.39%. Comparing base (3a93a0d) to head (530c66a).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pyfive/h5d.py 91.00% 3 Missing and 6 partials ⚠️
pyfive/btree.py 94.93% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #218      +/-   ##
==========================================
+ Coverage   77.62%   78.39%   +0.77%     
==========================================
  Files          15       15              
  Lines        3128     3300     +172     
  Branches      499      526      +27     
==========================================
+ Hits         2428     2587     +159     
- Misses        573      578       +5     
- Partials      127      135       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants