feat: deduplicate shared URL downloads across test suites#338
Conversation
|
@ylatuya ping |
|
@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites |
| ) | ||
| # When archive_path is provided, the archive was already downloaded | ||
| # by the DownloadManager — skip directly to extraction. | ||
| if ctx.archive_path and os.path.exists(ctx.archive_path): |
There was a problem hiding this comment.
Shouldn't all the download logic be in the DownloadManager? I would expect _download_single_test_vector and _download_single_archive use the download manager instead of utils.download and let the download manager handle all the checks so that those are not duplicated.
| f"Checksum mismatch for source file {os.path.basename(first_tv.source)}: {checksum} " | ||
| f"instead of '{first_tv.source_checksum}'" | ||
| # Verify existing file: clean up corrupt, skip if valid | ||
| skip_download = False |
There was a problem hiding this comment.
All of this logic should be handled by the DownloadManager
| return (url, local_path) | ||
|
|
||
| max_workers = max(1, min(jobs, len(unique_source_list))) | ||
| with ThreadPoolExecutor(max_workers=max_workers) as dl_pool: |
There was a problem hiding this comment.
We use Pool from multiprocessing
| @@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None: | |||
|
|
|||
| downloads = [] | |||
| for tv in self.test_vectors.values(): | |||
There was a problem hiding this comment.
This should go away if we are using the DownloadManager.
@dabrain34, tested with Also regression tests ✔️ I’ll test again once the requested changes by @ylatuya are implemented. Thanks! |
|
thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file. I'm currently addressing comments from ylatuya. When this is ready I will come back to you |
Introduce a centralized DownloadManager so each URL is downloaded at most once, both within and across selected suites. Saves re-fetching multi-GB archives like AV1-ARGON shared by 12 suites. DownloadManager (fluster/utils.py): - Thread-safe per-URL caching at resources/.cache/; concurrent get() calls on the same URL block on the in-flight download. - BoundedSemaphore caps HTTP concurrency at 8. - Per-URL retry budget; ChecksumMismatchError poisons immediately. - invalidate(url) lets consumers drop a corrupt cached archive. - Context manager: cleanup() runs via __exit__, honoring keep_file. - filename_from_url() strips query strings for safe on-disk names. TestSuite.download() (fluster/test_suite.py): - Requires a DownloadManager (keyword-only). All three download paths consume pre-downloaded archives. - Multi-TV branch pre-downloads unique URLs in parallel before the multiprocessing extraction pool. - Raw source files are moved out of the cache (no double storage). CLI (fluster/fluster.py): - Three-phase: collect URLs across selected suites, parallel pre-download, per-suite extraction. Cross-suite parallelism is the main user-visible win. - All callers (CLI + 7 scripts/gen_*.py) use the with-statement form.
dc31a68 to
a3b3d5f
Compare
Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.
This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.
Fix #309