feat: deduplicate shared URL downloads across test suites by dabrain34 · Pull Request #338 · fluendo/fluster

dabrain34 · 2026-02-26T15:07:45Z

Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.

Add DownloadManager class in utils.py with download-once caching and centralized archive cleanup
Refactor TestSuite.download() to use pre-downloaded archives from the manager across all three download paths
Use a thread pool to download concurrently and make DownloadManager thread-safe so duplicate URLs are still fetched only once.

This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.

Fix #309

dabrain34 · 2026-03-16T15:35:58Z

@ylatuya ping

dabrain34 · 2026-04-21T15:47:22Z

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

ylatuya · 2026-04-22T07:52:15Z

-                )
+        # When archive_path is provided, the archive was already downloaded
+        # by the DownloadManager — skip directly to extraction.
+        if ctx.archive_path and os.path.exists(ctx.archive_path):


Shouldn't all the download logic be in the DownloadManager? I would expect _download_single_test_vector and _download_single_archive use the download manager instead of utils.download and let the download manager handle all the checks so that those are not duplicated.

ylatuya · 2026-04-22T07:53:00Z

-                    f"Checksum mismatch for source file {os.path.basename(first_tv.source)}: {checksum} "
-                    f"instead of '{first_tv.source_checksum}'"
+            # Verify existing file: clean up corrupt, skip if valid
+            skip_download = False


All of this logic should be handled by the DownloadManager

ylatuya · 2026-04-22T07:59:06Z

+                    return (url, local_path)
+
+                max_workers = max(1, min(jobs, len(unique_source_list)))
+                with ThreadPoolExecutor(max_workers=max_workers) as dl_pool:


We use Pool from multiprocessing

ylatuya · 2026-04-22T07:59:39Z

@@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None:

                downloads = []
                for tv in self.test_vectors.values():


This should go away if we are using the DownloadManager.

rsanchez87 · 2026-04-22T12:08:46Z

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

@dabrain34, tested with python3 fluster.py download AV1-ARGON-PROFILE0-CORE-ANNEX-B AV1-ARGON-PROFILE1-CORE-ANNEX-B AV1-ARGON-PROFILE2-CORE-ANNEX-B
master: 49m 40s
PR: 16m 2s (~3x faster, ZIP downloaded once instead of 3 times) ✅

Also regression tests ✔️

I’ll test again once the requested changes by @ylatuya are implemented. Thanks!

dabrain34 · 2026-04-22T12:49:38Z

thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file.

I'm currently addressing comments from ylatuya. When this is ready I will come back to you

Introduce a centralized DownloadManager so each URL is downloaded at most once, both within and across selected suites. Saves re-fetching multi-GB archives like AV1-ARGON shared by 12 suites. DownloadManager (fluster/utils.py): - Thread-safe per-URL caching at resources/.cache/; concurrent get() calls on the same URL block on the in-flight download. - BoundedSemaphore caps HTTP concurrency at 8. - Per-URL retry budget; ChecksumMismatchError poisons immediately. - invalidate(url) lets consumers drop a corrupt cached archive. - Context manager: cleanup() runs via __exit__, honoring keep_file. - filename_from_url() strips query strings for safe on-disk names. TestSuite.download() (fluster/test_suite.py): - Requires a DownloadManager (keyword-only). All three download paths consume pre-downloaded archives. - Multi-TV branch pre-downloads unique URLs in parallel before the multiprocessing extraction pool. - Raw source files are moved out of the cache (no double storage). CLI (fluster/fluster.py): - Three-phase: collect URLs across selected suites, parallel pre-download, per-suite extraction. Cross-suite parallelism is the main user-visible win. - All callers (CLI + 7 scripts/gen_*.py) use the with-statement form.

ylatuya requested changes Apr 22, 2026

View reviewed changes

dabrain34 force-pushed the dab_duplication_download branch from dc31a68 to a3b3d5f Compare May 22, 2026 13:27

dabrain34 marked this pull request as draft May 22, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deduplicate shared URL downloads across test suites#338

feat: deduplicate shared URL downloads across test suites#338
dabrain34 wants to merge 1 commit into
fluendo:masterfrom
dabrain34:dab_duplication_download

dabrain34 commented Feb 26, 2026 •

edited

Loading

Uh oh!

dabrain34 commented Mar 16, 2026

Uh oh!

dabrain34 commented Apr 21, 2026 •

edited

Loading

Uh oh!

ylatuya Apr 22, 2026

Uh oh!

ylatuya Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

ylatuya Apr 22, 2026

Uh oh!

ylatuya Apr 22, 2026

Uh oh!

rsanchez87 commented Apr 22, 2026

Uh oh!

dabrain34 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None:

		downloads = []
		for tv in self.test_vectors.values():

Conversation

dabrain34 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dabrain34 commented Mar 16, 2026

Uh oh!

dabrain34 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylatuya Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ylatuya Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ylatuya Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ylatuya Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

rsanchez87 commented Apr 22, 2026

Uh oh!

dabrain34 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dabrain34 commented Feb 26, 2026 •

edited

Loading

dabrain34 commented Apr 21, 2026 •

edited

Loading