Skip to content

feat: deduplicate shared URL downloads across test suites#338

Draft
dabrain34 wants to merge 1 commit into
fluendo:masterfrom
dabrain34:dab_duplication_download
Draft

feat: deduplicate shared URL downloads across test suites#338
dabrain34 wants to merge 1 commit into
fluendo:masterfrom
dabrain34:dab_duplication_download

Conversation

@dabrain34
Copy link
Copy Markdown
Contributor

@dabrain34 dabrain34 commented Feb 26, 2026

Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.

  • Add DownloadManager class in utils.py with download-once caching and centralized archive cleanup
  • Refactor TestSuite.download() to use pre-downloaded archives from the manager across all three download paths
  • Use a thread pool to download concurrently and make DownloadManager thread-safe so duplicate URLs are still fetched only once.

This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.

Fix #309

@dabrain34
Copy link
Copy Markdown
Contributor Author

@ylatuya ping

@dabrain34
Copy link
Copy Markdown
Contributor Author

dabrain34 commented Apr 21, 2026

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

Comment thread fluster/test_suite.py Outdated
)
# When archive_path is provided, the archive was already downloaded
# by the DownloadManager — skip directly to extraction.
if ctx.archive_path and os.path.exists(ctx.archive_path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all the download logic be in the DownloadManager? I would expect _download_single_test_vector and _download_single_archive use the download manager instead of utils.download and let the download manager handle all the checks so that those are not duplicated.

Comment thread fluster/test_suite.py Outdated
f"Checksum mismatch for source file {os.path.basename(first_tv.source)}: {checksum} "
f"instead of '{first_tv.source_checksum}'"
# Verify existing file: clean up corrupt, skip if valid
skip_download = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic should be handled by the DownloadManager

Comment thread fluster/test_suite.py Outdated
Comment thread fluster/test_suite.py
Comment thread fluster/test_suite.py
return (url, local_path)

max_workers = max(1, min(jobs, len(unique_source_list)))
with ThreadPoolExecutor(max_workers=max_workers) as dl_pool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Pool from multiprocessing

Comment thread fluster/test_suite.py
@@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None:

downloads = []
for tv in self.test_vectors.values():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go away if we are using the DownloadManager.

@rsanchez87
Copy link
Copy Markdown
Contributor

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

@dabrain34, tested with python3 fluster.py download AV1-ARGON-PROFILE0-CORE-ANNEX-B AV1-ARGON-PROFILE1-CORE-ANNEX-B AV1-ARGON-PROFILE2-CORE-ANNEX-B
master: 49m 40s
PR: 16m 2s (~3x faster, ZIP downloaded once instead of 3 times) ✅

Also regression tests ✔️

I’ll test again once the requested changes by @ylatuya are implemented. Thanks!

@dabrain34
Copy link
Copy Markdown
Contributor Author

thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file.

I'm currently addressing comments from ylatuya. When this is ready I will come back to you

Introduce a centralized DownloadManager so each URL is downloaded at
most once, both within and across selected suites. Saves re-fetching
multi-GB archives like AV1-ARGON shared by 12 suites.

DownloadManager (fluster/utils.py):
- Thread-safe per-URL caching at resources/.cache/; concurrent get()
  calls on the same URL block on the in-flight download.
- BoundedSemaphore caps HTTP concurrency at 8.
- Per-URL retry budget; ChecksumMismatchError poisons immediately.
- invalidate(url) lets consumers drop a corrupt cached archive.
- Context manager: cleanup() runs via __exit__, honoring keep_file.
- filename_from_url() strips query strings for safe on-disk names.

TestSuite.download() (fluster/test_suite.py):
- Requires a DownloadManager (keyword-only). All three download paths
  consume pre-downloaded archives.
- Multi-TV branch pre-downloads unique URLs in parallel before the
  multiprocessing extraction pool.
- Raw source files are moved out of the cache (no double storage).

CLI (fluster/fluster.py):
- Three-phase: collect URLs across selected suites, parallel
  pre-download, per-suite extraction. Cross-suite parallelism is the
  main user-visible win.
- All callers (CLI + 7 scripts/gen_*.py) use the with-statement form.
@dabrain34 dabrain34 force-pushed the dab_duplication_download branch from dc31a68 to a3b3d5f Compare May 22, 2026 13:27
@dabrain34 dabrain34 marked this pull request as draft May 22, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Downloading the AV1 test suites results in downloading multiple times a 6GB archive

3 participants