Skip to content

feat: Artifactory Archive Entry Download Optimization for subdirectory/file packages #417

@chkp-roniz

Description

@chkp-roniz

Problem

When APM installs a virtual subdirectory package (e.g., github/awesome-copilot/skills/review-and-refactor) via Artifactory, it currently downloads the entire repository archive (e.g., 5.9MB for awesome-copilot), extracts it to a temp directory, then copies only the target subdirectory. This is wasteful for large repos where the needed subdirectory is a tiny fraction of the total.

Proposed Solution

JFrog Artifactory supports Archive Entry Download — fetching individual files from inside a zip archive without downloading the whole archive.

API Reference: https://docs.jfrog.com/artifactory/reference/archiveEntryDownload

URL Pattern

GET https://<host>/artifactory/<repo-key>/<path/to/archive>.zip!/<path/inside/archive>

Examples

GitHub archive via Artifactory:

https://<artifactory-host>/artifactory/<repo-key>/github/awesome-copilot/archive/refs/heads/main.zip!/awesome-copilot-main/skills/review-and-refactor/SKILL.md

GitLab archive via Artifactory:

https://<artifactory-host>/artifactory/<repo-key>/<owner>/<repo>/-/archive/main/<repo>-main.zip!/<repo>-main/.apm/agents/design-reviewer.agent.md

Archive Root Prefix Convention

Both GitHub and GitLab archives contain a root directory prefix: {repo}-{ref}/

Source Archive URL Root prefix
GitHub .../github/awesome-copilot/archive/refs/heads/main.zip awesome-copilot-main/
GitLab .../<owner>/<repo>/-/archive/main/<repo>-main.zip <repo>-main/

The entry path must include this root prefix:

{archive_url}!/{repo}-{ref}/{path_inside_repo}

Implementation Approach

Where to Change

File: src/apm_cli/deps/github_downloader.py

Method: _download_subdirectory_from_artifactory() (line ~1658)

Current Flow (full archive download)

1. Download full archive zip (potentially many MB)
2. Extract to temp directory
3. Find subdirectory inside extracted files
4. Copy subdirectory to target path
5. Clean up temp directory

Proposed Flow (entry-level download)

1. Construct archive URL (already done by build_artifactory_archive_url())
2. Infer root prefix from convention: "{repo}-{ref}/"
3. For each file in subdirectory:
   GET {archive_url}!/{root_prefix}/{subdir}/{file}
4. Write files directly to target path

Root Prefix Discovery

Option A — Infer from convention (preferred):
The root prefix is always {repo}-{ref}/. Both GitHub and GitLab follow this pattern. This avoids any extra HTTP calls.

root_prefix = f"{repo}-{ref}"
entry_url = f"{archive_url}!/{root_prefix}/{subdir_path}/{filename}"

Option B — Discovery via partial download:
Download first few bytes of the zip to read the central directory. More robust but adds latency.

File Listing Challenge

The archive entry API downloads individual files — it doesn't list directory contents. Options:

  1. Fetch the full archive file list via Artifactory's File List API:
    GET /api/storage/{repo-key}/{path}?list&deep=1
    
  2. Fetch a manifest file first (e.g., apm.yml or SKILL.md) to validate, then fall back to full archive for extraction.
  3. Hybrid approach: Use archive entry download for known files, fall back to full archive only if needed.
  4. Accept full archive for subdirectory packages but use entry download for virtual file packages (single .prompt.md, .agent.md files) — simplest and most common case.

Recommended Phased Approach

Phase 1: Virtual File Packages (Simplest)

For _download_file_from_artifactory() — currently downloads full archive to extract one file. Replace with single entry download:

def _download_file_from_artifactory(self, host, prefix, owner, repo, file_path, ref, scheme="https"):
    archive_urls = build_artifactory_archive_url(host, prefix, owner, repo, ref, scheme=scheme)
    root_prefix = f"{repo}-{ref}"
    headers = self._get_artifactory_headers()

    for archive_url in archive_urls:
        entry_url = f"{archive_url}!/{root_prefix}/{file_path}"
        try:
            resp = self._resilient_get(entry_url, headers=headers)
            if resp.status_code == 200:
                return resp.content
        except requests.RequestException:
            continue

    # Fall back to full archive download
    return self._download_file_from_artifactory_full(...)

Savings: For a single .prompt.md file (~1KB), avoids downloading a multi-MB archive.

Phase 2: Subdirectory Packages (More Complex)

  1. Fetch the package manifest via entry download to validate (apm.yml, SKILL.md)
  2. If subdirectory has few files, fetch each via entry download
  3. For large subdirectories, fall back to full archive download

Heuristic: If the manifest lists fewer than N primitives (e.g., 20 files), use entry-level download. Otherwise full archive is more efficient.

Phase 3: Smart Caching

Cache archive metadata (file list, root prefix) so subsequent installs of different subdirectories from the same repo don't re-discover.

Performance Impact

Scenario Current Optimized
Single virtual file (.prompt.md) from 6MB repo 6MB download + unzip ~1KB download
Skill subdirectory (5 files) from 6MB repo 6MB download + unzip ~5 small downloads (~50KB total)
Large subdirectory (100+ files) 6MB download + unzip Full archive (same as current)

Edge Cases

Case Behavior
Root prefix doesn't follow {repo}-{ref} convention Fall back to full archive download
Entry download returns 404 (file not in archive) Fall back to full archive download
Artifactory instance doesn't support archive entry API Graceful degradation to full archive
Archive is a tag (not branch) Root prefix uses tag name: {repo}-{tag}/

Testing

  1. Unit tests: Mock Artifactory responses for entry download URL pattern
  2. Integration tests: Verify against real Artifactory instance with both GitHub and GitLab remote repos
  3. Fallback tests: Simulate entry download failure → verify full archive fallback works
  4. Root prefix tests: Verify prefix construction for branches, tags, and commit SHAs

Dependencies

  • Requires Artifactory server to support archive entry download (standard feature, not an add-on)
  • No client-side library changes needed — uses standard HTTP GET
  • Backward compatible — falls back to full archive download on any failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions