Skip to content

perf: avoid urljoin for absolute HTML links#10903

Merged
radoering merged 3 commits into
python-poetry:mainfrom
dhimasardinata:dhimas/html-absolute-link-fast-path
May 17, 2026
Merged

perf: avoid urljoin for absolute HTML links#10903
radoering merged 3 commits into
python-poetry:mainfrom
dhimasardinata:dhimas/html-absolute-link-fast-path

Conversation

@dhimasardinata
Copy link
Copy Markdown
Contributor

@dhimasardinata dhimasardinata commented May 17, 2026

This mirrors the JSON Simple API shortcut from #10896 for HTML Simple API pages: absolute file links do not need urllib.parse.urljoin(), and PyPI-style repository pages commonly use absolute links.

A small local benchmark over absolute https:// links showed the prefix check avoiding urljoin() is significantly faster:

old_s min/med/max=0.177560/0.236734/0.484216
new_s min/med/max=0.000984/0.001196/0.008956
speedup=198.0x

Pull Request Check List

  • Added tests for changed code.
  • Updated documentation for changed code. N/A, no user-facing behavior change.

Local checks:

PYTHONPATH=src uv run --with pytest --with pytest-xdist --with responses --with pytest-mock python -m pytest tests/repositories/link_sources/test_html.py tests/repositories/link_sources/test_json.py -q
uv run --with ruff python -m ruff check src/poetry/repositories/link_sources/base.py src/poetry/repositories/link_sources/html.py src/poetry/repositories/link_sources/json.py tests/repositories/link_sources/test_html.py
uv run --with ruff python -m ruff format --check src/poetry/repositories/link_sources/base.py src/poetry/repositories/link_sources/html.py src/poetry/repositories/link_sources/json.py tests/repositories/link_sources/test_html.py
PYTHONPATH=src uv run --with poetry-core --with packaging --with mypy python -m mypy src/poetry/repositories/link_sources/base.py src/poetry/repositories/link_sources/html.py src/poetry/repositories/link_sources/json.py
git diff --check

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The absolute URL detection currently relies on hard-coded string prefixes; consider using urllib.parse.urlparse(href).scheme (or reusing a shared helper with the JSON Simple API) to avoid missing valid schemes (e.g. //example.org/..., ftp:) and to reduce the risk of the two code paths diverging.
  • Since ABSOLUTE_LINK_PREFIXES is only used inside this module and is performance-related, you might want to annotate this with a short comment explaining why only these schemes are handled specially and why others still go through urljoin, to make future modifications safer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The absolute URL detection currently relies on hard-coded string prefixes; consider using `urllib.parse.urlparse(href).scheme` (or reusing a shared helper with the JSON Simple API) to avoid missing valid schemes (e.g. `//example.org/...`, `ftp:`) and to reduce the risk of the two code paths diverging.
- Since `ABSOLUTE_LINK_PREFIXES` is only used inside this module and is performance-related, you might want to annotate this with a short comment explaining why only these schemes are handled specially and why others still go through `urljoin`, to make future modifications safer.

## Individual Comments

### Comment 1
<location path="tests/repositories/link_sources/test_html.py" line_range="99" />
<code_context>
     assert link.hashes == {"sha256": "abcd1234"}


+def test_absolute_url_skips_urljoin(
+    html_page_content: HTMLPageGetter, monkeypatch: pytest.MonkeyPatch
+) -> None:
</code_context>
<issue_to_address>
**suggestion (testing):** Consider parametrizing this test over all supported absolute URL prefixes

`ABSOLUTE_LINK_PREFIXES` includes `("http://", "https://", "file://")`, but this test only covers `https://`. Please parametrize it over all entries in `ABSOLUTE_LINK_PREFIXES` (or at least add explicit `http://` and `file://` cases) so we verify that `urljoin` is consistently skipped for every supported absolute scheme and remain robust if the tuple changes in the future.

Suggested implementation:

```python
@pytest.mark.parametrize("prefix", ABSOLUTE_LINK_PREFIXES)
def test_absolute_url_skips_urljoin(
    html_page_content: HTMLPageGetter,
    monkeypatch: pytest.MonkeyPatch,
    prefix: str,
) -> None:
    def fail_urljoin(base: str, url: str) -> str:
        raise AssertionError("urljoin should not be called for absolute URLs")

    monkeypatch.setattr(
        "poetry.repositories.link_sources.html.urllib.parse.urljoin", fail_urljoin
    )

    anchor = (
        f'<a href="{prefix}files.pythonhosted.org/packages/demo-0.1.whl">'
        "demo-0.1.whl</a><br/>"

```

1. Ensure `ABSOLUTE_LINK_PREFIXES` is imported in this test module, for example:
   `from poetry.repositories.link_sources.html import ABSOLUTE_LINK_PREFIXES, HTMLPageGetter`
   or by extending the existing import that already brings in `HTMLPageGetter`.
2. If `pytest` is not yet imported under the name `pytest` in this file, add `import pytest` to the imports so that `@pytest.mark.parametrize` is available.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tests/repositories/link_sources/test_html.py Outdated
@dhimasardinata dhimasardinata force-pushed the dhimas/html-absolute-link-fast-path branch from 80cbce6 to 7a8c413 Compare May 17, 2026 10:01
Copy link
Copy Markdown
Member

@radoering radoering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense but I think we should not duplicate the logic. Maybe, we should introduce a helper in LinkSource.

Edit: I see, you noticed yourself. 😄

@radoering radoering force-pushed the dhimas/html-absolute-link-fast-path branch from fc175ab to 56bc6a8 Compare May 17, 2026 17:29
@radoering
Copy link
Copy Markdown
Member

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@radoering radoering enabled auto-merge (squash) May 17, 2026 17:33
@radoering radoering merged commit 81875f3 into python-poetry:main May 17, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants