Implement Specifiers and SpecifierSets filtering using ranges#1120
Implement Specifiers and SpecifierSets filtering using ranges#1120notatallshaw wants to merge 8 commits into
Conversation
|
I'll remove this from pip: pypa/pip#13850 But I think it's best to either wrap a deprecated around it and/or make a public version of "operators". |
|
I played around asking copilot in vscode to optimize this branch. It got about 0.7 or so on both complex and simple filtering. This is still slower for simple specifiers, but it's about 1.2x slower instead of 1.6x or so. Here's the AI generated summary of the things it thinks it optimized:
And the diff: Detailsdiff --git a/src/packaging/specifiers.py b/src/packaging/specifiers.py
index c341232..62cc181 100644
--- a/src/packaging/specifiers.py
+++ b/src/packaging/specifiers.py
@@ -84,12 +84,22 @@ class _ExclusionBound:
def _is_family(self, other: Version) -> bool:
"""Is ``other`` a version that this sentinel sorts above?"""
v = self.version
- if not (
- other.epoch == v.epoch
- and _trim_release(other.release) == self._trimmed_release
- and other.pre == v.pre
- ):
+ if other.epoch != v.epoch or other.pre != v.pre:
+ return False
+
+ # Compare trimmed release equality without allocating a new tuple
+ # for ``other.release`` on each call.
+ other_release = other.release
+ trimmed = self._trimmed_release
+ if len(other_release) < len(trimmed):
return False
+ for idx, value in enumerate(trimmed):
+ if other_release[idx] != value:
+ return False
+ for value in other_release[len(trimmed) :]:
+ if value != 0:
+ return False
+
if self._kind == _AFTER_LOCALS:
# Local family: exact same public version (any local label).
return other.post == v.post and other.dev == v.dev
@@ -107,8 +117,25 @@ class _ExclusionBound:
return self.version < other.version
return self._kind < other._kind
assert isinstance(other, Version)
+ # Cheap reject first: if ``other`` is not above ``V``,
+ # ``self < other`` can never hold.
+ if not (self.version < other):
+ return False
# self < other iff other is NOT in the family and other > V
- return not self._is_family(other) and self.version < other
+ return not self._is_family(other)
+
+ def __gt__(self, other: object) -> bool:
+ if isinstance(other, _ExclusionBound):
+ if self.version != other.version:
+ return self.version > other.version
+ return self._kind > other._kind
+ assert isinstance(other, Version)
+ # Fast path: base version already dominates — no family check needed.
+ if self.version >= other:
+ return True
+ # Slower path: other > V, but might still be in the family
+ # (e.g. a post-release counted as V.postN with AFTER_POSTS semantics).
+ return self._is_family(other)
def __hash__(self) -> int:
return hash((self.version, self._kind))
@@ -203,7 +230,7 @@ def _filter_by_intervals(
intervals: list[_SpecifierInterval],
iterable: Iterable[Any],
key: Callable[[Any], UnparsedVersion] | None,
- prereleases: bool,
+ prereleases: bool | None,
) -> Iterator[Any]:
"""Filter versions against precomputed intervals.
@@ -211,12 +238,116 @@ def _filter_by_intervals(
use :class:`_ExclusionBound` to handle local-version semantics.
Used by both :class:`Specifier` and :class:`SpecifierSet`.
- Prerelease buffering (PEP 440 default) is NOT handled here;
- callers wrap the result with :func:`_pep440_filter_prereleases`
- when needed.
+ When ``prereleases`` is ``None``, PEP 440 default semantics apply:
+ prereleases are excluded unless no final releases match.
"""
+ if not intervals:
+ return
+
+ # PEP 440 default behavior: exclude prereleases unless no finals match.
+ if prereleases is None:
+ prereleases_buffer: list[Any] = []
+ found_final = False
+
+ if len(intervals) == 1:
+ (
+ (lower_version, lower_inclusive),
+ (
+ upper_version,
+ upper_inclusive,
+ ),
+ ) = intervals[0]
+
+ for item in iterable:
+ parsed = _coerce_version(item if key is None else key(item))
+ if parsed is None:
+ continue
+ if lower_version is not None:
+ if lower_inclusive:
+ if parsed < lower_version:
+ continue
+ elif not (parsed > lower_version):
+ continue
+ if upper_version is not None:
+ if upper_inclusive:
+ if parsed > upper_version:
+ continue
+ elif not (parsed < upper_version):
+ continue
+ if parsed.is_prerelease:
+ prereleases_buffer.append(item)
+ else:
+ found_final = True
+ yield item
+ if not found_final:
+ yield from prereleases_buffer
+ return
+
+ for item in iterable:
+ parsed = _coerce_version(item if key is None else key(item))
+ if parsed is None:
+ continue
+ # Check if version falls within any interval. Intervals are sorted
+ # and non-overlapping, so at most one can match.
+ for (lower_version, lower_inclusive), (
+ upper_version,
+ upper_inclusive,
+ ) in intervals:
+ if lower_version is not None:
+ if lower_inclusive:
+ if parsed < lower_version:
+ break
+ elif not (parsed > lower_version):
+ break
+ if upper_version is None:
+ matched = True
+ elif upper_inclusive:
+ matched = not (parsed > upper_version)
+ else:
+ matched = parsed < upper_version
+ if matched:
+ if parsed.is_prerelease:
+ prereleases_buffer.append(item)
+ else:
+ found_final = True
+ yield item
+ break
+ if not found_final:
+ yield from prereleases_buffer
+ return
+
exclude_prereleases = prereleases is False
+ if len(intervals) == 1:
+ (
+ (lower_version, lower_inclusive),
+ (
+ upper_version,
+ upper_inclusive,
+ ),
+ ) = intervals[0]
+
+ for item in iterable:
+ parsed = _coerce_version(item if key is None else key(item))
+ if parsed is None:
+ continue
+ if exclude_prereleases and parsed.is_prerelease:
+ continue
+ if lower_version is not None:
+ if lower_inclusive:
+ if parsed < lower_version:
+ continue
+ elif not (parsed > lower_version):
+ continue
+ if upper_version is not None:
+ if upper_inclusive:
+ if parsed > upper_version:
+ continue
+ elif not (parsed < upper_version):
+ continue
+ yield item
+ return
+
for item in iterable:
parsed = _coerce_version(item if key is None else key(item))
if parsed is None:
@@ -229,20 +360,67 @@ def _filter_by_intervals(
upper_version,
upper_inclusive,
) in intervals:
- if lower_version is not None and (
- parsed < lower_version
- or (parsed == lower_version and not lower_inclusive)
- ):
- break
- if (
- upper_version is None
- or parsed < upper_version
- or (parsed == upper_version and upper_inclusive)
- ):
+ if lower_version is not None:
+ if lower_inclusive:
+ if parsed < lower_version:
+ break
+ elif not (parsed > lower_version):
+ break
+ if upper_version is None:
+ matched = True
+ elif upper_inclusive:
+ matched = not (parsed > upper_version)
+ else:
+ matched = parsed < upper_version
+ if matched:
yield item
break
+def _version_in_intervals(
+ version: Version, intervals: list[_SpecifierInterval]
+) -> bool:
+ """Return whether ``version`` falls within any of ``intervals``."""
+ if not intervals:
+ return False
+
+ if len(intervals) == 1:
+ (
+ (lower_version, lower_inclusive),
+ (
+ upper_version,
+ upper_inclusive,
+ ),
+ ) = intervals[0]
+ if lower_version is not None:
+ if lower_inclusive:
+ if version < lower_version:
+ return False
+ elif not (version > lower_version):
+ return False
+ if upper_version is None:
+ return True
+ if upper_inclusive:
+ return not (version > upper_version)
+ return version < upper_version
+
+ for (lower_version, lower_inclusive), (upper_version, upper_inclusive) in intervals:
+ if lower_version is not None:
+ if lower_inclusive:
+ if version < lower_version:
+ break
+ elif not (version > lower_version):
+ break
+ if upper_version is None:
+ return True
+ if upper_inclusive:
+ if not (version > upper_version):
+ return True
+ elif version < upper_version:
+ return True
+ return False
+
+
def _pep440_filter_prereleases(
iterable: Iterable[Any], key: Callable[[Any], UnparsedVersion] | None
) -> Iterator[Any]:
@@ -820,7 +998,24 @@ class Specifier(BaseSpecifier):
True
"""
- return bool(list(self.filter([item], prereleases=prereleases)))
+ if self.operator == "===":
+ return str(item).lower() == self.version.lower()
+
+ version = _coerce_version(item)
+ if version is None:
+ return False
+
+ if prereleases is None:
+ if self._prereleases is not None:
+ prereleases = self._prereleases
+ elif self.prereleases:
+ prereleases = True
+
+ resolve_pre = True if prereleases is None else prereleases
+ if not resolve_pre and version.is_prerelease:
+ return False
+
+ return _version_in_intervals(version, self._to_intervals())
@typing.overload
def filter(
@@ -890,22 +1085,15 @@ class Specifier(BaseSpecifier):
elif self.prereleases:
prereleases = True
- # When prereleases is still None, pass True to include all versions
- # and let _pep440_filter_prereleases handle the buffering.
- resolve_pre = True if prereleases is None else prereleases
-
- filtered = _filter_by_intervals(
+ # _filter_by_intervals handles prereleases=None (PEP 440 semantics)
+ # directly, so no wrapper needed.
+ yield from _filter_by_intervals(
self._to_intervals(),
iterable,
key,
- prereleases=resolve_pre,
+ prereleases=prereleases,
)
- if prereleases is not None:
- yield from filtered
- else:
- yield from _pep440_filter_prereleases(filtered, key)
-
class SpecifierSet(BaseSpecifier):
"""This class abstracts handling of a set of version specifiers.
@@ -1244,8 +1432,30 @@ class SpecifierSet(BaseSpecifier):
if version is not None and installed and version.is_prerelease:
prereleases = True
- check_item = item if version is None else version
- return bool(list(self.filter([check_item], prereleases=prereleases)))
+ if prereleases is None:
+ default_prereleases = self.prereleases
+ if default_prereleases is not None:
+ prereleases = default_prereleases
+
+ allow_prereleases = True if prereleases is None else prereleases
+
+ if self._specs:
+ intervals = self._get_intervals()
+ if intervals is not None:
+ if version is None:
+ return False
+ if not allow_prereleases and version.is_prerelease:
+ return False
+ return _version_in_intervals(version, intervals)
+
+ candidate = item if version is None else version
+ return all(
+ s.contains(candidate, prereleases=allow_prereleases)
+ for s in self._specs
+ )
+
+ # Empty SpecifierSet matches everything unless prereleases are disabled.
+ return allow_prereleases or version is None or not version.is_prerelease
@typing.overload
def filter(
@@ -1313,37 +1523,48 @@ class SpecifierSet(BaseSpecifier):
# Determine if we're forcing a prerelease or not, if we're not forcing
# one for this particular filter call, then we'll use whatever the
# SpecifierSet thinks for whether or not we should support prereleases.
- if prereleases is None and self.prereleases is not None:
- prereleases = self.prereleases
+ if prereleases is None:
+ default_prereleases = self.prereleases
+ if default_prereleases is not None:
+ prereleases = default_prereleases
# Filter versions that match all specifiers.
if self._specs:
- resolve_pre = True if prereleases is None else prereleases
+ # Fast path: a single specifier can delegate directly.
+ # This avoids an extra PEP 440 pass in the common one-spec case.
+ if len(self._specs) == 1:
+ return self._specs[0].filter(
+ iterable,
+ prereleases=prereleases,
+ key=key,
+ )
- filtered: Iterator[Any]
intervals = self._get_intervals()
if intervals is not None:
- filtered = _filter_by_intervals(
+ # _filter_by_intervals handles prereleases=None (PEP 440
+ # semantics) directly.
+ return _filter_by_intervals(
intervals,
iterable,
key,
- prereleases=resolve_pre,
+ prereleases=prereleases,
)
- else:
- # _get_intervals returns None when specs include ===
- # (arbitrary string matching, not version comparison).
- specs = self._specs
- filtered = (
- item
- for item in iterable
- if all(
- s.contains(
- item if key is None else key(item),
- prereleases=resolve_pre,
- )
- for s in specs
+
+ # _get_intervals returns None when specs include ===
+ # (arbitrary string matching, not version comparison).
+ allow_prereleases = True if prereleases is None else prereleases
+ specs = self._specs
+ filtered: Iterator[Any] = (
+ item
+ for item in iterable
+ if all(
+ s.contains(
+ item if key is None else key(item),
+ prereleases=allow_prereleases,
)
+ for s in specs
)
+ )
if prereleases is not None:
return filtered |
ae9865c to
489edcf
Compare
7ccace1 to
522b695
Compare
71684a2 to
c39cac8
Compare
0f3f135 to
2ac24b5
Compare
|
I've rebased and cleaned this up now #1119 has been merged, it's going to be a few weeks before I can start working on this, but my plan is the following:
|
2ac24b5 to
92a7a46
Compare
|
This PR is now ready for review, it is completely overhauled but keeps the same concepts and logic. The diff was bigger than expected, because this PR is a stepping stone to a public This PR now does all the range based calculations in a new private Performance is now improved for all contains and filter operations, without performance regression anywhere else. Performance improvement is especially dramatic for complex specifiers with lots of versions, as after the initial range construction version filtering has gone from |
|
This seems good, I'm not seeing anything significant. I'll trigger a copilot review just to see if it sees anything. |
There was a problem hiding this comment.
Pull request overview
This pull request refactors packaging.specifiers.Specifier and SpecifierSet filtering to use precomputed PEP 440 “version ranges” (intervals) rather than repeatedly applying per-version operator predicates, with the goal of significantly improving performance for complex specifiers.
Changes:
- Introduces a new private
_rangesmodule that models specifiers as sorted, non-overlapping version ranges and provides range intersection + filtering utilities. - Reworks
Specifier/SpecifierSetto lazily compute/cache ranges and use range-based filtering and unsatisfiability checks (including prerelease-only range detection). - Updates and expands tests to cover new caching behavior, boundary semantics, prerelease filtering behavior, and pickle cache reset/loading scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/packaging/specifiers.py |
Switches specifier mechanics to range-based filtering/intersection, updates caching and prerelease/unsatisfiable handling to use _ranges. |
src/packaging/_ranges.py |
Adds private range/boundary primitives plus range intersection, filtering, and prerelease-only range detection helpers. |
tests/test_specifiers.py |
Updates expectations around caching and pickle cache reset; adds regression tests for multirange + prerelease filtering and lazy construction. |
tests/test_version.py |
Adds coverage for a Version.__gt__ comparison fast-path involving cached comparison keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Looking at the benchmark numbers again I see marker evaluation is a few % slower, I'm going to quickly profile and see if this can be simply fixed. There are actually a lot of minor performance optimizations I haven't included due to the tradeoff in complexity. |
|
I noticed that, but it was under 5% so wasn't too worried about it, but if you see a quick fix, that's good. I'd focus on readability for a first iteration, so if it's too ugly to fix now, that could be a followup. |
e9da7e4 to
e524f48
Compare
472c258 to
cea3be1
Compare
|
Bad news/Good news: Bad news, investigating the marker performance loss I found that the cold Good news, rather than implementing a full fallback to the old code we can add a fairly simple fast path for when we've not constructed a range object yet and the comparison reduces down to a simple This actually speeds up the marker evaluation benchmark a couple of percent: |
|
Why is |
It's not, it's inside the private _ranges module, but this fast path specifiers module needs to access it. When ranges becomes public I will move trim_release and coerce_version into some shared private utility module. |
6139500 to
72bb0ba
Compare
|
Having the clankers do a couple of deep sweeps, they found two edge cases with
|
This is a
proof of concept ofswitching the internal mechanics of Specifier and SpecifierSet to using intervals instead of iterative version filters. In general this speeds up any mildly complex specifier and can make very complex specifiers significantly faster (more than 10x).However this is a large overhaul, first it is intended we land #1119 which implements just enough of this to introduce a new
is_unsatisfiablemethod.This implementation is designed to make it easy to build a public API on top of this, I imagine having some
to_range()method which returns a PEP 440 compliantVersionRangeobject that can be manipulated with set algebra.However, to make this more performant, especially for very simple one off cases where it is currently slower, the internal machinery probably needs to move away from bounds logic, implementing hot paths and side channels. Will leave this in draft until after #1119 lands.