-
Notifications
You must be signed in to change notification settings - Fork 26
Optimise import and microsimulation init performance #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Three changes that together reduce import + Microsimulation() time by ~40%: 1. Enum encoding: replace np.select (O(n*m)) with np.searchsorted (O(n log m)) plus cached lookup arrays 2. empty_clone: replace dynamic type creation with object.__new__() 3. Period/instant parsing: add lru_cache to avoid repeated strptime calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the initialization performance of PolicyEngine microsimulations by approximately 40% (from 10.75s to 6.6s) through three key performance improvements: enum encoding optimization using np.searchsorted, replacing dynamic type creation with object.__new__() in empty_clone, and caching period/instant string parsing.
- Replaced
np.selectwithnp.searchsortedfor O(n log m) enum encoding performance - Simplified
empty_clone()to useobject.__new__()instead of dynamic type creation - Added
@lru_cacheto period and instant parsing functions to avoid repeatedstrptimecalls
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| policyengine_core/enums/enum.py | Implements searchsorted-based enum encoding with cached sorted lookup arrays, but contains a critical bug where invalid enum values can cause IndexError or incorrect results |
| policyengine_core/commons/misc.py | Simplifies empty_clone to use object.new() for 33x performance improvement |
| policyengine_core/periods/helpers.py | Adds LRU caching to instant and period string parsing functions to avoid repeated strptime calls |
| changelog_entry.yaml | Documents the three optimization changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
- Log warning when encoding invalid enum string values (they default to 0) - Add tests for invalid enum value warning - Document in changelog that random() now produces different sequences 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Review SummaryGreat performance optimization PR! The 40% speedup (10.75s → 6.6s) is excellent. Changes Reviewed ✅
Follow-up Commit AddedI pushed a commit with two small improvements:
All tests pass locally (455 passed). |
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI checks pass. Great performance improvements!
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI checks pass. Great performance improvements!
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI checks pass. Great performance improvements!
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Three changes that together reduce
from policyengine_uk import Microsimulation; sim = Microsimulation()time by ~40% (10.75s → 6.6s):Enum encoding - replaced
np.selectwithnp.searchsortedfor O(n log m) lookup instead of O(n*m), with cached sorted lookup arrays via@lru_cacheempty_clone - replaced dynamic type creation with
object.__new__()which is ~33x faster for the 32k calls during parameter cloningPeriod/instant parsing - added
@lru_cacheto_instant_from_stringand_period_from_stringto avoid repeatedstrptimecalls for the same period strings (called ~20k times during fiscal year parameter conversion)Remaining bottlenecks are mostly inherent to data loading (
astypeat 0.80s, HuggingFace dataset at 0.82s) rather than algorithmic inefficiencies.