Add per-operator physical implementation hints#295
Open
tareqmahmood wants to merge 2 commits into
Open
Conversation
Add a `physical=` dict parameter to sem_filter, sem_map, sem_flat_map, sem_join, and sem_agg that lets users override the optimizer's physical operator selection. The dict requires an "implementation" key (the physical operator class). All other keys are forwarded as constructor kwargs to that operator, overriding rule-generated defaults. Only the matching implementation receives the extra kwargs — other rules build operators normally and are filtered out post-substitution. Changes: - logical.py: store physical on LogicalOperator, validate at construction, include in get_logical_op_params() but not get_logical_id_params() - dataset.py: thread physical= through semantic Dataset methods - rules.py: guard extra kwargs injection by implementation class match - tasks.py: post-filter expressions by implementation, warn on empty
18 tests covering: - Expression filtering by implementation class (exact type match) - Validation (rejects missing/invalid implementation key) - Propagation through logical operators and copy - Dataset API integration (sem_filter, sem_map, sem_flat_map) - End-to-end usage pattern
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
physical=parameter to semantic Dataset operations (sem_filter,sem_map,sem_flat_map,sem_join,sem_agg) that lets users override the optimizer's physical implementation choice per operator.Today, palimpzest's Cascades optimizer explores all valid physical implementations (LLMFilter, RAGFilter, MixtureOfAgentsFilter, etc.) for each logical operator and picks the best one according to the policy. This works well in general, but there are cases where the user knows which implementation they want — for benchmarking, debugging, cost control, or when the optimizer's choice is suboptimal for a specific workload.
The
physical=dict gives users a direct way to pin a specific physical operator class and its constructor kwargs for any semantic operation, while leaving other operators in the query free for the optimizer to handle.Usage
The
physicaldict requires an"implementation"key (the physical operator class). All other keys are forwarded as constructor kwargs to that operator, overriding the rule-generated defaults. This means any parameter the operator accepts (model,embedding_model,chunk_size,prompt_strategy, etc.) can be controlled.Validation at query construction time ensures
"implementation"is present and is a class. If the hint filters out all candidates for an operator, a warning is logged.How it works
The feature touches four files in the optimizer pipeline:
LogicalOperator(logical.py) — storesphysicaldict, includes it inget_logical_op_params()for copy support, excludes it fromget_logical_id_params()so it doesn't affect caching or logical plan identity.Dataset(dataset.py) — threadsphysical=throughsem_filter,sem_map,sem_flat_map,sem_join,sem_agginto the logical operator constructor.ImplementationRule._perform_substitution(rules.py) — when building physical operators, extra kwargs from thephysicaldict are only injected when thephysical_op_classmatches the requested"implementation". This prevents invalid kwargs from being passed to unrelated operators.ApplyRule.perform(tasks.py) — after each implementation rule fires, a post-filter discards physical expressions whose exact type doesn't match"implementation". A warning is logged if all candidates from a rule are filtered out.How it interacts with the optimizer
Every key in the
physicaldict (beyond"implementation") overrides the corresponding constructor kwarg for all candidates of the matching class. For example, withavailable_models=[GPT_4o_MINI, GPT_4o]:Operators without
physical=are completely free — the optimizer explores all implementations and models as usual.With
run(): The Cascades optimizer generates all physical candidates per logical operator, then the hint filters and overrides them before costing. The optimizer costs and ranks the survivors normally.With
optimize_and_run(): The hint constrains which candidates enter the sentinel plan's MAB sampling. Only matching operators are sampled on the training data, avoiding wasted budget on operators the user doesn't want. After sentinel execution, the final plan selection respects the hint.Transformation rules (filter push-down, convert reorder) are unaffected — the hint travels with the logical operator.
Validation
tests/pytest/test_hints.pycovering filtering logic, propagation, copy, Dataset API, and validation.Example 1:
run()with hinted mapOutput:
Example 2:
optimize_and_run()with hinted map + extra kwargsOutput:
The map used
GPT_4owithreasoning_effort: highas hinted. The filter was free to choose — the optimizer pickedGPT_4o_MINI(cheaper) after sentinel sampling with both models.Example 3:
optimize_and_run()with MixtureOfAgents (partially pinned)Here only
"implementation"is specified — the optimizer is free to choose proposer models, temperatures, and aggregator model from the available pool.Output:
The map was pinned to
MixtureOfAgentsConvertbut the optimizer chose the proposer model (GPT_4o_MINI), temperature (0.4), and aggregator model (GPT_4o_MINI) freely via sentinel sampling.Test plan
tests/pytest/test_hints.py— 18 unit tests (filtering, propagation, copy, validation, API)run(),optimize_and_run(), andoptimize_and_run()+ MixtureOfAgents (examples above)test_rules.py,test_schemas.py,test_records.py,test_optimizer.py)