fix(inference): save merged LoRA adapter in flat layout for vLLM#58
Open
Manuscrit wants to merge 2 commits intolongtermrisk:v0.9from
Open
fix(inference): save merged LoRA adapter in flat layout for vLLM#58Manuscrit wants to merge 2 commits intolongtermrisk:v0.9from
Manuscrit wants to merge 2 commits intolongtermrisk:v0.9from
Conversation
…nference When `lora_adapters` (List[str]) is supplied, the job merges all adapters into a single combined adapter via PEFT linear combination on CPU before vLLM is initialised. This keeps the merged rank identical to the input rank so vLLM's max_lora_rank constraint is never violated. Key changes: - `InferenceConfig`: new `lora_adapters` field; validated to require ≥ 2 entries (single adapter stays in `model` as before, preserving compat). - `InferenceJobs.create()`: client-side rank-equality assertion across all adapters, with a clear error before any GPU time is spent. - `cli.py`: new `download_adapter()` helper (handles org/repo/subfolder paths); new `merge_lora_adapters()` runs PEFT `add_weighted_adapter` (combination_type="linear") on CPU, saves the combined adapter to /tmp/merged_lora/, then frees memory before vLLM loads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple LoRA adapters are loaded, PEFT's save_pretrained creates subdirectories per adapter (e.g. /tmp/merged_lora/combined/). vLLM expects adapter_config.json at the top level. Delete the source adapters before saving so only "combined" remains, producing a flat layout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
- When merging multiple LoRA adapters via PEFT, save_pretrained creates subdirectories per adapter (e.g. /tmp/merged_lora/combined/adapter_config.json), but vLLM expects a flat layout (/tmp/merged_lora/adapter_config.json)
- Fix: delete the source adapters before saving so only the "combined" adapter remains, producing the flat layout vLLM expects
- Without this fix, multi-LoRA inference jobs crash immediately with FileNotFoundError: No such file or directory: /tmp/merged_lora/adapter_config.json
## Test plan
- [ ] Run a multi-LoRA inference job (2+ adapters) and verify it completes without the FileNotFoundError
- [ ] Verify /tmp/merged_lora/adapter_config.json exists at the top level after merge
- [ ] Verify single-adapter inference (no merge path) still works unchanged