Skip to content

fix(inference): save merged LoRA adapter in flat layout for vLLM#58

Open
Manuscrit wants to merge 2 commits intolongtermrisk:v0.9from
slacki-ai:fix/multi-lora-save-flat
Open

fix(inference): save merged LoRA adapter in flat layout for vLLM#58
Manuscrit wants to merge 2 commits intolongtermrisk:v0.9from
slacki-ai:fix/multi-lora-save-flat

Conversation

@Manuscrit
Copy link
Copy Markdown
Collaborator

Summary

 - When merging multiple LoRA adapters via PEFT, save_pretrained creates subdirectories per adapter (e.g. /tmp/merged_lora/combined/adapter_config.json), but vLLM expects a flat layout (/tmp/merged_lora/adapter_config.json)
 - Fix: delete the source adapters before saving so only the "combined" adapter remains, producing the flat layout vLLM expects
 - Without this fix, multi-LoRA inference jobs crash immediately with FileNotFoundError: No such file or directory: /tmp/merged_lora/adapter_config.json

 ## Test plan
 - [ ] Run a multi-LoRA inference job (2+ adapters) and verify it completes without the FileNotFoundError
 - [ ] Verify /tmp/merged_lora/adapter_config.json exists at the top level after merge
 - [ ] Verify single-adapter inference (no merge path) still works unchanged

slacki-ai and others added 2 commits April 4, 2026 10:10
…nference

When `lora_adapters` (List[str]) is supplied, the job merges all adapters
into a single combined adapter via PEFT linear combination on CPU before
vLLM is initialised. This keeps the merged rank identical to the input
rank so vLLM's max_lora_rank constraint is never violated.

Key changes:
- `InferenceConfig`: new `lora_adapters` field; validated to require ≥ 2
  entries (single adapter stays in `model` as before, preserving compat).
- `InferenceJobs.create()`: client-side rank-equality assertion across all
  adapters, with a clear error before any GPU time is spent.
- `cli.py`: new `download_adapter()` helper (handles org/repo/subfolder
  paths); new `merge_lora_adapters()` runs PEFT `add_weighted_adapter`
  (combination_type="linear") on CPU, saves the combined adapter to
  /tmp/merged_lora/, then frees memory before vLLM loads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple LoRA adapters are loaded, PEFT's save_pretrained creates
subdirectories per adapter (e.g. /tmp/merged_lora/combined/). vLLM
expects adapter_config.json at the top level. Delete the source adapters
before saving so only "combined" remains, producing a flat layout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants