-
Notifications
You must be signed in to change notification settings - Fork 77
fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA #422
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
89c4710 to
c2fa5c8
Compare
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Example logs from run with cuda (integration+unit): |
jakelorocco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm; a few nits and looks like the tests are failing
|
The new alora tests do fail in CI with: Investigating |
36f4b55 to
c741486
Compare
Migrated m train command from deprecated IBM alora package to PEFT 0.18+ native aLoRA support. - Updated dependencies: removed alora==0.2.0, added peft>=0.18.1 - Replaced IBM imports with PEFT native API (LoraConfig, get_peft_model) - Changed invocation format: invocation_string → alora_invocation_tokens (list of token IDs) - Added comprehensive test suite: 4 unit tests + 2 integration tests with full adapter verification - Tests validate config format, weight integrity, adapter loading, and inference with/without activation
c741486 to
3e2a34e
Compare
|
The CI test failure is caused by not having GPU. The train needs to use CPU instead if no GPUs available. we have the same issue running locally on mac arm (mps) with the current pytorch version. This is why originally we skip the alora test on mac. However this means a mac user cannot use alora at all -- so looking at whether we can detect mps/bad pytorch and revert to cpu-only with a warning rather than fail. |
|
It's quite hard to get the initialization working in the train cli such that after detecting mps/backlevel pytorch it then uses cpu only. After a few attempts I feel an alternative is worthwhile. Fail as now .. but also add an option to use |
|
The MPS backup was only partly working due to using multiple libraries in training process. Forcing mps disabled after detecting backlevel pytorch/cpu fallback on macOS now allows training to run and tests to pass (integration test takes around 43s) |
|
@jakelorocco I'm thinking of keeping this PR just for the cli/library improvements, and work on the sample in an additional PR |
|
I got a failure on the alora training test on my laptop: I have a GPU, so not sure why it's getting that error |
|
It worked fine on the remote environment though |
|
Confirmed that training now works on both CUDA and mps (m1 max 64gb) 🥳 . @psschwei I tried to reproduce that problem but wasn't able. Are you still encountering that error? If so, it might be hardware-specific and require some collab to debug. There's a larger issue, though. Using adapters as a requirement doesn't work because the I started fixing that example and ran into some issues; I opened #423 to get feedback from the architects of the new Intrinsic and Adapter system. We should close out that issue before merging this PR. |
Yes, it's still failing for me. Haven't spent any time debugging, but assume it's something peculiar to my system as I was able to run elsewhere without issue. Looking at the error, could be there's a problem with how we offload to CPU (?) System info: GPU: NVIDIA RTX A1000 Full logs from run (minus code coverage): |
|
Update: made a bunch of progress toward getting the example running, and also uncovered some issues along the way. I have a PR against @planetf1's clone (from which this PR comes) because I don't want to push into his repo without syncing with him first: |
|
@nathan - thanks for update - looks fine. On example - I started there, but was going to return after the core was updated the biggest issue though seems to be the new issue #423 which I'll also read through to understand more about the design options. |
|
@psschwei I know it doesn't address the big issue, but you could now try |
Fix: Migrate m train to PEFT 0.18.1 Native aLoRA
Description
Migrates
m traincommand from IBM's deprecatedalora==0.2.0package to PEFT 0.18.1+ native aLoRA support. This removes an external dependency and uses the officially supported PEFT API.Key Changes:
alora==0.2.0dependencypeft>=0.18.1LoraConfig,get_peft_model)alora_invocation_tokensparameter (list of token IDs) instead ofinvocation_stringSpecial Note:
I set peft to 0.18.1 not 0.18.0 (a minor update) since there are issues in swapping adapters & loading parameters which seemed as if they could affect the activities mellea performs
Hugging face tests run on cuda - working except for
FAILED test/backends/test_huggingface.py::test_error_during_generate_with_lockwhich seems a backend bug unrelated to thisTodos:
Implementation Checklist
Protocol Compliance
Integration
cli/alora/train.pywith PEFT native APIdocs/alora.mddocumentationTesting
test/cli/test_alora_train.py(4 tests, all passing)test/cli/test_alora_train_integration.py(2 tests, verified on CUDA)