fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA #422

planetf1 · 2026-02-06T11:33:37Z

Fix: Migrate m train to PEFT 0.18.1 Native aLoRA

Description

Link to Issue: bug: broken aLora example after intrinsics refactor #385

Migrates m train command from IBM's deprecated alora==0.2.0 package to PEFT 0.18.1+ native aLoRA support. This removes an external dependency and uses the officially supported PEFT API.

Key Changes:

Removed alora==0.2.0 dependency
Updated to peft>=0.18.1
Replaced IBM-specific imports with PEFT native API (LoraConfig, get_peft_model)
Updated to use alora_invocation_tokens parameter (list of token IDs) instead of invocation_string

Special Note:

I set peft to 0.18.1 not 0.18.0 (a minor update) since there are issues in swapping adapters & loading parameters which seemed as if they could affect the activities mellea performs

Hugging face tests run on cuda - working except for FAILED test/backends/test_huggingface.py::test_error_during_generate_with_lock which seems a backend bug unrelated to this

Todos:

Extend test to do inference with mellea backend
Fix up alora 101 sample for further verification

Implementation Checklist

Protocol Compliance

Maintains backward compatibility - existing adapters work unchanged
Only affects training workflow, inference unchanged

Integration

Updated cli/alora/train.py with PEFT native API
Updated docs/alora.md documentation

Testing

Unit tests added to test/cli/test_alora_train.py (4 tests, all passing)
Integration tests added to test/cli/test_alora_train_integration.py (2 tests, verified on CUDA)

github-actions · 2026-02-06T11:33:49Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

mergify · 2026-02-06T13:39:17Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

planetf1 · 2026-02-06T13:47:07Z

Example logs from run with cuda (integration+unit):

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /u/jonesn/.conda/envs/mellea/bin/python3
cachedir: .pytest_cache
rootdir: /proj/dmfexp/eiger/users/jonesn/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, asyncio-1.3.0, Faker-40.1.2, timeout-2.4.0, langsmith-0.6.6, anyio-4.12.1, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collecting ... collected 6 items

test/cli/test_alora_train.py::test_alora_config_creation PASSED          [ 16%]
test/cli/test_alora_train.py::test_lora_config_creation PASSED           [ 33%]
test/cli/test_alora_train.py::test_invocation_prompt_tokenization PASSED [ 50%]
test/cli/test_alora_train.py::test_imports_work PASSED                   [ 66%]
test/cli/test_alora_train_integration.py::test_alora_training_integration PASSED [ 83%]
test/cli/test_alora_train_integration.py::test_lora_training_integration PASSED [100%]

=============================== warnings summary ===============================
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

jakelorocco

mostly lgtm; a few nits and looks like the tests are failing

test/cli/test_alora_train.py

test/cli/test_alora_train_integration.py

planetf1 · 2026-02-06T13:52:18Z

The new alora tests do fail in CI with:

FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
FAILED test/cli/test_alora_train_integration.py::test_lora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
= 2 failed, 239 passed, 109 skipped, 1 xpassed, 25 warnings in 562.56s (0:09:22) =

Investigating

Migrated m train command from deprecated IBM alora package to PEFT 0.18+ native aLoRA support. - Updated dependencies: removed alora==0.2.0, added peft>=0.18.1 - Replaced IBM imports with PEFT native API (LoraConfig, get_peft_model) - Changed invocation format: invocation_string → alora_invocation_tokens (list of token IDs) - Added comprehensive test suite: 4 unit tests + 2 integration tests with full adapter verification - Tests validate config format, weight integrity, adapter loading, and inference with/without activation

planetf1 · 2026-02-06T14:18:05Z

The CI test failure is caused by not having GPU. The train needs to use CPU instead if no GPUs available.
I think that's now fixed

we have the same issue running locally on mac arm (mps) with the current pytorch version. This is why originally we skip the alora test on mac. However this means a mac user cannot use alora at all -- so looking at whether we can detect mps/bad pytorch and revert to cpu-only with a warning rather than fail.

planetf1 · 2026-02-06T14:27:15Z

It's quite hard to get the initialization working in the train cli such that after detecting mps/backlevel pytorch it then uses cpu only.

After a few attempts I feel an alternative is worthwhile. Fail as now .. but also add an option to use --device=cpu which forces cpu only and could be used on mac (or any system) if we want to avoid auto-detection of gpu.

planetf1 · 2026-02-06T15:57:41Z

The MPS backup was only partly working due to using multiple libraries in training process. Forcing mps disabled after detecting backlevel pytorch/cpu fallback on macOS now allows training to run and tests to pass (integration test takes around 43s)

planetf1 · 2026-02-06T16:25:26Z

@jakelorocco I'm thinking of keeping this PR just for the cli/library improvements, and work on the sample in an additional PR

psschwei · 2026-02-06T18:51:33Z

I got a failure on the alora training test on my laptop:
FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a ...

I have a GPU, so not sure why it's getting that error

psschwei · 2026-02-06T19:31:16Z

It worked fine on the remote environment though

nrfulton · 2026-02-08T17:16:31Z

Confirmed that training now works on both CUDA and mps (m1 max 64gb) 🥳 .

@psschwei I tried to reproduce that problem but wasn't able. Are you still encountering that error? If so, it might be hardware-specific and require some collab to debug.

There's a larger issue, though. Using adapters as a requirement doesn't work because the 101_example.py was never updated after the rewrite of our adapter logic to use granite-common for intrinsics.

I started fixing that example and ran into some issues; I opened #423 to get feedback from the architects of the new Intrinsic and Adapter system. We should close out that issue before merging this PR.

psschwei · 2026-02-08T20:21:30Z

Are you still encountering that error? If so, it might be hardware-specific and require some collab to debug.

Yes, it's still failing for me. Haven't spent any time debugging, but assume it's something peculiar to my system as I was able to run elsewhere without issue.

Looking at the error, could be there's a problem with how we offload to CPU (?)

System info:

GPU: NVIDIA RTX A1000
Driver: 580.119.02
Memory: 4096 MiB
OS: GNU/Linux (Fedora 43)
Kernel: 6.18.8-200.fc43.x86_64
Arch: x86_64

Full logs from run (minus code coverage):

$ uv run pytest test/cli/test_alora_train_integration.py::test_alora_training_integration
=========================================================================== test session starts ===========================================================================
platform linux -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /home/paulschw/generative-computing/mellea-pr-422
configfile: pyproject.toml
plugins: timeout-2.4.0, cov-7.0.0, anyio-4.11.0, asyncio-1.3.0, nbmake-1.5.5, Faker-37.12.0, langsmith-0.6.6
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                          

test/cli/test_alora_train_integration.py F                                                                                                                          [100%]

================================================================================ FAILURES =================================================================================
_____________________________________________________________________ test_alora_training_integration _____________________________________________________________________

    @pytest.mark.huggingface
    @pytest.mark.llm
    def test_alora_training_integration():
        """Integration test: Train a tiny aLoRA adapter and verify it works.
    
        This test:
        1. Creates a minimal training dataset (5 samples)
        2. Trains an aLoRA adapter for 1 epoch using a small model
        3. Verifies adapter files are created with correct PEFT 0.18+ format
        4. Cleans up temporary files
    
        Uses ibm-granite/granite-4.0-micro (smallest Granite model, 3B params).
        """
        from cli.alora.train import train_model
    
        # Force CPU if MPS is available but PyTorch is too old
        if _mps_needs_cpu_fallback:
            import os
    
            # Disable MPS entirely to force CPU usage
            os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0"
            print(
                "⚠  Warning: MPS available but PyTorch < 2.8.0. "
                "Disabling MPS to run on CPU and avoid gradient scaling issues."
            )
    
        # Create temporary directory for test artifacts
        with tempfile.TemporaryDirectory() as tmpdir:
            tmpdir_path = Path(tmpdir)
    
            # Create minimal training dataset (5 samples)
            dataset_path = tmpdir_path / "train.jsonl"
            training_data = [
                {"item": "Flywheel imbalance detected.", "label": "flywheel"},
                {"item": "Connecting rod bent.", "label": "connecting rod"},
                {"item": "Piston crown cracked.", "label": "piston"},
                {"item": "Oil seepage around rings.", "label": "piston rings"},
                {"item": "Carburetor obstructed.", "label": "mini-carburetor"},
            ]
    
            with open(dataset_path, "w") as f:
                for item in training_data:
                    f.write(json.dumps(item) + "\n")
    
            # Output path for adapter
            adapter_path = tmpdir_path / "test_alora_adapter"
    
            # Train aLoRA adapter with minimal settings
            # Using smallest Granite model: granite-4.0-micro (3B params)
>           train_model(
                dataset_path=str(dataset_path),
                base_model="ibm-granite/granite-4.0-micro",
                output_file=str(adapter_path),
                adapter="alora",
                epochs=1,  # Just 1 epoch for speed
                learning_rate=6e-6,
                batch_size=1,  # Minimal batch size
                max_length=512,  # Shorter sequences
                grad_accum=1,  # No gradient accumulation
            )

test/cli/test_alora_train_integration.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cli/alora/train.py:158: in train_model
    trainer = SafeSaveTrainer(
.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:506: in __init__
    super().__init__(
.venv/lib/python3.12/site-packages/transformers/utils/deprecation.py:172: in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/transformers/trainer.py:619: in __init__
    self._move_model_to_device(model, args.device)
.venv/lib/python3.12/site-packages/transformers/trainer.py:895: in _move_model_to_device
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1355: in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:942: in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

t = Parameter containing:
tensor(..., device='meta', size=(2560,))

    def convert(t):
        try:
            if convert_to_format is not None and t.dim() in (4, 5):
                return t.to(
                    device,
                    dtype if t.is_floating_point() or t.is_complex() else None,
                    non_blocking,
                    memory_format=convert_to_format,
                )
            return t.to(
                device,
                dtype if t.is_floating_point() or t.is_complex() else None,
                non_blocking,
            )
        except NotImplementedError as e:
            if str(e) == "Cannot copy out of meta tensor; no data!":
>               raise NotImplementedError(
                    f"{e} Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() "
                    f"when moving module from meta to a different device."
                ) from None
E               NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1348: NotImplementedError
-------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.89s/it]
Applying formatting function to train dataset:   0%|          | 0/4 [00:00<?, ? examples/s]
Applying formatting function to train dataset: 100%|██████████| 4/4 [00:00<00:00, 1705.87 examples/s]
Adding EOS to train dataset: 100%|██████████| 4/4 [00:00<00:00, 1714.41 examples/s]
Tokenizing train dataset: 100%|██████████| 4/4 [00:00<00:00, 953.58 examples/s]
Truncating train dataset: 100%|██████████| 4/4 [00:00<00:00, 1745.44 examples/s]
Applying formatting function to eval dataset:   0%|          | 0/1 [00:00<?, ? examples/s]
Applying formatting function to eval dataset: 100%|██████████| 1/1 [00:00<00:00, 542.39 examples/s]
Adding EOS to eval dataset: 100%|██████████| 1/1 [00:00<00:00, 499.86 examples/s]
Tokenizing eval dataset: 100%|██████████| 1/1 [00:00<00:00, 315.67 examples/s]
Truncating eval dataset: 100%|██████████| 1/1 [00:00<00:00, 491.42 examples/s]
---------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------
WARNING  accelerate.big_modeling:big_modeling.py:442 Some parameters are on the meta device because they were offloaded to the cpu.
============================================================================ warnings summary =============================================================================
test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

<snip code coverage>

========================================================================= short test summary info =========================================================================
FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a ...
===================================================================== 1 failed, 3 warnings in 19.00s ======================================================================

nrfulton · 2026-02-09T03:48:02Z

Update: made a bunch of progress toward getting the example running, and also uncovered some issues along the way. I have a PR against @planetf1's clone (from which this PR comes) because I don't want to push into his repo without syncing with him first:

planetf1#1

planetf1 · 2026-02-09T13:07:41Z

@nathan - thanks for update - looks fine. On example - I started there, but was going to return after the core was updated
@psschwei the tensor error was one I saw before I added a change for CI - are you using the latest code? 4GB is also very limited. Maybe a more explicit way is needed to force cpu only. I had originally added this whilst working on the PR - but then managed to address CI / large cudo / macOS through detection ...

the biggest issue though seems to be the new issue #423 which I'll also read through to understand more about the design options.

planetf1 · 2026-02-09T13:13:41Z

@psschwei I know it doesn't address the big issue, but you could now try --device=cpu on the latest code.
I don't really have a system I can test this on ....

planetf1 force-pushed the fix/issue-385-peft-migration branch from 89c4710 to c2fa5c8 Compare February 6, 2026 13:06

jakelorocco reviewed Feb 6, 2026

View reviewed changes

test/cli/test_alora_train.py Outdated Show resolved Hide resolved

test/cli/test_alora_train_integration.py Outdated Show resolved Hide resolved

planetf1 force-pushed the fix/issue-385-peft-migration branch 2 times, most recently from 36f4b55 to c741486 Compare February 6, 2026 14:07

planetf1 force-pushed the fix/issue-385-peft-migration branch from c741486 to 3e2a34e Compare February 6, 2026 14:09

fix: disable MPS on macOS with PyTorch < 2.8 for CPU fallback

ef974b1

fix: restrict MPS patching to macOS only for cross-platform safety

bcb3629

planetf1 marked this pull request as ready for review February 6, 2026 16:25

planetf1 requested a review from jakelorocco February 6, 2026 16:25

nrfulton added 2 commits February 7, 2026 10:19

Adds a test file -- remove before merging.

03f2248

Fixes output dirname where a relative path is used.

28a80a7

nrfulton mentioned this pull request Feb 8, 2026

Adapter code is undocumented and over-specialized to Intrinsics. #423

Open

planetf1 added 2 commits February 9, 2026 12:54

chore: remove temporary test file

a5d497a

fix: add GPU memory check and better error handling

94679a8

feat: add --device flag for explicit device control

ff0a4cd

docs: add --device flag to alora documentation

1366277

fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA #422

Are you sure you want to change the base?

fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA #422

Conversation

planetf1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Migrate m train to PEFT 0.18.1 Native aLoRA

Description

Implementation Checklist

Protocol Compliance

Integration

Testing

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

mergify bot commented Feb 6, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

planetf1 commented Feb 6, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

planetf1 commented Feb 6, 2026

Uh oh!

planetf1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

planetf1 commented Feb 6, 2026

Uh oh!

planetf1 commented Feb 6, 2026

Uh oh!

planetf1 commented Feb 6, 2026

Uh oh!

psschwei commented Feb 6, 2026

Uh oh!

psschwei commented Feb 6, 2026

Uh oh!

nrfulton commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psschwei commented Feb 8, 2026

Uh oh!

nrfulton commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

planetf1 commented Feb 9, 2026

Uh oh!

planetf1 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

planetf1 commented Feb 6, 2026 •

edited

Loading

planetf1 commented Feb 6, 2026 •

edited

Loading

nrfulton commented Feb 8, 2026 •

edited

Loading

nrfulton commented Feb 9, 2026 •

edited

Loading