Skip to content

Conversation

@planetf1
Copy link
Contributor

@planetf1 planetf1 commented Feb 6, 2026

Fix: Migrate m train to PEFT 0.18.1 Native aLoRA

Description

Migrates m train command from IBM's deprecated alora==0.2.0 package to PEFT 0.18.1+ native aLoRA support. This removes an external dependency and uses the officially supported PEFT API.

Key Changes:

  • Removed alora==0.2.0 dependency
  • Updated to peft>=0.18.1
  • Replaced IBM-specific imports with PEFT native API (LoraConfig, get_peft_model)
  • Updated to use alora_invocation_tokens parameter (list of token IDs) instead of invocation_string

Special Note:

I set peft to 0.18.1 not 0.18.0 (a minor update) since there are issues in swapping adapters & loading parameters which seemed as if they could affect the activities mellea performs

Hugging face tests run on cuda - working except for FAILED test/backends/test_huggingface.py::test_error_during_generate_with_lock which seems a backend bug unrelated to this

Todos:

  • Extend test to do inference with mellea backend
  • Fix up alora 101 sample for further verification

Implementation Checklist

Protocol Compliance

  • Maintains backward compatibility - existing adapters work unchanged
  • Only affects training workflow, inference unchanged

Integration

  • Updated cli/alora/train.py with PEFT native API
  • Updated docs/alora.md documentation

Testing

  • Unit tests added to test/cli/test_alora_train.py (4 tests, all passing)
  • Integration tests added to test/cli/test_alora_train_integration.py (2 tests, verified on CUDA)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch from 89c4710 to c2fa5c8 Compare February 6, 2026 13:06
@mergify
Copy link

mergify bot commented Feb 6, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

Example logs from run with cuda (integration+unit):

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /u/jonesn/.conda/envs/mellea/bin/python3
cachedir: .pytest_cache
rootdir: /proj/dmfexp/eiger/users/jonesn/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, asyncio-1.3.0, Faker-40.1.2, timeout-2.4.0, langsmith-0.6.6, anyio-4.12.1, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collecting ... collected 6 items

test/cli/test_alora_train.py::test_alora_config_creation PASSED          [ 16%]
test/cli/test_alora_train.py::test_lora_config_creation PASSED           [ 33%]
test/cli/test_alora_train.py::test_invocation_prompt_tokenization PASSED [ 50%]
test/cli/test_alora_train.py::test_imports_work PASSED                   [ 66%]
test/cli/test_alora_train_integration.py::test_alora_training_integration PASSED [ 83%]
test/cli/test_alora_train_integration.py::test_lora_training_integration PASSED [100%]

=============================== warnings summary ===============================
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Copy link
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm; a few nits and looks like the tests are failing

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

The new alora tests do fail in CI with:

FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
FAILED test/cli/test_alora_train_integration.py::test_lora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
= 2 failed, 239 passed, 109 skipped, 1 xpassed, 25 warnings in 562.56s (0:09:22) =

Investigating

@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch 2 times, most recently from 36f4b55 to c741486 Compare February 6, 2026 14:07
Migrated m train command from deprecated IBM alora package to PEFT 0.18+ native aLoRA support.
- Updated dependencies: removed alora==0.2.0, added peft>=0.18.1
- Replaced IBM imports with PEFT native API (LoraConfig, get_peft_model)
- Changed invocation format: invocation_string → alora_invocation_tokens (list of token IDs)
- Added comprehensive test suite: 4 unit tests + 2 integration tests with full adapter verification
- Tests validate config format, weight integrity, adapter loading, and inference with/without activation
@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch from c741486 to 3e2a34e Compare February 6, 2026 14:09
@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

The CI test failure is caused by not having GPU. The train needs to use CPU instead if no GPUs available.
I think that's now fixed

we have the same issue running locally on mac arm (mps) with the current pytorch version. This is why originally we skip the alora test on mac. However this means a mac user cannot use alora at all -- so looking at whether we can detect mps/bad pytorch and revert to cpu-only with a warning rather than fail.

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

It's quite hard to get the initialization working in the train cli such that after detecting mps/backlevel pytorch it then uses cpu only.

After a few attempts I feel an alternative is worthwhile. Fail as now .. but also add an option to use --device=cpu which forces cpu only and could be used on mac (or any system) if we want to avoid auto-detection of gpu.

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

The MPS backup was only partly working due to using multiple libraries in training process. Forcing mps disabled after detecting backlevel pytorch/cpu fallback on macOS now allows training to run and tests to pass (integration test takes around 43s)

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

@jakelorocco I'm thinking of keeping this PR just for the cli/library improvements, and work on the sample in an additional PR

@planetf1 planetf1 marked this pull request as ready for review February 6, 2026 16:25
@planetf1 planetf1 requested a review from jakelorocco February 6, 2026 16:25
@psschwei
Copy link
Member

psschwei commented Feb 6, 2026

I got a failure on the alora training test on my laptop:
FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a ...

I have a GPU, so not sure why it's getting that error

@psschwei
Copy link
Member

psschwei commented Feb 6, 2026

It worked fine on the remote environment though

@nrfulton
Copy link
Member

nrfulton commented Feb 8, 2026

Confirmed that training now works on both CUDA and mps (m1 max 64gb) 🥳 .

@psschwei I tried to reproduce that problem but wasn't able. Are you still encountering that error? If so, it might be hardware-specific and require some collab to debug.

There's a larger issue, though. Using adapters as a requirement doesn't work because the 101_example.py was never updated after the rewrite of our adapter logic to use granite-common for intrinsics.

I started fixing that example and ran into some issues; I opened #423 to get feedback from the architects of the new Intrinsic and Adapter system. We should close out that issue before merging this PR.

@psschwei
Copy link
Member

psschwei commented Feb 8, 2026

Are you still encountering that error? If so, it might be hardware-specific and require some collab to debug.

Yes, it's still failing for me. Haven't spent any time debugging, but assume it's something peculiar to my system as I was able to run elsewhere without issue.

Looking at the error, could be there's a problem with how we offload to CPU (?)

System info:

GPU: NVIDIA RTX A1000
Driver: 580.119.02
Memory: 4096 MiB
OS: GNU/Linux (Fedora 43)
Kernel: 6.18.8-200.fc43.x86_64
Arch: x86_64

Full logs from run (minus code coverage):

$ uv run pytest test/cli/test_alora_train_integration.py::test_alora_training_integration
=========================================================================== test session starts ===========================================================================
platform linux -- Python 3.12.8, pytest-9.0.0, pluggy-1.6.0
rootdir: /home/paulschw/generative-computing/mellea-pr-422
configfile: pyproject.toml
plugins: timeout-2.4.0, cov-7.0.0, anyio-4.11.0, asyncio-1.3.0, nbmake-1.5.5, Faker-37.12.0, langsmith-0.6.6
timeout: 900.0s
timeout method: signal
timeout func_only: False
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                          

test/cli/test_alora_train_integration.py F                                                                                                                          [100%]

================================================================================ FAILURES =================================================================================
_____________________________________________________________________ test_alora_training_integration _____________________________________________________________________

    @pytest.mark.huggingface
    @pytest.mark.llm
    def test_alora_training_integration():
        """Integration test: Train a tiny aLoRA adapter and verify it works.
    
        This test:
        1. Creates a minimal training dataset (5 samples)
        2. Trains an aLoRA adapter for 1 epoch using a small model
        3. Verifies adapter files are created with correct PEFT 0.18+ format
        4. Cleans up temporary files
    
        Uses ibm-granite/granite-4.0-micro (smallest Granite model, 3B params).
        """
        from cli.alora.train import train_model
    
        # Force CPU if MPS is available but PyTorch is too old
        if _mps_needs_cpu_fallback:
            import os
    
            # Disable MPS entirely to force CPU usage
            os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0"
            print(
                "⚠  Warning: MPS available but PyTorch < 2.8.0. "
                "Disabling MPS to run on CPU and avoid gradient scaling issues."
            )
    
        # Create temporary directory for test artifacts
        with tempfile.TemporaryDirectory() as tmpdir:
            tmpdir_path = Path(tmpdir)
    
            # Create minimal training dataset (5 samples)
            dataset_path = tmpdir_path / "train.jsonl"
            training_data = [
                {"item": "Flywheel imbalance detected.", "label": "flywheel"},
                {"item": "Connecting rod bent.", "label": "connecting rod"},
                {"item": "Piston crown cracked.", "label": "piston"},
                {"item": "Oil seepage around rings.", "label": "piston rings"},
                {"item": "Carburetor obstructed.", "label": "mini-carburetor"},
            ]
    
            with open(dataset_path, "w") as f:
                for item in training_data:
                    f.write(json.dumps(item) + "\n")
    
            # Output path for adapter
            adapter_path = tmpdir_path / "test_alora_adapter"
    
            # Train aLoRA adapter with minimal settings
            # Using smallest Granite model: granite-4.0-micro (3B params)
>           train_model(
                dataset_path=str(dataset_path),
                base_model="ibm-granite/granite-4.0-micro",
                output_file=str(adapter_path),
                adapter="alora",
                epochs=1,  # Just 1 epoch for speed
                learning_rate=6e-6,
                batch_size=1,  # Minimal batch size
                max_length=512,  # Shorter sequences
                grad_accum=1,  # No gradient accumulation
            )

test/cli/test_alora_train_integration.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cli/alora/train.py:158: in train_model
    trainer = SafeSaveTrainer(
.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:506: in __init__
    super().__init__(
.venv/lib/python3.12/site-packages/transformers/utils/deprecation.py:172: in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/transformers/trainer.py:619: in __init__
    self._move_model_to_device(model, args.device)
.venv/lib/python3.12/site-packages/transformers/trainer.py:895: in _move_model_to_device
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1355: in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:915: in _apply
    module._apply(fn)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:942: in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

t = Parameter containing:
tensor(..., device='meta', size=(2560,))

    def convert(t):
        try:
            if convert_to_format is not None and t.dim() in (4, 5):
                return t.to(
                    device,
                    dtype if t.is_floating_point() or t.is_complex() else None,
                    non_blocking,
                    memory_format=convert_to_format,
                )
            return t.to(
                device,
                dtype if t.is_floating_point() or t.is_complex() else None,
                non_blocking,
            )
        except NotImplementedError as e:
            if str(e) == "Cannot copy out of meta tensor; no data!":
>               raise NotImplementedError(
                    f"{e} Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() "
                    f"when moving module from meta to a different device."
                ) from None
E               NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1348: NotImplementedError
-------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.89s/it]
Applying formatting function to train dataset:   0%|          | 0/4 [00:00<?, ? examples/s]
Applying formatting function to train dataset: 100%|██████████| 4/4 [00:00<00:00, 1705.87 examples/s]
Adding EOS to train dataset: 100%|██████████| 4/4 [00:00<00:00, 1714.41 examples/s]
Tokenizing train dataset: 100%|██████████| 4/4 [00:00<00:00, 953.58 examples/s]
Truncating train dataset: 100%|██████████| 4/4 [00:00<00:00, 1745.44 examples/s]
Applying formatting function to eval dataset:   0%|          | 0/1 [00:00<?, ? examples/s]
Applying formatting function to eval dataset: 100%|██████████| 1/1 [00:00<00:00, 542.39 examples/s]
Adding EOS to eval dataset: 100%|██████████| 1/1 [00:00<00:00, 499.86 examples/s]
Tokenizing eval dataset: 100%|██████████| 1/1 [00:00<00:00, 315.67 examples/s]
Truncating eval dataset: 100%|██████████| 1/1 [00:00<00:00, 491.42 examples/s]
---------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------
WARNING  accelerate.big_modeling:big_modeling.py:442 Some parameters are on the meta device because they were offloaded to the cpu.
============================================================================ warnings summary =============================================================================
test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
  /home/paulschw/generative-computing/mellea-pr-422/.venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

<snip code coverage>

========================================================================= short test summary info =========================================================================
FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a ...
===================================================================== 1 failed, 3 warnings in 19.00s ======================================================================

@nrfulton
Copy link
Member

nrfulton commented Feb 9, 2026

Update: made a bunch of progress toward getting the example running, and also uncovered some issues along the way. I have a PR against @planetf1's clone (from which this PR comes) because I don't want to push into his repo without syncing with him first:

planetf1#1

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 9, 2026

@nathan - thanks for update - looks fine. On example - I started there, but was going to return after the core was updated
@psschwei the tensor error was one I saw before I added a change for CI - are you using the latest code? 4GB is also very limited. Maybe a more explicit way is needed to force cpu only. I had originally added this whilst working on the PR - but then managed to address CI / large cudo / macOS through detection ...

the biggest issue though seems to be the new issue #423 which I'll also read through to understand more about the design options.

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 9, 2026

@psschwei I know it doesn't address the big issue, but you could now try --device=cpu on the latest code.
I don't really have a system I can test this on ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants