Skip to content

[Doc] #1330

@guidryheal-create

Description

@guidryheal-create

Description

It is not clear which scheduler strategies are available in AReaL.

The default value used in examples (None) leads to runtime errors, which makes the system confusing and fragile for first-time users.


Issue Type

  • Missing documentation
  • Incorrect information
  • Unclear or confusing
  • Typo or formatting error
  • Other (please describe)

Expected Behavior

The system should either:

  • Work with a safe default scheduler, or
  • Clearly document valid scheduler options and enforce them in config validation.

Current Behavior

Using the default configuration results in a crash:

Traceback (most recent call last):
  File "/content/AReaL/examples/vlm/geometry3k_grpo.py", line 75, in <module>
    main(sys.argv[1:])
  File "/content/AReaL/examples/vlm/geometry3k_grpo.py", line 61, in main
    with PPOTrainer(
         ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/areal/trainer/rl_trainer.py", line 123, in __init__
    self.scheduler = self._init_scheduler()
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/areal/trainer/rl_trainer.py", line 873, in _init_scheduler
    raise NotImplementedError(f"Unknown scheduler type: {cfg.type}")
NotImplementedError: Unknown scheduler type:

Additional Context

After inspecting the source code, only the following scheduler strategies are actually defined:

class SchedulingStrategyType(str, Enum):
    separation = "separation"
    colocation = "colocation"

And the configuration dataclass:

@dataclass
class SchedulingStrategy:
    type: str = field(
        default="separation",
        metadata={"choices": ["separation", "colocation"]},
    )

    target: str | None = field(
        default=None,
        metadata={"help": "The target role to be colocated with"},
    )

    fork: bool = field(
        default=True,
        metadata={
            "help": (
                "When True with colocation, the target worker spawns a new "
                "process on the same node/GPUs instead of sharing its process. "
                "Provides process isolation while sharing GPU resources."
            )
        },
    )

Suggested Fix / Clarification

  1. Documentation should explicitly state valid values:
scheduler:
  type: separation  # default

or:

scheduler:
  type: colocation
  1. Consider rejecting None earlier with a clear validation error message like:
Scheduler type must be one of: separation, colocation

instead of:

Unknown scheduler type: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions