Skip to content

Splitters_stories

GitHub Action edited this page Apr 11, 2026 · 1 revision

US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.


classes relations

classDiagram
    direction LR

    class Splitter {
        <<abstract>>
        +KIND: str
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits*
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int*
    }
    Splitter --|> pdt.BaseModel : inherits
    Splitter --|> abc.ABC : inherits

    class TrainTestSplitter {
        +KIND: T.Literal["TrainTestSplitter"] = "TrainTestSplitter"
        +shuffle: bool = False
        +test_size: int | float = 1728
        +random_state: int = 42
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
    }
    TrainTestSplitter --|> Splitter : inherits

    class TimeSeriesSplitter {
        +KIND: T.Literal["TimeSeriesSplitter"] = "TimeSeriesSplitter"
        +gap: int = 0
        +n_splits: int = 4
        +test_size: int | float = 1728
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
    }
    TimeSeriesSplitter --|> Splitter : inherits

    class TrainTestSplits {
        <<type>>
        T.Iterator[tuple[Index, Index]]
    }

    class Index {
        <<type>>
        npt.NDArray[np.int64]
    }

    Splitter --> schemas.Inputs : "uses"
    Splitter --> schemas.Targets : "uses"
    Splitter --> TrainTestSplits : "returns"
Loading

User Stories: Splitter Management


1. User Story: Configure Splitter

Title:
As a data scientist, I want to configure a splitter that defines how datasets are divided into training and testing subsets, so that I can ensure proper model evaluation.

Description:
The Splitter class serves as the base to specify configurations for splitting datasets according to the chosen method (train/test or time series).

Acceptance Criteria:

  • A splitter can be instantiated with specific parameters for splitting datasets.
  • Default values are handled and configurable through initialization.

2. User Story: Split Data into Train and Test Sets

Title:
As a data engineer, I want to split datasets into training and testing sets using the specified Splitter, so that I can prepare them for model training and evaluation.

Description:
The TrainTestSplitter allows for a straightforward split between the training and testing datasets based on the parameters provided.

Acceptance Criteria:

  • The job successfully divides input and target datasets into train and test subsets.
  • The shapes of the resulting datasets are logged for verification.

3. User Story: Split Data for Time Series Analysis

Title:
As a data scientist, I want to use the TimeSeriesSplitter to split datasets specifically for time series analysis, so that temporal dependencies are respected in the splits.

Description:
The TimeSeriesSplitter class creates fixed-time subsets of the data, allowing for model evaluation that maintains time order.

Acceptance Criteria:

  • The time series data is split according to specified parameters without shuffling.
  • The results accurately reflect the required time splits in the dataset.

4. User Story: Get Number of Splits

Title:
As a data scientist, I want to retrieve the number of splits generated by the splitter class, so that I can understand how many distinct training/test sets will be produced.

Description:
Both the TrainTestSplitter and TimeSeriesSplitter should provide the number of splits available for model training.

Acceptance Criteria:

  • The method to get the number of splits returns an integer count accurately representing the available splits.
  • The information should be logged for reference during the training process.

Common Acceptance Criteria

  1. Implementation Requirements:

    • The Splitter, TrainTestSplitter, and TimeSeriesSplitter classes must properly define their respective methods.
    • Interfaces for splitting and returning the number of splits should be consistent across all implementations.
  2. Error Handling:

    • Errors encountered during data splitting should be logged with clear messages, especially when invalid data is provided.
  3. Testing:

    • Unit tests validate each splitter's functionality, confirming that datasets are split correctly based on the modes specified.
    • Edge cases regarding splitting configurations should also be included in the tests.
  4. Documentation:

    • Each class and method should contain thorough docstrings describing their intended use and how to interact with them.
    • Examples of initializing and using each splitter should be included for user reference.

Definition of Done (DoD):

  • The Splitter, TrainTestSplitter, and TimeSeriesSplitter classes are fully functional and meet the outlined acceptance criteria.
  • All functionalities are verified through testing, ensuring reliability in data splits.
  • Documentation is complete, providing clarity for integration and use.

Code location

src/regression_model_template/utils/splitters.py

Test location

tests/utils/test_splitters.py

MLOps Python Package ๐Ÿ“ฆ

๐Ÿง  Models & Data

๐Ÿš€ Lifecycle

๐Ÿ› ๏ธ Config & Tools

๐Ÿ›๏ธ Registry & Services

๐Ÿ”ง Components


Powered by MLOps Factory

Clone this wiki locally