Splitters_stories

US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.

US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.

classes relations

classDiagram
    direction LR

    class Splitter {
        <<abstract>>
        +KIND: str
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits*
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int*
    }
    Splitter --|> pdt.BaseModel : inherits
    Splitter --|> abc.ABC : inherits

    class TrainTestSplitter {
        +KIND: T.Literal["TrainTestSplitter"] = "TrainTestSplitter"
        +shuffle: bool = False
        +test_size: int | float = 1728
        +random_state: int = 42
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
    }
    TrainTestSplitter --|> Splitter : inherits

    class TimeSeriesSplitter {
        +KIND: T.Literal["TimeSeriesSplitter"] = "TimeSeriesSplitter"
        +gap: int = 0
        +n_splits: int = 4
        +test_size: int | float = 1728
        +split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
        +get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
    }
    TimeSeriesSplitter --|> Splitter : inherits

    class TrainTestSplits {
        <<type>>
        T.Iterator[tuple[Index, Index]]
    }

    class Index {
        <<type>>
        npt.NDArray[np.int64]
    }

    Splitter --> schemas.Inputs : "uses"
    Splitter --> schemas.Targets : "uses"
    Splitter --> TrainTestSplits : "returns"

User Stories: Splitter Management

1. User Story: Configure Splitter

Title:
As a data scientist, I want to configure a splitter that defines how datasets are divided into training and testing subsets, so that I can ensure proper model evaluation.

Description:
The Splitter class serves as the base to specify configurations for splitting datasets according to the chosen method (train/test or time series).

Acceptance Criteria:

A splitter can be instantiated with specific parameters for splitting datasets.
Default values are handled and configurable through initialization.

2. User Story: Split Data into Train and Test Sets

Title:
As a data engineer, I want to split datasets into training and testing sets using the specified Splitter, so that I can prepare them for model training and evaluation.

Description:
The TrainTestSplitter allows for a straightforward split between the training and testing datasets based on the parameters provided.

Acceptance Criteria:

The job successfully divides input and target datasets into train and test subsets.
The shapes of the resulting datasets are logged for verification.

3. User Story: Split Data for Time Series Analysis

Title:
As a data scientist, I want to use the TimeSeriesSplitter to split datasets specifically for time series analysis, so that temporal dependencies are respected in the splits.

Description:
The TimeSeriesSplitter class creates fixed-time subsets of the data, allowing for model evaluation that maintains time order.

Acceptance Criteria:

The time series data is split according to specified parameters without shuffling.
The results accurately reflect the required time splits in the dataset.

4. User Story: Get Number of Splits

Title:
As a data scientist, I want to retrieve the number of splits generated by the splitter class, so that I can understand how many distinct training/test sets will be produced.

Description:
Both the TrainTestSplitter and TimeSeriesSplitter should provide the number of splits available for model training.

Acceptance Criteria:

The method to get the number of splits returns an integer count accurately representing the available splits.
The information should be logged for reference during the training process.

Common Acceptance Criteria

Implementation Requirements:
- The Splitter, TrainTestSplitter, and TimeSeriesSplitter classes must properly define their respective methods.
- Interfaces for splitting and returning the number of splits should be consistent across all implementations.
Error Handling:
- Errors encountered during data splitting should be logged with clear messages, especially when invalid data is provided.
Testing:
- Unit tests validate each splitter's functionality, confirming that datasets are split correctly based on the modes specified.
- Edge cases regarding splitting configurations should also be included in the tests.
Documentation:
- Each class and method should contain thorough docstrings describing their intended use and how to interact with them.
- Examples of initializing and using each splitter should be included for user reference.

Definition of Done (DoD):

The Splitter, TrainTestSplitter, and TimeSeriesSplitter classes are fully functional and meet the outlined acceptance criteria.
All functionalities are verified through testing, ensuring reliability in data splits.
Documentation is complete, providing clarity for integration and use.

Splitters_stories

US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.

classes relations

User Stories: Splitter Management

1. User Story: Configure Splitter

2. User Story: Split Data into Train and Test Sets

3. User Story: Split Data for Time Series Analysis

4. User Story: Get Number of Splits

Common Acceptance Criteria

Definition of Done (DoD):

Code location

Test location

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MLOps Python Package 📦

🧠 Models & Data

🚀 Lifecycle

🛠️ Config & Tools

🏛️ Registry & Services

🔧 Components

Clone this wiki locally