-
Notifications
You must be signed in to change notification settings - Fork 0
Splitters_stories
US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.
- US Data Splitting Functionality : Split dataframes into subsets for model training and evaluation.
classDiagram
direction LR
class Splitter {
<<abstract>>
+KIND: str
+split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits*
+get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int*
}
Splitter --|> pdt.BaseModel : inherits
Splitter --|> abc.ABC : inherits
class TrainTestSplitter {
+KIND: T.Literal["TrainTestSplitter"] = "TrainTestSplitter"
+shuffle: bool = False
+test_size: int | float = 1728
+random_state: int = 42
+split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
+get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
}
TrainTestSplitter --|> Splitter : inherits
class TimeSeriesSplitter {
+KIND: T.Literal["TimeSeriesSplitter"] = "TimeSeriesSplitter"
+gap: int = 0
+n_splits: int = 4
+test_size: int | float = 1728
+split(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) TrainTestSplits
+get_n_splits(inputs: schemas.Inputs, targets: schemas.Targets, groups: Index | None = None) int
}
TimeSeriesSplitter --|> Splitter : inherits
class TrainTestSplits {
<<type>>
T.Iterator[tuple[Index, Index]]
}
class Index {
<<type>>
npt.NDArray[np.int64]
}
Splitter --> schemas.Inputs : "uses"
Splitter --> schemas.Targets : "uses"
Splitter --> TrainTestSplits : "returns"
Title:
As a data scientist, I want to configure a splitter that defines how datasets are divided into training and testing subsets, so that I can ensure proper model evaluation.
Description:
The Splitter class serves as the base to specify configurations for splitting datasets according to the chosen method (train/test or time series).
Acceptance Criteria:
- A splitter can be instantiated with specific parameters for splitting datasets.
- Default values are handled and configurable through initialization.
Title:
As a data engineer, I want to split datasets into training and testing sets using the specified Splitter, so that I can prepare them for model training and evaluation.
Description:
The TrainTestSplitter allows for a straightforward split between the training and testing datasets based on the parameters provided.
Acceptance Criteria:
- The job successfully divides input and target datasets into train and test subsets.
- The shapes of the resulting datasets are logged for verification.
Title:
As a data scientist, I want to use the TimeSeriesSplitter to split datasets specifically for time series analysis, so that temporal dependencies are respected in the splits.
Description:
The TimeSeriesSplitter class creates fixed-time subsets of the data, allowing for model evaluation that maintains time order.
Acceptance Criteria:
- The time series data is split according to specified parameters without shuffling.
- The results accurately reflect the required time splits in the dataset.
Title:
As a data scientist, I want to retrieve the number of splits generated by the splitter class, so that I can understand how many distinct training/test sets will be produced.
Description:
Both the TrainTestSplitter and TimeSeriesSplitter should provide the number of splits available for model training.
Acceptance Criteria:
- The method to get the number of splits returns an integer count accurately representing the available splits.
- The information should be logged for reference during the training process.
-
Implementation Requirements:
- The
Splitter,TrainTestSplitter, andTimeSeriesSplitterclasses must properly define their respective methods. - Interfaces for splitting and returning the number of splits should be consistent across all implementations.
- The
-
Error Handling:
- Errors encountered during data splitting should be logged with clear messages, especially when invalid data is provided.
-
Testing:
- Unit tests validate each splitter's functionality, confirming that datasets are split correctly based on the modes specified.
- Edge cases regarding splitting configurations should also be included in the tests.
-
Documentation:
- Each class and method should contain thorough docstrings describing their intended use and how to interact with them.
- Examples of initializing and using each splitter should be included for user reference.
- The
Splitter,TrainTestSplitter, andTimeSeriesSplitterclasses are fully functional and meet the outlined acceptance criteria. - All functionalities are verified through testing, ensuring reliability in data splits.
- Documentation is complete, providing clarity for integration and use.
Powered by MLOps Factory