-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets_stories
US Datasetes : Handle loading, preprocessing, and managing data sets for training, evaluation, and inference.
- US Datasetes : Handle loading, preprocessing, and managing data sets for training, evaluation, and inference.
classDiagram
%% Base Class: Reader
class Reader {
<<abstract>>
+KIND: str
+limit: int | None = None
+read() pd.DataFrame*
+lineage(name: str, data: pd.DataFrame, targets: str | None, predictions: str | None) Lineage*
}
Reader --|> pdt.BaseModel : inherits
Reader --|> abc.ABC : inherits
%% ParquetReader Class
class ParquetReader {
+KIND: T.Literal["ParquetReader"] = "ParquetReader"
+path: str
+read() pd.DataFrame
+lineage(name: str, data: pd.DataFrame, targets: str | None, predictions: str | None) Lineage
}
Reader <|-- ParquetReader : specializes
%% Base Class: Writer
class Writer {
<<abstract>>
+KIND: str
+write(data: pd.DataFrame) None*
}
Writer --|> pdt.BaseModel : inherits
Writer --|> abc.ABC : inherits
%% ParquetWriter Class
class ParquetWriter {
+KIND: T.Literal["ParquetWriter"] = "ParquetWriter"
+path: str
+write(data: pd.DataFrame) None
}
Writer <|-- ParquetWriter : specializes
%% Aliases
Lineage --> lineage.PandasDataset : type alias
ReaderKind --> ParquetReader : type alias
WriterKind --> ParquetWriter : type alias
%% Relationships
Reader ..> pd.DataFrame : "returns"
Writer ..> pd.DataFrame : "uses"
ParquetReader ..> lineage.PandasDataset : "uses"
ParquetReader ..> pd.DataFrame : "uses"
Title:
As a data scientist, I want to load datasets into memory using a standardized reader so that I can access the data for analysis and model training.
Description:
The Reader class provides a base implementation for loading datasets into a Pandas DataFrame from various sources (e.g., files, databases, or cloud storage). This ensures consistent data handling across the project.
Acceptance Criteria:
- The reader supports loading datasets into a Pandas DataFrame.
- The
limitparameter can restrict the number of rows read (optional). - The
readmethod is abstract and must be implemented by subclasses to support specific data sources. - The returned DataFrame adheres to the schema defined by the dataset source.
Title:
As a data engineer, I want to generate lineage information for datasets so that I can track their origin and transformations for compliance and debugging purposes.
Description:
The lineage method generates metadata describing the dataset's origin, target columns, and prediction columns. This enables better tracking and reproducibility in the machine learning workflow.
Acceptance Criteria:
- The
lineagemethod accepts the following parameters:-
name: The name of the dataset. -
data: The DataFrame representation of the dataset. -
targets: (Optional) The name of the target column(s). -
predictions: (Optional) The name of the prediction column(s).
-
- The method returns a
PandasDatasetobject containing lineage metadata. - Subclasses of
Readerimplement the method to ensure lineage tracking is specific to their data source.
-
Implementation Requirements:
- The
Readerclass is abstract and cannot be instantiated directly. - Subclasses must implement both the
readandlineagemethods.
- The
-
Error Handling:
- If the
readmethod is not implemented in a subclass, an appropriate error is raised. - If
lineageparameters (e.g.,data,name) are invalid or missing, the method raises an informative error.
- If the
-
Testing:
- Unit tests for the
Readerbase class validate correct implementation by subclasses. - Tests cover various data sources and edge cases, such as large datasets, missing parameters, and invalid configurations.
- Unit tests for the
-
Documentation:
- Clear docstrings and examples are provided for both the
readandlineagemethods. - The purpose and usage of the
limit,targets, andpredictionsparameters are well-documented.
- Clear docstrings and examples are provided for both the
- The
Readerclass and all required methods are implemented with clear documentation. - Subclasses of
Readerare tested for different data sources (e.g., file, database, cloud). - The
lineagemethod generates correct and complete metadata. - Code adheres to the projectโs coding standards and passes peer review.
- Unit tests cover a wide range of scenarios and achieve high test coverage.
- Documentation includes usage examples for developers and data scientists.
Title:
As a data scientist, I want to load datasets stored in Parquet format into a Pandas DataFrame so that I can analyze and process them in memory.
Description:
The ParquetReader class provides functionality to read data from Parquet files and return it as a Pandas DataFrame. It ensures compatibility with modern data storage formats and supports optional row limits.
Acceptance Criteria:
- The
readmethod reads the dataset from the specified Parquet file path. - If a
limitis provided, the number of rows in the DataFrame is capped accordingly. - The method raises an informative error if the file path is invalid or the file format is unsupported.
- The returned DataFrame matches the content and structure of the original Parquet file.
Title:
As a data engineer, I want to generate lineage metadata for datasets read from Parquet files so that I can track their origin and ensure reproducibility.
Description:
The lineage method generates metadata describing the dataset's source, name, and optional target or prediction columns. This metadata integrates with lineage tracking tools for debugging and audit trails.
Acceptance Criteria:
- The
lineagemethod accepts the following parameters:-
name: The logical name of the dataset. -
data: The Pandas DataFrame representation of the dataset. -
targets: (Optional) The name of the target column(s). -
predictions: (Optional) The name of the prediction column(s).
-
- The method leverages the
lineage.from_pandasfunction to create aPandasDatasetobject. - The
sourceattribute in the metadata corresponds to the Parquet file path.
-
Implementation Requirements:
- The
ParquetReaderclass extends theReaderbase class. - The
KINDattribute is set to"ParquetReader"for identification.
- The
-
Error Handling:
- The
readmethod raises an error if the file does not exist or is not a valid Parquet file. - The
lineagemethod validates the presence and correctness of the input parameters (e.g.,name,data).
- The
-
Testing:
- Unit tests for
readverify correct data loading, row limiting, and error scenarios (e.g., missing or invalid files). - Unit tests for
lineagevalidate proper metadata generation and integration with lineage tracking tools.
- Unit tests for
-
Documentation:
- Clear docstrings are provided for the
ParquetReaderclass and its methods. - Examples include loading a dataset and generating lineage metadata for tracking.
- Clear docstrings are provided for the
-
Load and Limit Rows:
reader = ParquetReader(path="data/dataset.parquet", limit=1000) data = reader.read() print(data.head())
-
Generate Lineage Metadata:
reader = ParquetReader(path="data/dataset.parquet") data = reader.read() lineage_info = reader.lineage(name="My Dataset", data=data, targets="target_column") print(lineage_info)
- The
ParquetReaderclass is implemented with thereadandlineagemethods. - The class passes all unit tests, including edge cases and error handling.
- The
lineagemetadata integrates seamlessly with lineage tracking tools. - Documentation and usage examples are complete and accessible to developers and data scientists.
Title:
As a data engineer, I want to save a Pandas DataFrame to a specified location so that I can persist my data for later use or sharing.
Description:
The Writer base class defines an abstract interface for saving datasets to various storage backends (e.g., file systems, databases, cloud storage). Implementations of this class, like the ParquetWriter, provide specific functionality to save data in a defined format.
Acceptance Criteria:
- The
Writerclass defines an abstractwritemethod to be implemented by subclasses. - Subclasses specify the
KINDattribute to identify the type of writer. - Documentation exists for how to extend the
Writerclass for other formats or storage solutions.
Title:
As a data scientist, I want to save a Pandas DataFrame as a Parquet file so that I can efficiently store and retrieve large datasets.
Description:
The ParquetWriter class provides functionality to save a DataFrame in Parquet format to a local or remote path. This ensures compatibility with modern analytics workflows and data pipelines.
Acceptance Criteria:
- The
ParquetWriterclass:- Inherits from the
Writerbase class. - Implements the
writemethod to save a DataFrame to a Parquet file. - Accepts a
pathparameter specifying the storage location (local or S3).
- Inherits from the
- The
writemethod:- Calls
pd.DataFrame.to_parquet()to perform the save operation. - Overwrites existing files at the specified path.
- Calls
- The method raises an error if:
- The path is invalid or inaccessible.
- The data cannot be serialized to Parquet format.
- The Parquet file is written with the structure and content of the provided DataFrame.
-
Implementation Requirements:
- The
Writerclass is abstract and enforces implementation of thewritemethod in subclasses. - The
ParquetWriterclass sets itsKINDattribute to"ParquetWriter"for identification.
- The
-
Error Handling:
- The
writemethod raises informative errors for invalid paths, permissions issues, or serialization problems. - Validation ensures the
dataargument is a valid Pandas DataFrame.
- The
-
Testing:
- Unit tests verify:
- Successful writing of DataFrames to Parquet files.
- Handling of edge cases like empty DataFrames or invalid paths.
- Tests mock file paths and S3 URLs for isolated functionality.
- Unit tests verify:
-
Documentation:
- Clear docstrings are provided for the
WriterandParquetWriterclasses. - Usage examples include writing data to both local and S3 paths.
- Clear docstrings are provided for the
-
Write DataFrame to Local Path:
writer = ParquetWriter(path="data/output.parquet") writer.write(data=df)
-
Write DataFrame to S3 Path:
writer = ParquetWriter(path="s3://my-bucket/output.parquet") writer.write(data=df)
- The
WriterandParquetWriterclasses are implemented and adhere to the design principles. - The
writemethod inParquetWritersuccessfully saves DataFrames in Parquet format. - All functionality is tested with unit tests, including error scenarios.
- Documentation is complete, with clear instructions for usage and extensibility.
Powered by MLOps Factory