This backend is being built as a sequential pipeline. The --mode flag in
src/main.py
means "run up to this stage", not "run only this stage in isolation".
Current intended stage order:
inputformatinfer_typeinfer_structurecompute_temporal_statsintegrityaudit_missingnesshandle_missingnessstandardizeunivariate_metrics_plottingtest_transforms
- Backend runtime code now lives under
src/. - Ingestion stages live under
src/ingest/. - Configuration lives under
src/config/. - Shared deterministic helpers live under
src/tools/.
- The time column is derived by the formatter stage. It should not be manually overridden from the CLI.
- Entity identifiers are also intended to be derived, not passed in manually.
- In
infer_type.py, secondary/entity keys should be modeled assecondary_keys: list[str], not a single string, because real datasets may require a composite entity key such as(store_id, sku_id). - A dataset with entity identifiers should generally be classified as
"multiple"even if it also has several numeric value columns. infer_typeis deterministic-first. It uses helper functions insrc/tools/input_tools.pyand only uses fuzzy/agentic fallback for ambiguous secondary-key cases.- If timestamps are mostly unique, the pipeline should short-circuit away from panel detection and classify the dataset as single-series or multivariate.
run_input_handler()detects bad/non-data rows and stores them asbad_rows.bad_rowsis intended for rows that appear in the loaded dataset and do not behave like observations because temporal fields are missing or unparseable.- Each bad row currently carries:
row_indexcsv_row_numbertemporal_valuesreasonsraw_rowfuzzy_descriptor
- The fuzzy descriptor is intentionally lightweight and can be used downstream for later cleanup or reporting.
- The recent confusion around
FREDtest.csvwas interpretive, not a current code bug:row_indexis dataframe row indexcsv_row_numberis physical CSV line number- the
Transform:row is a bad data row after the header, not the header itself - the trailing comma-only row was already being detected
- Integrity checks depend on inferred schema, so
infer_typeshould happen beforeintegrity. - The quality-handling stages are intended to run after integrity:
audit_missingnessmeasures value missingness and timestamp holeshandle_missingnesschooses bounded repair actions from the auditstandardizeoptionally applies scale transforms after missingness handling
- The current integrity implementation only supports one entity identifier for sanity checks.
- As a temporary bridge, integrity uses the first element of
secondary_keysasentity_col. - This is intentionally temporary. The correct long-term behavior is to support composite entity keys directly during duplicate and jump checks.
- The quality-handling package now exists under
src/quality_handling/and is wired intosrc/main.pyvia theaudit_missingness,handle_missingness, andstandardizemodes. audit_missingnessis deterministic. It separates:- missing values inside observed rows
- missing timestamps / coverage holes in the expected time grid
- Timestamp holes are currently audited and logged, not repaired.
- The backend does not currently reindex to an expected grid
- It does not insert synthetic rows for absent timestamps
- It does not impute missing timestamps as part of
handle_missingness
handle_missingnessis intended for cell-level missing-value handling only.- The LLM chooses among bounded strategies such as
leave_as_nan,forward_fill,interpolate,zero_fill, anddrop_rows - The actual data mutation is deterministic and recorded in traces/artifacts
- The LLM chooses among bounded strategies such as
standardizeis optional at the dataset level.- The stage first profiles scale/tail behavior deterministically
- Then an LLM gate decides whether point 9 should run at all for the dataset
- If the gate says
no, the stage returns the handled dataset unchanged - If the gate says
yes, the LLM may still choosenonefor individual columns
- The intended current behavior for SCADA / sensor-style datasets is usually:
- audit missingness
- optionally handle cell-level missing values
- skip standardization unless there is a strong modeling-oriented reason
- The user-facing plan summaries for
handle_missingnessandstandardizeare synthesized from the final normalized action list, so they should match the actual applied plan rather than raw LLM prose.
- The univariate-analysis package now exists under
src/univariate_analysis/and is wired intosrc/main.pyvia theunivariate_metrics_plottingandtest_transformsmodes. univariate_metrics_plottingcovers EDA rules 10 and 11.- It computes deterministic univariate summary metrics for numeric features
- It writes per-feature histogram + ECDF plots
- KDE is only plotted when there is enough continuous support for a stable deterministic Gaussian-kernel estimate
- The stage analyzes the handled dataset path rather than the standardized path, so raw-value interpretability is preserved for univariate EDA
test_transformscovers EDA rule 12.- It is fully deterministic
- It only tests transforms for columns with enough evidence that a transform might matter, currently using thresholds on non-null support, skewness, and tail ratio
- Candidate transforms are compared by deterministic shape-improvement scores rather than LLM judgment or external fitting libraries
- The stage reports recommendations; it does not mutate the dataset
- The project is moving toward one consolidated state shared across the sequential pipeline.
- Until that is finalized, stage outputs should expose a composite state that includes every field that has ever been part of the pipeline state, even if a given stage has not populated some of those fields yet.
- For earlier modes such as
infer_type, later-stage integrity-related fields may still be present with default values. That is expected.
- There is known redundancy between
reportand other top-level state fields. - Short term, carrying redundant fields is acceptable while the consolidated state is still being worked out.
- Longer term, internal graph state should likely be normalized, with any
packaged
reportartifact assembled at stage boundaries or final output. - Deterministic evidence that was cluttering the state has been moved into
trace files under
traces/rather than kept in the main payload. - Stage-produced dataset artifacts such as handled/standardized CSV outputs are
also written under that same top-level
traces/directory.
src/main.pyshould prefer derived schema over manual key arguments.- The output of a stage should reflect the state accumulated up to that stage.
- In particular,
infer_typeshould return the full composite payload, not only the narrowedtype/primary_key/secondary_keyssubset.