Skip to content

Add Enhanced Ingestion Mode to genddl Tool#85

Open
ron-daniel1 wants to merge 20 commits into
prestodb:mainfrom
ron-daniel1:rework-perf-wxd
Open

Add Enhanced Ingestion Mode to genddl Tool#85
ron-daniel1 wants to merge 20 commits into
prestodb:mainfrom
ron-daniel1:rework-perf-wxd

Conversation

@ron-daniel1
Copy link
Copy Markdown

Adds enhanced ingestion mode to cmd/genddl for generating TPC-DS data
ingestion SQL files.

Key Features:

  • Two-stage ingestion: separate source (CSV/TEXTFILE) and target (Parquet) tables
  • Catalog support: multi-catalog environments with source/target catalogs
  • Format handling: CSV with CAST/NULLIF, TEXTFILE with direct SELECT
  • Engine support: Presto (WITH clause) and Spark (USING iceberg + TBLPROPERTIES)
  • Correct schema syntax: CREATE SCHEMA catalog.schema WITH (location = 's3a://...')

Files Changed:

New Templates:

  • create_source_table.sql.tmpl - Source table DDL (CSV/TEXTFILE)
  • create_target_table.sql.tmpl - Target table DDL (Parquet)

Modified: main.go

Schema struct additions:

  • Mode string - Detects "enhanced_ingestion" vs legacy mode
  • SourceFileFormat string - "CSV" or "TEXTFILE"
  • SourceSchema, TargetSchema string - Separate schema names
  • SourceCatalog, TargetCatalog string - Multi-catalog support
  • S3SourceLocation, S3TargetLocation string - Separate S3 paths
  • Engine string - "presto" or "spark" for engine-specific syntax

New functions:

  • isEnhancedIngestionMode() - Mode detection helper
  • generateSourceTable() - Generates 1a-create-source-*.sql
  • generateTargetTable() - Generates 1b-create-target-*.sql

Modified functions:

  • loadSchemas() - In enhanced mode, generates only specified catalog type
    (iceberg=true → Iceberg only) instead of all 4 variants
  • generateSchemaFromDef() - Routes to enhanced or legacy generation logic
    based on mode
  • Run() - Orchestrates enhanced vs legacy workflow

Modified: insert_table.sql.tmpl

  • Added conditional CAST/NULLIF for CSV: CAST(NULLIF(column, '') AS type)
  • Added direct SELECT * for TEXTFILE format
  • Added catalog-qualified table references: source_catalog.source_schema.table

Testing:
✅ All 10 tests pass
✅ Backward compatible - legacy mode unchanged
✅ Generated examples match golden files

Usage:
Enhanced: go run main.go genddl config_enhanced_ingestion.json
Legacy: go run main.go genddl config.json (unchanged)

rzIBM and others added 15 commits March 17, 2026 14:08
)

* add new version of stream run for 10TB
* add README.md in queries_v2
* add hyperlink to TPCDS_FIXES_SUMMARY_PRESTO.md
* move README location
)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.36.0 to 0.45.0.
- [Commits](golang/crypto@v0.36.0...v0.45.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.45.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…db#80)

Bumps [filippo.io/edwards25519](https://github.com/FiloSottile/edwards25519) from 1.1.0 to 1.1.1.
- [Commits](FiloSottile/edwards25519@v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: filippo.io/edwards25519
  dependency-version: 1.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Fix data race: remove concurrent write to pseudoStage.States.RunStartTime
  inside syncedTime callback in loadjson (line 118 sets it after goroutines finish)
- Guard saveQueryJsonFile against empty QueryId to avoid calling GetQueryInfo("")
- Fix zerolog field chain in runShellScripts: assign result back to logEntry so
  stdout, stderr, exit_code, and stage fields are actually emitted
- Remove dead ValidateRequiredFlags() call in queryplan (no required flags exist)
- Log os.Remove error in genconfig stale file cleanup
- Switch FileBasedRunRecorder to encoding/csv.Writer to properly escape fields
  and log write errors instead of silently discarding them
- Replace shared package-level RunsValueOne/RunsValueZero with intPtr() helper
  to avoid aliasing risk where mutation would affect all stages
- Use errors.New instead of fmt.Errorf("%s", ...) in loadjson
- Replace custom fileNameWithoutPathAndExt with filepath.Base + filepath.Ext
- Check handleQueryError return value for SELECT COUNT(*) in table_summary
…stodb#81)

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Replace the previous run-specific delete with a generic cleanup that removes orphaned rows from five Presto query metadata tables (presto_query_creation_info, presto_query_operator_stats, presto_query_plans, presto_query_stage_stats, presto_query_statistics). Each DELETE uses a LEFT JOIN to presto_benchmarks.pbench_queries and removes rows where p.query_id IS NULL, ensuring metadata not referenced by pbench_queries is purged. This replaces the prior ad-hoc delete targeting r.run_id IN (2833).
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
…ntext propagation

- Quote SQL identifiers in table_summary.go to prevent injection via adversarial names
- Handle RowsAffected errors in mysql_run_recorder.go instead of discarding
- Check csv.Writer.Write errors in run_recorder.go
- Propagate parent context in GetCtxWithTimeout instead of using context.Background()
- Add dedicated HTTP client with 30s timeout for Pulumi API calls
- Handle JSON null in Float64Time.UnmarshalJSON
- Log queryOutputFile.Close() errors in stage.go
- Remove redundant continue in unmarshaller.go pointer loop
- Add tests for sqlIdent, Float64Time null, GetCtxWithTimeout, FileBasedRunRecorder
The query_json package was renamed to queryjson in v2.1.1.
Updated all import paths and package qualifiers across 6 files.
If source file format is TEXTFILE then insert can be completed without CAST.

Schema creation for target table has a syntax error now fixed.

The sql files are created for the specified schema and catalog and not for every table combinations.
@ron-daniel1 ron-daniel1 requested a review from ethanyzhang as a code owner April 1, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants