Inserts Bechtel invoice CSV data into the con_master_fact ClickHouse table (405 columns).
Python packages
pip install clickhouse-driver pandasClickHouse via Docker
The project expects ClickHouse running with these port mappings:
9001→ native TCP (used byclickhouse-driver)8124→ HTTP interface
docker run -d \
--name gangster-clickhouse \
-p 9001:9000 \
-p 8124:8123 \
-e CLICKHOUSE_USER=admin \
-e CLICKHOUSE_PASSWORD=admin123 \
-e CLICKHOUSE_DB=invoice_extraction \
clickhouse/clickhouse-server:24.3Verify it's healthy:
docker ps | grep clickhouseDefault run (uses bechtel_invoices_v2.csv)
python insert_to_clickhouse.py --csv bechtel_invoices_v2.csvDry run — parses and prints 2 rows, no insert
python insert_to_clickhouse.py --csv bechtel_invoices_v2.csv --dry-runCustom connection
python insert_to_clickhouse.py \
--csv bechtel_invoices_v2.csv \
--host localhost \
--port 9001 \
--user admin \
--password admin123 \
--database invoice_extraction| Flag | Default | Description |
|---|---|---|
--csv |
bechtel_invoices_v2.csv |
Path to the invoices CSV |
--host |
localhost |
ClickHouse host |
--port |
9001 |
ClickHouse native TCP port (not the HTTP port) |
--user |
admin |
ClickHouse user |
--password |
admin123 |
ClickHouse password |
--database |
invoice_extraction |
Target database |
--table |
con_master_fact |
Target table |
--batch |
500 |
Rows per insert batch |
--dry-run |
off | Preview rows without inserting |
- Reads the CSV (63 extracted columns)
- Casts and expands every row to all 405 columns (remaining columns default to NULL/0)
- Sorts rows by
inv_invoice_dateto minimise partition spread per batch - Inserts in batches of 500 rows into
con_master_fact - Creates the table automatically if it doesn't exist (reads
construction_schema/con_master_fact.sql) - Runs verification queries after insert
Unexpected packet from server on connect
You're hitting the HTTP port (8124) instead of the native TCP port. Use --port 9001.
Too many partitions for single INSERT block
The data spans too many invoice-date months in one batch. This is already handled by sorting rows before insert and passing max_partitions_per_insert_block=1000 per query. If it still occurs, reduce --batch to 100.
Table con_master_fact not found
The script will auto-create it from construction_schema/con_master_fact.sql. Make sure that file exists before running.