Skip to content

Add plan for coalescing operations during streaming#770

Draft
tsg wants to merge 2 commits intomainfrom
stream_batch_operations_plan
Draft

Add plan for coalescing operations during streaming#770
tsg wants to merge 2 commits intomainfrom
stream_batch_operations_plan

Conversation

@tsg
Copy link
Member

@tsg tsg commented Mar 11, 2026

Trying something new: this adds only the plan for solving #769, so we can discuss it before implementing.

Putting it in Draft mode because this is only meant for discussion.


## Expected impact

For the customer's workload (9,737 DELETEs in a single batch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we see other columns in the WHERE condition in users' WAL events? In the future we could make it more flexible and batch DELETE statements with different column names. Example:

DELETE FROM t1 WHERE my_column in ($1, $2, $3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WALs always refer to changes by their identify columns, so we don't have to support complex WHERE conditions.

Copy link
Collaborator

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement. Hopefully, it will solve most of the issues for us. If not, there are several ways we can improve the process.


Batch raw WAL events instead of pre-built SQL strings, then build bulk SQL at execution time.

- N DELETEs on the same table become: `DELETE FROM t WHERE "id" IN ($1, $2, ..., $N)`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider composite primary keys?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see that they are included.


1. Separate DDL and DML messages
2. For DDL: build and execute queries via existing `ddlAdapter`
3. For DML: walk messages in order, building "runs" of consecutive same-(schema, table, action) events:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we already walk the messages in order, we can make the query aggregator logic a bit smarter, and consider adding interleaved DELETEs, if there is no INSERT or UPDATE statement that conflicts with them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add this later. In some situation we can prove that it's still correct.

Copy link
Collaborator

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid plan, small comments

3. For DML: look up `schemaInfo` via schema observer (cached), create `walMessage` with raw `wal.Data` + `schemaInfo`
4. Send `walMessage` to batch sender

### 4. Refactor BatchWriter.sendBatch — bulk query building
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the sending fails? Do we fall back to separate DELETE statements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for simplicity for now we log as DATALOSS, I think this is similar to how we do for batch inserts in the snapshot mode.


In practice this works well for the target workload: WAL events from bulk operations on the source database (batch purges, accounting reconciliation, ETL loads) naturally produce long runs of the same operation on the same table, which coalesce effectively.
5. Execute via existing `flushQueries` / `execQueries`
6. Respect PostgreSQL's 65,535 parameter limit — split runs at ~60,000 params
Copy link
Collaborator

@kvch kvch Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can avoid this limit by using WHERE id = ANY($1::bigint[]) because the array is just one parameter.

So the code would be:

fmt.Sprintf("DELETE FROM %s WHERE id = ANY($1::%s[])", table, idType)
args := []any{idArray}

This would perform better for multiple reasons:

  • only one parameter is passed to postgresql, so the parameter binding overhead is smaller
  • in the planning phase the original suggestion takes longer because the planner has to optimize a long-long condition with many OR operators. with the array, there is no such optimization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion! Seems like we can do this when the types are well known (scalar types) and fallback to IN for the corner cases.

@tsg tsg mentioned this pull request Mar 12, 2026
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants