Skip to content

POC Row event sequence stamping for submissions#1699

Merged
brontolosone merged 2 commits into
getodk:masterfrom
brontolosone:eventcounter-by-row
Dec 19, 2025
Merged

POC Row event sequence stamping for submissions#1699
brontolosone merged 2 commits into
getodk:masterfrom
brontolosone:eventcounter-by-row

Conversation

@brontolosone
Copy link
Copy Markdown
Contributor

@brontolosone brontolosone commented Nov 28, 2025

Related: getodk/central#1439 . With the note that here we are not necessarily going for fast revalidation, explained below.

This is the DB part of per-row-eventstamps for submissions. We can do other tables as well but since this is a POC, I'm leaving it at this for now.

I've chosen to try to do it with an application-transparent approach. The appeal is that it "just" works, as evident from the absence of any application modifications.

But to make that bliss possible, I went down the path of using a deferred constraint trigger to apply the stamps automatically at commit time, and there's some trickyness around that which led to some unexpected code. It probably needs an in-person walkthrough to answer all the "why not just..." type questions that would also jump to my mind if I hadn't just tried to "just ..." and found out why one can't "just ...".

Approach

  1. For max(event) (+row count, to account for deletions) on a projection to work as an etag (or cursor), we need strict ordering, so we need to prevent later-committed-rows having an earlier-generated (or, well, "sorted-earlier") eventstamp. Can't use a sequence. Can't use timestamps.
  2. The only way to do so is to have a critical section that effectively serializes commits (really, in PostgreSQL, IIUC, commits are always per-table serialized, but here we want to control the order.
  3. So the critical section should last from acquiring the new event sequence number until transaction commit. When the transaction commits or rolls back, the critical section would unlock, and then the next transaction (which can be a concurrent transaction that was blocking on the critical section) can do the same thing. Corollary: whichever transaction enters the critical section first (A), commits before any other transaction (B), as (B) can not commit until it itself enters the critical section, but that is locked until (A) commits (or rolls back). This is what we want! 🥳
  4. For performance under concurrency, we want that critical section window to be as small as possible. So we must be able to fuss around inserting/modifying rows, and only enter the critical section at the very last moment.
  5. The primary locked resource is an bigint event sequence counter (just a 1-column row in a 1-row table) that is incremented by 1 for every transaction commit event. This yields 2 additional properties for the "event tokens" that are not actually required for etags or cursors, but that are, I suppose, aesthetically pleasing:
    1. Not-required property that's there anyway 1: They are monotonically incremented, and we shouldn't see gaps. If some transaction fails (and rolls back), the increment is also rolled back.
    2. Not-required property that's there anyway 2: Anything part of the same externally-visible-effect-group (anything committed in 1 transaction) has the same event number. This property may yet show its use. For instance, if projects and forms are also eventstamped from the same counter, then if you want to know "did the project change since the last change to this collection of submissions", then you can simply check whether the project has a higher event stamp than the highest event stamp in your collection of submissions 👏 .
  6. Preemptively Asked Questions on alternative tokens (now that we have a critical section in which we can synthesize or derive those anyway):
    1. "Hey why not use a timestamp" — timestamps are not guaranteed to be distinct, especially with the truncation to millisecond-resolution that we're doing (which I find questionable since it's been done out of entanglement with a particular frontend expectation, but that's another story)
    2. "Hey why not use the xid8 transaction IDs — two reasons: main one being that the semantics of those are that they can wrap around, vacuuming can do that. Transaction IDs are only required to be unique among all currently ongoing transactions, not among all transactions that ever happened. So we'd get duplication pretty quickly I fear. Second (and minor) reason is that they're not pretty.
    3. "Hey shouldn't we start the counting at MIN_BIGINT (some negative number) so that we can do twice the number of (committed, observable) events before running into trouble" — Yeah we could but we'd forever be staring down very large negative numbers, so, trumped by aesthetics/DX I suppose?
  7. If I want to be application-transparent, and want to postpone entering the critical section until the very last moment, AFAIK the only thing that lends itself to being used for that is a deferred constraint trigger. This forces our hand:
    1. A deferred constraint trigger is always a per-row trigger. So whatever function it'll call, that function will be called for every row inserted/mutated in the transaction. This forces our hand if we want the (not strictly required!) property of "if it becomes visible at the same time, then it has the same event ID" because we'll then need to increment the event counter exactly once in the transaction whereas our trigger function may be called multiple times. So we'll need some extra logic to handle that. I've implemented that but we could kill it if we don't think it's worth it. But I think it's worth it, see "Not-required property that's there anyway 2" a bit higher up. This is where the business with the advisory lock comes from — it's taken upon processing the first row (first incantation of the deferred trigger function) and checked in every incantation to deduct if the increment has already happened in this transaction. The great thing about this is that similar triggers on separate tables can coordinate this way so that indeed should we choose to have eventstamps on projects, forms, whatever — then we can easily have them while preserving the "if it becomes visible at the same time, then it has the same event ID" property across all those tables. 🥳
    2. As it is a constraint trigger, not a "normal" trigger, it does nothing with any modifications you make to NEW. But we can make it have side effects (booooo!), thus writing a table anyway, which is how the event stamps are actually set. It's a bit of a perversion but I haven't found any alternatives. This, again, forces our hand:
      1. Performance wise it's not ideal to run a query (though indexed, with guaranteed 1 result) for every row added/modified. But what can I do 🤷‍♂️
      2. And we need some trick to avoid recursion in the trigger. The trick I employ is to have two normal (non-deferred) row triggers that set the event column to NULL for any insert or update, and then let the deferred trigger only fire (and set the event) for those rows where the event is NULL. 👏 So that's where the two triggers (plus one trigger function) (blank_submissions_event_*) that superficially may seem like excessive machinery come from.

Performance impact

A quick & dirty peek (with 10000 submission POSTs, with the worker system turned off):

This branch,      1 thread:  216 submissions/s
                  8 threads: 617 submissions/s
Master (9183257), 1 thread:  253 submissions/s
                  8 threads: 624 submissions/s

Some impact is to be expected, there's no free lunch, we're doing more stuff. I'm not panicking.

I've clocked the overhead at 89µs per row by just timing updating 10_000 submission rows with a new timestamp, comparing master with this branch. But anyway we don't usually do such large updates.

What has been done to verify that this works as intended?

CI

Why is this the best possible solution? Were any other approaches considered?

There are many ways to stroke a cat. When it comes to building blocks for etags, another approach was (is?) in #1654. For the etag use case this approach here has a different granularity tradeoff, namely, it chooses maximum granularity (at the row level), whereas #1654 was granular to the actee level. Consequently with the approach here validating the etag of a projection requires reprojection (rerunning the query with its filters, and calculating the max(event) of the submissions selected), whereas in #1654 the etag is validated through a small index (and doesn't even hit the heap!).

Besides granularity for etags, there is the topic of suitability as a cursor. The hope is that we can use this safely as a cursor. IIUC we can (safely, under concurrency) but I'd feel better if some others come to to that same conclusion independently.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

@brontolosone brontolosone marked this pull request as ready for review November 28, 2025 18:27
@github-project-automation github-project-automation Bot moved this to 🕒 backlog in ODK Central Nov 28, 2025
@brontolosone brontolosone moved this from 🕒 backlog to ✏️ in progress in ODK Central Nov 28, 2025
@brontolosone brontolosone force-pushed the eventcounter-by-row branch 2 times, most recently from d9952fa to f673225 Compare November 29, 2025 12:58
Comment thread lib/model/migrations/20251127-01-submission-event-stamping-02.up.sql Outdated
Comment thread lib/model/migrations/20251127-01-submission-event-stamping-02.up.sql Outdated
Comment thread lib/model/migrations/20251127-01-submission-event-stamping-01.up.sql Outdated
Comment on lines +31 to +33
CREATE TABLE current_event (
event bigint NOT NULL
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we will add this mechanism for other tables then what will be the value of this? I am assuming it will be just existing value + number of rows in that table

Copy link
Copy Markdown
Contributor Author

@brontolosone brontolosone Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to us. We can choose to pretend that all rows in that other table were added at the same time (which is not what I did in the migration), in which case it'd be existing value + 1, or we can indeed do what you suggest.

Another thing we could do is to use 0 for everything that needs to be retroactively eventstamped. But then we'd impact the ability to use it as a cursor - the first chunk of a collection, for retro-applied eventstamps (existing deployments), for event 0, will be potentially huge. On the plus side we'd gain some measure of causal consistency, eg a submission won't exist "before" the project it belongs to. But that causal ordering can break anyway since only few of our tables are append-only (a project can be modified, in which case its eventstamp is going to be higher than that of any of the submissions that have been made for it, until those get modified).

Comment thread lib/model/migrations/20251127-01-submission-event-stamping-02.up.sql Outdated
@brontolosone brontolosone merged commit 1b27a5f into getodk:master Dec 19, 2025
7 checks passed
@github-project-automation github-project-automation Bot moved this from ✏️ in progress to ✅ done in ODK Central Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants