Event counter migrations, towards getodk/central#1439#1654
Event counter migrations, towards getodk/central#1439#1654brontolosone wants to merge 2 commits into
Conversation
ace390a to
6379727
Compare
| AFTER INSERT OR UPDATE OF processed | ||
| ON "public"."audits" | ||
| FOR EACH ROW | ||
| WHEN (NEW.action = ANY(ARRAY['form.update.publish', 'submission.create', 'submission.update', 'submission.update.version', 'submission.attachment.update', 'submission.delete', 'submission.restore', 'dataset.update', 'dataset.update.publish', 'entity.create', 'entity.update.version', 'entity.update.resolve', 'entity.delete', 'entity.restore', 'entity.bulk.delete', 'entity.bulk.restore'])) |
There was a problem hiding this comment.
Important assumptions:
…
- The list of form-or-dataset-actee-mutating-actions will be kept up to date.
I'm a little concerned about this assumption. New actions aren't added very frequently, but they are added with some regularity, especially for datasets. I think it'd be easy to forget to add an action to this list, in which case there will be over-caching, which could be hard to catch.
What do you think about using a blacklist instead? I think it's a good assumption that most actions will change the associated actee and should bump the counter. Actions for which that isn't the case are the exception to the rule. Example: form.submission.export.
By the way, this comment doesn't need to block merge or anything like that. I could file it as a follow-up issue. But I do think it'd be good to discuss at some point. My apologies if the idea has already been discussed and discarded, I know I'm mostly an interested observer to this work. 😅
There was a problem hiding this comment.
TLDR Blocklist sounds good, but a registry sounds even better, skip towards the end ;-)
More avenues:
-
The DB-hardliner "proper" solution to the bookkeeping problem is to make the implicit information explicit. Actions should be first-class entities in the DB (thus get a table), and be FK-referred to by ID from the audits table.
Then rows in the hypotheticalaudit_actionstable would have a boolean columnmodifies_actee, and then there'd be one place to register actions and their effects.
A side benefit is that on theauditstable, the more compact FK IDs in place of the current strings should also make that table itself, and the indices using that column, much more compact, which makes things faster for operations on that table and those indices, but also for other things in the DB as other heaps/indices are now going to fit in the page cache. -
Another way to achieve consistency (with some tradeoffs) through centralized bookkeeping, is to formulate the definitions of actions in JS, as constants of some kind or another, and then dynamically update a view in the DB to be consistent with that definition any time migrations are run. Optionally, to maintain a form of referential integrity, because one can't have FKs to a view, we'd similarly define a dynamic PostgreSQL domain (basically a reusable CHECK constraint) for the now-pseudo-FK
actioncolumn ofaudits. This would come with quite a bit of tooling (which exists) for the benefit of the JS side of things leading with definitions, and the DB side following. -
Yet another way is the "light" version of the above: a centralized registry at the JS side, which would be a good thing in itself for inventory purposes. Syncing that to the DB is a matter of manually writing a migration, but, a test can query the state in the DB, and compare it to the current state of things on the JS side, and thus alert to changes, so that developers are prompted to reflect mutations to the JS-inventory of actions in the DB. We have something similar already since db: default to declare-side indexes for foreign keys #1550.
-
And yet another way to go about things, if we really don't fully understand/trust our events, or don't want to have to, is to run a count-incrementing trigger on any mutation on any row of various relevant tables. It'll likely lead to under-caching (over-invalidation), and for each table there'll need to be SQL code that gets an acteeId based on the affected row, which means the trigger will do DB queries with potentially quite a few joins, for every insert, which is going to absolutely tank performance. Just mentioning this bad option as present but bad, and on an extreme side in the spectrum of "how much trust do we have in the
actions".
- Separately, an alternative to using FKs to an
audit_actionstable registry, would be to make theaudits.actioncolumn a PostgreSQL enum. This is fortunately possible as our longest "action" string so far has string length 37 AFAIK (empirical, just based on querying the staging DB — now if we'd only have a centralized registry of some sorts...) so, that'll fit in the enum constraints. In the DB, theactioncolumn would, under the hood, become of type integer, which has benefits stated earlier, but the big upside is that we probably won't need to modify as many queries as we'd have to do when the string representation of an action starts to live a table away fromaudits. And we could still compare the enum member state to the JS state in a test as described under option number 3.
I think in our case option number 3, combined with a PostgreSQL enum, is the most pragmatic one. We have some somewhat similar work to do if we want to pursue actees-as-uuids, see this comment.
There was a problem hiding this comment.
Some thoughts from me:
- I'm convinced that a registry would be better than a blocklist. 👍 I like the idea of developers being required to update the registry, e.g., via a test that compares what's in the JS to what's in the DB. I suggested a blocklist partly thinking that people would forget to update the list, but your suggestion means that people wouldn't forget (or at least would be reminded). It'd be good when people add a new action for them to actively decide at that time whether the action should increment the event counter.
- I guess someone could still forget to update both the DB and the JS. Is that right? Maybe that's one way in which your option (1) is better. In (1), there's no way of avoiding adding an action to the in-DB registry.
- Another decision that we currently have to make whenever we add a new audit action is whether the action is "verbose", i.e., whether or not it should be shown in the frontend audit log (
nonverbose= show in Frontend). Actions on the level of an individual submission or entity are excluded. That logic is implemented inactionCondition()in lib/model/query/audits.js. So I'm thinking, whenever a new action is added, the dev should decide (1) is the action verbose, and (2) should the action increment the event counter. I.e., eventually it'd be great if the registry could also include a property or column about whether each action is verbose. That's not needed for the work here — probably the focus should be on the event counter — but I wanted to point to a similar pattern/problem. Actually, one comment aboveactionCondition()seems quite relevant: "TODO: some sort of central table for all these things, not here." - Option (3) sounds nice to me.
- Existing queries won't have to change that much.
- A registry in JS means that we can automatically check that the trigger in the DB is up-to-date.
- The enum sounds like a nice performance gain to me. But I don't really see it as being necessary for event counters. It sounds like a separate change that could be layered on afterwards. I think it'd be ideal for this PR to be as small as possible, so maybe (3) without enums for now, then enums come later as an additional performance gain.
- The reasoning behind option (1) makes sense to me. I'm certainly intrigued by the option. If the DB needs to reference an ever-changing list, maybe the best approach is to put that list in the DB. Your point about performance also makes sense to me. Maybe (1) is also simpler in some ways than trying to keep a trigger and an enum up-to-date via a registry in JS.
- The main downside is that a bunch of queries would have to change, right?
- I guess theoretically we might someday want to access the registry in the JS code to drive some logic there. If the registry were just in the DB, then we wouldn't be able to do that. That seems like a speculative concern though.
- My initial feeling about (2) is that the tooling required would increase the overall complexity of the system/codebase too much.
- (4): Let's try to make our
actions trustworthy!
There was a problem hiding this comment.
A couple other thoughts!
One is that there's a list of audit log actions in the API docs (under "Server Audit Logs"). I doubt it's exhaustive, but there's a lot in there in case that's helpful.
The main downside is that a bunch of queries would have to change, right?
Now I'm wondering, maybe I'm overestimating the number of queries that reference audits.action. Maybe it's not such a big number.
There was a problem hiding this comment.
The enum sounds like a nice performance gain to me. But I don't really see it as being necessary for event counters. It sounds like a separate change that could be layered on afterwards. I think it'd be ideal for this PR to be as small as possible, so maybe (3) without enums for now, then enums come later as an additional performance gain.
The enum is not mainly for perf, the enum would also be
- a guarantee that no rogue string values will be used/found in the
actioncolumn - much easier to check for in-syncness with the JS registry than the body of a trigger (though we'd still have to check the body of the trigger anyway, to check whether the declared-mutating actions are all present).
There was a problem hiding this comment.
The reasoning behind option (1) makes sense to me. I'm certainly intrigued by the option. If the DB needs to reference an ever-changing list, maybe the best approach is to put that list in the DB. Your point about performance also makes sense to me. Maybe (1) is also simpler in some ways than trying to keep a trigger and an enum up-to-date via a registry in JS.
The downside is that unless we autogenerate/autoupdate the trigger (tooling for that is waiting in the wings, if we want that) or autotest-and-manually-update the trigger to embed the appropriate actions in its code (as it has in its current form), then the trigger would have to query that table of actions on every insert, which I think would make things significantly slower.
Avoiding that lookup forces our hand into rewriting the trigger body to embed the action values. We can either do it manually, manually-with-crutches (prompted by failing test that assert sync), or automatically (with appropriate tooling).
The main downside is that a bunch of queries would have to change, right?
Also that! But perhaps we want them to change. Currently we're basically using constant values (bare strings) which is a bit like using magic numbers.
I guess theoretically we might someday want to access the registry in the JS code to drive some logic there. If the registry were just in the DB, then we wouldn't be able to do that.
And that, exactly!
Typically you want the enum/table/registry to live on both sides. So that in JS code you can do
sql`insert into audits ([...], action) values ([...], ${actions.SUBMISSION_ATTACHMENT_UPDATE})`
with a string value of submission.attachment.update for actions.SUBMISSION_ATTACHMENT_UPDATE), this would allow a rolling update to our queries where we replace the old literal strings with references into that actions registry. Under the hood this
Another option would be to use integer values (or PG enums, so strings again for the humans both in JS and SQL, but integers under the hood for the DB) there that are foreign keys to that hypothetical actions table expanded upon above, so we'd gain compactness. But then we'd need to update all our queries at once to switch to using the registry, but perhaps that's as much trouble as I fear.
Why would one prefer actions.SUBMISSION_ATTACHMENT_UPDATE over developers just throwing bare strings around? Well, it makes it clear to the writer (or copy-paste-adapter of code) that there is a registry, and makes it easy for readers to look up meta-info on an action (does it meaningfully mutate its actee, do we log it, ...), IDEs may autocomplete, typos are averted, etc etc. Might as well get it over with and clean up those bare strings.
So that's the JS side of things.
On the on the SQL side of things, you also want ergonomics, so you'd have a PG enum, or type, table, or domain that gives you a similar "registry", that is, something that defines exactly the values that are our actions, to be used in queries there. Using a table is best for when you need metadata (such as the hypothetical bumps_eventcounter boolean). Using an enum as the PK (and thus elsewhere as FKs) on such a table would be ideal: the metadata benefits of a table, the compactness of integers (enums are integers under the hood), and the DX of being able to use string labels instead of those integers, and the domain restriction benefits that comes with both.
There was a problem hiding this comment.
The downside is that unless we autogenerate/autoupdate the trigger (tooling for that is waiting in the wings, if we want that) or autotest-and-manually-update the trigger to embed the appropriate actions in its code (as it has in its current form), then the trigger would have to query that table of actions on every insert, which I think would make things significantly slower.
That makes sense to me that we need a trigger then. 👍
Avoiding that lookup forces our hand into rewriting the trigger body to embed the
actionvalues. We can either do it manually, manually-with-crutches (prompted by failing test that assert sync), or automatically (with appropriate tooling).
I think my preference would be to start with manually-with-crutches. If we were to do it manually without crutches, there's a risk that people would forget to update the trigger. Re: doing it automatically, I think we could always add that tooling later.
Currently we're basically using constant values (bare strings) which is a bit like using magic numbers.
Why would one prefer
actions.SUBMISSION_ATTACHMENT_UPDATEover developers just throwing bare strings around? Well, it makes it clear to the writer (or copy-paste-adapter of code) that there is a registry, and makes it easy for readers to look up meta-info on an action (does it meaningfully mutate its actee, do we log it, ...), IDEs may autocomplete, typos are averted, etc etc. Might as well get it over with and clean up those bare strings.
I see what you're saying, but I think I'd prefer not to replace the bare strings as part of this initial work. It feels tangential to the main goal of improving caching. If we need to replace the bare strings in order to achieve the main goal, then let's do that. But if it's more of a refactor or improvement to DX, I'd say let's consider doing it separately. If we used an enum, we could leave the bare strings for now, right?
The enum is not mainly for perf, the enum would also be
- a guarantee that no rogue string values will be used/found in the
actioncolumn- much easier to check for in-syncness with the JS registry than the body of a trigger (though we'd still have to check the body of the trigger anyway, to check whether the declared-mutating actions are all present).
It still sounds to me that unlike the trigger, the enum isn't strictly necessary for the event counter. To be clear, I quite like the sound of an enum. Performance + extra validation sound great. 👍 What I'm more proposing is that the addition of event counters and of an audit action registry be completed in a series of PRs rather than one large one. Maybe you're also thinking that, given that you filed getodk/central#1462 as its own issue. I was thinking that the following might be a sequence that'd work well. What do you think?
- Merge this PR with the trigger pretty much as-is, with the hard-coded list of audit actions.
- PR 2: Add a JS registry and use it to validate the trigger (i.e., add crutches).
- PR 3: Add the enum.
- PR 4: Maybe replace bare strings.
- PR 5: Maybe something that implements the
action=nonverbosefilter using the registry.
One approach we could take is that Alex finishes up the review of this PR, then you and I continue discussing the registry idea. I'd be happy to review follow-up PRs related to the registry if you wanted me to.
|
|
||
| --- create: public.audits.eventcounter_bump_trigger --- | ||
| CREATE TRIGGER "eventcounter_bump_trigger" | ||
| AFTER INSERT OR UPDATE OF processed |
There was a problem hiding this comment.
This is more a curious question than anything else. Why would a change to processed increment the event counter? If processing an event has an effect on the actee, I would think that that would already be indicated by the insertion of new row(s) into audits.
- Example 1: Whenever a form is published, we check whether the form has an
enketoId. If it does, we do nothing (job ends). There's no need to increment the counter in that case. If the form doesn't have anenketoId, then we try to get one. If that's successful, there will be a newform.updateevent logged. The logging of that event alone should increment the counter. - Example 2: Whenever a submission is created, it is processed for entities. But I don't think such entity processing will affect the actee of the
submission.createevent (the form): it will only affect a dataset actee. That change will be indicated by the insertion of new entity event(s).
There was a problem hiding this comment.
Counterexample for submission.create: generation of secondary information which influences how submissions.csv looks — see 8cc3ab4 for what goes on there.
This information is added later (potentially much later; there's a race condition that I've shown when I demoed the pure-SQL insertion path, we'll clear that bug when we adopt the pure-SQL submission insertion path) in a separate transaction after insertion. And it manifests itself in the audits not with a new event, but only with setting the processed value of the existing row.
TLDR: Different outcomes for submissions.csv, but no sign in the audits other than that the processed column receives a value.
So I want to err on the cautionary side and interpret "something is processed" as an event. There may be cases where this results in over-invalidation, probably specific to the event action. I know I need it for submission.create, as I've shown above, but yeah there may be others where it's overzealous. We could take stock and follow up to make things more specific (but then, also think about how we're going to maintain bookkeeping, we don't want to miss events and potentially over-cache when the code changes...).
There was a problem hiding this comment.
OK, good counterexample. I thought about that case, but for some reason, I thought that any unprocessed submissions would be processed on the fly at request time for additional "select multiple" values. But looking at lib/data/briefcase.js, that doesn't seem to be the case. The CSV headers seem to be finalized before any data rows are processed or written (which makes sense). I must have been thinking about the aggregated client audit CSV instead. Submission attachments are processed for client audit rows after a submission.attachment.update event, but they're also processed on the fly if there are any client audit attachments that the worker hasn't processed. So the processed flag on the submission.attachment.update event doesn't affect the response. But it seems like that's not the case for submission.create. 👍
One counterexample is all we need, so I'm convinced that it makes sense to increment the event counter when processed is updated. I agree that we don't want extra bookkeeping in this area, so that logic shouldn't depend on whether it's a submission.create event vs. something else.
| ; | ||
|
|
||
| --- sign: eventcounter_squash(actee_id varchar(36)) --- | ||
| COMMENT ON FUNCTION "public"."eventcounter_squash"(actee_id varchar(36)) IS '{"dbsamizdat": {"version": 1, "definition_hash": "6697e80c4e5a2e69a33eb4aa6ccbea3d"}}'; |
There was a problem hiding this comment.
what's the significance of dbsamizdat?
There was a problem hiding this comment.
I use it to work with these things, letting the computer do the work of deciding whether they need updating (based on definitions I maintain out-of-tree, on my side, for now, but I'm happy to share them) and if so, making sure the DB objects get torn down and reinstated in the right dependency order, and writing out a migration file that does so which I then check into VCS, which is what you're looking at here. It's just a bit of dev tooling. But it needs those signatures (stored in comments, and as such tied to the lifecycle of the managed DB objects). Are they in your way?
There was a problem hiding this comment.
Note that you'll find those signatures also in master already, example. And we have it on good authority that that's fine.
There was a problem hiding this comment.
Thanks for the explanation 🙏
And we have it on good authority that that's fine.
The link refers to the legality of mentioning a GPL-licensed library from code.
Note that you'll find those signatures also in master already, example.
It's great if this tool helps to develop new database migrations. For people not using this tool, these comments could be considered noise when committed to master. As a user of the tool, do they have ongoing value for you after you've finished working on a PR and it's ready to merge?
There was a problem hiding this comment.
Yes, those tags are the thing that make the state diffing tick. Without state diffing one might as well write migrations by hand.
There was a problem hiding this comment.
Migrations ideally don't change after they've been merged to master, and especially not once released, so with that in mind, do the comments have ongoing value for you after you've finished working on a PR and it's ready to merge?
There was a problem hiding this comment.
Yes, the signatures in the DB have ongoing value, that's what makes the diffing tick. For instance this migration (trivial example, but an example nevertheless) was autogenerated, based on the diff, based on the signatures, as read from the DB, as put there by the previous migration.
A month or two ago I've shown how this works to @ktuite and @sadiqkhoja (was in the context of the geoprocessing functions). Let me know if you'd like to know more about how this works; this PR is probably not the best venue.
Any idea what the worst case is likely to be? |
Probably the 1 in 10_000 probability (more under concurrency) that two |
6379727 to
30b4475
Compare
| AFTER INSERT OR UPDATE OF processed | ||
| ON "public"."audits" | ||
| FOR EACH ROW | ||
| WHEN (NEW.action = ANY(ARRAY['form.update.publish', 'submission.create', 'submission.update', 'submission.update.version', 'submission.attachment.update', 'submission.delete', 'submission.restore', 'dataset.update', 'dataset.update.publish', 'entity.create', 'entity.update.version', 'entity.update.resolve', 'entity.delete', 'entity.restore', 'entity.bulk.delete', 'entity.bulk.restore'])) |
There was a problem hiding this comment.
Should form.attachment.update be in the list?
There was a problem hiding this comment.
Or maybe the question is more general - why is form.update.publish in the list, but other form.* actions are not?
There was a problem hiding this comment.
-
Why is
form.attachment.updatenot there — Take the case ofsubmissions.csv. You can't update an attachment of a form that is published, which is why I left it out; I don't want to invalidate a giant costlysubmissions.csvjust because someone is futzing around with some drafts of that same form. Consequently, probably don't use event counters as an etag on .csv exports of a draft form. -
Why is
form.update.publishthere — Again take the case ofsubmissions.csv. Publishing a form can alter the CSV-"view" on its existing submission-corpus (eg, extra column), so updates to the form should result in cache invalidation.
I agree that ideally we'd have semantics where any audit action accurately invalidates its actee for any purpose without false positives, but that's just not the reality of Central as it is currently, evident from the two examples above.
We can decide to not capture the finesses and instead just capture everything, which will make the counters yolo-safe to use, that is, whenever you generate data just etag it with whatever counter you find for the actee, without further consideration or thought. That'd result in over-invalidation/under-caching. It's one side of the spectrum as laid out in the design doc, but I'm not sure if it's attainable, I think there'll always be some thinking required.
Another ideal, and over on the other side of the spectrum (as laid out in the design doc) would be to be super precise, super granular, with the counters. For instance, if you're not listing columns with attachment counts in your .csv, then adding an attachment to a submission listed in your .csv doesn't need to invalidate its cache. So we'd really need segmented counters, and then the user of the etag will have to find the right combination of counters depending on their use case ("ok I'm doing something with attachments, but not something with drafts, so I need to combine counter X and Y for this form, ...."). Increased cognitive load (and thus risk of getting it wrong) for the benefit of decreased under-caching.
Whether you like my exclusion of form.attachment.update depends
a) on where you stand on that spectrum ideologically,
b) and separately, on how transparent this all is for a developer so that they can make an well-informed decision on whether they can use a counter as an etag.
a) is a matter of concensus and if you don't agree with the middle-ground position taken here in this code, it'd be good to bring it up, perhaps in the team Meet later today, so that we as a team can reach concensus on this.
To address b), @matthew-white and I think setting up a registry of actions will offer inroads to making this system more legible.
I'd say yes please, and enforced in the code 👍 |
30b4475 to
48ea16f
Compare
Part of getodk/central#1462 ! I can't promise it'll be enforced in the code. Supposedly there could be a DB function |
cb5c5da to
3aa0871
Compare
3aa0871 to
6d106ca
Compare
|
Converted back to draft while we explore an alternative concept. |
|
Closing, we're going to pursue the #1699 approach first and foremost. |
Towards getodk/central#1439
Per-actee event counters that spring into action upon create/update of specific
auditsrows.Important assumptions:
I've been assured that this is the case for inserts, and I've understood this to also be the case for the updates of an existing
auditsrow when a "worker" task completes — the worker task's work is committed together with the update to theprocessedtimestamp.Furthermore,
Using the event counter as an ETag
PR #1657 contains an example for using it (for forms, and datasets).
It raises a question: do we need to make explicit what types of actees are expected to work?
The way the counter is queried in #1657 always returns a result for the actee, so if someone blindly copies that exact code (coalescing null to 0) to something to do with another actee type (say, user), they might get the impression that they're doing things correctly (but the event count is always 0). Wrong expectation?
What has been done to verify that this works as intended?
Manual testing.
Why is this the best possible solution? Were any other approaches considered?
See issue getodk/central#1439 and the dev doc linked therefrom.
Just doing counts on aggregates over the audit log is not great, on my modest laptop (still much faster than a budget VPS) it's ~50 ms for some, with just 300K audit rows.
So, the solution in this PR does things a different way.
It keeps per-dataset/form actee event counters in a table. Of course, if there'd be just one holy row per actee, there'd be heavy lock contention on updating those counters, effectively serializing all mutations for a single dataset/form actee. So we don't do that. Instead, via a trigger, for each audit log addition (of the right type) there's simply a single row inserted, for the respective actee, with event count:
1. That's contention-free.But we don't want to sum over millions of rows with event count
1.Thus, on every trigger run, with probability 0.01, a squashing function is called. This function takes all rows in the
eventcounterstable for the actee, and replaces them with a single row containing the aggregate sum of theevt_cntfield of those rows. This does not get in the way of any concurrent auditlog inserts (or, indeed, in the way of the ensuingeventcountersrow creation). It does lock out concurrent incantations of itself (for the same actee), to no ill effect other than serializing concurrent auditlog insertions affecting the same actee with probability 0.01 * 0.01. It doesn't lock out any readers of the counters. It's fast! I measured an added cost of the trigger of 30µs (µs, not ms!) per call (with the squashing function's cost amortized).To read a counter, use a query like such:
On average (on busy servers) this will be a sum over about 100 rows. Due to the way the index is constructed, with inclusion of the
evt_cntvalues, this query will be an index-only scan that doesn't hit the heap (the table content, as it were) in most cases (in a nutshell, depending on whether PostgreSQL has marked the pages where the rows reside as visible in all transactions, which depends on how "settled" the situation is). Anyway, that one's plenty fast too, 2ms or so on my machine.How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
Not really.
Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.
No update required.
Before submitting this PR, please make sure you have:
make testand confirmed all checks still pass OR confirm CircleCI build passes