feat(#11071): add replication failure user count to monitoring v2#11072
feat(#11071): add replication failure user count to monitoring v2#11072dianabarsan wants to merge 1 commit into
Conversation
…er the last two months Signed-off-by: Diana Barsan <barsan@medic.org>
|
Hi @jkuester This is part 1 of monitoring addition. Next part will be an endpoint to show users who haven't succeeded to replicate at all recently + monitoring data. Separate to keep prs small. I'd appreciate your thoughts! |
|
hi @jkuester slight nudge, as there is a followup pr for this, to add the "overview" endpoint, and I would really love both of these to make it to 5.2. |
jkuester
left a comment
There was a problem hiding this comment.
👍 I left one complaint inline (which I am sure you anticipated). My main concern after reading the rest of the PR and thinking about this more is that, as things are currently in this PR, I am not sure how we could meaningfully monitor changes to the value over time (e.g. via Watchdog). Since the measurement is based on the reporting period, the actual returned value can change dramatically on the day the new reporting period begins. The cyclical pattern should be recognizable over time, but is just one more think we will have to keep in context when trying to sort through data.
Honestly, if you think it is not worth adding a new view for this so we can return the count from a rolling interval, then my vote would be to hold off on shipping this unless we have a coherent idea of how/what we want to monitor (no need to rush a value into Watchdog that will turn out to be useless, and will need a new version of /monitoring to fix....)
| const TYPE_PREFIX = `replication-fail-`; | ||
| const MAX_FAILURES = 50; | ||
| const REPORTING_PERIOD_FORMAT = 'YYYY-MM'; | ||
| const REPORTING_PERIOD_LENGTH = REPORTING_PERIOD_FORMAT.length; |
There was a problem hiding this comment.
Minor, but it is odd to have this variable for the reporting period, but not have one of type prefix... IMHO we don't really need this.
| return getPageByRange({ reportingPeriod, skip: cursor, limit }); | ||
| }; | ||
|
|
||
| const getRecentReportingPeriodsRange = () => { |
There was a problem hiding this comment.
Returning the current + previous period's data is pretty confusing to the consumer. 😓 I think I understand why you did it (it is better than just returning the current OR the previous period's data), but still....
I would like this much better if we could just a rolling sum from the last 30 days. (That kind of value will be WAY easier to track/measure in Watchdog) However, I think doing that efficiently would require a view (like we have for logs/connected_users). I know you mentioned having view PTSD 😬, but at a certain point the functionality tradeoffs are not worth it just to avoid the disk space of the view... 🤷
Description
Surfaces the number of distinct users that hit replication failures in the current or previous calendar month under
replication_failure.counton/api/v2/monitoring. Counting calendar months (rather than a rolling window) avoids artificially low counts in the first days after a month flip.closes #11071
Co-Authored-By: Claude Opus 4.7
Code review checklist
can_view_old_navigationpermission to see the old design. Test it has appropriate design for RTL languages.Compose URLs
If Build CI hasn't passed, these may 404:
License
The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.