Skip to content

feat(#11071): add replication failure user count to monitoring v2#11072

Open
dianabarsan wants to merge 1 commit into
masterfrom
11071-replication-failures-monitoring
Open

feat(#11071): add replication failure user count to monitoring v2#11072
dianabarsan wants to merge 1 commit into
masterfrom
11071-replication-failures-monitoring

Conversation

@dianabarsan
Copy link
Copy Markdown
Member

@dianabarsan dianabarsan commented May 14, 2026

Description

Surfaces the number of distinct users that hit replication failures in the current or previous calendar month under replication_failure.count on /api/v2/monitoring. Counting calendar months (rather than a rolling window) avoids artificially low counts in the first days after a month flip.

closes #11071

Co-Authored-By: Claude Opus 4.7

Code review checklist

  • UI/UX backwards compatible: Test it works for the new design (enabled by default). And test it works in the old design, enable can_view_old_navigation permission to see the old design. Test it has appropriate design for RTL languages.
  • Readable: Concise, well named, follows the style guide
  • Documented: Configuration and user documentation on cht-docs
  • Tested: Unit and/or e2e where appropriate
  • Internationalised: All user facing text
  • Backwards compatible: Works with existing data and configuration or includes a migration. Any breaking changes documented in the release notes.
  • AI disclosure: Please disclose use of AI per the guidelines.

Compose URLs

If Build CI hasn't passed, these may 404:

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

…er the last two months

Signed-off-by: Diana Barsan <barsan@medic.org>
@dianabarsan
Copy link
Copy Markdown
Member Author

Hi @jkuester

This is part 1 of monitoring addition. Next part will be an endpoint to show users who haven't succeeded to replicate at all recently + monitoring data. Separate to keep prs small.

I'd appreciate your thoughts!

@dianabarsan dianabarsan requested a review from jkuester May 15, 2026 05:18
@dianabarsan
Copy link
Copy Markdown
Member Author

hi @jkuester

slight nudge, as there is a followup pr for this, to add the "overview" endpoint, and I would really love both of these to make it to 5.2.
Thanks! Appreciate your time!

Copy link
Copy Markdown
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I left one complaint inline (which I am sure you anticipated). My main concern after reading the rest of the PR and thinking about this more is that, as things are currently in this PR, I am not sure how we could meaningfully monitor changes to the value over time (e.g. via Watchdog). Since the measurement is based on the reporting period, the actual returned value can change dramatically on the day the new reporting period begins. The cyclical pattern should be recognizable over time, but is just one more think we will have to keep in context when trying to sort through data.

Honestly, if you think it is not worth adding a new view for this so we can return the count from a rolling interval, then my vote would be to hold off on shipping this unless we have a coherent idea of how/what we want to monitor (no need to rush a value into Watchdog that will turn out to be useless, and will need a new version of /monitoring to fix....)

const TYPE_PREFIX = `replication-fail-`;
const MAX_FAILURES = 50;
const REPORTING_PERIOD_FORMAT = 'YYYY-MM';
const REPORTING_PERIOD_LENGTH = REPORTING_PERIOD_FORMAT.length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but it is odd to have this variable for the reporting period, but not have one of type prefix... IMHO we don't really need this.

return getPageByRange({ reportingPeriod, skip: cursor, limit });
};

const getRecentReportingPeriodsRange = () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the current + previous period's data is pretty confusing to the consumer. 😓 I think I understand why you did it (it is better than just returning the current OR the previous period's data), but still....

I would like this much better if we could just a rolling sum from the last 30 days. (That kind of value will be WAY easier to track/measure in Watchdog) However, I think doing that efficiently would require a view (like we have for logs/connected_users). I know you mentioned having view PTSD 😬, but at a certain point the functionality tradeoffs are not worth it just to avoid the disk space of the view... 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add monitoring data about users who fail to replicate

2 participants