Skip to content

Conversation

@coodos
Copy link
Contributor

@coodos coodos commented Jan 27, 2026

Description of change

Issue Number

closes #638

Type of change

  • Breaking (any change that would cause existing functionality to not work as expected)
  • New (a change which implements a new feature)
  • Update (a change which updates existing functionality)
  • Fix (a change which fixes an issue)
  • Docs (changes to the documentation)
  • Chore (refactoring, build scripts or anything else that isn't user-facing)

How the change has been tested

Change checklist

  • I have ensured that the CI Checks pass locally
  • I have removed any unnecessary logic
  • My code is well documented
  • I have signed my commits
  • My code follows the pattern of the application
  • I have self reviewed my code

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved webhook participant loading to avoid hangs and timeouts, with safer per-item handling and better failure isolation.
    • Reworked change handling to reliably reload and send updated entities after commits, reducing missed or inconsistent notifications.
  • Performance

    • Tuned database connection pooling and timeouts to improve resource utilization and responsiveness.

✏️ Tip: You can customize this high-level summary in your review settings.

The subscriber's afterUpdate was using event.entity which only contains
changed fields (partial entity), not the full entity with charter. When
groups were updated via webhooks, the charter field was often absent in
the partial entity, causing Cerberus to lose track of group charters
over time. After restart, charters would be loaded fresh from DB.

Root cause: TypeORM's afterUpdate event provides partial entities when
using repository.save(entity). The code was not reloading the full
entity with all fields (including charter) from the database after the
transaction committed.

Changes:
- Refactored afterUpdate to pass metadata (entityId, relations) instead
  of using the partial entity from event.entity
- Created handleChangeWithReload that schedules entity reload for after
  transaction commit (inside setTimeout with 50ms delay)
- Created executeReloadAndSend that does the actual findOne with all
  relations AFTER transaction commits, ensuring charter and all fields
  are loaded
- Groups and messages sync with 50ms delay (fast, ensures commit)

This is the same transaction timing issue fixed in file-manager-api and
dreamsync-api. Now Cerberus will maintain full group data (including
charters) indefinitely without requiring restarts.
The group webhook processing could hang indefinitely when loading
participant users, causing Cerberus to appear stuck after running
for some time. The last log was "Extracted userId" with no progress.

Root causes:
1. Promise.all blocks if any getUserById call hangs (DB lock, timeout, etc.)
2. No timeout protection - hangs wait forever
3. No error handling - failures block entire webhook
4. Loading unnecessary relations (followers/following) added complexity

Changes:
- Use Promise.allSettled instead of Promise.all to handle failures gracefully
- Add 5-second timeout per user lookup using Promise.race
- Wrap each participant load in try-catch with detailed error logging
- Load users without heavy relations in webhook context (don't need followers/following)
- Add indexed logging to identify which participant causes issues
- Log success/failure counts for transparency

Benefits:
- Webhook completes even if some participants fail to load
- 5s timeout prevents indefinite hangs
- Better diagnostics via indexed logging
- Reduced DB load by skipping unnecessary relations

The webhook will now respond within ~5 seconds even if all participant
lookups fail, preventing Cerberus from getting stuck.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Enhances webhook participant loading with per-participant timeouts to avoid hangs, adds connection-pool tuning to multiple Postgres DataSource configs, and refactors the subscriber afterUpdate flow to reload, enrich, and debounce post‑commit entity notifications.

Changes

Cohort / File(s) Summary
Webhook participant loading
platforms/cerberus/src/controllers/WebhookController.ts
Replaced prior per-participant async loading with a per-item async function that safely extracts userId, wraps each repository call in a 5s Promise.race timeout, and uses Promise.allSettled to collect valid participants; adds per-index error logging.
Subscriber reload-and-notify workflow
platforms/cerberus/src/web3adapter/watchers/subscriber.ts
Reworked afterUpdate handling to derive entityId from multiple sources and schedule a debounced post‑commit reload. Added private handleChangeWithReload and executeReloadAndSend methods that reload/enrich the entity, convert to plain data, skip junctions/non-system messages, and call adapter.handleChange. Includes additional guards and logging.
Database connection pool configs
platforms/cerberus/src/database/data-source.ts, infrastructure/evault-core/src/config/database.ts, platforms/dreamsync-api/src/database/data-source.ts, platforms/eCurrency-api/src/database/data-source.ts, platforms/eReputation-api/src/database/data-source.ts, platforms/emover-api/src/database/data-source.ts, platforms/esigner-api/src/database/data-source.ts, platforms/evoting-api/src/database/data-source.ts, platforms/file-manager-api/src/database/data-source.ts, platforms/group-charter-manager-api/src/database/data-source.ts, platforms/pictique-api/src/database/data-source.ts, platforms/registry/src/config/database.ts
Added an extra block to DataSource/DataSource options across multiple platforms with connection-pool and timeout settings (max: 10, min: 2, idleTimeoutMillis: 30000, connectionTimeoutMillis: 5000, statement_timeout: 10000). Review for consistent naming/typing and environment compatibility.

Sequence Diagram(s)

sequenceDiagram
    participant Event as Change Event
    participant Sub as Subscriber
    participant DB as Database
    participant Adapt as Adapter

    Event->>Sub: afterUpdate(event)
    Sub->>Sub: derive entityId (event.entity.id / databaseEntity?.id / common id fields)
    alt id missing
        Sub-->>Event: log warning & exit
    else id present
        Sub->>Sub: schedule handleChangeWithReload (debounced per table)
        Note right of Sub: debounced timer per table (skip junctions)
        Sub->>DB: reload entity by id after commit
        Note over Sub,DB: reload uses repository.findOne and enrichment
        DB-->>Sub: enriched entity/plain data
        Sub->>Sub: validate (skip locked/non-system where applicable)
        Sub->>Adapt: adapter.handleChange(envelope)
        Adapt-->>Sub: acknowledgement
        Sub->>Sub: log envelope
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • Bekiboo
  • sosweetham

Poem

🐰
I hopped through logs and timeouts bright,
Per‑participant naps cut short at night,
I fetch, I reload, I debounce with care,
Group chats chirp now — no hangs in the air! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (2 inconclusive)
Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR addresses issue #638 (Cerberus group chat failures) through connection pool additions and webhook/subscriber improvements, but the connection between specific code changes and the expected behavior is not explicitly documented. Clarify in the PR description how the connection pool and webhook timeout changes specifically address the notification and charter violation processing failures described in issue #638.
Out of Scope Changes check ❓ Inconclusive The PR contains mostly in-scope changes (database pool configurations and webhook/subscriber enhancements) related to fixing database hangups, but includes additional changes to subscriber reload logic that may extend beyond the immediate scope of connection pooling. Review whether the subscriber reload-and-notify workflow (155 lines added) is necessary for the core hangup fix or if it should be separated into a distinct PR for better clarity.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix/add connection pools to fix db hangup' directly relates to the main changes in the PR, which add connection pool configurations across multiple database data sources to address database hangup issues.
Description check ✅ Passed The PR description follows the provided template structure with Issue Number (#638) and lists all change type options, but the 'Type of change' section is not properly checked and 'How the change has been tested' is empty.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coodos coodos force-pushed the fix/add-connection-pools-to-fix-db-hangup branch from 83dce76 to c05d7d0 Compare January 27, 2026 20:59
@coodos coodos merged commit 46b062e into main Jan 27, 2026
7 checks passed
@coodos coodos deleted the fix/add-connection-pools-to-fix-db-hangup branch January 27, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Cerberus doesn't work with group chats and charters

3 participants