Skip to content

Feat/queue monitoring#337

Merged
RUKAYAT-CODER merged 16 commits into
rinafcode:mainfrom
Power70:feat/queue-monitoring
Apr 24, 2026
Merged

Feat/queue monitoring#337
RUKAYAT-CODER merged 16 commits into
rinafcode:mainfrom
Power70:feat/queue-monitoring

Conversation

@Power70
Copy link
Copy Markdown
Contributor

@Power70 Power70 commented Apr 24, 2026

Linked Issue

Closes #291


What does this PR do?

This PR introduces comprehensive queue monitoring and operational controls for Bull in the teachLink backend, covering metrics, health, failed-job handling, retry analytics, and scheduled-job management. The implementation adds strongly-typed DTO validation at controller boundaries, secures endpoints with authentication/authorization guards, and improves route safety by preventing static route collisions with parameterized routes. Queue observability is expanded with periodic health checks, trend-aware statistics, retry analysis, and stuck-job recovery hooks, making production diagnosis and response significantly faster. The changes are designed as non-breaking enhancements while preserving existing queue behavior.


Type of change

  • ✨ New feature (non-breaking change that adds functionality)
  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • 💥 Breaking change (fix or feature that changes existing API behaviour)
  • ♻️ Refactor (no functional change, no new feature)
  • 🧪 Tests only (no production code changes)
  • 📝 Documentation only
  • 🔧 Chore (build, dependencies, CI config)

Pre-merge checklist (required)

Do not remove items. Unchecked items without an explanation will block merge.

Branch & metadata

  • Branch name follows feature/issue-<N>-<slug> / fix/issue-<N>-<slug> convention
  • Branch is up to date with the target branch (develop or main)
  • All commits and the PR title follow the Conventional Commits format with issue reference

Code quality & tests

  • npm run lint:ci — zero ESLint warnings
  • npm run format:check — Prettier reports no changes needed
  • npm run typecheck — zero TypeScript errors
  • npm run test:ci — all tests pass, coverage ≥ 70%
  • New service methods have corresponding .spec.ts unit tests
  • New API endpoints are covered by at least one e2e test
  • No existing tests were deleted (if any were, justification is provided in the PR description)

Error handling & NestJS best practices

  • All new/updated DTOs use class-validator / class-transformer decorators and are wired through NestJS pipes (e.g. global ValidationPipe or explicit)
  • All controller entry points validate external input at the boundary (no unvalidated raw any/unknown reaching the domain)
  • Controllers/services throw appropriate NestJS HTTP exceptions (e.g. BadRequestException, UnauthorizedException, ForbiddenException, NotFoundException) instead of generic Error
  • Any new error shapes are handled by existing exception filters or the filters have been updated accordingly
  • Logging goes through the shared logging abstraction (e.g. Nest Logger or central logger service) with meaningful, structured messages
  • Authentication/authorization guards (e.g. AuthGuard, role/permissions guards, custom guards) are applied to all new/modified endpoints where appropriate
  • If an endpoint is intentionally public, this is explicitly mentioned in the PR description with rationale

API documentation / Swagger

  • Swagger / OpenAPI decorators are added or updated for all new/changed controller endpoints (including DTOs, responses, and error schemas)
  • I have started the app locally and confirmed the /api (or Swagger UI) reflects new/changed endpoints correctly
  • If there are no API surface changes, this is explicitly stated in the PR description

Breaking changes

  • This PR does not introduce a breaking API change
  • OR: this PR introduces a breaking change and it is documented below, with migration notes

Breaking change description (if applicable)

Not applicable.


Changes Overview

New file

  • src/queues/dto/queue.dto.ts
    • Added strongly-typed DTOs with validation for queue operations:
      • AddJobDto, AddBulkJobsDto
      • ScheduleJobDto, ScheduleDelayedJobDto
      • FailedJobsQueryDto, StuckJobsQueryDto, AnalyticsQueryDto
      • CleanQueueDto

Updated file

  • src/queues/queue.controller.ts
    • Fixed static route precedence so /jobs/failed and /jobs/stuck no longer collide with /jobs/:id.
    • Applied DTO + ValidationPipe validation for body/query boundaries.
    • Added JwtAuthGuard + RolesGuard with admin restriction on mutation endpoints.
    • Added NotFoundException handling for missing jobs in getJob, retryJob, and removeJob.
    • Added/expanded endpoints:
      • POST /queues/jobs/failed/retry-all
      • GET /queues/metrics/history
      • GET /queues/metrics/retries
      • GET /queues/counts
      • GET /queues/jobs/scheduled
      • GET /queues/cron/jobs
      • POST /queues/jobs/delay
      • DELETE /queues/jobs/scheduled/:id
      • DELETE /queues/empty

Updated file

  • src/queues/monitoring/queue-monitoring.service.ts
    • Fixed throughput calculation to use real elapsed time between timestamped snapshots.
    • Added capturedAt to metrics history for time-series analysis.
    • Fixed stuck-job handling to fall back to job.timestamp when processedOn is null.
    • Added stall detection in health checks when waiting > 0, active = 0, throughput = 0.
    • Added guarded queue API calls with try/catch and structured logger output.
    • Added getRetryAnalytics(windowMinutes) with rates and per-job-type aggregation.
    • Added retryAllFailedJobs() with summary { requeued, skipped, errors }.
    • Extended periodic health checks (@Cron(EVERY_MINUTE)) with alert stubs and stuck-job recovery.
    • Added configurable monitoring thresholds for failure rate, backlog, active jobs, delayed jobs, and stuck duration.

Acceptance Criteria Checklist

  • Add queue metrics (GET /queues/metrics, GET /queues/metrics/history, GET /queues/counts, GET /queues/statistics)
  • Add failed job handling (GET /queues/jobs/failed, POST /queues/jobs/failed/retry-all, POST /queues/jobs/:id/retry)
  • Add retry monitoring (GET /queues/metrics/retries with per-job-type breakdown, window filtering, rates)
  • Queue monitoring available (GET /queues/health with healthy/warning/critical, stuck-job recovery, periodic alerting)

Labels

backend queue monitoring priority-medium


Test evidence (required)

Commands run locally

npm test
npx prettier --write src/queues/monitoring/queue-monitoring.service.ts
npx eslint src/queues/monitoring/queue-monitoring.service.ts --fix --max-warnings 0

Observed results

PASS src/queues/queue.spec.ts
Tests: 72 passed, 72 total
Suites: 1 passed, 1 total
Time: 4.904s

prettier: completed successfully
eslint: completed successfully with --max-warnings 0

Manual / API verification

Endpoint contract verification and behavior coverage were validated through QueueController and QueueMonitoringService unit tests in src/queues/queue.spec.ts, including:
- metrics, history, health, counts, statistics
- failed/stuck/scheduled job listing and retry flows
- retry analytics and window filtering
- route ordering and not-found behavior for /jobs/:id endpoints

Screenshots / recordings (if applicable)

Not applicable for backend-only changes.

@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented Apr 24, 2026

@Power70 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@RUKAYAT-CODER
Copy link
Copy Markdown
Contributor

Please resolve conflict fix CI
Kindly support the project with a star

@Power70
Copy link
Copy Markdown
Contributor Author

Power70 commented Apr 24, 2026

Please resolve conflict fix CI Kindly support the project with a star

I have ran all failing test on my local and they're all passing. I'm still trying to understand why it's failing here

@Power70
Copy link
Copy Markdown
Contributor Author

Power70 commented Apr 24, 2026

Please resolve conflict fix CI Kindly support the project with a star

Please resolve conflict fix CI Kindly support the project with a star

All checks are passing now

@RUKAYAT-CODER RUKAYAT-CODER merged commit 12bca5f into rinafcode:main Apr 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Queue Monitoring

2 participants