Skip to content

issues #179

@chintakjoshi

Description

@chintakjoshi

Code Review Issues

  • lru_cache on Singletons Prevents Configuration Reload and Causes Stale State

    Files: config.py:346, jwt.py:203, sessions.py:577, signing_keys.py:261, token_service.py:131, lifecycle_service.py:764, otp_service.py:855, brute_force_service.py:284, webhook_service.py:825-853, rate_limit.py:62

    Nearly every service and infrastructure dependency is wrapped in @lru_cache. This creates several production risks:

    • No graceful restart on config change — if Redis or Postgres connection details change (e.g., failover, credential rotation), the process must be restarted. There's no mechanism to invalidate the cache.
    • Redis client is cached forever — if the connection breaks and the client enters a bad state, it won't be recreated. asyncpg has pool_pre_ping but Redis doesn't get equivalent treatment.
    • get_settings() is cached — environment variable changes (for sidecar-injected secrets) won't take effect without process restart.
  • Refresh Token Stored as SHA-256 Hash — Vulnerable to Database Leaks

    Files: sessions.py:560-562

    Refresh tokens are hashed with plain sha256. Unlike bcrypt for passwords, SHA-256 is fast — if the sessions table is leaked, an attacker could brute-force refresh tokens (they're JWTs with known structure, making partial preimage attacks feasible). Consider HMAC with a server-side secret key, or at minimum use a keyed hash.

  • Email HTML Injection in Verification and Reset Emails

    Files: lifecycle_service.py:107-109, lifecycle_service.py:119-122

    The verification_link is interpolated directly into HTML without escaping:

    f'<p><a href="{verification_link}">{verification_link}</a></p>'

    While the link is server-generated, a JWT token value containing " could break out of the attribute. Use proper HTML escaping.

  • No Database Connection Pool Tuning

    File: session.py:22

    create_async_engine(settings.database.url, pool_pre_ping=True)

    The engine is created with default pool settings (pool_size=5, max_overflow=10). For a centralized auth service handling traffic from multiple downstream services, this will bottleneck under production load. There's no configuration for:

    • pool_size / max_overflow
    • pool_timeout
    • pool_recycle (important for cloud databases that silently kill idle connections)
  • BaseHTTPMiddleware for Rate Limiting is a Known Anti-Pattern

    File: rate_limit.py:69

    Starlette's BaseHTTPMiddleware wraps each request in a StreamingResponse and holds the entire request body in memory. For an auth service this is somewhat tolerable (small payloads), but it also prevents proper response streaming and has known issues with background tasks. All your middleware uses this pattern. Consider migrating at least the rate limiter to a pure ASGI middleware.

  • Rate Limiting is Per-Path, Per-IP Only — No Per-User Rate Limiting

    File: rate_limit.py:168-171

    Rate limiting keys are rate_limit:{path}:{client_ip}. This means:

    • A single corporate NAT can be rate-limited across all its users
    • An attacker behind many IPs (botnet) bypasses the rate limit entirely
    • Authenticated endpoints have no per-user rate limits for token refresh, API key operations, etc.
    • The brute force service partially compensates for login, but there's no per-user limiting on /auth/token or admin endpoints.
  • _issue_token_pair Uses inspect.signature on Every Call

    File: auth.py:104-155

    This function uses inspect.signature() on every token issuance to support "legacy test doubles". In production, this is pure overhead. The signature won't change at runtime. Cache the inspection result or remove the legacy shim.

  • OAuth State Not Atomic (get + delete Race)

    File: oauth_service.py:385-390

    if hasattr(self._redis, "getdel"):
        raw_payload = await self._redis.getdel(key)
    else:
        raw_payload = await self._redis.get(key)
        if raw_payload is not None:
            await self._redis.delete(key)

    The fallback branch has a TOCTOU race — two concurrent callbacks with the same state could both read the value before either deletes it, allowing a replay. The getdel path is correct, but the fallback should use a Lua script or pipeline for atomicity.

  • UserService is Re-instantiated Per Request (No Caching)

    File: auth.py:71-72

    def get_user_service() -> UserService:
        return UserService()

    This creates a new CryptContext on every request. While passlib.CryptContext is lightweight, it's still unnecessary object churn at scale compared to a cached singleton.

  • No Graceful Shutdown / Lifespan Hook

    File: main.py:18-63

    There's no FastAPI lifespan handler to:

    • Close the SQLAlchemy engine pool on shutdown (there's a dispose_engine() but it's never called)
    • Close the Redis connection pool
    • Drain the webhook queue gracefully

    This will cause connection leak warnings and potentially lost webhook deliveries during rolling deployments.

  • Audit Events Are Written in the Same Transaction as Business Logic

    Files: auth.py:432-478, lifecycle.py:100-143

    Audit records are awaited in the same request path as login/signup. If the audit service or DB is slow, it directly impacts authentication latency. Audit writes should be fire-and-forget or queued asynchronously (similar to how webhook emission already uses a separate session factory).

  • Missing Index on sessions.hashed_refresh_token

    File: sessions.py:357-371

    _fetch_session_by_refresh_hash queries Session.hashed_refresh_token with WITH FOR UPDATE, but there's no visible index on this column. On every token refresh, this becomes a sequential scan against the sessions table, which grows continuously. This will become a performance cliff as the user base scales.

  • Webhook SSRF DNS Rebinding Not Fully Mitigated

    File: webhook_service.py:161-205

    The DNS resolution happens at registration time and again at delivery time (_resolve_target), which is good. However, between resolution and the actual HTTP request (httpx.post), DNS could rebind to an internal IP. The host=connect_host substitution partially mitigates this, but httpx may still perform its own DNS resolution depending on transport configuration. Consider pinning via httpx.AsyncClient(transport=...) with explicit address binding.

  • Email Column Has unique=True, but Soft Deletes Create Ambiguity

    File: user.py:42

    The email column has a hard UNIQUE constraint, but soft-deleted users retain their email. This means:

    • A user who erases their account permanently blocks that email from re-registration
    • The composite index ix_users_email_deleted_at exists but the UNIQUE constraint doesn't leverage it
  • Error Code Defaults Are Misleading

    File: error_handlers.py:48-57

    _DEFAULT_ERROR_CODE_BY_STATUS: dict[int, str] = {
        404: "invalid_token",
        405: "invalid_token",
        503: "session_expired",
    }

    A 404 or 405 returning invalid_token is semantically wrong and will confuse clients. A 503 returning session_expired masks infrastructure failures. These defaults should be more accurate codes.

  • 500 Errors Return code: "invalid_token"

    File: error_handlers.py:191

    return _error_response(status_code=500, detail=detail, code="invalid_token")

    Internal server errors are tagged with invalid_token, which will mislead SDK consumers into treating infrastructure failures as authentication failures and triggering re-auth loops.

  • No Cookie max-age Set on Auth Cookies

    File: browser_sessions.py:207-232

    set_cookie() doesn't include a max_age parameter. Without it, cookies become session cookies (deleted when browser closes). This conflicts with the 7-day refresh token TTL — users will lose their session on browser close even though the refresh token is still valid.

  • Docker Compose Services Generate Independent JWT Keys

    File: docker-compose.yml:111-119

    Each container (auth-service, webhook-worker, webhook-scheduler) independently generates RSA keys if not provided. In a fresh docker compose up, each service will have different JWT keys, meaning tokens issued by the auth service won't verify on the worker and vice versa. The key generation should be centralized or the keys should be shared via a volume/secret.

  • No Mechanism for Session Listing / Active Session Management

    Users have no way to:

    • List their active sessions
    • Revoke a specific session (only all sessions via password reset)
    • See where they're logged in from

    This is a standard feature expectation for production auth systems.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions