Skip to content

fix: configra observability + ops hardening#77

Merged
KooshaPari merged 1 commit into
mainfrom
fix/configra-obs-ops
Jun 26, 2026
Merged

fix: configra observability + ops hardening#77
KooshaPari merged 1 commit into
mainfrom
fix/configra-obs-ops

Conversation

@KooshaPari

@KooshaPari KooshaPari commented Jun 26, 2026

Copy link
Copy Markdown
Owner

User description

Weakest audit areas G/K, additive.

Made with Cursor


CodeAnt-AI Description

Add a shared observability and operations toolkit for Configra services

What Changed

  • Introduced a new configra-ops crate with shared logging, correlation IDs, metrics hooks, health probes, and graceful shutdown support
  • Added a CLI for liveness/readiness checks and version output, with JSON output and container-friendly exit codes
  • Re-exported the new observability tools from settly so services can use them without changing existing integration paths
  • Added a root Dockerfile, environment template, and deployment guides for local runs, containers, Kubernetes probes, and shutdown behavior

Impact

✅ Easier container health checks
✅ Clearer service logs and request tracing
✅ Safer shutdown during deploys

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codeant-ai

codeant-ai Bot commented Jun 26, 2026

Copy link
Copy Markdown

CodeAnt AI is reviewing your PR.

@codeant-ai

codeant-ai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcargo/​anyhow@​1.0.102 ⏵ 1.0.1038110093100100
Updatedcargo/​uuid@​1.23.3 ⏵ 1.23.410010093100100

View full report

@codeant-ai codeant-ai Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Jun 26, 2026
Comment thread Dockerfile
FROM rust:1.83-bookworm AS builder

WORKDIR /build
COPY Cargo.toml Cargo.lock rust-toolchain.toml deny.toml ./

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The build stage copies rust-toolchain.toml from the repository root, but this file does not exist at that path in this repo, so docker build fails at the COPY step before compilation starts. Remove that filename from the root COPY list or point to the actual existing location. [possible bug]

Severity Level: Critical 🚨
- ❌ Docker image build for configra-ops CLI always fails.
- ⚠️ Containerized configra-ops unavailable for CI and ops tooling.
Steps of Reproduction ✅
1. From the repository root `/workspace/Configra`, run `docker build .` which uses the
root `Dockerfile` (listed by `LS` at `/workspace/Configra/Dockerfile`) as the build
definition.

2. Inspect the build context at `/workspace/Configra` (per `LS`) and observe only
`Cargo.toml`, `Cargo.lock`, and `deny.toml` exist at the root, with no
`rust-toolchain.toml` file there; `Glob` shows `rust-toolchain.toml` only at
`crates/settly/rust-toolchain.toml`.

3. During the build, Docker executes the build-stage instruction at `Dockerfile:9` (`COPY
Cargo.toml Cargo.lock rust-toolchain.toml deny.toml ./`), attempting to copy
`rust-toolchain.toml` from the root of the build context.

4. Docker fails the build with a `COPY failed: file not found in build context` error for
`rust-toolchain.toml`, and the image build aborts before compiling the `configra-ops` CLI.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** Dockerfile
**Line:** 9:9
**Comment:**
	*Possible Bug: The build stage copies `rust-toolchain.toml` from the repository root, but this file does not exist at that path in this repo, so `docker build` fails at the `COPY` step before compilation starts. Remove that filename from the root `COPY` list or point to the actual existing location.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +33 to +35
signal::ctrl_c()
.await
.expect("failed to install Ctrl+C handler");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Installing the Ctrl+C handler is forced with expect, so environments where signal handler registration fails will panic the process instead of degrading gracefully or returning an error. Propagate or handle this error path to avoid crashing during shutdown setup. [possible bug]

Severity Level: Major ⚠️
- ❌ Service can panic during Ctrl+C handler setup.
- ⚠️ Graceful shutdown helper unusable on failing platforms.
Steps of Reproduction ✅
1. A consumer calls the public `shutdown_signal()` helper re-exported from `configra_ops`
at `crates/configra-ops/src/lib.rs:17` or via `settly::infrastructure::observability` at
`crates/settly/src/infrastructure/observability.rs:3-6`, in a Tokio-based service.

2. The function `shutdown_signal()` in `crates/configra-ops/src/shutdown.rs:31-52`
constructs the `ctrl_c` future that awaits `tokio::signal::ctrl_c()` at line 33.

3. In an environment where `tokio::signal::ctrl_c()` returns an `Err` (for example, due to
OS signal API limitations or misconfiguration), the `.await` at line 33 yields an error
which is immediately passed to `.expect("failed to install Ctrl+C handler");` at line 35.

4. The `.expect` call panics the process, aborting the service during shutdown handler
installation rather than allowing the caller to log the failure or fall back to
alternative shutdown wiring; because `shutdown_signal` is the primary signal entrypoint
exported from the crate, any normal use in such environments will trigger this crash.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/shutdown.rs
**Line:** 33:35
**Comment:**
	*Possible Bug: Installing the Ctrl+C handler is forced with `expect`, so environments where signal handler registration fails will panic the process instead of degrading gracefully or returning an error. Propagate or handle this error path to avoid crashing during shutdown setup.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +40 to +43
signal::unix::signal(signal::unix::SignalKind::terminate())
.expect("failed to install SIGTERM handler")
.recv()
.await;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: SIGTERM handler registration also uses expect, which can panic on platforms/runtime contexts where this signal setup fails, turning a shutdown-path setup issue into process termination. Handle the error explicitly and continue with a fallback strategy. [possible bug]

Severity Level: Major ⚠️
- ❌ Service may crash on SIGTERM handler registration failure.
- ⚠️ Graceful shutdown coordination becomes fragile on some Unix setups.
Steps of Reproduction ✅
1. On a Unix platform, a service uses `configra_ops::shutdown_signal()` or
`GracefulShutdown::run()` exported from `crates/configra-ops/src/lib.rs:17` and
re-exported by `crates/settly/src/infrastructure/observability.rs:3-6` to coordinate
shutdown.

2. `GracefulShutdown::run()` in `crates/configra-ops/src/shutdown.rs:78-109` awaits
`shutdown_signal().await;` at line 86, which in turn sets up both SIGINT and SIGTERM
listeners via Tokio at lines 32-43.

3. On an environment or runtime where
`tokio::signal::unix::signal(SignalKind::terminate())` returns an `Err` (e.g., due to
restricted signal APIs or runtime limitations), the call at line 40 produces an error that
is immediately consumed by `.expect("failed to install SIGTERM handler")` at line 41.

4. This `.expect` panics the process while trying to install the SIGTERM handler, causing
the entire service using `GracefulShutdown` or `shutdown_signal` to abort rather than
proceeding with SIGINT-only shutdown or another fallback; the panic occurs on any normal
use of these helpers under such failing Unix environments.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/shutdown.rs
**Line:** 40:43
**Comment:**
	*Possible Bug: SIGTERM handler registration also uses `expect`, which can panic on platforms/runtime contexts where this signal setup fails, turning a shutdown-path setup issue into process termination. Handle the error explicitly and continue with a fallback strategy.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5df6569fe7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


use clap::{Parser, Subcommand};
use configra_ops::{
HealthCheck, LoggingConfig, WorkspaceCheck, init_logging, liveness, readiness, VERSION,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Import WorkspaceCheck through a public API

In the configra-ops binary, WorkspaceCheck is imported from the crate root, but src/lib.rs does not re-export that type; it is only public under configra_ops::health::WorkspaceCheck. Any cargo build -p configra-ops will fail with an unresolved import here, so the new CLI cannot be built until the import path or re-export is fixed.

Useful? React with 👍 / 👎.

Comment thread Dockerfile
FROM rust:1.83-bookworm AS builder

WORKDIR /build
COPY Cargo.toml Cargo.lock rust-toolchain.toml deny.toml ./

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove the missing rust-toolchain from the Docker COPY

The root Docker build now tries to copy rust-toolchain.toml, but this commit's Docker context has no root-level file by that name; repo-wide search only finds crates/settly/rust-toolchain.toml. As a result, docker build -t configra-ops:local . fails at this COPY step before it can compile the CLI, which breaks the new deploy artifact described in the added docs.

Useful? React with 👍 / 👎.

Comment on lines +39 to +41
let workspace = WorkspaceCheck;
let checks: Vec<&dyn HealthCheck> = vec![&workspace];
readiness(VERSION, &checks)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Readiness mode claims to run dependency checks, but it only executes WorkspaceCheck, which is a compile-time/linkability no-op that always returns healthy. This makes readiness effectively meaningless and can report healthy even when real runtime dependencies are down. Wire actual dependency checks into this command path. [incomplete implementation]

Severity Level: Major ⚠️
- ⚠️ CLI readiness never fails independent of runtime dependency health.
- ⚠️ Operators may trust a misleading readiness probe status.
Steps of Reproduction ✅
1. Execute `configra-ops health --ready` which runs `main()` in
`crates/configra-ops/src/bin/configra-ops.rs:32-67` and selects the readiness branch at
`configra-ops.rs:37-44`.

2. In that branch, the CLI constructs a single check: `let workspace = WorkspaceCheck; let
checks: Vec<&dyn HealthCheck> = vec![&workspace];` at `configra-ops.rs:39-40` and passes
it to `readiness(VERSION, &checks)` at `configra-ops.rs:41`.

3. `WorkspaceCheck` is defined in `crates/configra-ops/src/health.rs:151-163` and its
`check()` implementation at `health.rs:160-161` always returns `Ok(())`, i.e., it cannot
fail and does not consult any runtime dependency (databases, queues, external services,
etc.).

4. `readiness()` in `crates/configra-ops/src/health.rs:95-97` simply wraps
`readiness_with_metrics()`, which iterates the provided checks at `health.rs:113-123`;
since only `WorkspaceCheck` is present and always returns `Ok(())`, the resulting
`HealthReport.status` is always `Healthy` and the CLI prints `"ok"` (or a healthy JSON
payload) regardless of the actual state of any real runtime dependencies, making the
`--ready` mode effectively equivalent to a liveness check.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/bin/configra-ops.rs
**Line:** 39:41
**Comment:**
	*Incomplete Implementation: Readiness mode claims to run dependency checks, but it only executes `WorkspaceCheck`, which is a compile-time/linkability no-op that always returns healthy. This makes readiness effectively meaningless and can report healthy even when real runtime dependencies are down. Wire actual dependency checks into this command path.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

liveness(VERSION)
};

let emit_json = json || std::env::var("CONFIGRA_LOG_FORMAT").map(|v| v == "json").unwrap_or(false);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: JSON mode detection is case-sensitive here, but logging format parsing in LogFormat::from_env is case-insensitive. With values like JSON, logs switch to JSON while health output stays plain text, creating inconsistent behavior. Normalize case in this check the same way as the logging parser. [api mismatch]

Severity Level: Major ⚠️
- ⚠️ Health CLI ignores uppercase CONFIGRA_LOG_FORMAT for JSON mode.
- ⚠️ Probes expecting JSON may mis-parse plain-text responses.
Steps of Reproduction ✅
1. In a shell, set `CONFIGRA_LOG_FORMAT=JSON` (uppercase) before running the CLI so that
`LogFormat::from_env()` in `crates/configra-ops/src/logging.rs:21-31` lowercases the value
and selects `LogFormat::Json`.

2. Run the health command, e.g. `configra-ops health` or `configra-ops health --ready`,
which invokes `main()` in `crates/configra-ops/src/bin/configra-ops.rs:32-67`.

3. `main()` initializes logging via `init_logging(&LoggingConfig::default())` at
`configra-ops.rs:34`, which uses `LogFormat::from_env()`; logging is now in JSON format
due to the case-insensitive parsing.

4. Later in the same function, health output format is decided by `let emit_json = json ||
std::env::var("CONFIGRA_LOG_FORMAT").map(|v| v == "json").unwrap_or(false);` at
`configra-ops.rs:46`, where the env var is compared case-sensitively to `"json"`, so with
`CONFIGRA_LOG_FORMAT=JSON` this expression evaluates to `false` and the CLI prints plain
`"ok"` / `"unhealthy"` instead of JSON, diverging from the configured log format.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/bin/configra-ops.rs
**Line:** 46:46
**Comment:**
	*Api Mismatch: JSON mode detection is case-sensitive here, but logging format parsing in `LogFormat::from_env` is case-insensitive. With values like `JSON`, logs switch to JSON while health output stays plain text, creating inconsistent behavior. Normalize case in this check the same way as the logging parser.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +82 to +85
pub fn ensure_active() -> CorrelationId {
let id = CorrelationId::from_env_or_new();
id.attach_to_span();
id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This function always generates a new correlation ID (when env is unset) and records it, even if the active span already has one. Repeated calls in the same request path will overwrite/fragment correlation context and break end-to-end traceability. Reuse the existing active correlation ID when present instead of always minting a new one. [logic error]

Severity Level: Major ⚠️
- ⚠️ Single request may emit multiple correlation IDs.
- ⚠️ Tracing a request across layers becomes significantly harder.
Steps of Reproduction ✅
1. A service crate imports `CorrelationLayer` via the re-export in
`crates/settly/src/infrastructure/observability.rs:3-6` and calls
`CorrelationLayer::ensure_active()` in two different layers of the same request-handling
path (e.g., an HTTP middleware and a downstream service function), without setting
`CONFIGRA_CORRELATION_ID` in the environment.

2. On the first call, `ensure_active()` at `crates/configra-ops/src/correlation.rs:82-85`
invokes `CorrelationId::from_env_or_new()` at `correlation.rs:30-35`; since
`CONFIGRA_CORRELATION_ID` is unset, this falls through to `CorrelationId::new()` at
`correlation.rs:15-17`, generating a fresh UUID (ID A).

3. `ensure_active()` then calls `attach_to_span()` at `correlation.rs:54-56`, which
records `correlation_id` with value A on the current `tracing::Span`, and returns the
`CorrelationId`.

4. On a later call within the same logical request, `ensure_active()` repeats the same
logic: `from_env_or_new()` again sees no env var, generates a different random UUID (ID
B), and `attach_to_span()` overwrites the span's `correlation_id` at `correlation.rs:55`;
logs or child spans created before and after the second call will now carry different
correlation IDs for the same request, fragmenting end-to-end traceability.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/correlation.rs
**Line:** 82:85
**Comment:**
	*Logic Error: This function always generates a new correlation ID (when env is unset) and records it, even if the active span already has one. Repeated calls in the same request path will overwrite/fragment correlation context and break end-to-end traceability. Reuse the existing active correlation ID when present instead of always minting a new one.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +135 to +140
let started = Instant::now();
let outcome = check.check();
if started.elapsed() > timeout {
return Err(format!("check exceeded {}ms budget", timeout.as_millis()));
}
outcome

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The timeout wrapper is not enforcing a real timeout because check.check() runs to completion before elapsed time is evaluated. If a check blocks/hangs, readiness will hang indefinitely instead of failing fast. Run each check under an actual timed execution path (for example, async timeout or worker thread + join timeout) so slow/hung checks cannot block the probe. [logic error]

Severity Level: Critical 🚨
- ❌ Hung dependency check can block readiness endpoint indefinitely.
- ⚠️ Orchestrators may mis-handle unresponsive readiness for the service.
Steps of Reproduction ✅
1. An integrator implements a `HealthCheck` in their service crate using the trait from
`crates/configra-ops/src/health.rs:75-81`, with a `check()` method that performs blocking
I/O (e.g., a network call) and can hang indefinitely under certain failure modes.

2. That check is added to the `checks` slice passed into `readiness()` at
`crates/configra-ops/src/health.rs:95-97` (or `readiness_with_metrics()` at
`health.rs:100-131`), which is then exposed via an HTTP readiness endpoint or via the
`configra-ops` CLI's `Health { ready: true, .. }` branch at
`crates/configra-ops/src/bin/configra-ops.rs:37-44`.

3. `readiness_with_metrics()` iterates the checks and calls `run_with_timeout(check,
timeout)` at `health.rs:113-116`; inside `run_with_timeout()` at `health.rs:133-140`,
`check.check()` is executed synchronously at `health.rs:136` before any timeout
evaluation.

4. If `check.check()` hangs, `run_with_timeout()` never reaches the `started.elapsed() >
timeout` check at `health.rs:137-139`, so no timeout error is returned; instead, the
readiness call (and any CLI or HTTP endpoint wrapping it) blocks indefinitely, defeating
the purpose of `CONFIGRA_HEALTHCHECK_TIMEOUT_SECS` and allowing a single bad check to
stall the entire readiness probe.

Fix in Cursor Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** crates/configra-ops/src/health.rs
**Line:** 135:140
**Comment:**
	*Logic Error: The timeout wrapper is not enforcing a real timeout because `check.check()` runs to completion before elapsed time is evaluated. If a check blocks/hangs, readiness will hang indefinitely instead of failing fast. Run each check under an actual timed execution path (for example, async timeout or worker thread + join timeout) so slow/hung checks cannot block the probe.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai

codeant-ai Bot commented Jun 26, 2026

Copy link
Copy Markdown

CodeAnt AI finished reviewing your PR.

@KooshaPari KooshaPari merged commit 9306740 into main Jun 26, 2026
12 of 17 checks passed
@KooshaPari KooshaPari deleted the fix/configra-obs-ops branch June 26, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant