Observability Agent: read-only data gateway for logs, events, and metrics. Supports both K8s clusters and bare metal/VM servers (standalone mode). This document is served by OA at
GET /skill.md.
OA runs in one of two modes, auto-detected by the presence of KUBERNETES_SERVICE_HOST.
| Mode | Detection | Targets | Log Source | Events | Metrics Source |
|---|---|---|---|---|---|
| K8s | KUBERNETES_SERVICE_HOST present |
Pods (namespace/selector) | K8s container logs API | K8s Events | Pod annotation-based scrape |
| Standalone | KUBERNETES_SERVICE_HOST absent |
Services (OA_SERVICES env) |
File tail + journalctl | None | Direct URL scrape |
- Auth:
Authorization: Bearer <JWT>(required on protected API requests) - No-auth endpoints:
/healthz,/livez,/readyz,/skill.md,/.well-known/skill.md
OA verifies JWTs using an HS256 shared secret.
OA_JWT_SECRET(required, HS256 shared secret, min 32 chars)
JWT rules:
- Algorithm: HS256
expclaim required (recommended 5–15 min)- Missing or invalid JWT → 401
- Scoped JWT missing namespace/service/capability scope → 403
The client (AI Agent) signs an HS256 JWT using
OA_JWT_SECRET(env) and sends it with each request. The secret is used only in runtime memory — never expose it in logs, files, or output.
Authorization claims:
{
"sub": "agent-01",
"allowedNamespaces": ["prod", "monitoring"],
"allowedServices": ["validator-*"],
"capabilities": ["pods", "logs", "events", "metrics"],
"admin": false
}- K8s pod discovery requires
podscapability and namespace scope. - K8s namespace scopes support exact names and
*wildcards;allowedNamespaces: ["*"]permits all namespaces andns=*. - K8s selector bundles require
podscapability because selector targeting performs pod discovery internally. - Bundle create/status/download enforce the bundle target scope and requested capabilities.
- Standalone
allowedServicesentries can use*wildcards;allowedServices: ["*"]permits all configured services. - Non-admin discovery responses are redacted.
- Legacy JWTs with no authorization scope claims keep full access for compatibility.
- Create bundle:
POST /v1/bundles - Poll status:
GET /v1/bundles/{bundleId}— every 1–2 s, up to 30 s untildone - Download:
GET /v1/bundles/{bundleId}/download→ndjson.gz - Analyze: stream-parse NDJSON, then AI analyzes
GET /v1/pods?ns=<namespace>&q=<substring>
ns: namespace (*= all namespaces; requires admin orallowedNamespaces: ["*"])selector: label selectorq: pod name substring search
Response: namespace, name, labels, containers[], ready, phase. Admin responses also include podIP, annotations, and nodeName.
GET /v1/services
Returns registered services configured via OA_SERVICES env, filtered by JWT service scope.
Admin response example:
{
"items": [
{ "name": "solana-validator", "logs": ["/var/log/solana/validator.log"], "journal": null, "metrics": "http://localhost:9090/metrics" },
{ "name": "rpc-node", "logs": ["/var/log/solana/rpc.log"], "journal": null, "metrics": null }
]
}OA supports two time window modes. Use only one at a time. In standalone mode, timeWindow is a journal-only selector.
- Relative:
{ "timeWindow": { "sinceSeconds": 600 } }- Absolute (UTC, ISO8601Z):
{
"timeWindow": {
"start": "2026-02-09T00:00:00Z",
"end": "2026-02-09T00:10:00Z"
}
}Rules:
- Using both
sinceSecondsandstart/end→ 400 - In standalone mode, time windows apply only to journal sources; file sources use
tailLines
{
"timeWindow": { "sinceSeconds": 600 },
"target": {
"namespace": "*",
"selector": "app=web,tier=backend"
},
"include": {
"logs": { "enabled": true, "tailLines": 2000, "previous": true, "timestamps": true },
"events": { "enabled": true },
"metrics": { "enabled": true }
},
"limits": {
"maxPods": 20,
"maxTotalLogLines": 50000,
"metricsTimeoutMs": 2000
}
}{
"timeWindow": { "sinceSeconds": 600 },
"target": {
"pods": [
{ "namespace": "default", "pod": "my-app-pod-0" }
]
},
"include": {
"logs": { "enabled": true, "tailLines": 2000, "previous": true, "timestamps": true },
"events": { "enabled": true },
"metrics": { "enabled": true }
}
}
selectorandpods[]are mutually exclusive. Providing both → 400.
{
"target": {
"kind": "services",
"services": ["solana-validator", "rpc-node"]
},
"include": {
"logs": { "enabled": true, "tailLines": 2000, "includePatterns": ["ERROR"], "excludePatterns": ["healthcheck"] },
"metrics": { "enabled": true }
},
"limits": {
"maxTotalLogLines": 50000,
"metricsTimeoutMs": 2000
}
}Standalone rules:
- Use
target.kind: "services"with a requiredtarget.servicesarray of names registered inOA_SERVICES, ortarget.kind: "all"for every registered service kindis"services"when a services array is present and no explicit kind is suppliedeventsis ignored in standalone requestspreviousandtimestampsare ignored in standalone requests- File logs are collected via
tail -n <include.logs.tailLines>from paths configured per service - Journal logs are collected via
journalctl; they usetimeWindowwhen supplied, otherwiseinclude.logs.tailLines - When logs are enabled,
timeWindowis accepted only when selected standalone services include a configured journal source; file logs are never time-filtered - OA applies include/exclude filters before the final
maxTotalLogLines, then globally merges matching records by parsed timestamp. Untimestamped records inherit the previous timestamp seen from the same source for ranking, or source read order when no previous source timestamp exists. Diagnostic skipped/error records are emitted outside this returned-line budget and counted asdiagnosticRecordsinlog_summary - Clients cannot request arbitrary file paths or journal units; only registered
OA_SERVICESentries are available - OA uses the current process OS permissions and does not elevate privileges
Standalone log API constraints:
| Field | Applies to | Behavior |
|---|---|---|
include.logs.tailLines |
File logs, journal logs without timeWindow |
Passed to tail -n for files and journalctl -n for journals |
timeWindow.sinceSeconds |
Journal logs only | Relative journal window; rejected when logs are enabled and selected services have no journal source |
timeWindow.start / timeWindow.end |
Journal logs only | Absolute journal window; both fields required together; rejected when logs are enabled and selected services have no journal source |
limits.maxTotalLogLines |
Standalone log lines | Final returned-line budget after filtering and global merge; diagnostic skipped/error records are outside this cap |
K8s selector bundle note:
- Selector targets list matching pods internally before collecting logs/events/metrics.
- Scoped tokens therefore need both
podscapability and the requested data-source capabilities for selector bundles.
include.logs.includePatterns: string[] keeps only lines containing at least one substring (like grep).
include.logs.excludePatterns: string[] removes lines by substring match (like grep -v).
Standalone applies include/exclude filters before the final maxTotalLogLines budget. includePatterns is standalone-only; excludePatterns also works in K8s mode.
Example:
{
"include": {
"logs": {
"enabled": true,
"includePatterns": ["ERROR", "panic"],
"excludePatterns": ["GET /healthz", "healthcheck"]
}
}
}| type | Description | Key Fields |
|---|---|---|
meta |
Bundle metadata | bundleId, createdAt, params |
| type | Description | Key Fields |
|---|---|---|
log |
Container log | namespace, pod, container, ts, line, previous?, skipped?, reason? |
event |
K8s event | namespace, reason, message, ts, involvedObject |
metrics_text |
Pod metrics | namespace, pod, port, path, ts, ok/skipped/error, content |
| type | Description | Key Fields |
|---|---|---|
log |
File log | service, file, ts, line, skipped?, reason? |
log |
Journal log | service, journal, journalScope?, journalUser?, ts, line, skipped?, reason? |
log_error |
User journal error | service, journal, journalScope, journalUser, ts, reason, error |
log_summary |
Log budget/source summary | ts, lineLimited, matchedLogRecords, returnedLogRecords, diagnosticRecords, sources[] |
metrics_text |
Service metrics | service, url, ts, ok/skipped/error, content |
Standalone log skip reasons:
file_not_found: log file does not existread_error: file read failed (permissions, etc.)journalctl_not_found: journalctl binary not foundjournal_permission_denied: journalctl reported insufficient journal permissionsjournal_read_error: journalctl execution failed (permissions, etc.)
Standalone metrics status:
| Status | Meaning | Fields |
|---|---|---|
| Success | Scrape OK | ok: true, content: "# HELP ..." |
| Normal skip | No metrics URL configured | skipped: true, reason: "no_metrics_url" |
| Timeout | Response timed out | ok: false, error: "timeout after 2000ms" |
| Failure | Connection failed | ok: false, error: "fetch_failed: ECONNREFUSED" |
If a pod has not restarted, previous=true logs may not exist and K8s may return 400/404. This is normal and must not fail the bundle.
OA writes a skip record in this case:
{"type":"log","namespace":"ns","pod":"p","container":"c","ts":"...","previous":true,"skipped":true,"reason":"no_previous_container"}| Status | Meaning | Fields |
|---|---|---|
| Success | Scrape OK | ok: true, content: "# HELP ..." |
| Normal skip | No annotation (pod does not expose metrics) | skipped: true, reason: "annotation_missing" |
| Failure | Annotation present but connection failed (anomaly signal) | ok: false, error: "timeout after 2000ms" |
- Events (K8s only): OOMKilled, CrashLoopBackOff, FailedScheduling
- Logs: panic, fatal, segfault, timeout, connection refused
- Metrics:
ok:falseis an anomaly signal (service down / network issue),skipped:trueis normal
- Group recurring errors by signature + count occurrences
- Record first/last occurrence timestamps
- Drill down: in K8s use narrower selector / single pod; in standalone use single service, lower
tailLines, or a shorter journaltimeWindow
| User Input | Action |
|---|---|
| "Analyze backend logs" | GET /v1/pods?q=backend → bundle all matching pods |
| "Only my-app pod 0" | target.pods: [{namespace: "default", pod: "my-app-pod-0"}] |
| "All cluster error logs" | namespace: "*", logs only, cluster ERROR/WARN |
| User Input | Action |
|---|---|
| "Analyze solana validator logs" | GET /v1/services → target.services: ["solana-validator"] |
| "Check all service status" | target.kind: "all" |
| "Only rpc-node metrics" | target.services: ["rpc-node"], logs disabled, metrics only |
| Field | Default |
|---|---|
| sinceSeconds | 600 (10 min) |
| tailLines | 2000 |
| namespace | * (all) |
| containers | all |
| previous | true |
| timestamps | true (forced true in absolute time mode) |
| Field | Default |
|---|---|
| timeWindow | none (journal sources use tailLines unless requested) |
| tailLines | 2000 |
| Field | Value |
|---|---|
| maxTotalLogLines | 50,000 |
| sinceSecondsMax | 3,600 (1 hour) |
| metricsTimeoutMs | 2,000 |
| bundle TTL | 60 min auto-delete |
| Field | Value |
|---|---|
| maxPods | 20 |
| maxMetricsPods | 20 |
Standalone mode defines services via the OA_SERVICES env:
export OA_JWT_SECRET="replace-with-at-least-32-random-chars"
export OA_SERVICES='[
{"name":"solana-validator","logs":["/var/log/solana/validator.log"],"metrics":"http://localhost:9090/metrics"},
{"name":"rpc-node","logs":["/var/log/solana/rpc.log"]}
]'
node dist/index.jsService definition fields:
| Field | Required | Description |
|---|---|---|
name |
Yes | Unique service identifier |
logs |
No | Array of log file paths to collect |
journal |
No | systemd unit name (journalctl log collection) |
journalScope |
No | system (default) or user |
journalUser |
No | Username or UID required when journalScope is user |
metrics |
No | Prometheus metrics URL |
Standalone permission model:
- File and journal readability depends on the OS permissions of the OA process.
- OA does not create users, join system groups, run sudo, or bypass systemd journal permissions.
- Full system and user journal visibility is possible only when the existing process account can already read those journals.
- User journal permission and
journalUserresolution failures are emitted aslog_errorrecords instead of empty log output. - Metrics URLs are operator-provided trusted configuration and may point at localhost or private networks for compatibility.
Standalone time windows:
- File log requests read the latest configured line budget with
tail;sinceSecondsand absolute windows do not seek or filter file contents. - Journal requests use either
timeWindow(--since/--until) or the configured line budget, not both. timeWindowon a standalone request is rejected when logs are enabled and the selected services have no configured journal source.
- Always prefer the bundle API (raw endpoints are for small-scale debugging)
- Use multiple smaller bundles to drill down rather than one large time range
metrics_textwithok:falseis an anomaly signal by itselfskipped:trueis normal (the service/pod does not expose metrics)