Skip to content

Entlein/adaptive write perf#38

Draft
entlein wants to merge 66 commits into
mainfrom
entlein/adaptive-write-perf
Draft

Entlein/adaptive write perf#38
entlein wants to merge 66 commits into
mainfrom
entlein/adaptive-write-perf

Conversation

@entlein
Copy link
Copy Markdown

@entlein entlein commented May 9, 2026

will be deleted

ddelnano and others added 30 commits April 12, 2026 18:57
…house

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ntext

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…e context prometheus backends and test out the write clickhouse experiment

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ging

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
…e schema_creation option , probably needs complete rewrite, this is to make it work e2e for the lab now

Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Entlein and others added 19 commits May 9, 2026 22:02
…set install

- sink/clickhouse.go: CH JSONEachRow encodes UInt64 columns as quoted
  strings by default (output_format_json_quote_64bit_integers=1). The
  rehydrate query for adaptive_attribution returned pid + n_anomalies
  as "49927" / "47", failing json.Unmarshal into a Go uint64 field.
  Switch to json.RawMessage + nsFromRaw which already handles either
  shape.
- cmd/main.go: log presets_from_cloud + already_on_cluster + builtin_count
  so 'installed=0' becomes diagnosable.
Cloud-side preset retention scripts on this fork's Pixie target
legacy tables (conn_stats, stack_traces, dc_snoop) not present in
the rev-2 schema. Even with INSTALL_PRESET_SCRIPTS=true the operator
'installed=0' because the legacy scripts were already on the cluster
and our schema lacks their target tables — the plugin then writes
nothing.

Change the install path to: purge ALL existing ClickHouse-plugin
retention scripts on the cluster, then install the operator's 12
built-in scripts (one per socket_tracer table we did DDL). Side
benefit: removes legacy script-name strings from the cluster that
contained 'bmlv-demo-fresh-2202'-style language we no longer use.

Also adds DeleteDataRetentionScript to internal/pixie.
Pixie's cloud-side retention plugin can't reach an in-cluster CH
service (ClusterIP, no LoadBalancer / Tailnet exposure). On AOCC
this means INSTALL_PRESET_SCRIPTS=true installs scripts that run
but never deliver rows to forensic_db.<pixie_table>.

Resurrect the rev-1 path as an opt-in alternative:
- internal/pixieapi: Adapter wrapping pxapi to return flat rows.
- internal/pxl/queryfor.go: PxL generator per (table, target,
  time slice).
- internal/sink: WritePixieRows(ctx, table, rows) - JSONEachRow
  POST to /forensic_db.<table>.
- internal/controller: WithPixieQuerier + PushPixieTables config;
  on each fresh anomaly window, fan out per-table queries against
  vizier and write the results directly to CH.
- cmd/main.go: ADAPTIVE_PUSH_PIXIE_ROWS=true wires the path.

Operator-side authorisation is the same INSERT-only ingest_writer
already in use, so no extra grants needed.
…e returns ns/pod)

Carnot's UPIDToPodNameUDF::Exec returns absl::Substitute("$0/$1", ns, name)
(metadata_ops.h:387), so df.pod values look like 'redis/redis-foo-abc'. The
operator was filtering df.pod == 'redis-foo-abc' (bare pod), which never
matched, so every per-table push query returned 0 rows and pushPixieRows
silently skipped each table. With both namespace + pod available on the
attribution row, build the comparison key as <ns>/<pod>.

Plus three CodeRabbit follow-ups already on the PR:
- internal/pixie: hostname suffix-match for cluster.local TLS-bypass gate
- cmd/main.go: renumber lifecycle step comments + drop dead var _ = fmt.Sprintf
- internal/sink/clickhouse_test.go: gofumpt single-decl var
…rough

Self-hosted pixie clouds (e.g. AOCC) refuse pxapi.WithAPIKey for
freshly-deployed clusters: the cloud's request_proxyer requires user-JWT
auth, not API-key auth, returning 'failed to fetch creds for cluster'
on every ExecuteScript. The cluster ID is registered, the API key has
list-vizier access, but cluster-specific passthrough is denied.

Add direct mode: when ADAPTIVE_VIZIER_DIRECT_ADDR is set, the operator
bypasses the cloud, dials vizier-query-broker-svc.pl in-cluster, and
authenticates with a freshly-minted service JWT signed by the cluster's
own jwt_signing_key (audience='vizier', 10-min expiry, mint per query).

Requires PL_JWT_SIGNING_KEY (mounted from pl-cluster-secrets) and
PX_DISABLE_TLS=1 (pxapi gates InsecureSkipVerify behind that env when
the addr is internal — service-tls-certs is self-signed in-cluster).

Validated end-to-end on a fresh sovereignsocdemo k3s + AOCC pixie
deploy: 538 redis_events rows ingested into forensic_db on the first
fan-out after attribution row creation. Asciicast at
~/biz/PoC/OTel/casts/aw-direct-mode-pxl-fix-20260509-115455.cast.
Signed-off-by: entlein <einentlein@gmail.com>
Single-shot fan-out only captured pixie data from PEM-startup to
query-time (~30-60s). The spec requires the operator to write pixie
rows for the FULL [event_time-Before, event_time+After] window
(default ±5min) on each kubescape anomaly.

Loop pushPixieRows every PushRefreshInterval (default 30s) until the
attribution row's t_end is in the past. Each pass queries pixie for
[lastUpper, now] so already-written rows aren't duplicated. Window
extensions from concurrent kubescape events are picked up on the
next iteration.

Validated end-to-end: a 12-attack run produced 580s span of
redis_events (≈ ATTACK ± 5min), all timestamps consistent across
kubescape_logs.event_time / adaptive_attribution.t_start /
redis_events.time_ within 1s.
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 087ec150-85a1-4873-805d-4ac940a90d8c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch entlein/adaptive-write-perf

Comment @coderabbitai help to get the list of available commands and usage tips.

@entlein entlein temporarily deployed to pr-actions-approval May 9, 2026 22:10 — with GitHub Actions Inactive
@entlein entlein temporarily deployed to pr-actions-approval May 14, 2026 09:35 — with GitHub Actions Inactive
The sovereign_soc suite's existingVizierWorkload() path skips px deploy
and instead calls `px get cluster --id` against an already-running
Pixie deployment, then binds the cluster UUID. It's gated on
`SOC_VIZIER_EXISTING=1` and the workflow was missing that env, so every
run hit the 5-minute deploy.go:566 "Timed out waiting for cluster ID
assignment" wall (verified across attempts 75186783826, 75822144580,
75963566008).

Requires Pixie to be pre-deployed and CS_HEALTHY on the target cluster.
@entlein entlein temporarily deployed to pr-actions-approval May 14, 2026 10:14 — with GitHub Actions Inactive
…path

The SOC_VIZIER_EXISTING=1 path runs pxDeployImpl.Deploy() with
SetClusterID=true and empty Args. In run 75968884957 on k8ss-k3s-2 the
deploy completed in ~4s (auth login only — no logged output) but the
subsequent healthcheck loop spent 9 minutes failing with
"must call SetClusterID before calling NewVizierClient on Context"
(pkg/pixie/context.go:73), meaning pxCtx.clusterID stayed at the zero
UUID. The 4-second timing rules out a successful `px get cluster --id`;
the error never surfacing rules out the FromString-error path. So the
returned bytes parsed cleanly as the zero UUID (cluster never selected
in the runner's px CLI config, or the returned row is null).

This change:
- adds a SOC_VIZIER_CLUSTER_ID env override so callers can pass the
  cluster UUID directly when `px get cluster --id` cannot be made to
  pick the right cluster from the runner container's px config;
- wraps errors from `px get cluster --id` so the real failure surfaces
  in the perf_tool log instead of being lost behind the generic
  healthcheck wrapper;
- guards against the zero-UUID case explicitly — better to fail Deploy
  with a clear message than spend 10 min in a silent healthcheck loop.

Pre-existing log lines under task=Deploy are now visible so future
diagnosis doesn't require reading uuid.FromString call traces.
@entlein entlein temporarily deployed to pr-actions-approval May 14, 2026 11:38 — with GitHub Actions Inactive
@entlein entlein deployed to pr-actions-approval May 14, 2026 14:37 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants