Entlein/adaptive write perf by entlein · Pull Request #38 · k8sstormcenter/pixie

entlein · 2026-05-09T22:10:17Z

will be deleted

…house Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

…ntext Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

…e context prometheus backends and test out the write clickhouse experiment Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

…ging Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Signed-off-by: entlein <einentlein@gmail.com>

…e schema_creation option , probably needs complete rewrite, this is to make it work e2e for the lab now Signed-off-by: entlein <einentlein@gmail.com>

Signed-off-by: entlein <einentlein@gmail.com>

…set install - sink/clickhouse.go: CH JSONEachRow encodes UInt64 columns as quoted strings by default (output_format_json_quote_64bit_integers=1). The rehydrate query for adaptive_attribution returned pid + n_anomalies as "49927" / "47", failing json.Unmarshal into a Go uint64 field. Switch to json.RawMessage + nsFromRaw which already handles either shape. - cmd/main.go: log presets_from_cloud + already_on_cluster + builtin_count so 'installed=0' becomes diagnosable.

Cloud-side preset retention scripts on this fork's Pixie target legacy tables (conn_stats, stack_traces, dc_snoop) not present in the rev-2 schema. Even with INSTALL_PRESET_SCRIPTS=true the operator 'installed=0' because the legacy scripts were already on the cluster and our schema lacks their target tables — the plugin then writes nothing. Change the install path to: purge ALL existing ClickHouse-plugin retention scripts on the cluster, then install the operator's 12 built-in scripts (one per socket_tracer table we did DDL). Side benefit: removes legacy script-name strings from the cluster that contained 'bmlv-demo-fresh-2202'-style language we no longer use. Also adds DeleteDataRetentionScript to internal/pixie.

Pixie's cloud-side retention plugin can't reach an in-cluster CH service (ClusterIP, no LoadBalancer / Tailnet exposure). On AOCC this means INSTALL_PRESET_SCRIPTS=true installs scripts that run but never deliver rows to forensic_db.<pixie_table>. Resurrect the rev-1 path as an opt-in alternative: - internal/pixieapi: Adapter wrapping pxapi to return flat rows. - internal/pxl/queryfor.go: PxL generator per (table, target, time slice). - internal/sink: WritePixieRows(ctx, table, rows) - JSONEachRow POST to /forensic_db.<table>. - internal/controller: WithPixieQuerier + PushPixieTables config; on each fresh anomaly window, fan out per-table queries against vizier and write the results directly to CH. - cmd/main.go: ADAPTIVE_PUSH_PIXIE_ROWS=true wires the path. Operator-side authorisation is the same INSERT-only ingest_writer already in use, so no extra grants needed.

…tation)

…e returns ns/pod) Carnot's UPIDToPodNameUDF::Exec returns absl::Substitute("$0/$1", ns, name) (metadata_ops.h:387), so df.pod values look like 'redis/redis-foo-abc'. The operator was filtering df.pod == 'redis-foo-abc' (bare pod), which never matched, so every per-table push query returned 0 rows and pushPixieRows silently skipped each table. With both namespace + pod available on the attribution row, build the comparison key as <ns>/<pod>. Plus three CodeRabbit follow-ups already on the PR: - internal/pixie: hostname suffix-match for cluster.local TLS-bypass gate - cmd/main.go: renumber lifecycle step comments + drop dead var _ = fmt.Sprintf - internal/sink/clickhouse_test.go: gofumpt single-decl var

…rough Self-hosted pixie clouds (e.g. AOCC) refuse pxapi.WithAPIKey for freshly-deployed clusters: the cloud's request_proxyer requires user-JWT auth, not API-key auth, returning 'failed to fetch creds for cluster' on every ExecuteScript. The cluster ID is registered, the API key has list-vizier access, but cluster-specific passthrough is denied. Add direct mode: when ADAPTIVE_VIZIER_DIRECT_ADDR is set, the operator bypasses the cloud, dials vizier-query-broker-svc.pl in-cluster, and authenticates with a freshly-minted service JWT signed by the cluster's own jwt_signing_key (audience='vizier', 10-min expiry, mint per query). Requires PL_JWT_SIGNING_KEY (mounted from pl-cluster-secrets) and PX_DISABLE_TLS=1 (pxapi gates InsecureSkipVerify behind that env when the addr is internal — service-tls-certs is self-signed in-cluster). Validated end-to-end on a fresh sovereignsocdemo k3s + AOCC pixie deploy: 538 redis_events rows ingested into forensic_db on the first fan-out after attribution row creation. Asciicast at ~/biz/PoC/OTel/casts/aw-direct-mode-pxl-fix-20260509-115455.cast.

Signed-off-by: entlein <einentlein@gmail.com>

Single-shot fan-out only captured pixie data from PEM-startup to query-time (~30-60s). The spec requires the operator to write pixie rows for the FULL [event_time-Before, event_time+After] window (default ±5min) on each kubescape anomaly. Loop pushPixieRows every PushRefreshInterval (default 30s) until the attribution row's t_end is in the past. Each pass queries pixie for [lastUpper, now] so already-written rows aren't duplicated. Window extensions from concurrent kubescape events are picked up on the next iteration. Validated end-to-end: a 12-attack run produced 580s span of redis_events (≈ ATTACK ± 5min), all timestamps consistent across kubescape_logs.event_time / adaptive_attribution.t_start / redis_events.time_ within 1s.

Signed-off-by: entlein <einentlein@gmail.com>

coderabbitai · 2026-05-09T22:10:24Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 087ec150-85a1-4873-805d-4ac940a90d8c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch entlein/adaptive-write-perf

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The sovereign_soc suite's existingVizierWorkload() path skips px deploy and instead calls `px get cluster --id` against an already-running Pixie deployment, then binds the cluster UUID. It's gated on `SOC_VIZIER_EXISTING=1` and the workflow was missing that env, so every run hit the 5-minute deploy.go:566 "Timed out waiting for cluster ID assignment" wall (verified across attempts 75186783826, 75822144580, 75963566008). Requires Pixie to be pre-deployed and CS_HEALTHY on the target cluster.

…path The SOC_VIZIER_EXISTING=1 path runs pxDeployImpl.Deploy() with SetClusterID=true and empty Args. In run 75968884957 on k8ss-k3s-2 the deploy completed in ~4s (auth login only — no logged output) but the subsequent healthcheck loop spent 9 minutes failing with "must call SetClusterID before calling NewVizierClient on Context" (pkg/pixie/context.go:73), meaning pxCtx.clusterID stayed at the zero UUID. The 4-second timing rules out a successful `px get cluster --id`; the error never surfacing rules out the FromString-error path. So the returned bytes parsed cleanly as the zero UUID (cluster never selected in the runner's px CLI config, or the returned row is null). This change: - adds a SOC_VIZIER_CLUSTER_ID env override so callers can pass the cluster UUID directly when `px get cluster --id` cannot be made to pick the right cluster from the runner container's px config; - wraps errors from `px get cluster --id` so the real failure surfaces in the perf_tool log instead of being lost behind the generic healthcheck wrapper; - guards against the zero-UUID case explicitly — better to fail Deploy with a clear message than spend 10 min in a silent healthcheck loop. Pre-existing log lines under task=Deploy are now visible so future diagnosis doesn't require reading uuid.FromString call traces.

…tormcenter/pixie into entlein/adaptive-write-perf

ddelnano and others added 30 commits April 12, 2026 18:57

Changes needed to get clickhouse e2e test working with external click…

86397a8

…house Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Implement parquet export format

f590005

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Allow prometheus recorders to specifiy different kubeconfig or kubeco…

3510794

…ntext Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Fix parquet file overflow bug

5a8fb65

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add duck db wasm visualization file

17188d5

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Temporary changes to make load testing easier

63f7d5f

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add clickhouse perf_tool suite, ability to query cross kubeconfig/kub…

839af02

…e context prometheus backends and test out the write clickhouse experiment Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Ensure px delete works with external k8s ApiService

06a8d3a

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add github workflow for perf clickhouse suite

1f9c121

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Ignore non alphabetic characters in the service account json

5ecab7c

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add tailscale debugging info for perf workflow

5112a10

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Initial sovereign_soc suite, which segfaults kelvin on first run

bb80ebb

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Fix segfault issues, but fails with missing alerts clickhouse table

f1302fd

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add --skaffold_stderr_file to perf_tool to ease github workflow debug…

cf29e2b

…ging Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add x86_64_sysroot in profile

026e3eb

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Don't use verbose logging

6dd6107

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Remove verbosity flag that was missed

267ea25

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

fix protocol_loadtest build

5c0eb9f

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Install the px cli

d9b9adc

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Use correct cloud

78f2853

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Reduce test time

eb1abb3

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Get redis-attack experiment working

dfcf602

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Add perf github action for soc attack

1d6ad69

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Don't let cronjobs fail the build

7cf848f

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

Only attempt job once

1790956

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

experiment with the adaptive feature

8af6f8a

Signed-off-by: entlein <einentlein@gmail.com>

settings for lab as default

756d88d

Signed-off-by: entlein <einentlein@gmail.com>

not sure about the scheduler annotations, but the main.go now sets th…

09e28ba

…e schema_creation option , probably needs complete rewrite, this is to make it work e2e for the lab now Signed-off-by: entlein <einentlein@gmail.com>

address linting issues 1

e2e124b

Signed-off-by: entlein <einentlein@gmail.com>

pinning trivvy to higher version

b7b0389

Signed-off-by: entlein <einentlein@gmail.com>

Entlein and others added 19 commits May 9, 2026 22:02

adaptive_export/cmd: add internal/script bazel dep for builtin presets

428a2aa

adaptive_export/cmd: log cluster + preset script names on install

7a88b4f

adaptive_export/cmd: skip dotted-name tables from push list (PxL limi…

7e4b786

…tation)

adaptive_export/controller: instrument pushPixieRows + per-query timeout

b8a90ca

addressing the rabbit2

98ac1f0

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit3

e4329d1

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit4

9f91360

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit5

bb11514

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit6

feb3a03

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit7

e84cbac

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit8

b599e77

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit9

b386ce8

Signed-off-by: entlein <einentlein@gmail.com>

addressing the rabbit10

9b74bc7

Signed-off-by: entlein <einentlein@gmail.com>

entlein temporarily deployed to pr-actions-approval May 9, 2026 22:10 — with GitHub Actions Inactive

fix perf soc eval test

833c5e5

entlein temporarily deployed to pr-actions-approval May 14, 2026 09:35 — with GitHub Actions Inactive

entlein temporarily deployed to pr-actions-approval May 14, 2026 10:14 — with GitHub Actions Inactive

entlein temporarily deployed to pr-actions-approval May 14, 2026 11:38 — with GitHub Actions Inactive

entlein added 2 commits May 14, 2026 14:32

adding load test yamls

1d6a93e

Merge branch 'entlein/adaptive-write-perf' of https://github.com/k8ss…

8083eeb

…tormcenter/pixie into entlein/adaptive-write-perf

entlein deployed to pr-actions-approval May 14, 2026 14:37 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entlein/adaptive write perf#38

Entlein/adaptive write perf#38
entlein wants to merge 66 commits into
mainfrom
entlein/adaptive-write-perf

entlein commented May 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

entlein commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

entlein commented May 9, 2026 •

edited

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading