Entlein/adaptive write perf#38
Draft
entlein wants to merge 66 commits into
Draft
Conversation
…house Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ntext Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…e context prometheus backends and test out the write clickhouse experiment Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ging Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
…e schema_creation option , probably needs complete rewrite, this is to make it work e2e for the lab now Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
…set install - sink/clickhouse.go: CH JSONEachRow encodes UInt64 columns as quoted strings by default (output_format_json_quote_64bit_integers=1). The rehydrate query for adaptive_attribution returned pid + n_anomalies as "49927" / "47", failing json.Unmarshal into a Go uint64 field. Switch to json.RawMessage + nsFromRaw which already handles either shape. - cmd/main.go: log presets_from_cloud + already_on_cluster + builtin_count so 'installed=0' becomes diagnosable.
Cloud-side preset retention scripts on this fork's Pixie target legacy tables (conn_stats, stack_traces, dc_snoop) not present in the rev-2 schema. Even with INSTALL_PRESET_SCRIPTS=true the operator 'installed=0' because the legacy scripts were already on the cluster and our schema lacks their target tables — the plugin then writes nothing. Change the install path to: purge ALL existing ClickHouse-plugin retention scripts on the cluster, then install the operator's 12 built-in scripts (one per socket_tracer table we did DDL). Side benefit: removes legacy script-name strings from the cluster that contained 'bmlv-demo-fresh-2202'-style language we no longer use. Also adds DeleteDataRetentionScript to internal/pixie.
Pixie's cloud-side retention plugin can't reach an in-cluster CH service (ClusterIP, no LoadBalancer / Tailnet exposure). On AOCC this means INSTALL_PRESET_SCRIPTS=true installs scripts that run but never deliver rows to forensic_db.<pixie_table>. Resurrect the rev-1 path as an opt-in alternative: - internal/pixieapi: Adapter wrapping pxapi to return flat rows. - internal/pxl/queryfor.go: PxL generator per (table, target, time slice). - internal/sink: WritePixieRows(ctx, table, rows) - JSONEachRow POST to /forensic_db.<table>. - internal/controller: WithPixieQuerier + PushPixieTables config; on each fresh anomaly window, fan out per-table queries against vizier and write the results directly to CH. - cmd/main.go: ADAPTIVE_PUSH_PIXIE_ROWS=true wires the path. Operator-side authorisation is the same INSERT-only ingest_writer already in use, so no extra grants needed.
…e returns ns/pod)
Carnot's UPIDToPodNameUDF::Exec returns absl::Substitute("$0/$1", ns, name)
(metadata_ops.h:387), so df.pod values look like 'redis/redis-foo-abc'. The
operator was filtering df.pod == 'redis-foo-abc' (bare pod), which never
matched, so every per-table push query returned 0 rows and pushPixieRows
silently skipped each table. With both namespace + pod available on the
attribution row, build the comparison key as <ns>/<pod>.
Plus three CodeRabbit follow-ups already on the PR:
- internal/pixie: hostname suffix-match for cluster.local TLS-bypass gate
- cmd/main.go: renumber lifecycle step comments + drop dead var _ = fmt.Sprintf
- internal/sink/clickhouse_test.go: gofumpt single-decl var
…rough Self-hosted pixie clouds (e.g. AOCC) refuse pxapi.WithAPIKey for freshly-deployed clusters: the cloud's request_proxyer requires user-JWT auth, not API-key auth, returning 'failed to fetch creds for cluster' on every ExecuteScript. The cluster ID is registered, the API key has list-vizier access, but cluster-specific passthrough is denied. Add direct mode: when ADAPTIVE_VIZIER_DIRECT_ADDR is set, the operator bypasses the cloud, dials vizier-query-broker-svc.pl in-cluster, and authenticates with a freshly-minted service JWT signed by the cluster's own jwt_signing_key (audience='vizier', 10-min expiry, mint per query). Requires PL_JWT_SIGNING_KEY (mounted from pl-cluster-secrets) and PX_DISABLE_TLS=1 (pxapi gates InsecureSkipVerify behind that env when the addr is internal — service-tls-certs is self-signed in-cluster). Validated end-to-end on a fresh sovereignsocdemo k3s + AOCC pixie deploy: 538 redis_events rows ingested into forensic_db on the first fan-out after attribution row creation. Asciicast at ~/biz/PoC/OTel/casts/aw-direct-mode-pxl-fix-20260509-115455.cast.
Signed-off-by: entlein <einentlein@gmail.com>
Single-shot fan-out only captured pixie data from PEM-startup to query-time (~30-60s). The spec requires the operator to write pixie rows for the FULL [event_time-Before, event_time+After] window (default ±5min) on each kubescape anomaly. Loop pushPixieRows every PushRefreshInterval (default 30s) until the attribution row's t_end is in the past. Each pass queries pixie for [lastUpper, now] so already-written rows aren't duplicated. Window extensions from concurrent kubescape events are picked up on the next iteration. Validated end-to-end: a 12-attack run produced 580s span of redis_events (≈ ATTACK ± 5min), all timestamps consistent across kubescape_logs.event_time / adaptive_attribution.t_start / redis_events.time_ within 1s.
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
Signed-off-by: entlein <einentlein@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The sovereign_soc suite's existingVizierWorkload() path skips px deploy and instead calls `px get cluster --id` against an already-running Pixie deployment, then binds the cluster UUID. It's gated on `SOC_VIZIER_EXISTING=1` and the workflow was missing that env, so every run hit the 5-minute deploy.go:566 "Timed out waiting for cluster ID assignment" wall (verified across attempts 75186783826, 75822144580, 75963566008). Requires Pixie to be pre-deployed and CS_HEALTHY on the target cluster.
…path The SOC_VIZIER_EXISTING=1 path runs pxDeployImpl.Deploy() with SetClusterID=true and empty Args. In run 75968884957 on k8ss-k3s-2 the deploy completed in ~4s (auth login only — no logged output) but the subsequent healthcheck loop spent 9 minutes failing with "must call SetClusterID before calling NewVizierClient on Context" (pkg/pixie/context.go:73), meaning pxCtx.clusterID stayed at the zero UUID. The 4-second timing rules out a successful `px get cluster --id`; the error never surfacing rules out the FromString-error path. So the returned bytes parsed cleanly as the zero UUID (cluster never selected in the runner's px CLI config, or the returned row is null). This change: - adds a SOC_VIZIER_CLUSTER_ID env override so callers can pass the cluster UUID directly when `px get cluster --id` cannot be made to pick the right cluster from the runner container's px config; - wraps errors from `px get cluster --id` so the real failure surfaces in the perf_tool log instead of being lost behind the generic healthcheck wrapper; - guards against the zero-UUID case explicitly — better to fail Deploy with a clear message than spend 10 min in a silent healthcheck loop. Pre-existing log lines under task=Deploy are now visible so future diagnosis doesn't require reading uuid.FromString call traces.
…tormcenter/pixie into entlein/adaptive-write-perf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
will be deleted