YC's Paxel tells you how you build with AI — by shipping your coding-agent transcripts off your machine to do it. Per YC's own description, it runs a Docker container that mounts your home directory, sends transcript excerpts — your prompts, the agent's replies, tool-call snippets — to an LLM proxy, and uploads a JSON of scores, narratives, and session metadata to YC, readable by any YC employee and retained indefinitely. The redaction strips credentials — not your source code, customer data, secrets, or unreleased ideas.
paxel-local gives you the same profile with zero data leaving your machine — no upload, no proxy, no account, no network calls at all. One command reads your local transcripts and writes a branded, shareable builder profile:
git clone https://github.com/Photobombastic/paxel-local
cd paxel-local && python3 paxel.py # reads your local transcripts; opens your profileClone, run, done — your profile pops open in the browser: an archetype, a gstack-grounded scorecard, your signature moves, and a one-click shareable poster like the one above. Nothing leaves the laptop. (That poster up top is my own real profile — generated locally, public only because I chose to share it.)
No — and it can't be. Paxel is closed-source (a curl | bash → proprietary Docker image).
This is a functional recreation: it reproduces the metric set Paxel advertises (builder
archetype, autonomy score, planning ratio, code velocity, tool diversity, work-hour
distribution, error recovery, iteration depth, standout traits) using its own reasonable
formulas — Paxel's exact algorithm isn't published. Same input (your local transcripts),
same experience, not byte-for-byte parity.
One command emits a complete, branded, shareable profile.html — open it in a browser and you get:
- An archetype — the builder you are (Architect, Brute-Force Architect, Velocity Machine, Quality Guardian, The Director, …), named from your sessions.
- A 0–10 scorecard across three dimensions (Execution, Planning, Engineering), each grounded in gstack (below) — plus a plain-language read on how you steer agents (long leash vs. short leash), which we describe instead of grade (here's why: below).
- Your signature moves — the decision-patterns in how you direct the AI ("you review more than you write", "plan wide, then grind narrow"), drawn from real session behavior and tagged to the gstack stage each expresses.
- Your growth edge — a few specific things to try next, keyed to your own weakest signals and the gstack skill that addresses each — not generic advice.
- A "what we noticed" card grid + Share buttons (Post on X / Copy caption / Download a branded poster image — archetype, scorecard, headline numbers and your highlight cards in one PNG that works in every browser).
No manual step, no LLM call required. The archetype, scores, signature moves, and growth edges all
come from transparent local rule engines (compute_scores / pick_archetype / signature_moves
/ growth_edges in paxel.py) over the measured metrics — Paxel's real algorithm is closed, so this
is a reasoned estimate, not a replica. The counts are measured and reproducible; the verdicts are an
opinion, and the report says so. Nothing quotes your raw prompt text, so the profile stays shareable
without leaking session content.
Want a richer, prose narrative? narrative_input.md is also written — paste it into your own
Claude/GPT and it'll write you a deeper profile locally. That's optional; the HTML stands alone.
What the scores are. The counts are measured from your real sessions; the three scores are a read on your style — how you work, grounded in gstack, not a ranking of how good an engineer you are. We learned this the hard way: a fourth axis, Steering, used to be graded
(15 − actions_per_prompt)— which runs backwards, scoring a more autonomous engineer lower (a top agentic engineer ran it and got a 1/10). You don't fix a backwards gauge by writing "this may read backwards" underneath it. So Steering is no longer scored — it's described: we just state how you run agents (long leash vs. short leash) as a fact, with no good-or-bad end.
The three scored axes aren't a rubric we invented in a vacuum. Each one is derived from Garry Tan's gstack — his open-source framework that turns Claude Code into a virtual engineering team. gstack and YC's Paxel both come out of Garry-Tan-world, so grounding the scores in gstack's actual definitions of good building is plausibly closer to what Paxel itself grades against — and it's a more honest story than a number we made up.
We then audited the rubric by running the real gstack skills on it — /plan-eng-review,
/plan-ceo-review, and /review, dispatched as independent subagents so the tool's author wasn't
grading their own work. That audit hardened the design: each metric is now owned by exactly one
axis (so no two axes secretly move together), and a fifth "Product Instinct" axis — which Paxel has
— was cut, because the review showed it was mostly skill-detection plus terms recycled from other
axes. Coding transcripts don't honestly reveal product judgment, so we don't fake a score for it.
A later validity pass went one further and demoted Steering from a scored axis to a described one
(above): hands-on cadence is real and measurable, but it has no better-or-worse end, so grading it
only ever produced a backwards number. Three honest scores beat four where one points the wrong way.
gstack frames building as a sprint — Think → Plan → Build → Review → Test → Ship → Reflect — on top of three ethos pillars: Boil the Lake (completeness is cheap with AI, so do the complete thing), Search Before Building (know what exists before you build it), and User Sovereignty (AI recommends, the human decides — and per Anthropic's own research that gstack cites, experts interrupt more, not less). Each axis maps a slice of that framework onto the metrics paxel can honestly measure from transcripts:
| Axis | What it measures | Grounded in |
|---|---|---|
| Execution | Shipped output at AI leverage — committed-code rate (coverage-corrected, ≤1.4×, disclosed in report.md), fidelity (how much of what you generate actually lands in git), and delegation/parallelism |
gstack's Build phase + the "Golden Age" ethos (one builder shipping like a team) |
| Planning | Think-before-build — exploring before writing, reasoning depth, and plan/spec ceremony | gstack's Think + Plan phases + "Search Before Building" |
| Engineering | Craft & low rework — getting files right early, little file-thrash, low error rate, and review/test/investigate discipline | "Boil the Lake" + the Review / Test / Reflect stages |
Steering is described, not scored. Below the three bars you'll see a one-line read — long leash (you point the agent and let it run) → short leash (you stay close and course-correct) — with the raw cadence (actions per turn, how often the agent checked in). It maps to gstack's User Sovereignty pillar, but we don't put a 0–10 on it: a deliberate hands-off operator who delegates and gets clean autonomous output back is steering by a mechanism transcripts can't see (it needs delegation→survived-to-commit attribution — on the roadmap), so any grade would just punish autonomy.
Signature moves (signature_moves) and the growth edge (growth_edges) are the same idea applied
to prose: named decision-patterns and next-steps, each gated on a real threshold (we never pad), tied to
a gstack stage, and keyed to your numbers. The growth edge points at the gstack skill that closes your
weakest gap — e.g. review ≫ test → /qa; file-hammering → /investigate; thin planning → /autoplan.
How the criteria were built: one subagent per axis read the real gstack role/skill definitions
(office-hours, autoplan, plan-ceo-review, review, qa, investigate, ship, retro, …)
and the ethos, derived that axis's notion of "good," then mapped it onto paxel's available metrics —
and a later round of gstack-skill audits hardened it (above). Every term is transparent, clamped 0–1
against a justified target, and weighted to sum to 1.0 — read compute_scores in paxel.py; nothing
is hidden. Honest limits we don't paper over: paxel can't see test coverage from transcripts —
it detects test runs (named test skills and shell runners like pytest / go test /
npm test), so if you test some other way it can't see, the growth edge says so rather than claiming
"0 tests"; the git-vs-tool fidelity signal is noisier when git only sees some of your repos; and
Engineering's iteration signals only see Edit/Write work, not files you rewrite purely through the
shell. Scores are an opinion; the counts underneath are fact.
Auto-detected and parsed (all reads local):
| Tool | Location | Status |
|---|---|---|
| Claude Code | ~/.claude/projects |
full |
| Codex CLI | ~/.codex/sessions |
full |
| Gemini CLI | ~/.gemini/tmp |
full |
| Pi | ~/.pi/agent/sessions |
full |
| opencode | ~/.local/share/opencode/storage |
full |
| Cursor | state.vscdb + ~/.cursor/projects/.../agent-transcripts |
full (SQLite-first + JSONL, deduped) |
Non-Claude formats are translated into a common event shape so every metric works across tools (Claude/Pi-specific signals like skills/subagents/thinking are naturally richer).
python3 paxel.py # all detected sources → writes profile.html (+ report.md, stats.json)
python3 paxel.py claude # restrict to one (or several) sources, e.g. just Claude CodeIt then opens profile.html in your browser automatically when it finishes (pass
--no-open to skip — useful on a headless box or in CI, where it just prints the path).
No dependencies beyond the Python 3 standard library. No network calls anywhere. For
accurate churn it shells out to the local git CLI (git log --numstat) on the repos found
in your transcripts — still 100% on-device, nothing uploaded.
Most "how you build" profilers only see the assistant's Edit/Write tool calls. But a huge
amount of real work happens through the shell — cat <<EOF > file heredocs, >/>>
redirects, sed -i, scripts that generate files. That work is invisible to the tool-call
path, which makes shell-heavy ("brute-force") builders look artificially clean.
So this reports churn three ways, honestly:
- Git churn (gold standard) —
git log --numstatover your authored commits in the window, deduped by repo identity (root commit) so multiple clones aren't double-counted. Captures every committed change however it was made. Caveat: only covers repos still on disk — the report tells you the coverage (e.g. "4/13 repos"), because work done in directories that no longer exist can't be counted. - Tool churn — lines via
Edit/Write/MultiEdit. What naive profilers show. - Shell-authored estimate — file-writing Bash calls + lines of heredoc/redirect content.
Iteration depth is reported as mean / median / p90 / max (a single mean hides the "hammered one file 100+ times" tail), and errors as a rate, so brute-forcing reads as brute-forcing.
| file | contents |
|---|---|
report.md |
deterministic stats, human-readable |
stats.json |
all metrics, machine-readable |
narrative_input.md |
curated excerpts for the narrative pass — stays local; may contain private content from your own prompts |
profile.html |
the deliverable — branded, shareable builder profile (open in a browser) |
Note: every output stays on your machine. Add them to
.gitignore(this repo does) so you never accidentally commit your own data.
- Multi-source (Claude Code, Codex, Gemini, Pi, opencode, Cursor), with per-source selection via args.
Cursor sessions live in BOTH
state.vscdb(SQLite) and agent-transcripts JSONL with complementary data; the SQLite copy is the event stream (it carries per-event timestamps and tool error statuses the JSONL lacks), while the JSONL twin backfills what the DB omits: the workspace path (from the project folder slug) and the edit old/new strings that drive tool churn. JSONL stands alone for sessions missing from the DB and for subagent sidechains. - One-shot. Just re-run to rebuild as sessions accumulate.
- Genuine prompts exclude
isMeta,isCompactSummary, tool-results, andisSidechainsubagent-dispatch instructions — only human-typed turns count. - Active time uses capped inter-event gaps (10-min cap), not raw session span, because
sessionIdis reused across resumed sessions spanning weeks (raw span over-inflates time). - Subagent work counts toward tool/churn totals (it's work you delegated) but never toward your prompt count.
- Timestamps are converted UTC → local timezone for the work-hour histogram.
- The archetype and 0–10 axis scores are interpretive (Paxel's rubric is closed). The axes are derived from Garry Tan's gstack (see "How scores are graded" above), but the counts are measured and reproducible; the scores are an opinion laid on top.
Honest about what it can't see. If you can close one of these, open a PR:
sed -i/ runtime-generated files — a command likepython build.pywrites files whose content never appears in the transcript, so the shell-authored estimate misses it. Git churn catches it if it was committed in a repo still on disk.~/.claude/history.jsonl(a separate flat prompt log) isn't parsed yet.- Cursor workspace slugs — project
cwdis reconstructed from the folder slug (Users-you-dev-foo→/Users/you/dev/foo); usernames or paths with dashes may mis-parse. - Cursor JSONL carries no timestamps, models, or tool error statuses — sessions also present
in
state.vscdbuse the richer SQLite copy; JSONL-only sessions get a single file-mtime timestamp (so they land on the calendar without faking an hour-by-hour history). - Cursor
ApplyPatchchurn counts raw patch lines (like Codexapply_patch), so it slightly over-estimates; git churn is unaffected. - Codex tool churn from
apply_patchcounts raw patch lines (diff markers included), so it over-estimates; the gold-standard git churn is unaffected. - Score grounding — axis criteria are derived from gstack, but the archetype picker
(
pick_archetype) is still a hand-rolled rule set, and the gstack→metric mappings are a first pass. Sharper targets, or mapping archetypes onto gstack's roles (CEO / Eng Manager / QA Lead / …), would be great contributions.
Issues and pull requests welcome. See CONTRIBUTING.md — it splits the work into more sources & signals (great first PRs) and the harder, more interesting one below.
Adding a source is mechanical; changing what gets measured is not. paxel reads usage style and volume, not skill — several axes can run opposite to seniority (experts prompt terser, ship cleaner, lean on the agent harder). We've already shipped and killed real scoring bugs: a term that gave a spammer a 9.4, and a "Steering" axis that scored a 30-year engineer 1/10 for being hands-off.
So scoring changes are held to an adversarial suite (tests/test_scoring_invariants.py): ten
invariants — a spammer must not out-score a shipper, raw volume must not inflate a score, verbosity
must not buy "Planning," being hands-off must never cost points — each a frozen memory of a bug.
Think a score is wrong? Don't just reweight a coefficient: add an adversarial profile that encodes
what should be true and show the current math violates it. Rules of engagement in
CONTRIBUTING.md.
There's a stdlib-only smoke harness (no pytest to install) — run it with:
python3 -m unittest discover -s testsIt points paxel at tiny committed transcripts under tests/fixtures/<source>/, runs the whole
pipeline end-to-end per source, and asserts a valid profile comes out — plus a node --check of
the embedded poster JS and a few invariant guards. Adding a new source? Drop a minimal fixture
in tests/fixtures/<your-source>/ matching its real on-disk layout, then add an entry for it to
SRC_DIRS and EXPECTED_SOURCES in tests/test_smoke.py — and it's covered. CI (GitHub Actions)
runs all of this on every PR.
Scoring is guarded separately by tests/test_scoring_invariants.py — ten adversarial invariants
(built from the synthetic profiles in tests/adversarial_profiles.py) that any change to
compute_scores must keep green. See CONTRIBUTING.md before touching the math.
