Skip to content

feat: exhaustive-by-default scans with a design-judgment layer#16

Merged
gabelul merged 7 commits into
mainfrom
feat/exhaustive-by-default
Jun 10, 2026
Merged

feat: exhaustive-by-default scans with a design-judgment layer#16
gabelul merged 7 commits into
mainfrom
feat/exhaustive-by-default

Conversation

@gabelul

@gabelul gabelul commented Jun 9, 2026

Copy link
Copy Markdown
Owner

The full direction change: Pixelslop stops optimizing for "fewer defensible measured findings only" and becomes exhaustive by default, with a labeled judgment layer on top of the measured backbone. Goal: be good enough that you never run a second design tool. Four commits.

1. Exhaustive by default

deep and thorough now default to true, because Pixelslop is usually driven by an AI agent that won't remember the flags — the default has to be the thorough one. thorough shows low-confidence findings tagged instead of hiding them. --fast is the opt-out.

2. Measured vs judgment layers

Every finding carries kind: "measured" (default, evidence-backed) or "judgment" (subjective). The report keeps them in separate labeled sections so an opinion never reads as a measured fact. The /20 stays measured-only.

3. The design-director pass

A 7th evaluator that looks at the screenshots and opines like a design director: does this read as AI-generated, is the composition generic, what's the missed opportunity, where does it overload the user. It emits judgment findings only, never a score. The guard against vague "make it pop" noise is a mandatory second pass where it argues against its own findings, drops what it can't defend or what a measured evaluator already caught, respects intentional bold design, and tags confidence. This is what beats the alternative instead of just matching it: a deep measured backbone plus the subjective read, not the subjective half alone.

4. Project-specific personas

The 8 built-in personas were the only lens, and the documented custom-persona discovery was never actually wired. Now personas write/list manage validated custom personas, and the orchestrator generates 1-2 from the project's audience/brand so the persona findings fit your users.

Tests

1000 passing, zero dependencies. New: report-layers, design-director contract, personas-tool, plus updated default + evaluator-count tests.

Regular merge, not squash, so release-please keeps each feat entry.

Gabi added 4 commits June 10, 2026 01:45
Pixelslop is usually driven by an AI agent, and an agent won't remember to pass
--thorough or --deep any more than a person will. So the default behaviour was
the only behaviour, and the default was the minimal one: low-confidence findings
hidden, shallow collection. That made scans miss things.

Flip deep and thorough to default true. thorough now shows lower-confidence
findings tagged with their confidence instead of hiding them; deep doubles the
collection budgets for more evidence. The opt-out is --fast, which turns both
back off for a quick high-confidence-only pass.

Personas already defaulted to all. SKILL.md and the settings docs explain the
exhaustive-by-default posture and the --fast escape hatch.
Findings now carry a kind: "measured" (the default, evidence-backed) or
"judgment" (a subjective read). The HTML report keeps them in separate labeled
sections so an opinion never reads as a measured fact, and judgment findings show
their confidence inline. A scan with only measured findings looks exactly as
before — the judgment layer only appears when there is something in it.

This is the report foundation for the design-director pass. The /20 score stays
measured-only; judgment is additive coverage, not a score input.
The six measured evaluators score what's measurable. None of them can say
whether a page is actually any good, which is the thing a designer catches by
eye and the reason a measured-only scan misses things.

The design-director is a seventh evaluator that looks at the screenshots and
opines: does this read as AI-generated, is the composition generic, what's the
missed opportunity, where does the page make the user think too hard. It emits
judgment findings only and never touches the /20 — the score stays measured.

The guard against turning into vague "make it pop" noise is a mandatory second
pass: it argues against each of its own findings, drops the ones it can't defend
or that a measured evaluator already caught, respects intentional bold design,
and tags what survives with a confidence. The orchestrator spawns it alongside
the six and routes its findings to the report's Design judgment layer.
Pixelslop shipped 8 generic personas, and the docs claimed custom ones in
.pixelslop/personas/ were auto-discovered — but nothing actually loaded them. So
every project got the same generic lens, and a wedding-planner site was never
tested by "the bride three weeks out."

Adds a personas tool group: `personas write` validates a persona (required
fields, slug-only id, no built-in collision, no path traversal) and saves it to
.pixelslop/personas/; `personas list` returns built-ins plus custom. The
orchestrator now generates 1-2 personas from the project's audience and brand
and evaluates them alongside the built-ins, leading with the project-specific
one when it surfaces a real audience issue.
@gabelul gabelul changed the title feat: make scans exhaustive by default feat: exhaustive-by-default scans with a design-judgment layer Jun 10, 2026
Gabi added 3 commits June 10, 2026 02:36
…drift

An agent invoking /pixelslop reads SKILL.md, and SKILL.md was advertising almost
none of what Pixelslop can do — the description was three releases stale, the args
list was missing --fast and --deep, and personas, the design-director, trends, and
tokens went unmentioned. Capabilities nobody knows about may as well not exist.

Rewrites the frontmatter description and args to match reality, and adds a single
canonical "Capabilities & Options" menu near the top of SKILL.md — the first thing
an agent reads — that also tells it to surface the relevant option to the user when
a scan finishes.

The durable part is the guard: skill-discoverability.test.js pulls the setting keys
straight from SETTING_DEFS and fails the build if SKILL.md doesn't mention each one,
plus curated checks for every flag, command, and capability. Add a feature without
advertising it and the build breaks. That's the mechanism that keeps this from
rotting again.
Knowing the options isn't the same as knowing the best one. An agent could read
the capabilities menu and still just run defaults, or open with a wall of
settings questions. Neither is advice.

The skill now carries an advisory playbook: infer the user's intent — a quick
look, a pre-launch review, a CI run, tracking progress — and lead with a
recommendation plus the one tradeoff, only asking when there's a real fork. The
user shouldn't need to know --fast or --deep exist; translating intent into flags
is the agent's job. The drift guard now also asserts the advisory section stays.
…Code

SKILL.md told the agent to use AskUserQuestion in ~14 places, but that's a Claude
Code tool. Codex CLI has no choice-prompt popup (it's an open request upstream),
and the installer only rewrites paths — so a Codex-installed skill was asking the
agent to use a tool that doesn't exist there.

Adds an "Asking the user" protocol at the top: the AskUserQuestion blocks are the
question content, and each harness renders them its own way. Claude Code uses the
tool; Codex and others present a numbered menu and wait for a reply; non-interactive
runs skip the question and use the default. One SKILL.md works everywhere, no
per-harness rewriting. Drift-guarded.
@gabelul gabelul merged commit 7c8d0b3 into main Jun 10, 2026
3 checks passed
@gabelul gabelul deleted the feat/exhaustive-by-default branch June 10, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant