feat: exhaustive-by-default scans with a design-judgment layer#16
Merged
Conversation
added 4 commits
June 10, 2026 01:45
Pixelslop is usually driven by an AI agent, and an agent won't remember to pass --thorough or --deep any more than a person will. So the default behaviour was the only behaviour, and the default was the minimal one: low-confidence findings hidden, shallow collection. That made scans miss things. Flip deep and thorough to default true. thorough now shows lower-confidence findings tagged with their confidence instead of hiding them; deep doubles the collection budgets for more evidence. The opt-out is --fast, which turns both back off for a quick high-confidence-only pass. Personas already defaulted to all. SKILL.md and the settings docs explain the exhaustive-by-default posture and the --fast escape hatch.
Findings now carry a kind: "measured" (the default, evidence-backed) or "judgment" (a subjective read). The HTML report keeps them in separate labeled sections so an opinion never reads as a measured fact, and judgment findings show their confidence inline. A scan with only measured findings looks exactly as before — the judgment layer only appears when there is something in it. This is the report foundation for the design-director pass. The /20 score stays measured-only; judgment is additive coverage, not a score input.
The six measured evaluators score what's measurable. None of them can say whether a page is actually any good, which is the thing a designer catches by eye and the reason a measured-only scan misses things. The design-director is a seventh evaluator that looks at the screenshots and opines: does this read as AI-generated, is the composition generic, what's the missed opportunity, where does the page make the user think too hard. It emits judgment findings only and never touches the /20 — the score stays measured. The guard against turning into vague "make it pop" noise is a mandatory second pass: it argues against each of its own findings, drops the ones it can't defend or that a measured evaluator already caught, respects intentional bold design, and tags what survives with a confidence. The orchestrator spawns it alongside the six and routes its findings to the report's Design judgment layer.
Pixelslop shipped 8 generic personas, and the docs claimed custom ones in .pixelslop/personas/ were auto-discovered — but nothing actually loaded them. So every project got the same generic lens, and a wedding-planner site was never tested by "the bride three weeks out." Adds a personas tool group: `personas write` validates a persona (required fields, slug-only id, no built-in collision, no path traversal) and saves it to .pixelslop/personas/; `personas list` returns built-ins plus custom. The orchestrator now generates 1-2 personas from the project's audience and brand and evaluates them alongside the built-ins, leading with the project-specific one when it surfaces a real audience issue.
added 3 commits
June 10, 2026 02:36
…drift An agent invoking /pixelslop reads SKILL.md, and SKILL.md was advertising almost none of what Pixelslop can do — the description was three releases stale, the args list was missing --fast and --deep, and personas, the design-director, trends, and tokens went unmentioned. Capabilities nobody knows about may as well not exist. Rewrites the frontmatter description and args to match reality, and adds a single canonical "Capabilities & Options" menu near the top of SKILL.md — the first thing an agent reads — that also tells it to surface the relevant option to the user when a scan finishes. The durable part is the guard: skill-discoverability.test.js pulls the setting keys straight from SETTING_DEFS and fails the build if SKILL.md doesn't mention each one, plus curated checks for every flag, command, and capability. Add a feature without advertising it and the build breaks. That's the mechanism that keeps this from rotting again.
Knowing the options isn't the same as knowing the best one. An agent could read the capabilities menu and still just run defaults, or open with a wall of settings questions. Neither is advice. The skill now carries an advisory playbook: infer the user's intent — a quick look, a pre-launch review, a CI run, tracking progress — and lead with a recommendation plus the one tradeoff, only asking when there's a real fork. The user shouldn't need to know --fast or --deep exist; translating intent into flags is the agent's job. The drift guard now also asserts the advisory section stays.
…Code SKILL.md told the agent to use AskUserQuestion in ~14 places, but that's a Claude Code tool. Codex CLI has no choice-prompt popup (it's an open request upstream), and the installer only rewrites paths — so a Codex-installed skill was asking the agent to use a tool that doesn't exist there. Adds an "Asking the user" protocol at the top: the AskUserQuestion blocks are the question content, and each harness renders them its own way. Claude Code uses the tool; Codex and others present a numbered menu and wait for a reply; non-interactive runs skip the question and use the default. One SKILL.md works everywhere, no per-harness rewriting. Drift-guarded.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The full direction change: Pixelslop stops optimizing for "fewer defensible measured findings only" and becomes exhaustive by default, with a labeled judgment layer on top of the measured backbone. Goal: be good enough that you never run a second design tool. Four commits.
1. Exhaustive by default
deepandthoroughnow default to true, because Pixelslop is usually driven by an AI agent that won't remember the flags — the default has to be the thorough one.thoroughshows low-confidence findings tagged instead of hiding them.--fastis the opt-out.2. Measured vs judgment layers
Every finding carries
kind: "measured"(default, evidence-backed) or"judgment"(subjective). The report keeps them in separate labeled sections so an opinion never reads as a measured fact. The /20 stays measured-only.3. The design-director pass
A 7th evaluator that looks at the screenshots and opines like a design director: does this read as AI-generated, is the composition generic, what's the missed opportunity, where does it overload the user. It emits judgment findings only, never a score. The guard against vague "make it pop" noise is a mandatory second pass where it argues against its own findings, drops what it can't defend or what a measured evaluator already caught, respects intentional bold design, and tags confidence. This is what beats the alternative instead of just matching it: a deep measured backbone plus the subjective read, not the subjective half alone.
4. Project-specific personas
The 8 built-in personas were the only lens, and the documented custom-persona discovery was never actually wired. Now
personas write/listmanage validated custom personas, and the orchestrator generates 1-2 from the project's audience/brand so the persona findings fit your users.Tests
1000 passing, zero dependencies. New: report-layers, design-director contract, personas-tool, plus updated default + evaluator-count tests.
Regular merge, not squash, so release-please keeps each
featentry.