Fix AI classification workflow using legacy prompt/simple inference with no grounded validation

## Goal

Fix the AI classification workflow so repo classification is grounded, validated, and testable instead of letting `actions/ai-inference@v2` free-associate JSON under legacy prompt mode with no tools.

Parent: #42
Related: #46, #48, #50, #53, #54, #62, #69

## Triggering Evidence

Direct evidence from the supplied workflow log at `2026-05-10T07:25:19Z`:

```text
Run actions/ai-inference@v2
model: openai/gpt-4o
max-tokens: 3000
system-prompt-file: .github-stars/data/system-prompt.txt
prompt-file: .github-stars/data/user-prompt.txt
endpoint: https://models.github.ai/inference
system-prompt: You are a helpful assistant
enable-github-mcp: false
BATCH_LIMIT: 15
Using legacy prompt format
Running simple inference without tools
```

The model then returned a raw JSON array of classifications, for example:

```json
[
  {"repo":"streetwriters/notesnook","categories":["productivity","desktop-dev"],"tags":["note-taking","note-management","lang:ts"],"framework":null},
  {"repo":"vercel-labs/skills","categories":["ai-ml","dev-tools"],"tags":["ai-agent","skills-tool","lang:ts"],"framework":null},
  {"repo":"JamieMason/syncpack","categories":["productivity","dev-tools"],"tags":["dependency-management","monorepo","lang:rust"],"framework":null},
  {"repo":"cursor/agent-trace","categories":["ai-ml","documentation"],"tags":["ai-code-tracing","standard-format","lang:ts"],"framework":null}
]
```

## Problem

The classification stage is currently accepting ungrounded model output from a legacy/simple inference path.

Observed failure shape:

```text
legacy prompt format
  -> generic fallback system prompt visible in action inputs
  -> no GitHub MCP/tools
  -> no per-repo grounded evidence fetch
  -> raw model JSON returned
  -> taxonomy/schema gates may catch shape, but not attribution truth
```

This can produce plausible-looking classifications with no proof that the model read the repository, package metadata, README, topics, language stats, or any canonical source.

The visible example includes at least one suspicious classification candidate: `JamieMason/syncpack` is tagged `lang:rust` in the returned model output. That may be wrong or stale, but this issue does not need to prove that specific repo's language to prove the workflow defect. The defect is that the classifier can emit language/category/framework claims without direct evidence.

## Required Architecture

Do not rely on raw LLM output as classification truth.

Classification must become:

```text
candidate repos
  -> evidence collection
  -> typed classification prompt/input
  -> model candidate classification
  -> typed parse
  -> schema validation
  -> taxonomy validation
  -> evidence validation / confidence scoring
  -> bounded write to repos.yml
  -> workflow summary with proof
```

## Required Changes

### 1. Replace legacy/simple inference mode

The workflow must stop using a path that reports:

```text
Using legacy prompt format
Running simple inference without tools
```

If `actions/ai-inference@v2` remains, configure it so the prompt contract is explicit and current. If the action cannot support the needed grounding/validation, move classification into TypeScript and call the model through a typed adapter.

### 2. Add evidence-backed classification input

For each repo in the batch, capture and pass grounded fields such as:

```text
repo
html_url
description
primary language
repository topics
stargazer count
fork count
archived/fork/private flags
README excerpt if available
package manifests if available / feasible
existing categories/tags/framework
last updated / pushed timestamp
source fields used for classification
```

Do not classify from repo name alone unless the output marks low confidence / needs review.

### 3. Add typed parser and validator

Model output must pass a TypeScript parser before it can mutate `repos.yml`.

Required checks:

```text
valid JSON
array length matches batch or unmatched repos are explicit
repo names match requested batch
categories are in canonical taxonomy
framework is null or allowed
language tags match collected language evidence or are flagged needs_review
tags are normalized and bounded
unknown/unsupported claims are rejected or marked needs_review
```

### 4. Add confidence / evidence status

Each classification result should carry internal evidence status before write:

```text
Direct evidence: supported by collected repo metadata or source field
Weak inference: plausible from description/topics/name but not proven
Unsupported: not grounded in available input; do not write silently
```

`repos.yml` may not need to store all of this permanently, but the workflow summary/artifact must expose enough proof for review.

### 5. Add workflow summary diagnostics

The classification workflow summary must report:

```text
model
inference mode
tools enabled/disabled
batch size
repos classified
repos rejected
repos marked needs_review
schema validation status
taxonomy validation status
evidence validation status
sample rejected reason
artifact paths
commit SHA / no-change status
```

### 6. Add regression fixtures

Add fixtures/tests for model-output failure modes:

```text
wrong repo returned
extra repo returned
missing repo returned
invalid JSON
unknown category
unknown framework
language tag contradicted by collected metadata
unsupported tag with no evidence
legacy prompt output accepted without evidence
```

This must land under the TypeScript control-plane direction from #69, not as one more YAML blob with hopes and dreams zip-tied to it.

## Acceptance Criteria

- Classification no longer runs through ungrounded `legacy prompt format` / `simple inference without tools` as the accepted production path.
- Every model classification is parsed by TypeScript before mutation.
- Classification output is matched back to the requested repo batch.
- Schema validation and taxonomy validation remain hard gates.
- Evidence validation exists for language/category/framework/tag claims.
- Unsupported or contradictory classification claims are rejected or marked `needs_review`.
- Workflow summary reports model, inference mode, tools/grounding status, counts, rejected items, and validation results.
- Tests cover malformed, mismatched, unsupported, and contradicted model outputs.
- `AGENTS.md` documents that raw model JSON is candidate classification, not truth.
- Final acceptance #54 can cite a successful classification run with evidence-rich summary output.

## Proof Required

Completion comment must include:

- Workflow diff and TypeScript parser/validator diff.
- Test output for classification parser/validator fixtures.
- Successful workflow run URL.
- Workflow summary excerpt showing inference mode, validation counts, and rejected/needs-review counts.
- Example of at least one rejected or needs-review classification from a fixture or test.
- Confirmation that `repos.yml` is not mutated by unvalidated raw model output.

## Evidence Labels for Implementer

Use these labels in the completion report:

- Direct evidence: workflow log, source diff, parser/test code, test output, workflow summary, artifact, commit SHA.
- Weak inference: a classification is plausible based on description/topics but not directly proven by source metadata.
- Unsupported: model output claims a language/category/framework/tag with no supporting input evidence.
- Contradicted: model output conflicts with collected metadata or canonical taxonomy.
- Blocked: model/action cannot provide current inference mode, repo metadata unavailable, or grounding source cannot be fetched.

## Non-Goals

- Do not hand-edit `repos.yml` to clean this single batch.
- Do not merely increase prompt verbosity.
- Do not treat schema-valid JSON as classification-valid truth.
- Do not rely on model self-reporting confidence unless the validator can independently ground the fields.
- Do not collapse this into #69; #69 is structural, this is a concrete production classifier failure.

## Definition of Done

The classifier treats model output as untrusted candidate data, validates it through typed code, binds classifications to collected evidence, and refuses to mutate `repos.yml` from legacy/simple ungrounded inference output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AI classification workflow using legacy prompt/simple inference with no grounded validation #71

Goal

Triggering Evidence

Problem

Required Architecture

Required Changes

1. Replace legacy/simple inference mode

2. Add evidence-backed classification input

3. Add typed parser and validator

4. Add confidence / evidence status

5. Add workflow summary diagnostics

6. Add regression fixtures

Acceptance Criteria

Proof Required

Evidence Labels for Implementer

Non-Goals

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fix AI classification workflow using legacy prompt/simple inference with no grounded validation #71

Description

Goal

Triggering Evidence

Problem

Required Architecture

Required Changes

1. Replace legacy/simple inference mode

2. Add evidence-backed classification input

3. Add typed parser and validator

4. Add confidence / evidence status

5. Add workflow summary diagnostics

6. Add regression fixtures

Acceptance Criteria

Proof Required

Evidence Labels for Implementer

Non-Goals

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions