Skip to content

feat: analysis improvements, gemma4 default model & schema promotion#68

Open
jmlweb wants to merge 15 commits intomainfrom
feat/analysis-improvements
Open

feat: analysis improvements, gemma4 default model & schema promotion#68
jmlweb wants to merge 15 commits intomainfrom
feat/analysis-improvements

Conversation

@jmlweb
Copy link
Copy Markdown
Owner

@jmlweb jmlweb commented Jan 30, 2026

Summary

This PR adds several improvements to make analysis more reliable, especially for small models, and migrates the default model from gemma3:4b to gemma4:e4b.

🚀 Gemma 4 as Default Model

  • Default model: gemma3:4bgemma4:e4b (~5GB Q4, 128K context, native function calling)
  • Added gemma4:e2b (micro), gemma4:e4b (small), gemma4:31b (standard) to MODEL_STRATEGY_MAP
  • Promoted small strategy to full schema — models like gemma4:e4b, mistral:7b, llama3:8b now use the rich analysis schema (patterns, severity, before/after examples) instead of the simplified individual schema
  • Small strategy now processes up to 10 prompts per batch (was 1 with individual schema)
  • Fixed installCommand bug in model-suggester (was pointing to llama3.2 instead of default model)
  • Updated BATCH_STRATEGIES descriptions from size-based to capability-based

🎯 Gold Standard Benchmark

  • 50 curated prompts with human-rated quality scores
  • Covers all tiers: excellent (10), good (15), fair (15), poor (10)
  • Includes correlation calculation for model accuracy measurement
  • Used for calibrating scores across different providers

✅ Semantic Validation

  • Validates that scores correlate with issue counts
  • Detects when examples are not found in original prompts
  • Auto-corrects results when validation fails
  • Prevents logically inconsistent outputs

🔄 Temperature Fallback Retry

  • When JSON parsing fails, retries with lower temperatures
  • Sequence: 0.3 → 0.1 → 0.0
  • More deterministic outputs reduce parse failures

📝 Enhanced SYSTEM_PROMPT_MINIMAL

  • More contrastive examples showing score progression
  • Clear examples for each tier (POOR, FAIR, GOOD, EXCELLENT)
  • Better calibrated scoring guidelines

Test plan

  • All 1678 tests passing (54 test files)
  • Build succeeds
  • detectBatchStrategy('gemma4:e4b')'small'
  • Small strategy selects full schema (not individual)
  • Manual validation with gemma4:e4b on real prompts (when model is available in Ollama)

🤖 Generated with Claude Code

jmlweb and others added 10 commits January 30, 2026 14:03
…ompts

- Add gold-standard benchmark with 50 curated prompts for calibration
- Add semantic validator to detect score/issue inconsistencies
- Implement temperature fallback retry (0.3 -> 0.1 -> 0.0) for Ollama
- Enhance SYSTEM_PROMPT_MINIMAL with more contrastive examples
- Auto-correct results when semantic validation fails

This improves analysis reliability especially for small models.
- Fix @typescript-eslint/restrict-template-expressions by converting numbers to strings
- Fix @typescript-eslint/no-unnecessary-condition by removing redundant checks
- Fix test expectation to match actual error message
- Sort imports in benchmark/index.ts
- Document extractRealExamples() heuristic matching logic
- Identify category ID inconsistency between base.ts and schemas.ts
- Recommend unifying category mappings
- Note that individual mode already extracts real examples from AI
- All 50 aggregator tests passing
- undici: >=7.24.0 (CRLF injection, unbounded memory)
- hono: >=4.12.4
- @hono/node-server: >=1.19.10
- @modelcontextprotocol/sdk: >=1.26.0
- @isaacs/brace-expansion: >=5.0.1
- minimatch: >=10.2.3
- rollup: >=4.59.0
- flatted: >=3.4.2
- ajv: >=8.18.0
- qs: >=6.14.2

Resolves 29 vulnerabilities (2 low, 8 moderate, 19 high) → 0
…patibility)

ajv override forced >=8.18.0 but @eslint/eslintrc requires ajv v6.
ESLint's ajv@6.x is already patched (>=6.14.0), so no override needed.
All other overrides retained. Result: 0 vulnerabilities + lint passing.
## [3.0.2](v3.0.1...v3.0.2) (2026-03-22)

### Bug Fixes

* **security:** remove ajv override that broke ESLint (ajv v6/v8 incompatibility) ([88aac5c](88aac5c))
* update dependency overrides to resolve security vulnerabilities ([685befa](685befa))

### Documentation

* add quality assessment report ([722de83](722de83))
…egy to full schema

Gemma 4 E4B offers 128K context, native function calling, and configurable
thinking modes — enabling richer analysis with the full schema that was
previously reserved for large (>7GB) models.

Key changes:
- Default model: gemma3:4b → gemma4:e4b (small strategy, ~5GB Q4)
- Add gemma4:e2b (micro) and gemma4:31b (standard) to MODEL_STRATEGY_MAP
- Promote small strategy from individual to full schema (10 prompts/batch)
- Fix installCommand bug in model-suggester (was pointing to llama3.2)
- Update BATCH_STRATEGIES descriptions to capability-based

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jmlweb jmlweb changed the title feat: benchmark calibration, semantic validation & improved prompts feat: analysis improvements, gemma4 default model & schema promotion Apr 6, 2026
jmlweb and others added 5 commits April 7, 2026 01:00
…dator

- Sort imports per simple-import-sort rule
- Convert numeric template expressions to String()
- Remove unnecessary conditionals on always-truthy result.patterns
- Fix metadata possibly-undefined with type assertion (ID is verified via matchedId)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update semantic-release 25.0.2 → 25.0.3
- Add overrides for handlebars (>=4.7.9), picomatch (>=4.0.4),
  path-to-regexp (>=8.4.0), vite (>=7.3.2), brace-expansion (>=5.0.5),
  yaml (>=2.8.3)
- Bump existing lodash/lodash-es overrides to >=4.18.0
- Resolves all pnpm audit vulnerabilities

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aging

Each batch reports how many prompts exhibit a pattern locally. When merging
batches, these counts must be summed to reflect the true global frequency.
The previous average produced artificially low frequencies (e.g. frequency=1
for patterns appearing across all batches).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short user messages like "si", "ok", "ya" that are responses to assistant
questions are now detected by walking the parentUuid chain in Claude Code
logs. If the nearest ancestor assistant message ends with "?", the prompt
is marked as a confirmation and excluded from analysis.

- Add parseLogEntry() for lightweight extraction of uuid/parentUuid/content
- Add isConfirmationMessage() with max 5-hop chain traversal
- Two-pass readJsonlFile: build index first, then extract with detection
- Filter confirmations in CLI before sending to AI provider
- 13 new tests covering detection, edge cases, and integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants