Conversation
Spelling corrections identified and validated through a multi-stage process: 1. Automated detection using pyspellchecker library 2. False positive filtering via fine-tuned LLM classifier (Gemma3:4b via Ollama, GEPA-optimized) 3. Automated fixes applied by Claude Opus 4.5 4. Final human review and approval
Adds a Python-based spellcheck CI that blocks PRs with 100% reliable typos. Features: - Precompiled regex patterns for performance - Skips code blocks, inline code, and YAML frontmatter - Directory pruning (os.walk) for efficiency - Excludes localizedContent (English-only check) - GitHub Actions annotations for inline PR feedback - Symlink escape protection - JSON schema validation Files: - scripts/ci_spellcheck.py: Main detection script - data/common_typos.json: 32 typo patterns - .github/workflows/spellcheck.yml: CI workflow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Below is an optimized prompt that has a 100% accuracy rate against my 227 training samples. I used Gemma3:4b locally with this prompt to review all docs. Contextual Spellcheck Prompt (GEPA Optimized - 100% F1)System InstructionsYou are a technical documentation editor specializing in data modeling and business intelligence tools. Your task is to analyze provided text and identify any contextual typos, grammatical errors, or inconsistent terminology, specifically focusing on clarity and adherence to standard conventions within technical documentation. Important Considerations & Specifics:
Output FormatReturn your findings as a JSON array. Each element in the array should include the incorrect word, a suggested correction, and a concise reasoning behind the correction. For example: Few-Shot ExamplesExample 1Text: Use XMLA, where as REST is slower. Example 2Text: The feature is suported in version 3. Example 3Text: The OLS (Object Level Security) feature restricts access to objects. Example 4Text: Use params to filter the results. |
|
Hey Eugene |
Dangit! I tested it on a repo, but made changes after. I will investigate. |
|
@mlonsk I've addressed the issue in my code (and found a few more typos) but it looks like the build job is not designed to handle pull requests from forks. This is what Claude Code said, I have not had a chance yet to manually validate. So it may be incorrect. Root Cause The AZURE_STATIC_WEB_APPS_API_TOKEN secret is not accessible. This typically happens when:
If this is your PR (not from a fork):
If this is a fork PR:
Quick fix for the workflow:
This would allow the build to pass even when secrets aren't available (useful for fork PRs). |
There was a problem hiding this comment.
I'd recommend that we look for existing solutions in the space rather than maintaining this. If there's nothing holistic, then we could still reduce the maintenance burden with more complete building blocks. We could use something like mq to extract prose and an existing CLI spell checker (e.g., hunspell).
There was a problem hiding this comment.
If we do roll our own spellchecker, then this file format is way over-engineered. We just need a 2-field or 3-field structured format of wrong, right, category, where category is optional. This is easy enough as a delimited file. The false positives can simply be another file. Then we can skip json deserialization. Version and updated don't make sense for a local format that is consumed by a single script. This will simplify the ingestion and validation as well in the script. Again, this feedback is only if we roll our own spellchecking script.
|
I'd recommend breaking out the corrections from the script and new build step. The typos are valuable as-is, but I don't know that maintaining our own hand-rolled spellchecking is the right choice. |
Makes sense, I'll figure out how to do that. I used pyspellchecker locally to catch spelling errors and then a local LLM to catch items that are not spelling errors but are contextual errors. My main thought with the Github action was catching verified typos that have occurred in the past. Treating them like regressions, so to speak. But I'm still very new to all this devops stuff. |
I automatically scanned for typos, trained a local LLM to look for contextual errors, then I had Opus apply the fixes and I manually reviewed each one.
Also added a blocking GitHub action for common typos. We can add an inline fix button but it would require write permissions to the PR.