Skip to content

Fix/typo corrections#256

Closed
eugman wants to merge 3 commits intoTabularEditor:mainfrom
eugman:fix/typo-corrections
Closed

Fix/typo corrections#256
eugman wants to merge 3 commits intoTabularEditor:mainfrom
eugman:fix/typo-corrections

Conversation

@eugman
Copy link
Collaborator

@eugman eugman commented Feb 1, 2026

I automatically scanned for typos, trained a local LLM to look for contextual errors, then I had Opus apply the fixes and I manually reviewed each one.

Also added a blocking GitHub action for common typos. We can add an inline fix button but it would require write permissions to the PR.

eugman and others added 2 commits February 1, 2026 14:32
Spelling corrections identified and validated through a multi-stage process:
1. Automated detection using pyspellchecker library
2. False positive filtering via fine-tuned LLM classifier (Gemma3:4b via Ollama, GEPA-optimized)
3. Automated fixes applied by Claude Opus 4.5
4. Final human review and approval
Adds a Python-based spellcheck CI that blocks PRs with 100% reliable typos.

Features:
- Precompiled regex patterns for performance
- Skips code blocks, inline code, and YAML frontmatter
- Directory pruning (os.walk) for efficiency
- Excludes localizedContent (English-only check)
- GitHub Actions annotations for inline PR feedback
- Symlink escape protection
- JSON schema validation

Files:
- scripts/ci_spellcheck.py: Main detection script
- data/common_typos.json: 32 typo patterns
- .github/workflows/spellcheck.yml: CI workflow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@eugman
Copy link
Collaborator Author

eugman commented Feb 2, 2026

Below is an optimized prompt that has a 100% accuracy rate against my 227 training samples. I used Gemma3:4b locally with this prompt to review all docs.

Contextual Spellcheck Prompt (GEPA Optimized - 100% F1)

System Instructions

You are a technical documentation editor specializing in data modeling and business intelligence tools. Your task is to analyze provided text and identify any contextual typos, grammatical errors, or inconsistent terminology, specifically focusing on clarity and adherence to standard conventions within technical documentation.

Important Considerations & Specifics:

  • File Extensions: Always capitalize file extensions (e.g., "pbix", "pbit").
  • Formal Tone: Prioritize formal and precise language over informal phrasing. For example, replace "lead to" with "resulted in" or "caused." Strive for direct and unambiguous phrasing.
  • Technical Terminology: Be aware that some terms might appear incorrect, but are actually valid technical terms within the data modeling and BI domain. Do not flag them as typos unless there is clear evidence of a misspelling. Examples of valid technical terms to not flag include "averagex" (a DAX function). A term's validity should be confirmed by domain expertise; do not assume a term is incorrect simply because it's unfamiliar.
  • Redundancy: Avoid flagging phrases that are already clear or overly explicit. Phrases like "No data" are considered clear and do not need correction.
  • Focus on Accuracy: Your focus is on technical accuracy and clarity, not subjective writing style. Concise and unambiguous phrasing is key.
  • Strategy: You should utilize a balanced strategy – critically assess the text for errors, but also maintain an understanding of the context to avoid flagging valid technical terms or minor stylistic choices as errors. Err on the side of caution; if there's a possibility a term is legitimately used within the BI/data modeling field, do not flag it.
  • "Fields" Terminology: When describing data model elements, be mindful of the term "fields," which can refer to various elements like model measures, Key Performance Indicators (KPIs), columns, and hierarchies. Recognize that any of these are valid uses of the term and should not be flagged as incorrect.

Output Format

Return your findings as a JSON array. Each element in the array should include the incorrect word, a suggested correction, and a concise reasoning behind the correction. For example: [{"word": "incorrect_word", "correction": "correct_word", "reason": "reason_for_correction"}]. If the text appears clear, concise, and grammatically correct with appropriate terminology, return an empty array.

Few-Shot Examples

Example 1

Text: Use XMLA, where as REST is slower.
Reasoning: The text contains a subtle grammatical error and a comparison highlighting a difference in performance between two technologies. The goal is to identify potential typos and suggest corrections within the context of technical documentation.
Issues: [{"word": "where as", "correction": "whereas", "reason": "grammatical error - 'where as' is incorrect usage; 'whereas' is the correct conjunction for contrasting ideas."}, {"word": "slower", "correction": "slower", "reason": "No correction needed - this is a valid comparative statement."}]

Example 2

Text: The feature is suported in version 3.
Issues: [{"word": "suported", "correction": "supported", "reason": "misspelling - incorrect spelling of supported"}]

Example 3

Text: The OLS (Object Level Security) feature restricts access to objects.
Reasoning: The phrase "OLS (Object Level Security) feature restricts access to objects" appears generally correct and doesn't contain obvious typos or contextual errors given the typical use of this terminology in a technical document.
Issues: []

Example 4

Text: Use params to filter the results.
Issues: []

@mlonsk
Copy link
Collaborator

mlonsk commented Feb 2, 2026

Hey Eugene
I have reviewed all the markdown files and thank you for cleaning those up.
However, the workflow you built fails and I cannot approve PR before that is fixed :)

@eugman
Copy link
Collaborator Author

eugman commented Feb 2, 2026

Hey Eugene I have reviewed all the markdown files and thank you for cleaning those up. However, the workflow you built fails and I cannot approve PR before that is fixed :)

Dangit! I tested it on a repo, but made changes after. I will investigate.

@eugman
Copy link
Collaborator Author

eugman commented Feb 3, 2026

@mlonsk I've addressed the issue in my code (and found a few more typos) but it looks like the build job is not designed to handle pull requests from forks. This is what Claude Code said, I have not had a chance yet to manually validate. So it may be incorrect.

Root Cause

The AZURE_STATIC_WEB_APPS_API_TOKEN secret is not accessible. This typically happens when:

  1. PR from a fork - GitHub doesn't expose repository secrets to PRs from forks for security reasons
  2. Secret expired/deleted - The Azure SWA API token may have been rotated or deleted
  3. Secret name mismatch - The workflow uses AZURE_STATIC_WEB_APPS_API_TOKEN_DELIGHTFUL_MUD_081AFFE03 but it may not exist
    Solution Options

If this is your PR (not from a fork):

  • Ask a repository maintainer to check if the secret AZURE_STATIC_WEB_APPS_API_TOKEN_DELIGHTFUL_MUD_081AFFE03 exists in the repo settings
  • The token may need to be regenerated from the Azure Portal → Static Web Apps → Manage deployment token

If this is a fork PR:

  • This is expected behavior - fork PRs can't access secrets
  • A maintainer needs to merge or run the workflow from the main repo
  • Alternatively, the workflow could be updated to use skip_deploy_on_missing_secrets: true for PR previews

Quick fix for the workflow:

  • name: Build And Deploy
    uses: Azure/static-web-apps-deploy@v1
    with:
    azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN_DELIGHTFUL_MUD_081AFFE03 }}
    skip_deploy_on_missing_secrets: true # Add this line
    ...

This would allow the build to pass even when secrets aren't available (useful for fork PRs).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we look for existing solutions in the space rather than maintaining this. If there's nothing holistic, then we could still reduce the maintenance burden with more complete building blocks. We could use something like mq to extract prose and an existing CLI spell checker (e.g., hunspell).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do roll our own spellchecker, then this file format is way over-engineered. We just need a 2-field or 3-field structured format of wrong, right, category, where category is optional. This is easy enough as a delimited file. The false positives can simply be another file. Then we can skip json deserialization. Version and updated don't make sense for a local format that is consumed by a single script. This will simplify the ingestion and validation as well in the script. Again, this feedback is only if we roll our own spellchecking script.

@greggyb
Copy link
Collaborator

greggyb commented Feb 3, 2026

I'd recommend breaking out the corrections from the script and new build step. The typos are valuable as-is, but I don't know that maintaining our own hand-rolled spellchecking is the right choice.

@eugman
Copy link
Collaborator Author

eugman commented Feb 3, 2026

I'd recommend breaking out the corrections from the script and new build step. The typos are valuable as-is, but I don't know that maintaining our own hand-rolled spellchecking is the right choice.

Makes sense, I'll figure out how to do that.

I used pyspellchecker locally to catch spelling errors and then a local LLM to catch items that are not spelling errors but are contextual errors.

My main thought with the Github action was catching verified typos that have occurred in the past. Treating them like regressions, so to speak. But I'm still very new to all this devops stuff.

@eugman eugman closed this Feb 3, 2026
@eugman eugman mentioned this pull request Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants