Skip to content

Build Arabic diacritization Chrome extension#14

Open
GiladAmar wants to merge 6 commits into
masterfrom
claude/arabic-diacritization-extension-h6yxm
Open

Build Arabic diacritization Chrome extension#14
GiladAmar wants to merge 6 commits into
masterfrom
claude/arabic-diacritization-extension-h6yxm

Conversation

@GiladAmar
Copy link
Copy Markdown
Owner

No description provided.

- Update manifest.json with Arabic extension name and description
- Replace Hebrew letters and diacritics with Arabic equivalents
- Update background.js with Arabic harakat and shadda prediction logic
- Modify content.js to detect Arabic text (Unicode U+0600-U+06FF)
- Update package.json metadata for Arabic Tashkeel extension
- Add MODEL_TRAINING.md with instructions for training Arabic model
- Update README.md with Arabic-specific documentation
- Add .gitignore for build artifacts

The extension now compiles successfully and is ready for Arabic
diacritization once the model is trained and converted.
@GiladAmar GiladAmar requested a review from Copilot January 21, 2026 12:14
@GiladAmar GiladAmar self-assigned this Jan 21, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adapts a Hebrew diacritization Chrome extension to work with Arabic text by replacing Hebrew-specific language processing with Arabic equivalents.

Changes:

  • Updated package metadata and extension manifest to reflect Arabic functionality
  • Replaced Hebrew Unicode ranges and character sets with Arabic equivalents
  • Renamed functions and variables from Hebrew terminology (nekudot) to Arabic terminology (tashkeel/harakat)
  • Added comprehensive documentation including installation guide and model training instructions

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
package.json Updated package name, version, description, and keywords from Hebrew to Arabic
manifest.json Changed extension name, version, and description to reflect Arabic functionality
content.js Updated Unicode range detection from Hebrew to Arabic and renamed functions from nekudot to tashkeel
background.js Replaced Hebrew diacritical marks with Arabic harakat, updated character sets, and adapted prediction logic
README.md Rewrote documentation with Arabic-specific features, installation steps, and usage instructions
MODEL_TRAINING.md Added new documentation for training and converting Arabic diacritization models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread background.js Outdated
Comment thread background.js Outdated
Comment thread background.js Outdated
Comment thread README.md Outdated
- Replace TensorFlow.js with client-server architecture
- Create Flask server (server/tashkeel_server.py) for CATT model inference
- Update background.js to communicate with local server via HTTP
- Remove TensorFlow.js dependency from package.json
- Add notifications permission for server status alerts
- Update manifest.json with host_permissions for localhost:5000
- Comprehensive README with server setup and usage instructions
- Remove obsolete MODEL_TRAINING.md (no longer needed with catt-tashkeel package)

The extension now uses the state-of-the-art CATT model from abjadai/catt
which provides superior diacritization accuracy compared to simpler models.

Users need to run the Python server locally before using the extension.
The catt-tashkeel package automatically downloads the pre-trained model.
Copy link
Copy Markdown

@code-review-doctor code-review-doctor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some food for thought. View full project report here.

Comment thread server/tashkeel_server.py Outdated
Convert extension to use ONNX.js for in-browser inference

BREAKING CHANGE: Replaced Python server with fully client-side solution

- Remove Python server dependency (server/ directory deleted)
- Add ONNX.js for browser-based model inference
- Port Python tokenizer to JavaScript (tashkeel_tokenizer.js)
- Port Buckwalter transliteration to JavaScript (buckwalter.js)
- Implement complete CATT inference pipeline in background.js
- Add model export script (scripts/export_onnx.py)
- Update manifest to expose model files as web-accessible resources
- Add onnxruntime-web dependency to package.json
- Create comprehensive MODEL_SETUP.md guide
- Update README with new architecture documentation

Benefits:
- No Python server required
- Works completely offline after initial load
- Better privacy (all processing in browser)
- Faster startup (no server to start)
- More reliable (no network communication)

Model Setup Required:
Users must export CATT models to ONNX format using the provided
script before building the extension. See MODEL_SETUP.md for details.

Extension now loads ~500MB of ONNX models directly in the browser
and performs inference using WebAssembly for optimal performance.
Copy link
Copy Markdown

@code-review-doctor code-review-doctor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Worth considering though. View full project report here.

Comment thread scripts/export_onnx.py Outdated
Co-authored-by: code-review-doctor[bot] <72320148+code-review-doctor[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants