Build Arabic diacritization Chrome extension#14
Open
GiladAmar wants to merge 6 commits into
Open
Conversation
- Update manifest.json with Arabic extension name and description - Replace Hebrew letters and diacritics with Arabic equivalents - Update background.js with Arabic harakat and shadda prediction logic - Modify content.js to detect Arabic text (Unicode U+0600-U+06FF) - Update package.json metadata for Arabic Tashkeel extension - Add MODEL_TRAINING.md with instructions for training Arabic model - Update README.md with Arabic-specific documentation - Add .gitignore for build artifacts The extension now compiles successfully and is ready for Arabic diacritization once the model is trained and converted.
There was a problem hiding this comment.
Pull request overview
This PR adapts a Hebrew diacritization Chrome extension to work with Arabic text by replacing Hebrew-specific language processing with Arabic equivalents.
Changes:
- Updated package metadata and extension manifest to reflect Arabic functionality
- Replaced Hebrew Unicode ranges and character sets with Arabic equivalents
- Renamed functions and variables from Hebrew terminology (nekudot) to Arabic terminology (tashkeel/harakat)
- Added comprehensive documentation including installation guide and model training instructions
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| package.json | Updated package name, version, description, and keywords from Hebrew to Arabic |
| manifest.json | Changed extension name, version, and description to reflect Arabic functionality |
| content.js | Updated Unicode range detection from Hebrew to Arabic and renamed functions from nekudot to tashkeel |
| background.js | Replaced Hebrew diacritical marks with Arabic harakat, updated character sets, and adapted prediction logic |
| README.md | Rewrote documentation with Arabic-specific features, installation steps, and usage instructions |
| MODEL_TRAINING.md | Added new documentation for training and converting Arabic diacritization models |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Replace TensorFlow.js with client-server architecture - Create Flask server (server/tashkeel_server.py) for CATT model inference - Update background.js to communicate with local server via HTTP - Remove TensorFlow.js dependency from package.json - Add notifications permission for server status alerts - Update manifest.json with host_permissions for localhost:5000 - Comprehensive README with server setup and usage instructions - Remove obsolete MODEL_TRAINING.md (no longer needed with catt-tashkeel package) The extension now uses the state-of-the-art CATT model from abjadai/catt which provides superior diacritization accuracy compared to simpler models. Users need to run the Python server locally before using the extension. The catt-tashkeel package automatically downloads the pre-trained model.
Convert extension to use ONNX.js for in-browser inference BREAKING CHANGE: Replaced Python server with fully client-side solution - Remove Python server dependency (server/ directory deleted) - Add ONNX.js for browser-based model inference - Port Python tokenizer to JavaScript (tashkeel_tokenizer.js) - Port Buckwalter transliteration to JavaScript (buckwalter.js) - Implement complete CATT inference pipeline in background.js - Add model export script (scripts/export_onnx.py) - Update manifest to expose model files as web-accessible resources - Add onnxruntime-web dependency to package.json - Create comprehensive MODEL_SETUP.md guide - Update README with new architecture documentation Benefits: - No Python server required - Works completely offline after initial load - Better privacy (all processing in browser) - Faster startup (no server to start) - More reliable (no network communication) Model Setup Required: Users must export CATT models to ONNX format using the provided script before building the extension. See MODEL_SETUP.md for details. Extension now loads ~500MB of ONNX models directly in the browser and performs inference using WebAssembly for optimal performance.
There was a problem hiding this comment.
Looks good. Worth considering though. View full project report here.
Co-authored-by: code-review-doctor[bot] <72320148+code-review-doctor[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.