spaCy named-entity recogniser + PyTorch GRU part-of-speech tagger for automatic parsing
Recruiters spend too much time pulling structured data out of free-form CVs.
This repo tackles that in two steps:
| Step | Model | Dataset | Goal |
|---|---|---|---|
| 1 NER | spaCy custom NER | Entities (DataTurks ↗︎) | Find names, skills, colleges, emails, etc. |
| 2 POS | GRU sequence tagger in PyTorch | BatteryData POS (HF Datasets ↗︎) | Provide syntactic features for downstream parsers. |
The notebook (pos_ner.ipynb) walks through data prep, training, evaluation and saving the trained artefacts.
. ├── README.md ├── requirements.txt ├── run_notebook.py # scripts to run main .ipynb file ├── src/ │ ├── pos_ner.ipynb │ ├── dataset │ ├── trained_models
# 1. clone
git clone https://github.com/YOUR-USER/ner-pos-tagger.git
cd ner-pos-tagger
# 2. python env
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 3. run script
python3 run_notebook.py| Entity | Precision | Recall | F1 score |
|---|---|---|---|
| Name | 0.973 | 0.766 | 0.857 |
| Email Address | 0.800 | 0.778 | 0.789 |
| College Name | 0.429 | 0.350 | 0.385 |
| Skills | 0.261 | 0.273 | 0.267 |
| Designation | 0.613 | 0.355 | 0.450 |
| Location | 0.622 | 0.299 | 0.404 |
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| GRU (ours) | 89.41 % | 0.898 | 0.894 | 0.893 |
| Most-frequent-tag baseline | 13.95 % | -- | -- | -- |
Take-away: the neural POS tagger lifts accuracy by +75 pp over the naive baseline, and the spaCy NER reaches up to 0.86 F1 on person names.
-
Hyper-parameter search for NER (dropout, LR scheduler)
-
CRF layer on top of GRU for POS
-
Export both models as a single REST/Gradio micro-service
PRs are very welcome! Please open an issue to discuss major changes first.
-
Fork → Commit → Pull Request
-
Follow
black&rufflinting -
Write / update unit tests where sensible