KeywordX

KeywordX is a lightweight Python library for extracting and matching keywords from text using semantic similarity and entity-based boosting.
Perfect for NLP pipelines, chatbots, search systems, and event extraction.

Features

Extract keywords with semantic similarity scoring
Boost keyword matches using entities (dates, times, places, etc.)
Supports custom IDF weighting for better relevance
Easy-to-use API for integration into NLP pipelines

Working Mechanism

KeywordX processes text in multiple logical stages to extract and score contextually relevant keywords:

Input
The user provides raw text and a list of target keywords to search for.
Text Chunking
Candidate phrases are extracted using noun chunks and important lemmas via the chunk_phrases method.
These chunks represent potential keyword matches.
Embeddings Generation
Candidate phrases, target keywords, and a baseline phrase are converted into vector embeddings using the spaCy model (en_core_web_md).
The helper functions embed_texts and whiten handle embedding and normalization.
Semantic Scoring
Cosine similarity is computed between each keyword embedding and all candidate phrase embeddings.
Optional IDF weighting refines the relevance score for each match (score_matches).
Semantic Selection
For each keyword, the highest-scoring phrase is selected if it surpasses the configurable min_score threshold (default = 0.3).
This logic is implemented in KeywordExtractor.extract.
Named Entity Recognition (NER)
Named entities (like DATE, TIME, GPE, etc.) are extracted using spaCy’s NER and structured date parsing (extract_structured).
Entity Boosting & Merging
Extracted entities are mapped to logical keyword types (e.g., DATE → "date", GPE → "location").
Each entity match receives a base score (0.6), optionally boosted by user-defined entity_weights (capped at 2.0).
If an entity score exceeds the semantic match score for the same keyword, the entity match is prioritized.
Output Generation
The final output is a dictionary containing:
- "entities" → list of extracted entities (text, type, span)
- "semantic_matches" → top-scoring matches for each keyword with their similarity scores

Key Implementation Files

File	Responsibility
extractor.py	Orchestrates the full extraction pipeline, handles entity merging and validation
chunker.py	Extracts candidate phrases (noun chunks, lemmas)
embeddings.py	Handles embedding creation and normalization
matcher.py	Performs cosine similarity + IDF scoring
ner.py	Performs NER and structured date parsing
utils.py	Loads spaCy model and cleans text

Tunable Parameters

Parameter	Default	Purpose
`min_score`	0.3	Minimum semantic score required to accept a match
`base_entity_score`	0.6	Default confidence score for entity matches
`entity_weights`	—	User-defined boosts for specific entity types (max 2.0)
`baseline_text`	—	Used for embedding normalization

Installation

Install from PyPI:

pip install keywordx

The en_core_web_md spaCy model is required for the library to function. Install it using the following command:

python -m spacy download en_core_web_md

If the en_core_web_md model is not available, the library will attempt to fall back to the smaller en_core_web_sm model. However, this may result in reduced accuracy. You can install the fallback model using:

python -m spacy download en_core_web_sm

Or install from source:

git clone https://github.com/keikurono7/keywordx.git
cd keywordx
pip install -e .

Quick Start

Here is a quick example to get you started:

from keywordx import KeywordExtractor

ke = KeywordExtractor()
text = "Tomorrow I have a work meeting at 5pm in Bangalore."
keywords = ["meeting", "time", "place", "date"]

result = ke.extract(text, keywords)
print(result)

Example Output

The result will include extracted entities and semantic matches with scores:

{
  "entities": [
    {"span": [0, 8], "text": "Tomorrow", "type": "DATE"},
    {"span": [34, 37], "text": "5pm", "type": "TIME"},
    {"span": [41, 50], "text": "Bangalore", "type": "GPE"}
  ],
  "semantic_matches": [
    {"keyword": "meeting", "match": "meeting", "score": 0.99},
    {"keyword": "time", "match": "5pm", "score": 1.0},
    {"keyword": "place", "match": "Bangalore", "score": 1.0},
    {"keyword": "date", "match": "Tomorrow", "score": 1.0}
  ]
}

or you can try it out at : hugging face/keywordx

API Reference

KeywordExtractor()
Initializes the keyword extractor.
.extract(text, keywords) → dict
Extracts keywords and entities from text.
- text: input string
- keywords: list of keywords to match
Returns:
- entities: named entities (DATE, TIME, GPE, etc.)
- semantic_matches: list of matched keywords with similarity scores

Use Cases

Event and meeting extraction for calendar assistants
Chatbot intent detection
Automatic tagging of documents and notes
Context-aware search and indexing

Contributing

Contributions are welcome. For significant changes, please open an issue first to discuss the proposal.

Contributors

Madhusudan
- Email: dmpathani@gmail.com
- GitHub: Madhusudan
- Role: Code implementation
Saniya Naaz
- Email: saniyanaaz2k4@gmail.com
- GitHub: Saniya Naaz
- Role: Research work
Dr. Nandeeswar S B
- Email: hodcse.aiml@amceducation.in
- Role: Concept and idea generation
Pratham Tomar
- Github: Pratham Tomar
- Role: Upgrading model implementation

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
benchmarks		benchmarks
demo		demo
dist		dist
examples		examples
src/keywordx		src/keywordx
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KeywordX

Features

Working Mechanism

Key Implementation Files

Tunable Parameters

Installation

Quick Start

Example Output

API Reference

Use Cases

Contributing

Contributors

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KeywordX

Features

Working Mechanism

Key Implementation Files

Tunable Parameters

Installation

Quick Start

Example Output

API Reference

Use Cases

Contributing

Contributors

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages