Skip to content
Merged

0.7.1 #107

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
168 commits
Select commit Hold shift + click to select a range
5f28658
Create super-linter.yml
tiagolv Nov 6, 2024
0e13bc5
Create pylint.yml
tiagolv Nov 6, 2024
ceef896
atualização das imagens das dockers
tiagolv Nov 6, 2024
e35fb82
yake.py 60%
tiagolv Feb 7, 2025
f678242
Create resultados.yml
tiagolv Feb 17, 2025
729d101
Update resultados.yml
tiagolv Feb 17, 2025
de7bd0f
Update resultados.yml
tiagolv Feb 17, 2025
d179ba3
pyint espaços + atualização de status
tiagolv Feb 18, 2025
522d95a
link code testes
tiagolv Feb 18, 2025
bb9c340
Levenshtein refatorização inicial
tiagolv Feb 18, 2025
79315a7
clip.py - initial refactoring
tiagolv Feb 18, 2025
ecddac4
removidos ficheiros originais
tiagolv Feb 18, 2025
3d7c317
cli.py atualizado
tiagolv Feb 18, 2025
956bb97
trailing whitespaces
tiagolv Feb 18, 2025
cf3ef99
highlight refatorização inicial
tiagolv Feb 18, 2025
2aadc64
highlight.py lines
tiagolv Feb 20, 2025
56b48b2
highlight.py docstrings
tiagolv Feb 20, 2025
9a13e29
datarepresentation.py formatação inicial
tiagolv Feb 20, 2025
d4cfd88
datarepresentatio 40%
tiagolv Feb 20, 2025
8755e64
datarepresentation linting
tiagolv Feb 25, 2025
df183c7
datarepresentation 70%
tiagolv Feb 25, 2025
cf59a6b
variaveis
tiagolv Feb 25, 2025
b8cdb5c
+ variables
tiagolv Feb 25, 2025
f8672e3
25/02/2025 acabado
tiagolv Feb 25, 2025
f4bd1ed
teste yake.py métodos
tiagolv Mar 5, 2025
0a3a482
yake.py métodos e atributos distribuidos
tiagolv Mar 5, 2025
116a82e
yake.py 100% Refactored and optimized
tiagolv Mar 5, 2025
36b4a08
separação de métodos highlight.py
tiagolv Mar 6, 2025
33d692f
documentação beta highlight.py
tiagolv Mar 6, 2025
b25cc10
+ documentação
tiagolv Mar 6, 2025
86149c4
highlights.py 90% + documentação
tiagolv Mar 6, 2025
1baf343
highlight.py criação de dicioários
tiagolv Mar 7, 2025
c13a7e4
07/03/2025 fim
tiagolv Mar 7, 2025
75288ef
Update yakenew.md
tiagolv Mar 12, 2025
c56b001
actions files
tiagolv Mar 12, 2025
8130fdc
Create Makefile
tiagolv Mar 12, 2025
68b0e94
Update requirements.txt
tiagolv Mar 12, 2025
245407e
actions mais abrangentes
tiagolv Mar 12, 2025
80b2a4e
Update requirements.txt
tiagolv Mar 12, 2025
ef22f31
Update Makefile
tiagolv Mar 17, 2025
d935bd1
Update Makefile
tiagolv Mar 17, 2025
bfa5191
dicionários e sesepararção final de argumentos e métodos
tiagolv Mar 17, 2025
9cfe7bf
continuação do ultimo commit
tiagolv Mar 17, 2025
12f2789
dicionarios compativeis e teste de novas classes
tiagolv Mar 18, 2025
f2bf058
Update Makefile
tiagolv Mar 18, 2025
a1c4d8d
datarepresentation lint
tiagolv Mar 18, 2025
c530cac
extensão das novas classes de contexto
tiagolv Mar 18, 2025
226003d
teste de nova classes de contexto
tiagolv Mar 25, 2025
98b5907
datarep new approach
tiagolv Mar 25, 2025
28a08d7
datarep
tiagolv Mar 25, 2025
7f175c7
estruta mais simples
tiagolv Mar 25, 2025
463276b
voltada a usar datarep com menos complexidade temporal
tiagolv Mar 25, 2025
60a1824
datarepresentation refactored 100%
tiagolv Mar 25, 2025
78f947b
-DOCCKER files - rest api expirada
tiagolv Mar 25, 2025
c678e9c
updated ymls
tiagolv Mar 25, 2025
4d3fec2
docs-sites inicio
tiagolv Apr 23, 2025
dab8c21
data-site setup finished
tiagolv Apr 23, 2025
044e726
Adiciona versão estática para GitHub Pages
tiagolv May 5, 2025
46b62f1
docs site estatico
tiagolv May 5, 2025
b01fa15
style
tiagolv May 5, 2025
771f4ca
cortes e organização de repositório
tiagolv May 5, 2025
6c2c99b
workflow para atualização do site
tiagolv May 5, 2025
b89da6b
Update deploy.yml
tiagolv May 5, 2025
345ed66
Update deploy.yml
tiagolv May 5, 2025
825ec76
ccs do site
tiagolv May 5, 2025
d52b8b7
updated layout
tiagolv May 6, 2025
21a8883
Update README.md
tiagolv May 6, 2025
dc56966
added search back to page
tiagolv May 6, 2025
dad527a
updated mdx for cleaner look
tiagolv May 7, 2025
c153ec0
updated gitignore
tiagolv May 7, 2025
aaa590d
Update package.json
tiagolv May 7, 2025
5e430f5
Update layout.tsx
tiagolv May 7, 2025
19d27db
Update layout.config.tsx
tiagolv May 7, 2025
88be423
cleaned pke and updated logo layout
tiagolv May 7, 2025
fc8cf81
Update config.ts
tiagolv May 7, 2025
954feca
Update layout.config.tsx
tiagolv May 7, 2025
f536633
Update layout.config.tsx
tiagolv May 7, 2025
019c9bd
updated documentation
tiagolv May 7, 2025
57dda5d
moved core files to core folder
tiagolv May 13, 2025
e374152
Update README.md
tiagolv May 13, 2025
6d85a05
cleaning repository and read-me tests
tiagolv May 13, 2025
0491657
Update README.md
tiagolv May 13, 2025
c510de0
updated homepage and readme
tiagolv May 13, 2025
0df5032
updated homepage doc site
tiagolv May 13, 2025
ae0828d
Update README.md
tiagolv May 13, 2025
b758a7f
updating homepage and doc site links
tiagolv May 13, 2025
6b01aa9
homepage final form
tiagolv May 13, 2025
f903ca2
cleaning up and link redirection working
tiagolv May 16, 2025
b7565b5
doc site update
tiagolv May 19, 2025
b996543
updated index
tiagolv May 19, 2025
6a25334
Update yake.mdx
tiagolv May 19, 2025
07dd952
updated home.mdx
tiagolv May 20, 2025
963687a
final docs website structure
tiagolv May 20, 2025
cd1b375
Update about.mdx
tiagolv May 20, 2025
d1800c8
Update about.mdx
tiagolv May 20, 2025
b3fd211
compatibility error
tiagolv May 20, 2025
00230a2
index
tiagolv May 20, 2025
0ada57f
updated formattting
tiagolv May 20, 2025
7c11fb4
icon test
tiagolv May 20, 2025
8ce7ca4
docs
tiagolv May 20, 2025
fb72306
added notebook
tiagolv May 20, 2025
921fbd3
updated collab redirections and notebook
tiagolv May 20, 2025
44446e9
Create meta.json
tiagolv May 20, 2025
d1fef97
updated order
tiagolv May 20, 2025
7343b3e
Delete meta.json
tiagolv May 20, 2025
fb3a3cf
Update getting-started.mdx
tiagolv May 20, 2025
ab73131
Update README.md
tiagolv May 20, 2025
75a81d7
sidebar test
tiagolv May 20, 2025
b0fd520
teste sidebar
tiagolv May 20, 2025
ce9fc5f
_meta
tiagolv May 20, 2025
d538819
updated sidebar
tiagolv May 20, 2025
9b99efe
sidebar final config
tiagolv May 20, 2025
b96f10a
updated utils and homepage
tiagolv May 20, 2025
31d32e6
Update README.md
tiagolv May 20, 2025
807453a
updated main class documentation
tiagolv May 20, 2025
dac7781
changed to uv
tiagolv May 20, 2025
a4f7027
finishing touches
tiagolv May 22, 2025
2e2fa6c
updated workflows for uv
tiagolv May 22, 2025
2e463b8
v envs for workflows
tiagolv May 22, 2025
97e0262
Update README.md
tiagolv May 22, 2025
6f03cfe
updated workflows
tiagolv May 22, 2025
95b35f9
Merge branch 'core-seperation' of https://github.com/tiagolv/yakerf i…
tiagolv May 22, 2025
a00da1f
final formatting for pull request
tiagolv May 27, 2025
c011916
test
tiagolv Jul 6, 2025
90842d9
implemented benchmark/optimized
tiagolv Sep 23, 2025
4cda8bb
added result visualization
tiagolv Sep 29, 2025
3318540
cache LRU + negative scores fix
tiagolv Oct 24, 2025
34655cf
fixed yake.py memory leak
tiagolv Oct 29, 2025
bb62931
added korean test + yake refactor
tiagolv Oct 29, 2025
1a462b9
added tests for more coverage
tiagolv Oct 29, 2025
c50c99e
added pylint ignore
tiagolv Nov 6, 2025
a479fba
same as last comm
tiagolv Nov 6, 2025
4b4d2da
refact
tiagolv Nov 6, 2025
23523e4
full refactor
tiagolv Nov 6, 2025
150bbc6
final refact
tiagolv Nov 6, 2025
700a968
cleaner and more comprehensive documentation
tiagolv Nov 6, 2025
92a0857
updated website and documentation
tiagolv Nov 20, 2025
d931ded
final adjustments
tiagolv Jan 29, 2026
a2a7368
Update -getting-started.mdx
tiagolv Jan 29, 2026
d4fed6c
Update -getting-started.mdx
tiagolv Jan 29, 2026
985b222
repository cleanup for PR
tiagolv Jan 29, 2026
31c8dcc
final updates
tiagolv Jan 29, 2026
3a3a0e2
Update README.md
tiagolv Jan 29, 2026
735c9ae
Update README.md
tiagolv Jan 29, 2026
cfe0409
Update README.md
tiagolv Jan 29, 2026
b37d31c
workflow updates
tiagolv Jan 29, 2026
0ca10b1
updated test.yml
tiagolv Jan 29, 2026
8848179
updated to python 3.13
tiagolv Jan 29, 2026
05be4d0
3.12 update python
tiagolv Jan 29, 2026
c15096a
Update test.yml
tiagolv Jan 29, 2026
12f0b92
Update test.yml
tiagolv Jan 29, 2026
5f07b6e
Update test.yml
tiagolv Jan 29, 2026
726d918
Update test.yml
tiagolv Jan 29, 2026
4d35647
updated numpy version
tiagolv Jan 29, 2026
472a4db
Update test.yml
tiagolv Jan 29, 2026
8d51be1
test update
tiagolv Jan 29, 2026
a72998f
links + fail safe no workflow de testes
tiagolv Jan 31, 2026
c33aaef
added negative score test
tiagolv Jan 31, 2026
a11a0e7
Update README.md
tiagolv Jan 31, 2026
dbd070a
merge fixes
tiagolv Jan 31, 2026
b1c6b09
Update test_features.py
tiagolv Jan 31, 2026
e01e9a3
Update 1YAKE.ipynb
tiagolv Jan 31, 2026
60e8f18
updated pages
tiagolv Jan 31, 2026
c7e2a59
idk
tiagolv Jan 31, 2026
aab6f2e
Update README.md
tiagolv Jan 31, 2026
85410ba
conflitcts resolve
tiagolv Feb 2, 2026
c132ee1
resolve conflicts
tiagolv Feb 2, 2026
bab3f41
Merge branch 'master' into core-seperation
tiagolv Feb 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
329 changes: 329 additions & 0 deletions docs-site/content/docs/Documentation/data/features.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
import { Accordion, AccordionContent, AccordionItem, AccordionTrigger } from '@/components/ui/accordion'

# Features Module

The `features` module contains pure functions for calculating statistical features used to score and rank keyword candidates in YAKE.

> **Info:** This documentation provides interactive code views for each method. Click on a function name to view its implementation.

## Module Overview

```python
"""
Feature calculation module for YAKE keyword extraction.

This module contains pure functions for calculating statistical features
used to score and rank keyword candidates. Separating feature calculations
from data structures improves testability and maintainability.

Based on the modular architecture from the reference YAKE implementation.
"""

import logging
import math
from typing import Dict, Any, Tuple
import numpy as np

# Configure module logger
logger = logging.getLogger(__name__)
```

This module provides stateless functions that calculate various statistical features for both single-word terms and multi-word expressions (n-grams).

## Main Functions

<Accordion type="single" collapsible>
<AccordionItem value="calculate_term_features">
<AccordionTrigger>
<code>calculate_term_features(term, max_tf, avg_tf, std_tf, number_of_sentences)</code>
</AccordionTrigger>
<AccordionContent>
```python
def calculate_term_features(
term: Any,
max_tf: float,
avg_tf: float,
std_tf: float,
number_of_sentences: int
) -> Dict[str, float]:
"""
Calculate all statistical features for a single term.

This function computes various statistical features that determine
a term's importance as a potential keyword. Features include term
relevance, frequency, spread, case information, and position.

The features calculated are:
- WRel: Term relevance based on graph connectivity (co-occurrence)
- WFreq: Normalized term frequency
- WSpread: Distribution across sentences
- WCase: Capitalization pattern (prefers proper nouns)
- WPos: Position bias (earlier terms preferred)
- H: Overall importance score (lower is better)

Args:
term: SingleWord object containing term information
max_tf: Maximum term frequency in the document
avg_tf: Average term frequency across all terms
std_tf: Standard deviation of term frequency
number_of_sentences: Total number of sentences in document

Returns:
Dictionary with calculated features:
- w_rel: Term relevance score
- w_freq: Normalized frequency score
- w_spread: Sentence spread score
- w_case: Case sensitivity score
- w_pos: Position score
- pl: Left context weight
- pr: Right context weight
- h: Final importance score (H-score)
"""
# Get graph metrics (cached in SingleWord)
if hasattr(term, "get_graph_metrics"):
metrics = term.get_graph_metrics()
else:
metrics = term.graph_metrics

# Calculate WRel (term relevance based on graph connectivity)
pwl = metrics['pwl']
pwr = metrics['pwr']
pl = metrics['wdl'] / max_tf if max_tf > 0 else 0
pr = metrics['wdr'] / max_tf if max_tf > 0 else 0

w_rel = (0.5 + (pwl * (term.tf / max_tf))) + (0.5 + (pwr * (term.tf / max_tf)))

# Calculate WFreq (normalized term frequency)
w_freq = term.tf / (avg_tf + std_tf) if (avg_tf + std_tf) > 0 else 0

# Calculate WSpread (term spread across sentences)
w_spread = len(term.sentence_ids) / number_of_sentences

# Calculate WCase (capitalization pattern)
w_case = max(term.tf_a, term.tf_n) / (1.0 + math.log(term.tf))

# Calculate WPos (position feature using median)
positions = list(term.occurs.keys())
w_pos = math.log(math.log(3.0 + np.median(positions)))

# Calculate H (overall importance score)
h_score = (w_pos * w_rel) / (
w_case + (w_freq / w_rel) + (w_spread / w_rel)
)

return {
'w_rel': w_rel,
'w_freq': w_freq,
'w_spread': w_spread,
'w_case': w_case,
'w_pos': w_pos,
'pl': pl,
'pr': pr,
'h': h_score
}
```
</AccordionContent>
</AccordionItem>

<AccordionItem value="calculate_composed_features">
<AccordionTrigger>
<code>calculate_composed_features(composed_word, stopword_weight='bi')</code>
</AccordionTrigger>
<AccordionContent>
```python
def calculate_composed_features(
composed_word: Any,
stopword_weight: str = 'bi'
) -> Dict[str, float]:
"""
Calculate features for multi-word expressions (n-grams).

Combines features from individual terms to score the entire phrase,
with special handling for stopwords based on the weighting method.

The features are aggregated from constituent terms using different
combination methods:
- TF: Product of term frequencies
- PL/PR: Multiplication with ratio adjustment
- H: Combined score using product and ratios

Args:
composed_word: ComposedWord object containing the n-gram
stopword_weight: Method for handling stopwords:
- 'bi': Bi-gram specific weighting (default)
- 'h': Use H-score for weighting
- 'none': No special stopword handling

Returns:
Dictionary with aggregated features for the multi-word expression
"""
# Get features from constituent terms
sum_tf, prod_tf, ratio_tf = composed_word.get_composed_feature(
'tf',
discart_stopword=(stopword_weight != 'none')
)

sum_pl, prod_pl, ratio_pl = composed_word.get_composed_feature(
'pl',
discart_stopword=True
)

sum_pr, prod_pr, ratio_pr = composed_word.get_composed_feature(
'pr',
discart_stopword=True
)

# Calculate combined H-score
sum_h, prod_h, ratio_h = composed_word.get_composed_feature(
'h',
discart_stopword=True
)

# Combine features based on n-gram size
if len(composed_word.terms) == 1:
# Single word - use its H-score directly
h_score = composed_word.terms[0].h
else:
# Multi-word - combine using product and ratios
h_score = prod_h / (sum_tf * (1.0 + sum_pl) * (1.0 + sum_pr))

return {
'tf': prod_tf,
'pl': prod_pl * ratio_pl,
'pr': prod_pr * ratio_pr,
'h': h_score,
'integrity': composed_word.integrity
}
```
</AccordionContent>
</AccordionItem>
</Accordion>

## Helper Functions

<Accordion type="single" collapsible>
<AccordionItem value="normalize_features">
<AccordionTrigger>
<code>normalize_features(features, max_vals)</code>
</AccordionTrigger>
<AccordionContent>
```python
def normalize_features(
features: Dict[str, float],
max_vals: Dict[str, float]
) -> Dict[str, float]:
"""
Normalize feature values to [0, 1] range.

Divides each feature by its maximum observed value in the corpus
to create normalized, comparable scores.

Args:
features: Dictionary of raw feature values
max_vals: Dictionary of maximum values for each feature

Returns:
Dictionary of normalized feature values
"""
normalized = {}
for key, value in features.items():
max_val = max_vals.get(key, 1.0)
if max_val > 0:
normalized[key] = value / max_val
else:
normalized[key] = 0.0
return normalized
```
</AccordionContent>
</AccordionItem>

<AccordionItem value="safe_divide">
<AccordionTrigger>
<code>safe_divide(numerator, denominator, default=0.0)</code>
</AccordionTrigger>
<AccordionContent>
```python
def safe_divide(
numerator: float,
denominator: float,
default: float = 0.0
) -> float:
"""
Safely divide two numbers, handling division by zero.

Args:
numerator: Value to divide
denominator: Value to divide by
default: Value to return if denominator is zero (default: 0.0)

Returns:
Result of division, or default if denominator is zero
"""
if denominator == 0:
return default
return numerator / denominator
```
</AccordionContent>
</AccordionItem>
</Accordion>

## Feature Descriptions

### Single-Term Features

- **WRel (Term Relevance)**: Measures term importance based on co-occurrence patterns with other terms
- **WFreq (Frequency)**: Normalized term frequency relative to corpus statistics
- **WSpread (Spread)**: Distribution of term across document sentences
- **WCase (Case)**: Capitalization patterns (favors proper nouns and acronyms)
- **WPos (Position)**: Positional bias favoring terms appearing earlier in document
- **H-Score**: Combined importance score (lower values indicate more important keywords)

### Multi-Word Features

- **TF (Term Frequency)**: Product of constituent term frequencies
- **PL/PR (Context)**: Left and right context weights
- **Integrity**: Cohesion measure for multi-word expressions
- **H-Score**: Aggregated importance combining all constituent features

## Usage Example

```python
from yake.data.features import calculate_term_features, calculate_composed_features
from yake.data import DataCore

# Build data representation
text = "Natural language processing is important for AI applications."
dc = DataCore(text=text, stopword_set={"is", "for"}, config={"windows_size": 1, "n": 3})
dc.build_single_terms_features()
dc.build_mult_terms_features()

# Features are automatically calculated and stored in term objects
for term in dc.terms.values():
print(f"{term.word}: H={term.h:.4f}, WRel={term.w_rel:.4f}")

for candidate in dc.candidates.values():
if candidate.is_valid():
print(f"{candidate.kw}: H={candidate.h:.4f}")
```

## Integration with YAKE

This module is used internally by:
- `SingleWord.update_h()`: Calculates features for single terms
- `ComposedWord.update_h()`: Calculates features for n-grams
- `DataCore.build_single_terms_features()`: Batch feature calculation
- `DataCore.build_mult_terms_features()`: N-gram feature aggregation

## Performance Considerations

- Features are calculated once and cached in term objects
- Pure functions enable easy testing and optimization
- Numpy is used for efficient median calculations
- Feature calculation is the most computationally intensive part of YAKE

## Dependencies

- `logging`: For debug and error messages
- `math`: For logarithmic calculations
- `numpy`: For efficient statistical operations
- `typing`: For type hints
2 changes: 1 addition & 1 deletion tests/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# 🧪 How to run the tests
This project uses pytes to run it´s tests.
This project uses pytest to run its tests.

### 📋 Pre-requirements
If not already installed install pytest:
Expand Down
1 change: 1 addition & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# -*- coding: utf-8 -*-
# pylint: skip-file

"""Unit test package for yake."""
Loading
Loading