This project implements the TextRank algorithm (Mihalcea, 2004) from scratch, including graph construction, weighted scoring, and convergence-based ranking.
TextRank is a graph-based ranking algorithm derived from PageRank (Brin & Page, 1998), originally developed for ranking web pages. It adapts the same iterative scoring mechanism to textual units such as words or sentences, enabling graph-based ranking of linguistic elements.
The system provides a graph-based keyword extraction pipeline applied to POS-tagged corpora in Penn Treebank format. It is designed as a complete end-to-end pipeline, from text preprocessing and graph construction to keyword extraction and evaluation outputs.
.
├── src/
│ ├── coocurrencias.py
│ ├── coocurrencias-lem.py
│ └── palabras_clave-lem.py
├── data/
│ └── sample_corpus_ptb/
├── docs/
│ ├── Instrucciones del programa coocurrencias.py.pdf
│ └── Instrucciones del programa palabras_clave-lem.py.pdf
├── README.md
The system expects POS-tagged text in Penn Treebank format, where each token is represented as:
word/POS
Example:
A/DT challenging/JJ problem/NN faced/VBN by/IN researchers/NNS
This script generates a co-occurrence graph from a collection of tagged documents.
- Parses documents into
(word, POS)tuples - Normalizes tokens:
- removes non-alphabetic characters
- converts to lowercase
- removes stopwords
- Applies stemming (Porter)
- Filters tokens by POS tags
- Builds a vocabulary
- Computes co-occurrences within a sliding window
- Generates a weighted graph in Pajek format
This script is an improved version of coocurrencias.py.
- Replaces stemming with lemmatization using WordNet
Both scripts generate a graph in Pajek format:
*Vertices N
1 "term1"
2 "term2"
...
*Edges
1 2 weight
This script implements the TextRank algorithm over the graph generated in Part 1.
The script palabras_clave-lem.py assumes that the terms in the graph vocabulary match exactly the lemmatized terms extracted from the documents. Therefore, the graph should be created using coocurrencias-lem.py.
If the graph is generated using stemming (e.g. coocurrencias.py), term mismatches may occur and the program will fail when mapping words to graph nodes.
- Reads vocabulary and edge weights from a Pajek file\
- Maps terms to node indices
- tokenization (
word/POS)\ - normalization\
- lemmatization (same as
coocurrencias-lem.py)\ - POS filtering
👉 Ensures consistency between graph and document processing
- extracts nodes present in the document\
- builds co-occurrence pairs within a window
An iterative algorithm computes node importance:
WS(v) = (1 - d) + d * Σ (WS(u) * weight(u,v) / W(u))
WS(v): score of node v\d: damping factor (typically 0.85)\W(u): total weight of edges from node u
The algorithm iterates until convergence (controlled by a threshold parameter).
- nodes are ranked by score\
- top nodes are selected dynamically:
- minimum: 5\
- maximum: 20
For each document:
document.txt.tagged.result
Also generates:
summary.csv
containing:
- number of iterations\
- convergence value\
- total score
Tagged documents
↓
coocurrencias.py / coocurrencias-lem.py
↓
Co-occurrence graph (Pajek)
↓
palabras_clave-lem.py
↓
Keywords per document
The folder sample_corpus_ptb/ contains 10 tagged documents.
python src/coocurrencias-lem.py -i data/sample_corpus_ptb -w 5
python src/palabras_clave-lem.py -i data/sample_corpus_ptb -g graph.txt -w 5 -d 0.85 -l 0.0001- Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine
- Mihalcea, R. (2004). TextRank: Bringing order into texts
- Pajek Batagelj, V., & Mrvar, A. (1998). Pajek – Program for Large Network Analysis