TextRank-based Keyword Extraction from POS-Tagged Documents

This project implements the TextRank algorithm (Mihalcea, 2004) from scratch, including graph construction, weighted scoring, and convergence-based ranking.

TextRank is a graph-based ranking algorithm derived from PageRank (Brin & Page, 1998), originally developed for ranking web pages. It adapts the same iterative scoring mechanism to textual units such as words or sentences, enabling graph-based ranking of linguistic elements.

The system provides a graph-based keyword extraction pipeline applied to POS-tagged corpora in Penn Treebank format. It is designed as a complete end-to-end pipeline, from text preprocessing and graph construction to keyword extraction and evaluation outputs.

📁 Project Structure

.
├── src/
│   ├── coocurrencias.py
│   ├── coocurrencias-lem.py
│   └── palabras_clave-lem.py
├── data/
│   └── sample_corpus_ptb/
├── docs/
│   ├── Instrucciones del programa coocurrencias.py.pdf
│   └── Instrucciones del programa palabras_clave-lem.py.pdf
├── README.md

📊 Input Format

The system expects POS-tagged text in Penn Treebank format, where each token is represented as:

word/POS

Example:

A/DT challenging/JJ problem/NN faced/VBN by/IN researchers/NNS

🧩 Part 1 --- Graph Construction (TextRank Input)

🔹 coocurrencias.py

This script generates a co-occurrence graph from a collection of tagged documents.

✔️ What it does

Parses documents into (word, POS) tuples
Normalizes tokens:
- removes non-alphabetic characters
- converts to lowercase
- removes stopwords
Applies stemming (Porter)
Filters tokens by POS tags
Builds a vocabulary
Computes co-occurrences within a sliding window
Generates a weighted graph in Pajek format

🔹 coocurrencias-lem.py (Improved version)

This script is an improved version of coocurrencias.py.

🔧 Key improvement

Replaces stemming with lemmatization using WordNet

🔗 Output of Part 1

Both scripts generate a graph in Pajek format:

*Vertices N
1 "term1"
2 "term2"
...

*Edges
1 2 weight

🧠 Part 2 --- Keyword Extraction (TextRank)

🔹 palabras_clave-lem.py

This script implements the TextRank algorithm over the graph generated in Part 1.

⚠️ Important: The input Pajek graph must be generated using the same preprocessing pipeline as the keyword extraction step, in particular lemmatization with WordNetLemmatizer.

The script palabras_clave-lem.py assumes that the terms in the graph vocabulary match exactly the lemmatized terms extracted from the documents. Therefore, the graph should be created using coocurrencias-lem.py.

If the graph is generated using stemming (e.g. coocurrencias.py), term mismatches may occur and the program will fail when mapping words to graph nodes.

⚙️ What it does

1. Loads the graph

Reads vocabulary and edge weights from a Pajek file\
Maps terms to node indices

2. Processes each document

tokenization (word/POS)\
normalization\
lemmatization (same as coocurrencias-lem.py)\
POS filtering

👉 Ensures consistency between graph and document processing

3. Builds document subgraph

extracts nodes present in the document\
builds co-occurrence pairs within a window

4. Applies TextRank

An iterative algorithm computes node importance:

WS(v) = (1 - d) + d * Σ (WS(u) * weight(u,v) / W(u))

WS(v): score of node v\
d: damping factor (typically 0.85)\
W(u): total weight of edges from node u

The algorithm iterates until convergence (controlled by a threshold parameter).

5. Selects keywords

nodes are ranked by score\
top nodes are selected dynamically:
- minimum: 5\
- maximum: 20

6. Outputs results

For each document:

document.txt.tagged.result

Also generates:

summary.csv

containing:

number of iterations\
convergence value\
total score

🔁 Complete Pipeline

Tagged documents
        ↓
coocurrencias.py / coocurrencias-lem.py
        ↓
Co-occurrence graph (Pajek)
        ↓
palabras_clave-lem.py
        ↓
Keywords per document

📦 Sample Corpus

The folder sample_corpus_ptb/ contains 10 tagged documents.

🚀 Usage

python src/coocurrencias-lem.py -i data/sample_corpus_ptb -w 5
python src/palabras_clave-lem.py -i data/sample_corpus_ptb -g graph.txt -w 5 -d 0.85 -l 0.0001

📚 References

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine
- Mihalcea, R. (2004). TextRank: Bringing order into texts
Pajek Batagelj, V., & Mrvar, A. (1998). Pajek – Program for Large Network Analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextRank-based Keyword Extraction from POS-Tagged Documents

📁 Project Structure

📊 Input Format

🧩 Part 1 --- Graph Construction (TextRank Input)

🔹 coocurrencias.py

✔️ What it does

🔹 coocurrencias-lem.py (Improved version)

🔧 Key improvement

🔗 Output of Part 1

🧠 Part 2 --- Keyword Extraction (TextRank)

🔹 palabras_clave-lem.py

⚙️ What it does

1. Loads the graph

2. Processes each document

3. Builds document subgraph

4. Applies TextRank

5. Selects keywords

6. Outputs results

🔁 Complete Pipeline

📦 Sample Corpus

🚀 Usage

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/sample_corpus_ptb		data/sample_corpus_ptb
docs		docs
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TextRank-based Keyword Extraction from POS-Tagged Documents

📁 Project Structure

📊 Input Format

🧩 Part 1 --- Graph Construction (TextRank Input)

🔹 coocurrencias.py

✔️ What it does

🔹 coocurrencias-lem.py (Improved version)

🔧 Key improvement

🔗 Output of Part 1

🧠 Part 2 --- Keyword Extraction (TextRank)

🔹 palabras_clave-lem.py

⚙️ What it does

1. Loads the graph

2. Processes each document

3. Builds document subgraph

4. Applies TextRank

5. Selects keywords

6. Outputs results

🔁 Complete Pipeline

📦 Sample Corpus

🚀 Usage

📚 References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages