Skip to content

ariadnafruits/textrank-keyword-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextRank-based Keyword Extraction from POS-Tagged Documents

This project implements the TextRank algorithm (Mihalcea, 2004) from scratch, including graph construction, weighted scoring, and convergence-based ranking.

TextRank is a graph-based ranking algorithm derived from PageRank (Brin & Page, 1998), originally developed for ranking web pages. It adapts the same iterative scoring mechanism to textual units such as words or sentences, enabling graph-based ranking of linguistic elements.

The system provides a graph-based keyword extraction pipeline applied to POS-tagged corpora in Penn Treebank format. It is designed as a complete end-to-end pipeline, from text preprocessing and graph construction to keyword extraction and evaluation outputs.


📁 Project Structure

.
├── src/
│   ├── coocurrencias.py
│   ├── coocurrencias-lem.py
│   └── palabras_clave-lem.py
├── data/
│   └── sample_corpus_ptb/
├── docs/
│   ├── Instrucciones del programa coocurrencias.py.pdf
│   └── Instrucciones del programa palabras_clave-lem.py.pdf
├── README.md

📊 Input Format

The system expects POS-tagged text in Penn Treebank format, where each token is represented as:

word/POS

Example:

A/DT challenging/JJ problem/NN faced/VBN by/IN researchers/NNS

🧩 Part 1 --- Graph Construction (TextRank Input)

🔹 coocurrencias.py

This script generates a co-occurrence graph from a collection of tagged documents.

✔️ What it does

  • Parses documents into (word, POS) tuples
  • Normalizes tokens:
    • removes non-alphabetic characters
    • converts to lowercase
    • removes stopwords
  • Applies stemming (Porter)
  • Filters tokens by POS tags
  • Builds a vocabulary
  • Computes co-occurrences within a sliding window
  • Generates a weighted graph in Pajek format

🔹 coocurrencias-lem.py (Improved version)

This script is an improved version of coocurrencias.py.

🔧 Key improvement

  • Replaces stemming with lemmatization using WordNet

🔗 Output of Part 1

Both scripts generate a graph in Pajek format:

*Vertices N
1 "term1"
2 "term2"
...

*Edges
1 2 weight

🧠 Part 2 --- Keyword Extraction (TextRank)

🔹 palabras_clave-lem.py

This script implements the TextRank algorithm over the graph generated in Part 1.

⚠️ Important: The input Pajek graph must be generated using the same preprocessing pipeline as the keyword extraction step, in particular lemmatization with WordNetLemmatizer.

The script palabras_clave-lem.py assumes that the terms in the graph vocabulary match exactly the lemmatized terms extracted from the documents. Therefore, the graph should be created using coocurrencias-lem.py.

If the graph is generated using stemming (e.g. coocurrencias.py), term mismatches may occur and the program will fail when mapping words to graph nodes.


⚙️ What it does

1. Loads the graph

  • Reads vocabulary and edge weights from a Pajek file\
  • Maps terms to node indices

2. Processes each document

  • tokenization (word/POS)\
  • normalization\
  • lemmatization (same as coocurrencias-lem.py)\
  • POS filtering

👉 Ensures consistency between graph and document processing


3. Builds document subgraph

  • extracts nodes present in the document\
  • builds co-occurrence pairs within a window

4. Applies TextRank

An iterative algorithm computes node importance:

WS(v) = (1 - d) + d * Σ (WS(u) * weight(u,v) / W(u))
  • WS(v): score of node v\
  • d: damping factor (typically 0.85)\
  • W(u): total weight of edges from node u

The algorithm iterates until convergence (controlled by a threshold parameter).


5. Selects keywords

  • nodes are ranked by score\
  • top nodes are selected dynamically:
    • minimum: 5\
    • maximum: 20

6. Outputs results

For each document:

document.txt.tagged.result

Also generates:

summary.csv

containing:

  • number of iterations\
  • convergence value\
  • total score

🔁 Complete Pipeline

Tagged documents
        ↓
coocurrencias.py / coocurrencias-lem.py
        ↓
Co-occurrence graph (Pajek)
        ↓
palabras_clave-lem.py
        ↓
Keywords per document

📦 Sample Corpus

The folder sample_corpus_ptb/ contains 10 tagged documents.


🚀 Usage

python src/coocurrencias-lem.py -i data/sample_corpus_ptb -w 5
python src/palabras_clave-lem.py -i data/sample_corpus_ptb -g graph.txt -w 5 -d 0.85 -l 0.0001

📚 References

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine
    • Mihalcea, R. (2004). TextRank: Bringing order into texts
  • Pajek Batagelj, V., & Mrvar, A. (1998). Pajek – Program for Large Network Analysis

About

Graph-based keyword extraction using a custom implementation of the TextRank algorithm on POS-tagged corpora

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages