A comparative study of two approaches to Relation Extraction (RE) on the SemEval 2010 Task 8 dataset — a BiLSTM model with custom Relative Positional Encoding (RPE) and a hybrid BERT + SVM pipeline.
Group Project — Joshwin Sundarraj, Habiba Ahmad, Dion Recai, Manan Mal
Relation Extraction is the task of identifying semantic relationships between entity pairs in text — a core component of knowledge base construction, biomedical text mining, and automated Q&A systems. This project implements and compares two contrasting paradigms:
| Aspect | BiLSTM + RPE | BERT + SVM |
|---|---|---|
| Feature learning | End-to-end (learned) | Pretrained embeddings + classical classifier |
| Positional awareness | Custom relative positional encoding | Entity tags preserved in input |
| Sequence modelling | Bidirectional LSTM | None (mean-pooled embeddings) |
| Hyperparameter tuning | Bayesian Optimisation | GridSearchCV (5-fold CV) |
| Test Accuracy | 75.27% | 64.26% |
| Macro F1 | 70.42% | 59.0% |
SemEval 2010 Task 8 — sentence-level relation extraction benchmark.
- 19 directional relation types (including an Other category)
- Human-annotated entity pairs with positional tags (
<e1>,<e2>) - Pre-defined train/test split
Entity tags are retained in the input for both approaches, consistent with findings by Baldini Soares et al. (2019) and Tao et al. (2019) that syntactic indicators improve relation learning.
- Entity tags (
<e1>,</e1>,<e2>,</e2>) are space-separated for clean tokenisation - Punctuation is filtered (except
<,>,\) to reduce OOV tokens - Sequences are padded to uniform length
- Relation labels are one-hot encoded to vectors of size 19
Two position encodings are computed per sample — one relative to each tagged entity. For example:
"The <e1>bottle</e1> was filled with <e2>coloured water</e2> and placed on the table."
RPE w.r.t. <e1>bottle</e1> → [-2, -1, 0, +1, ..., +13]
RPE w.r.t. <e2>water</e2> → [-8, ..., -1, 0, 0, +1, ..., +6]
Tagged entities are masked with 0. Both encodings are padded to match token sequence length.
- Embedding layer: GloVe pretrained embeddings (dim = 300, tuned via Bayesian Optimisation)
- Position embeddings: learned from RPE inputs
- Concatenation: word + position₁ + position₂ embeddings → 1D Spatial Dropout
- BiLSTM layer: single layer (research by Peters et al. (2018) shows lower layers capture significant syntactic information)
- Output: Dense layer with Softmax over 19 relation classes
| Hyperparameter | Value |
|---|---|
| Embedding dim | 300 |
| LSTM units | 96 |
| Dropout rate | 0.8 |
| Learning rate | 0.001 |
| BiLSTM dropout | 0.4 |
| Kernel regulariser (L2) | 0.0001 |
Training: 30 epochs, batch size 10, full training set with optimal hyperparameters.
- Model:
bert-base-uncased(768-dimensional output embeddings) - Base model chosen over BERT-large due to comparable performance at significantly lower compute cost (Shi and Lin, 2019)
- Sentence representation: mean of all token embeddings (outperforms [CLS] token pooling in our experiments; consistent with Reimers and Gurevych, 2019)
- PCA dimensionality reduction was tested but produced significantly worse results and was dropped
- Kernel: linear (recommended for text classification, Hsu et al., 2009)
- Hyperparameter tuning via GridSearchCV (5-fold cross-validation) over
Candclass_weight - Chosen for robustness to overfitting and computational efficiency
| Model | Test Accuracy | Macro F1 |
|---|---|---|
| BiLSTM + RPE | 75.27% | 70.42% |
| BERT + SVM | 64.26% | 59.0% |
| Relation | BiLSTM P | BiLSTM R | BiLSTM F1 | BERT+SVM P | BERT+SVM R | BERT+SVM F1 |
|---|---|---|---|---|---|---|
| Cause-Effect(e1,e2) | 0.89 | 0.89 | 0.89 | 0.85 | 0.63 | 0.72 |
| Cause-Effect(e2,e1) | 0.89 | 0.87 | 0.88 | 0.63 | 0.88 | 0.74 |
| Component-Whole(e1,e2) | 0.80 | 0.78 | 0.79 | 0.68 | 0.70 | 0.69 |
| Component-Whole(e2,e1) | 0.62 | 0.64 | 0.63 | 0.51 | 0.65 | 0.57 |
| Content-Container(e1,e2) | 0.72 | 0.95 | 0.82 | 0.79 | 0.83 | 0.81 |
| Content-Container(e2,e1) | 0.78 | 0.74 | 0.76 | 0.56 | 0.56 | 0.56 |
| Entity-Destination(e1,e2) | 0.85 | 0.90 | 0.87 | 0.73 | 0.85 | 0.78 |
| Entity-Destination(e2,e1) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Entity-Origin(e1,e2) | 0.79 | 0.87 | 0.83 | 0.59 | 0.78 | 0.67 |
| Entity-Origin(e2,e1) | 0.80 | 0.87 | 0.84 | 0.65 | 0.77 | 0.71 |
| Instrument-Agency(e1,e2) | 0.52 | 0.55 | 0.53 | 0.33 | 0.27 | 0.30 |
| Instrument-Agency(e2,e1) | 0.69 | 0.73 | 0.71 | 0.69 | 0.48 | 0.56 |
| Member-Collection(e1,e2) | 0.43 | 0.66 | 0.52 | 0.50 | 0.41 | 0.45 |
| Member-Collection(e2,e1) | 0.80 | 0.91 | 0.85 | 0.63 | 0.88 | 0.73 |
| Message-Topic(e1,e2) | 0.76 | 0.90 | 0.82 | 0.71 | 0.80 | 0.75 |
| Message-Topic(e2,e1) | 0.78 | 0.75 | 0.76 | 0.78 | 0.57 | 0.66 |
| Product-Producer(e1,e2) | 0.76 | 0.78 | 0.77 | 0.60 | 0.58 | 0.59 |
| Product-Producer(e2,e1) | 0.68 | 0.69 | 0.69 | 0.55 | 0.53 | 0.54 |
| Other | 0.54 | 0.33 | 0.41 | 0.49 | 0.22 | 0.30 |
| Macro Avg | 0.69 | 0.73 | 0.70 | 0.59 | 0.60 | 0.59 |
| Weighted Avg | 0.74 | 0.75 | 0.74 | 0.63 | 0.64 | 0.62 |
- BiLSTM + RPE outperforms BERT + SVM on all overall metrics, reinforcing the advantage of end-to-end sequential feature learning over a pipeline approach.
- Entity-Destination(e2,e1) scored 0% for both models — this relation had only one training sample, making it impossible to learn. Potential fixes: SMOTE oversampling, converting reverse-relation sentences, or removing the class.
- BERT + SVM showed competitive precision on some individual relations (e.g. Cause-Effect, Content-Container) where BERT's contextual embeddings captured sufficient signal.
- Both models underperform the SOTA BiLSTM+Attention model (Zhou et al., 2016) which achieves 85% accuracy — attributable to its attention mechanism, which we did not implement.
- The BERT + SVM model's precision (59%) falls below specialist SVM systems (63–84%) that use syntactic features like POS tags, WordNet, and NomLex — showing that BERT embeddings alone do not substitute for feature engineering in SVMs.
- Fine-tuning BERT end-to-end for relation classification rather than using it as a frozen feature extractor
- Adding an attention mechanism to the BiLSTM (as in Zhou et al., 2016) to target the performance gap with SOTA
- Addressing class imbalance via SMOTE or data augmentation for rare relation types
- Exploring graph-based approaches (e.g. GCN over dependency parse trees) for richer structural representations
- Devlin et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL.
- Zhou et al. (2016). Attention-based BiLSTM for Relation Classification. ACL.
- Baldini Soares et al. (2019). Matching the Blanks: Distributional Similarity for Relation Learning. ACL.
- Pennington et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP.
- Reimers and Gurevych (2019). Sentence-BERT. EMNLP.
- Hendrickx et al. (2010). SemEval-2010 Task 8. SemEval Workshop.
- Rink and Harabagiu (2010). UTD: Classifying Semantic Relations. SemEval Workshop.