Semantic Textual Similarity with Supervised and Unsupervised Learning

Authors: Tejansh Sachdeva and Mitaali Singhal

Overview

Semantic Textual Similarity (STS) assesses the relationship between text pairs, playing a critical role in Natural Language Processing (NLP) applications like information retrieval, question answering, and summarization.

This repository provides the codebase and resources for our research project, which introduces a hybrid approach combining Supervised Learning, Unsupervised Learning, and Ensembled Models to achieve state-of-the-art performance in STS tasks.

Key highlights include:

Use of traditional models like SVR, LightGBM, and XGBoost.
Integration of feedforward neural networks for unsupervised learning.
Novel ensemble models (XGB + LightGBM, SVR + Neural Network) for robust similarity prediction.
Experiments conducted on benchmark datasets such as SemEval 2012 Task 6.

Repository Structure

├── data/  
│   └── dataset.txt      # Primary dataset for experiments.  
├── supervised/  
│   ├── baseline/        # Baseline supervised models.  
│   ├── lightgbm/        # LightGBM implementation.  
│   ├── svr/             # Support Vector Regressor.  
│   └── xgboost/         # XGBoost implementation.  
├── unsupervised/  
│   └── feedfwd_nn/      # Feedforward neural network for unsupervised learning.  
├── ensembled/  
│   ├── xgb_lgm/         # Ensemble of XGBoost + LightGBM.  
│   └── svr_nn/          # Ensemble of SVR + Neural Network.  
├── requirements.txt     # Dependencies for the project.  
└── README.md            # Project documentation.

Installation

Clone the repository:

git clone https://github.com/ms923/Semantic-Analysis.git
cd Semantic-Analysis

Create a virtual environment (Python 3.11 recommended):

python3.11 -m venv env  
source env/bin/activate  # For Linux/Mac  
env\Scripts\activate     # For Windows

Install required dependencies:
```
pip install -r requirements.txt  
```

Usage

Supervised Models: Run individual models (svr, lightgbm, xgboost) from the supervised folder.
Unsupervised Models: Train the feedforward neural network in the unsupervised folder.
Ensembled Models: Combine the strengths of different methods by running scripts in the ensembled folder.

Results

Achieved state-of-the-art correlations with human similarity judgments using Pearson and Spearman metrics.
Detailed evaluation available in the paper accompanying this repository.

Datasets

The dataset.txt file in the data folder contains the primary data used for training and evaluation.

Citation

If you use this code or approach in your research, please cite:

@article{Tejansh2024STS,  
  title={Semantic Textual Similarity with Supervised and Unsupervised Learning: Applications of SVR and Ensembling},  
  author={Tejansh Sachdeva and Mitaali Singhal},  
  year={2024}  
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Ensembled		Ensembled
Streamlit		Streamlit
Supervised		Supervised
Unsupervised		Unsupervised
data		data
.gitignore		.gitignore
Readme.md		Readme.md
Research Paper.pdf		Research Paper.pdf
nlkt_download.py		nlkt_download.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Textual Similarity with Supervised and Unsupervised Learning

Overview

Repository Structure

Installation

Usage

Results

Datasets

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ms923/Semantic-Analysis

Folders and files

Latest commit

History

Repository files navigation

Semantic Textual Similarity with Supervised and Unsupervised Learning

Overview

Repository Structure

Installation

Usage

Results

Datasets

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages