This project is a mini-project for the NLP course (Master 1, AI, Semester 2) focused on applying and fine-tuning the CodeT5 model to generate natural language docstrings from source code snippets.
I not only implemented the model but also studied and summarized the CodeT5 research paper, ensuring I understood the architecture, training methodology, and evaluation metrics used in the original work.
Automatic code documentation is an important challenge in software engineering. Large models such as CodeT5 already achieve strong performance in code summarization tasks. However, fine-tuning on a carefully prepared dataset can adapt the model to a specific domain and improve the quality of generated docstrings.
In this project:
- I fine-tuned CodeT5-small on a dataset of 40k code-docstring pairs.
- I compared pre-trained performance vs. fine-tuned performance using multiple metrics.
- I implemented an interactive CLI tool where users can paste code and instantly get generated docstrings.
# Clone repository
git clone https://github.com/diaazg/code-comment.git
cd code-comment
# Create environment
conda create -n codet5-docstring python=3.10 -y
conda activate codet5-docstring
# Install dependencies
pip install -r requirements.txtRun the interactive CLI tool:
python main.py --model_dir ./checkpoints --device mps # use cuda or mps depends on ur deviceI evaluated on 4,000 samples before and after fine-tuning.
- ROUGE-1: ~0.05
- METEOR: ~0.027
- BERTScore (F1): ~0.78
- ROUGE-1: ~0.34
- METEOR: ~0.24
- BERTScore (F1): ~0.87
β Fine-tuning led to a substantial improvement, especially in semantic similarity (BERTScore), showing that the model better captures the meaning of reference docstrings.
This project is based on the CodeT5 model introduced in:
Wang et al., CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021.