Verifiable Code Generation Post-Training Pipeline for Open LLMs

Public home: https://github.com/checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs

This repository contains the code, frozen configs, and audited run artifacts for Assignment 2 of CUHKSZ—CSC5051/MDS5110/CSC6052: Natural Language Processing.

The assignment task is verifiable code generation. The mainline system is SFT on MBPP with a local Qwen3-14B model, evaluated against the required non-finetuning baselines on MBPP and checked for external generalization on HumanEval. GRPO is included as a real exploratory extension, but it is not the main scientific claim.

Assignment Scope

Task type: reasoning / code generation.
Main training method: supervised fine-tuning.
Exploratory extension: GRPO after the SFT pipeline stabilized.
Required non-finetuning baselines: zero-shot ICL and few-shot ICL.
Required supporting checks covered in this repo: data card information, catastrophic-forgetting sanity probe, training-cost audit, and peak GPU memory audit.

Main Results Snapshot

All numbers below come from archived run artifacts under outputs/runs/.

Method	Benchmark	compile_rate	pass@1	solved_count	Status
Zero-shot ICL	MBPP	`0.2111`	`0.0000`	`0/90`	official non-finetuning baseline
Few-shot ICL	MBPP	`0.0000`	`0.0000`	`0/90`	official negative baseline
SFT stable	MBPP	`0.3556`	`0.0111`	`1/90`	first stable training reference
Clean-view SFT stable	MBPP	`0.9000`	`0.1111`	`10/90`	mainline best result
Clean-view SFT stable	HumanEval post-fix	`0.7012`	`0.5061`	`83/164`	official external result
GRPO retry	MBPP	`0.9000`	`0.1000`	`9/90`	exploratory extension, still below clean-view SFT

Repository Structure

configs/: development-era configs kept for iteration and backward compatibility.
configs/official/assignment2/: report-aligned frozen configs for the promoted assignment-2 result rows.
scripts/: training, evaluation, official submission entry points, and memory profiling wrappers.
src/: preprocessing, prompting, training, reward, sandbox, evaluation, and analysis code.
outputs/runs/: archived, run-stamped audit artifacts cited by the report.
outputs/submission/assignment2/: default rerun location for the official frozen configs.
data/: placeholder directories plus local preprocess targets written by the preprocessing commands below.

Data and Licensing

MBPP is the training dataset and internal validation benchmark.
HumanEval is external evaluation only and is never used for training or reward construction.
MBPP licensing: the Hugging Face dataset card labels it CC BY 4.0.
HumanEval licensing: the official OpenAI repository is released under the MIT License.
Processed data statistics used in the report:
- MBPP: 464 validated rows, split into 374 train and 90 validation.
- MBPP clean view: 373 clean train and 90 clean validation rows.
- HumanEval: 164 evaluation-only problems.
Main data risks noted in the report:
- MBPP public tests are not exhaustive.
- The parser/evaluator prefers one top-level function, so helper-heavy valid solutions can still be undercounted.

Quickstart

1. Clone

git clone git@github.com:checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs.git
cd Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs

2. Create the environment

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

3. Configure local model paths

cp .env.example .env

The repo does not ship local model weights.
The default local assumption is models/Qwen3-14B/.
If needed, update configs/model_qwen3_14b.yaml or configs/official/assignment2/model_qwen3_14b.yaml.

4. Build local processed datasets

python -m src.data.preprocess_mbpp --output-dir data/processed/mbpp
python -m src.data.preprocess_humaneval --output-dir data/processed/humaneval
python -m src.data.build_mbpp_clean_sft_view --output-dir data/processed/mbpp_clean

5. Run the official assignment-2 configs

bash scripts/run_assignment2_official.sh train-sft-stable
bash scripts/run_assignment2_official.sh train-sft-clean
bash scripts/run_assignment2_official.sh eval-mbpp-main-table
bash scripts/run_assignment2_official.sh eval-humaneval-main-table

The default development entry points are still available:

bash scripts/run_baselines.sh
bash scripts/run_sft.sh
bash scripts/run_eval_humaneval.sh

6. Run the SFT peak-memory audit

bash scripts/profile_sft_memory.sh both

Official Frozen Configs vs Archived Runs

configs/*.yaml at the repository root are the live development defaults.
configs/official/assignment2/*.yaml are the submission-facing frozen configs. They map directly to the promoted assignment-2 table rows and rerun into outputs/submission/assignment2/.
outputs/runs/* are the archived audit artifacts from which the reported numbers are cited.

The official frozen config set contains:

train_sft_stable.yaml
train_sft_clean_stable.yaml
eval_mbpp_zero_shot_icl.yaml
eval_mbpp_fewshot_icl.yaml
eval_mbpp_sft_stable.yaml
eval_mbpp_sft_clean_stable.yaml
eval_humaneval_zero_shot_postfix.yaml
eval_humaneval_sft_clean_postfix.yaml

Assignment Requirement Coverage

Non-finetuning baselines:
- zero-shot ICL and few-shot ICL are both preserved and auditable.
Data card requirements:
- data source, licensing, sizes, filtering, split policy, and risks are all reflected in the report-aligned metadata and run notes.
Catastrophic forgetting:
- covered by the fixed general-task sanity probe.
Training cost:
- elapsed-time and hyperparameter audit is preserved.
Peak GPU memory:
- closed by the isolated official SFT reruns described below.

Audited Artifacts

The most important run directories are:

outputs/runs/20260318_200650_formal_final_eval_audit/: frozen MBPP protocol plus zero-shot / SFT-stable reference rows.
outputs/runs/20260319_011958_sft_clean_rerun/: clean-view SFT mainline best model.
outputs/runs/20260319_164024_humaneval_eval_fix_rerun/: post-fix HumanEval comparison.
outputs/runs/20260320_133330_mbpp_icl_baselines/: official zero-shot and few-shot ICL baseline framing.
outputs/runs/20260320_201837_general_ability_sanity/: catastrophic-forgetting sanity probe.
outputs/runs/20260320_213907_training_cost_audit/: original elapsed-time and hyperparameter audit.
outputs/runs/20260321_140603_grpo_retry_clean_init/: bounded exploratory GRPO retry.
outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/: isolated reruns used to measure SFT peak GPU memory.

Peak GPU Memory Audit

The assignment-facing SFT peak-memory audit lives in:

outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.md
outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.csv
outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.json

Measured report-facing peaks (torch.cuda.max_memory_reserved()):

SFT stable: 16236.0 MiB (15.855 GiB)
clean-view SFT stable: 16236.0 MiB (15.855 GiB)

Interpretation:

These runs are audit-only reruns of the frozen official SFT configs.
They do not replace the archived result runs used for the main result tables.
They close the assignment requirement that asks for explicit peak GPU memory reporting.

Notes on Scientific Scope

The main conclusion remains SFT-centered.
The strongest gain comes from clean-view target redesign and parser/evaluator alignment, not from RL.
GRPO became mechanically more credible by the end, but it still did not beat clean-view SFT on the main MBPP outcome metrics.

Acknowledgments

This work builds on Qwen3, MBPP, HumanEval, and the transformers / peft / trl stack. Reuse should respect the upstream licenses and usage terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verifiable Code Generation Post-Training Pipeline for Open LLMs

Assignment Scope

Main Results Snapshot

Repository Structure

Data and Licensing

Quickstart

1. Clone

2. Create the environment

3. Configure local model paths

4. Build local processed datasets

5. Run the official assignment-2 configs

6. Run the SFT peak-memory audit

Official Frozen Configs vs Archived Runs

Assignment Requirement Coverage

Audited Artifacts

Peak GPU Memory Audit

Notes on Scientific Scope

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data		data
outputs		outputs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Verifiable Code Generation Post-Training Pipeline for Open LLMs

Assignment Scope

Main Results Snapshot

Repository Structure

Data and Licensing

Quickstart

1. Clone

2. Create the environment

3. Configure local model paths

4. Build local processed datasets

5. Run the official assignment-2 configs

6. Run the SFT peak-memory audit

Official Frozen Configs vs Archived Runs

Assignment Requirement Coverage

Audited Artifacts

Peak GPU Memory Audit

Notes on Scientific Scope

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages