Public home: https://github.com/checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs
This repository contains the code, frozen configs, and audited run artifacts for Assignment 2 of CUHKSZ—CSC5051/MDS5110/CSC6052: Natural Language Processing.
The assignment task is verifiable code generation. The mainline system is SFT on MBPP with a local Qwen3-14B model, evaluated against the required non-finetuning baselines on MBPP and checked for external generalization on HumanEval. GRPO is included as a real exploratory extension, but it is not the main scientific claim.
- Task type: reasoning / code generation.
- Main training method: supervised fine-tuning.
- Exploratory extension: GRPO after the SFT pipeline stabilized.
- Required non-finetuning baselines: zero-shot ICL and few-shot ICL.
- Required supporting checks covered in this repo: data card information, catastrophic-forgetting sanity probe, training-cost audit, and peak GPU memory audit.
All numbers below come from archived run artifacts under outputs/runs/.
| Method | Benchmark | compile_rate | pass@1 | solved_count | Status |
|---|---|---|---|---|---|
| Zero-shot ICL | MBPP | 0.2111 |
0.0000 |
0/90 |
official non-finetuning baseline |
| Few-shot ICL | MBPP | 0.0000 |
0.0000 |
0/90 |
official negative baseline |
| SFT stable | MBPP | 0.3556 |
0.0111 |
1/90 |
first stable training reference |
| Clean-view SFT stable | MBPP | 0.9000 |
0.1111 |
10/90 |
mainline best result |
| Clean-view SFT stable | HumanEval post-fix | 0.7012 |
0.5061 |
83/164 |
official external result |
| GRPO retry | MBPP | 0.9000 |
0.1000 |
9/90 |
exploratory extension, still below clean-view SFT |
configs/: development-era configs kept for iteration and backward compatibility.configs/official/assignment2/: report-aligned frozen configs for the promoted assignment-2 result rows.scripts/: training, evaluation, official submission entry points, and memory profiling wrappers.src/: preprocessing, prompting, training, reward, sandbox, evaluation, and analysis code.outputs/runs/: archived, run-stamped audit artifacts cited by the report.outputs/submission/assignment2/: default rerun location for the official frozen configs.data/: placeholder directories plus local preprocess targets written by the preprocessing commands below.
MBPPis the training dataset and internal validation benchmark.HumanEvalis external evaluation only and is never used for training or reward construction.MBPPlicensing: the Hugging Face dataset card labels itCC BY 4.0.HumanEvallicensing: the official OpenAI repository is released under theMITLicense.- Processed data statistics used in the report:
MBPP: 464 validated rows, split into 374 train and 90 validation.MBPP clean view: 373 clean train and 90 clean validation rows.HumanEval: 164 evaluation-only problems.
- Main data risks noted in the report:
MBPPpublic tests are not exhaustive.- The parser/evaluator prefers one top-level function, so helper-heavy valid solutions can still be undercounted.
git clone git@github.com:checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs.git
cd Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMspython -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtcp .env.example .env- The repo does not ship local model weights.
- The default local assumption is
models/Qwen3-14B/. - If needed, update
configs/model_qwen3_14b.yamlorconfigs/official/assignment2/model_qwen3_14b.yaml.
python -m src.data.preprocess_mbpp --output-dir data/processed/mbpp
python -m src.data.preprocess_humaneval --output-dir data/processed/humaneval
python -m src.data.build_mbpp_clean_sft_view --output-dir data/processed/mbpp_cleanbash scripts/run_assignment2_official.sh train-sft-stable
bash scripts/run_assignment2_official.sh train-sft-clean
bash scripts/run_assignment2_official.sh eval-mbpp-main-table
bash scripts/run_assignment2_official.sh eval-humaneval-main-tableThe default development entry points are still available:
bash scripts/run_baselines.sh
bash scripts/run_sft.sh
bash scripts/run_eval_humaneval.shbash scripts/profile_sft_memory.sh bothconfigs/*.yamlat the repository root are the live development defaults.configs/official/assignment2/*.yamlare the submission-facing frozen configs. They map directly to the promoted assignment-2 table rows and rerun intooutputs/submission/assignment2/.outputs/runs/*are the archived audit artifacts from which the reported numbers are cited.
The official frozen config set contains:
train_sft_stable.yamltrain_sft_clean_stable.yamleval_mbpp_zero_shot_icl.yamleval_mbpp_fewshot_icl.yamleval_mbpp_sft_stable.yamleval_mbpp_sft_clean_stable.yamleval_humaneval_zero_shot_postfix.yamleval_humaneval_sft_clean_postfix.yaml
- Non-finetuning baselines:
- zero-shot ICL and few-shot ICL are both preserved and auditable.
- Data card requirements:
- data source, licensing, sizes, filtering, split policy, and risks are all reflected in the report-aligned metadata and run notes.
- Catastrophic forgetting:
- covered by the fixed general-task sanity probe.
- Training cost:
- elapsed-time and hyperparameter audit is preserved.
- Peak GPU memory:
- closed by the isolated official SFT reruns described below.
The most important run directories are:
outputs/runs/20260318_200650_formal_final_eval_audit/: frozen MBPP protocol plus zero-shot / SFT-stable reference rows.outputs/runs/20260319_011958_sft_clean_rerun/: clean-view SFT mainline best model.outputs/runs/20260319_164024_humaneval_eval_fix_rerun/: post-fix HumanEval comparison.outputs/runs/20260320_133330_mbpp_icl_baselines/: official zero-shot and few-shot ICL baseline framing.outputs/runs/20260320_201837_general_ability_sanity/: catastrophic-forgetting sanity probe.outputs/runs/20260320_213907_training_cost_audit/: original elapsed-time and hyperparameter audit.outputs/runs/20260321_140603_grpo_retry_clean_init/: bounded exploratory GRPO retry.outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/: isolated reruns used to measure SFT peak GPU memory.
The assignment-facing SFT peak-memory audit lives in:
outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.mdoutputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.csvoutputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.json
Measured report-facing peaks (torch.cuda.max_memory_reserved()):
SFT stable:16236.0 MiB(15.855 GiB)clean-view SFT stable:16236.0 MiB(15.855 GiB)
Interpretation:
- These runs are audit-only reruns of the frozen official SFT configs.
- They do not replace the archived result runs used for the main result tables.
- They close the assignment requirement that asks for explicit peak GPU memory reporting.
- The main conclusion remains SFT-centered.
- The strongest gain comes from clean-view target redesign and parser/evaluator alignment, not from RL.
GRPObecame mechanically more credible by the end, but it still did not beat clean-view SFT on the main MBPP outcome metrics.
This work builds on Qwen3, MBPP, HumanEval, and the transformers / peft / trl stack. Reuse should respect the upstream licenses and usage terms.