Skip to content

checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verifiable Code Generation Post-Training Pipeline for Open LLMs

Public home: https://github.com/checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs

This repository contains the code, frozen configs, and audited run artifacts for Assignment 2 of CUHKSZ—CSC5051/MDS5110/CSC6052: Natural Language Processing.

The assignment task is verifiable code generation. The mainline system is SFT on MBPP with a local Qwen3-14B model, evaluated against the required non-finetuning baselines on MBPP and checked for external generalization on HumanEval. GRPO is included as a real exploratory extension, but it is not the main scientific claim.

Assignment Scope

  • Task type: reasoning / code generation.
  • Main training method: supervised fine-tuning.
  • Exploratory extension: GRPO after the SFT pipeline stabilized.
  • Required non-finetuning baselines: zero-shot ICL and few-shot ICL.
  • Required supporting checks covered in this repo: data card information, catastrophic-forgetting sanity probe, training-cost audit, and peak GPU memory audit.

Main Results Snapshot

All numbers below come from archived run artifacts under outputs/runs/.

Method Benchmark compile_rate pass@1 solved_count Status
Zero-shot ICL MBPP 0.2111 0.0000 0/90 official non-finetuning baseline
Few-shot ICL MBPP 0.0000 0.0000 0/90 official negative baseline
SFT stable MBPP 0.3556 0.0111 1/90 first stable training reference
Clean-view SFT stable MBPP 0.9000 0.1111 10/90 mainline best result
Clean-view SFT stable HumanEval post-fix 0.7012 0.5061 83/164 official external result
GRPO retry MBPP 0.9000 0.1000 9/90 exploratory extension, still below clean-view SFT

Repository Structure

  • configs/: development-era configs kept for iteration and backward compatibility.
  • configs/official/assignment2/: report-aligned frozen configs for the promoted assignment-2 result rows.
  • scripts/: training, evaluation, official submission entry points, and memory profiling wrappers.
  • src/: preprocessing, prompting, training, reward, sandbox, evaluation, and analysis code.
  • outputs/runs/: archived, run-stamped audit artifacts cited by the report.
  • outputs/submission/assignment2/: default rerun location for the official frozen configs.
  • data/: placeholder directories plus local preprocess targets written by the preprocessing commands below.

Data and Licensing

  • MBPP is the training dataset and internal validation benchmark.
  • HumanEval is external evaluation only and is never used for training or reward construction.
  • MBPP licensing: the Hugging Face dataset card labels it CC BY 4.0.
  • HumanEval licensing: the official OpenAI repository is released under the MIT License.
  • Processed data statistics used in the report:
    • MBPP: 464 validated rows, split into 374 train and 90 validation.
    • MBPP clean view: 373 clean train and 90 clean validation rows.
    • HumanEval: 164 evaluation-only problems.
  • Main data risks noted in the report:
    • MBPP public tests are not exhaustive.
    • The parser/evaluator prefers one top-level function, so helper-heavy valid solutions can still be undercounted.

Quickstart

1. Clone

git clone git@github.com:checheng117/Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs.git
cd Verifiable-Code-Generation-Post-Training-Pipeline-for-Open-LLMs

2. Create the environment

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

3. Configure local model paths

cp .env.example .env
  • The repo does not ship local model weights.
  • The default local assumption is models/Qwen3-14B/.
  • If needed, update configs/model_qwen3_14b.yaml or configs/official/assignment2/model_qwen3_14b.yaml.

4. Build local processed datasets

python -m src.data.preprocess_mbpp --output-dir data/processed/mbpp
python -m src.data.preprocess_humaneval --output-dir data/processed/humaneval
python -m src.data.build_mbpp_clean_sft_view --output-dir data/processed/mbpp_clean

5. Run the official assignment-2 configs

bash scripts/run_assignment2_official.sh train-sft-stable
bash scripts/run_assignment2_official.sh train-sft-clean
bash scripts/run_assignment2_official.sh eval-mbpp-main-table
bash scripts/run_assignment2_official.sh eval-humaneval-main-table

The default development entry points are still available:

bash scripts/run_baselines.sh
bash scripts/run_sft.sh
bash scripts/run_eval_humaneval.sh

6. Run the SFT peak-memory audit

bash scripts/profile_sft_memory.sh both

Official Frozen Configs vs Archived Runs

  • configs/*.yaml at the repository root are the live development defaults.
  • configs/official/assignment2/*.yaml are the submission-facing frozen configs. They map directly to the promoted assignment-2 table rows and rerun into outputs/submission/assignment2/.
  • outputs/runs/* are the archived audit artifacts from which the reported numbers are cited.

The official frozen config set contains:

  • train_sft_stable.yaml
  • train_sft_clean_stable.yaml
  • eval_mbpp_zero_shot_icl.yaml
  • eval_mbpp_fewshot_icl.yaml
  • eval_mbpp_sft_stable.yaml
  • eval_mbpp_sft_clean_stable.yaml
  • eval_humaneval_zero_shot_postfix.yaml
  • eval_humaneval_sft_clean_postfix.yaml

Assignment Requirement Coverage

  • Non-finetuning baselines:
    • zero-shot ICL and few-shot ICL are both preserved and auditable.
  • Data card requirements:
    • data source, licensing, sizes, filtering, split policy, and risks are all reflected in the report-aligned metadata and run notes.
  • Catastrophic forgetting:
    • covered by the fixed general-task sanity probe.
  • Training cost:
    • elapsed-time and hyperparameter audit is preserved.
  • Peak GPU memory:
    • closed by the isolated official SFT reruns described below.

Audited Artifacts

The most important run directories are:

  • outputs/runs/20260318_200650_formal_final_eval_audit/: frozen MBPP protocol plus zero-shot / SFT-stable reference rows.
  • outputs/runs/20260319_011958_sft_clean_rerun/: clean-view SFT mainline best model.
  • outputs/runs/20260319_164024_humaneval_eval_fix_rerun/: post-fix HumanEval comparison.
  • outputs/runs/20260320_133330_mbpp_icl_baselines/: official zero-shot and few-shot ICL baseline framing.
  • outputs/runs/20260320_201837_general_ability_sanity/: catastrophic-forgetting sanity probe.
  • outputs/runs/20260320_213907_training_cost_audit/: original elapsed-time and hyperparameter audit.
  • outputs/runs/20260321_140603_grpo_retry_clean_init/: bounded exploratory GRPO retry.
  • outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/: isolated reruns used to measure SFT peak GPU memory.

Peak GPU Memory Audit

The assignment-facing SFT peak-memory audit lives in:

  • outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.md
  • outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.csv
  • outputs/runs/20260321_203456_assignment2_sft_peak_memory_profile/assignment2_sft_peak_memory_audit.json

Measured report-facing peaks (torch.cuda.max_memory_reserved()):

  • SFT stable: 16236.0 MiB (15.855 GiB)
  • clean-view SFT stable: 16236.0 MiB (15.855 GiB)

Interpretation:

  • These runs are audit-only reruns of the frozen official SFT configs.
  • They do not replace the archived result runs used for the main result tables.
  • They close the assignment requirement that asks for explicit peak GPU memory reporting.

Notes on Scientific Scope

  • The main conclusion remains SFT-centered.
  • The strongest gain comes from clean-view target redesign and parser/evaluator alignment, not from RL.
  • GRPO became mechanically more credible by the end, but it still did not beat clean-view SFT on the main MBPP outcome metrics.

Acknowledgments

This work builds on Qwen3, MBPP, HumanEval, and the transformers / peft / trl stack. Reuse should respect the upstream licenses and usage terms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors