Small Gemma Models for Code Generation

📋 Overview

This research project investigates whether smaller Gemma models (1B, 4B) can achieve code generation capabilities comparable to their larger counterparts (12B, 27B) through various enhancement techniques. We explore:

Zero-shot baseline approaches
Few-shot learning approaches
Zero- and few-shot approaches with additional reflection from a secondary model
Vector search-based few-shot example retrieval

🔍 Research Questions

What are the baseline zero-shot capabilities across different Gemma model sizes?
How stable are smaller models' results in zero-shot scenarios?
What improvements can be achieved through few-shot approaches?
How does the number of examples in few-shot learning influence results?
Can additional reflection from another agent regarding the task improve results?
Can few-shot examples and additional reflection significantly improve Gemma 12B's results?
Can vector search-based few-shot example retrieval improve model performance compared to sequential selection?
How does the number of examples affect performance in vector search-based few-shot learning?

🧪 Methodology

Dataset

We utilize a hand-verified subset of the Mostly Basic Python Problems (MBPP) benchmark. The original MBPP dataset includes 1,000 Python programming problems for entry-level programmers, each containing:

Task description
Reference solution
3 test cases for validation

For this study, we used 50 test examples due to computational budget limitations.

Experimental Approaches

Zero-shot baseline: Direct code generation without examples
Few-shot baseline: Direct code generation with examples
Reflection Agent: Direct code generation with additional task reflection from another agent
Vector Search-based Few-Shot: Using sentence transformers (all-MiniLM-L6-v2) to compute embeddings and select semantically similar examples based on cosine similarity

Computational Resources

All experiments were run locally on Apple Silicon M4 Pro processors for reproducibility.

📊 Key Findings

Q1: Zero-Shot Capabilities Across Model Sizes

For our evaluation, we generated one code sample per problem with a temperature setting of 0.5 and assessed whether it passed all three provided test cases. We limited model output tokens to 1000. These settings remained consistent across all subsequent experiments.

Model performance gradually improved with parameter count, starting from ~55% for the 1B model and reaching ~90% for the 27B model.

Q2: Results Stability in Zero-Shot Scenarios

As we operated on a small sample of data, we evaluated result stability by generating 5 different responses for each query and tracked accumulated accuracy across iterations. Accumulated accuracy measures whether the model had at least one success in N trials, calculated across all test samples. This experiment was performed using Gemma3 1B.

Our analysis showed that results stabilized after 2-3 iterations, informing our experimental design choices. For subsequent experiments, we standardized on 3 iterations to ensure result comparability, unless stated otherwise.

Q3: Zero-Shot vs. Few-Shot Comparison

We analyzed the impact of providing examples in the model prompt by evaluating zero-shot approaches against few-shot approaches with 3 examples. Both experiments were run three times, with accuracy calculated based on whether any of the model trials for each task was successful.

We observed slight improvements with the few-shot approach. However, it's worth noting that the zero-shot results were significantly lower than usual (as seen in Q1 and Q2). Increasing iterations or evaluating more examples could produce more reliable results, though both solutions would require greater computational resources. While Q2 suggested results stabilize around 3 iterations, our practical findings indicate this may not always be the case.

Q4: Impact of Example Count in Few-Shot Learning

We analyzed how the number of examples affects model performance in few-shot scenarios, using a methodology similar to previous research questions (3 runs with accuracy measured by any successful attempt).

We observed that increasing the number of examples in few-shot scenarios significantly improved model performance, showing a clear trend of enhanced accuracy with more examples presented to the model in the prompt.

Q5: Zero-shot vs. Zero-shot Reflection Approach Comparison

We analyzed whether providing additional task reflection from a Gemma 4B model could improve results. We compared the zero-shot approach with the reflection approach for both 4B and 12B models, running just one iteration due to computational constraints.

Surprisingly, the additional reflection component did not strengthen model results, with performance remaining on par for both 4B and 12B models.

Q6: Zero-shot vs. Few-shot Reflection Approach Comparison

We analyzed whether combining few-shot learning with reflection could improve model performance compared to zero-shot approaches. For this analysis, we used the Gemma 12B model with 3 few-shot examples.

Unfortunately, the few-shot approach with reflection did not strengthen model results either. This is somewhat understandable given that few-shot approaches didn't consistently improve results in Q3 and Q4.

Q7: Vector Search-based Few-Shot Example Retrieval

We investigated whether using vector search to select semantically similar examples could improve model performance compared to sequential example selection. We used the sentence transformer model (all-MiniLM-L6-v2) to compute embeddings and find similar examples based on cosine similarity.

The results showed that vector search-based example selection provided only a small accuracy improvement.

Q8: Impact of Example Count in Vector Search-based Few-Shot Learning

Building on Q7, we analyzed how the number of examples selected through vector search affects model performance. We compared different numbers of examples (1, 3, 5, and 7) to understand if more examples lead to better results.

Increasing the number of examples in vector search-based few-shot learning showed improved accuracy with larger sample sizes.

Summary

Our research on Gemma models for code generation revealed that larger models (12B, 27B) consistently outperform smaller variants (1B, 4B) in zero-shot tasks. While multiple iterations improved success rates, they plateaued after approximately 3 attempts. Few-shot learning showed modest improvements that scaled with example count, though surprisingly, adding structured reflection components to prompts did not yield significant improvements. Vector-based example selection provided only marginal benefits over sequential selection.

The results suggest that simpler approaches like basic few-shot learning can be as effective as more complex strategies, and that balancing model size with example count may be more practical than pursuing sophisticated prompt engineering techniques.

It's important to note that these conclusions might not be fully valid, as previous experiments showed a lack of result stability.

🚀 Getting Started

Prerequisites

uv - Modern Python package installer
ollama - Local LLM runner
Python 3.11 or higher

Installation

Clone the repository:

git clone https://github.com/yourusername/small-gemma-models-for-code-generation.git
cd small-gemma-models-for-code-generation

Download Gemma models:

ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

Create and activate virtual environment:

uv venv
source .venv/bin/activate  # On Unix/macOS
# or
.venv\Scripts\activate  # On Windows

Install dependencies:

uv pip install .

(Optional) Install development dependencies:

uv pip install -e ".[dev]"

🔧 Usage

Running Experiments

# Q1: Basic zero-shot experiments
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:1b --model_name gemma3:1b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:4b --model_name gemma3:4b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:12b --model_name gemma3:12b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:27b --model_name gemma3:27b

# Q2: Stability analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q2-gemma3:1b --model_name gemma3:1b --num-iterations 5

# Q3: Zero-shot vs. Few-shot Comparison
uv run run_experiments.py --experiment_type single-model --experiment_name q3-zero-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type single-model --experiment_name q3-few-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3

# Q4: Few-shot Number of Examples Analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-1-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 1
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-3-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-5-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 5
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-7-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 7

# Q5: Reflection Approach
uv run run_experiments.py --experiment_type single-model --experiment_name q5-zero-shot-gemma3:4b --model_name gemma3:4b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type single-model --experiment_name q5-zero-shot-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q5-reflection-approach-gemma3:4b --model_name gemma3:4b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q5-reflection-approach-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 0

# Q6: Reflection with Few-Shot Examples
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q6-reflection-approach-few-shot-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 2

# Q7: Few-Shot Examples via Vector Search
uv run run_experiments.py --experiment_type single-model --experiment_name q7-vector-search-few-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3 --use-vector-search

# Q8: Few-Shot Examples via Vector Search Number of Examples Analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-1-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 1 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-3-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-5-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 5 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-7-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 7 --use-vector-search

Evaluation

Run evaluation for specific results:

# Q1 Experiments
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:12b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:27b*.json

# Q2 Experiments
uv run run_evaluation.py --results-path results/q2-gemma3:1b*.json

# Q3 Experiments
uv run run_evaluation.py --results-path results/q3-zero-shot-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q3-few-shot-gemma3:1b*.json

# Q4 Experiments
uv run run_evaluation.py --results-path results/q4-few-shot-1-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-3-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-5-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-7-gemma3:1b*.json

# Q5 Experiments
uv run run_evaluation.py --results-path results/q5-zero-shot-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q5-zero-shot-gemma3:12b*.json
uv run run_evaluation.py --results-path results/q5-reflection-approach-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q5-reflection-approach-gemma3:12b*.json

# Q6 Experiments
uv run run_evaluation.py --results-path results/q6-reflection-approach-few-shot-gemma3:12b*.json

# Q7 Experiments
uv run run_evaluation.py --results-path results/q7-vector-search-few-shot-gemma3:1b*.json

# Q8 Experiments
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-1-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-3-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-5-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-7-gemma3:1b*.json

Generate visualizations:

uv run run_analysis.py

📝 License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
figures		figures
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.sh		run.sh
run_analysis.py		run_analysis.py
run_evaluation.py		run_evaluation.py
run_experiments.py		run_experiments.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Gemma Models for Code Generation

📋 Overview

🔍 Research Questions

🧪 Methodology

Dataset

Experimental Approaches

Computational Resources

📊 Key Findings

Q1: Zero-Shot Capabilities Across Model Sizes

Q2: Results Stability in Zero-Shot Scenarios

Q3: Zero-Shot vs. Few-Shot Comparison

Q4: Impact of Example Count in Few-Shot Learning

Q5: Zero-shot vs. Zero-shot Reflection Approach Comparison

Q6: Zero-shot vs. Few-shot Reflection Approach Comparison

Q7: Vector Search-based Few-Shot Example Retrieval

Q8: Impact of Example Count in Vector Search-based Few-Shot Learning

Summary

🚀 Getting Started

Prerequisites

Installation

🔧 Usage

Running Experiments

Evaluation

📝 License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Small Gemma Models for Code Generation

📋 Overview

🔍 Research Questions

🧪 Methodology

Dataset

Experimental Approaches

Computational Resources

📊 Key Findings

Q1: Zero-Shot Capabilities Across Model Sizes

Q2: Results Stability in Zero-Shot Scenarios

Q3: Zero-Shot vs. Few-Shot Comparison

Q4: Impact of Example Count in Few-Shot Learning

Q5: Zero-shot vs. Zero-shot Reflection Approach Comparison

Q6: Zero-shot vs. Few-shot Reflection Approach Comparison

Q7: Vector Search-based Few-Shot Example Retrieval

Q8: Impact of Example Count in Vector Search-based Few-Shot Learning

Summary

🚀 Getting Started

Prerequisites

Installation

🔧 Usage

Running Experiments

Evaluation

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages