This research project investigates whether smaller Gemma models (1B, 4B) can achieve code generation capabilities comparable to their larger counterparts (12B, 27B) through various enhancement techniques. We explore:
- Zero-shot baseline approaches
- Few-shot learning approaches
- Zero- and few-shot approaches with additional reflection from a secondary model
- Vector search-based few-shot example retrieval
- What are the baseline zero-shot capabilities across different Gemma model sizes?
- How stable are smaller models' results in zero-shot scenarios?
- What improvements can be achieved through few-shot approaches?
- How does the number of examples in few-shot learning influence results?
- Can additional reflection from another agent regarding the task improve results?
- Can few-shot examples and additional reflection significantly improve Gemma 12B's results?
- Can vector search-based few-shot example retrieval improve model performance compared to sequential selection?
- How does the number of examples affect performance in vector search-based few-shot learning?
We utilize a hand-verified subset of the Mostly Basic Python Problems (MBPP) benchmark. The original MBPP dataset includes 1,000 Python programming problems for entry-level programmers, each containing:
- Task description
- Reference solution
- 3 test cases for validation
For this study, we used 50 test examples due to computational budget limitations.
- Zero-shot baseline: Direct code generation without examples
- Few-shot baseline: Direct code generation with examples
- Reflection Agent: Direct code generation with additional task reflection from another agent
- Vector Search-based Few-Shot: Using sentence transformers (all-MiniLM-L6-v2) to compute embeddings and select semantically similar examples based on cosine similarity
All experiments were run locally on Apple Silicon M4 Pro processors for reproducibility.
For our evaluation, we generated one code sample per problem with a temperature setting of 0.5 and assessed whether it passed all three provided test cases. We limited model output tokens to 1000. These settings remained consistent across all subsequent experiments.
Model performance gradually improved with parameter count, starting from ~55% for the 1B model and reaching ~90% for the 27B model.
As we operated on a small sample of data, we evaluated result stability by generating 5 different responses for each query and tracked accumulated accuracy across iterations. Accumulated accuracy measures whether the model had at least one success in N trials, calculated across all test samples. This experiment was performed using Gemma3 1B.
Our analysis showed that results stabilized after 2-3 iterations, informing our experimental design choices. For subsequent experiments, we standardized on 3 iterations to ensure result comparability, unless stated otherwise.
We analyzed the impact of providing examples in the model prompt by evaluating zero-shot approaches against few-shot approaches with 3 examples. Both experiments were run three times, with accuracy calculated based on whether any of the model trials for each task was successful.
We observed slight improvements with the few-shot approach. However, it's worth noting that the zero-shot results were significantly lower than usual (as seen in Q1 and Q2). Increasing iterations or evaluating more examples could produce more reliable results, though both solutions would require greater computational resources. While Q2 suggested results stabilize around 3 iterations, our practical findings indicate this may not always be the case.
We analyzed how the number of examples affects model performance in few-shot scenarios, using a methodology similar to previous research questions (3 runs with accuracy measured by any successful attempt).
We observed that increasing the number of examples in few-shot scenarios significantly improved model performance, showing a clear trend of enhanced accuracy with more examples presented to the model in the prompt.
We analyzed whether providing additional task reflection from a Gemma 4B model could improve results. We compared the zero-shot approach with the reflection approach for both 4B and 12B models, running just one iteration due to computational constraints.
Surprisingly, the additional reflection component did not strengthen model results, with performance remaining on par for both 4B and 12B models.
We analyzed whether combining few-shot learning with reflection could improve model performance compared to zero-shot approaches. For this analysis, we used the Gemma 12B model with 3 few-shot examples.
Unfortunately, the few-shot approach with reflection did not strengthen model results either. This is somewhat understandable given that few-shot approaches didn't consistently improve results in Q3 and Q4.
We investigated whether using vector search to select semantically similar examples could improve model performance compared to sequential example selection. We used the sentence transformer model (all-MiniLM-L6-v2) to compute embeddings and find similar examples based on cosine similarity.
The results showed that vector search-based example selection provided only a small accuracy improvement.
Building on Q7, we analyzed how the number of examples selected through vector search affects model performance. We compared different numbers of examples (1, 3, 5, and 7) to understand if more examples lead to better results.
Increasing the number of examples in vector search-based few-shot learning showed improved accuracy with larger sample sizes.
Our research on Gemma models for code generation revealed that larger models (12B, 27B) consistently outperform smaller variants (1B, 4B) in zero-shot tasks. While multiple iterations improved success rates, they plateaued after approximately 3 attempts. Few-shot learning showed modest improvements that scaled with example count, though surprisingly, adding structured reflection components to prompts did not yield significant improvements. Vector-based example selection provided only marginal benefits over sequential selection.
The results suggest that simpler approaches like basic few-shot learning can be as effective as more complex strategies, and that balancing model size with example count may be more practical than pursuing sophisticated prompt engineering techniques.
It's important to note that these conclusions might not be fully valid, as previous experiments showed a lack of result stability.
- Clone the repository:
git clone https://github.com/yourusername/small-gemma-models-for-code-generation.git
cd small-gemma-models-for-code-generation- Download Gemma models:
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b- Create and activate virtual environment:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install dependencies:
uv pip install .- (Optional) Install development dependencies:
uv pip install -e ".[dev]"# Q1: Basic zero-shot experiments
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:1b --model_name gemma3:1b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:4b --model_name gemma3:4b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:12b --model_name gemma3:12b
uv run run_experiments.py --experiment_type single-model --experiment_name q1-zero-shot-gemma3:27b --model_name gemma3:27b
# Q2: Stability analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q2-gemma3:1b --model_name gemma3:1b --num-iterations 5
# Q3: Zero-shot vs. Few-shot Comparison
uv run run_experiments.py --experiment_type single-model --experiment_name q3-zero-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type single-model --experiment_name q3-few-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3
# Q4: Few-shot Number of Examples Analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-1-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 1
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-3-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-5-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 5
uv run run_experiments.py --experiment_type single-model --experiment_name q4-few-shot-7-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 7
# Q5: Reflection Approach
uv run run_experiments.py --experiment_type single-model --experiment_name q5-zero-shot-gemma3:4b --model_name gemma3:4b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type single-model --experiment_name q5-zero-shot-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q5-reflection-approach-gemma3:4b --model_name gemma3:4b --num-iterations 1 --num-few-shot-examples 0
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q5-reflection-approach-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 0
# Q6: Reflection with Few-Shot Examples
uv run run_experiments.py --experiment_type reflection-approach --experiment_name q6-reflection-approach-few-shot-gemma3:12b --model_name gemma3:12b --num-iterations 1 --num-few-shot-examples 2
# Q7: Few-Shot Examples via Vector Search
uv run run_experiments.py --experiment_type single-model --experiment_name q7-vector-search-few-shot-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3 --use-vector-search
# Q8: Few-Shot Examples via Vector Search Number of Examples Analysis
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-1-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 1 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-3-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 3 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-5-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 5 --use-vector-search
uv run run_experiments.py --experiment_type single-model --experiment_name q8-vector-search-few-shot-7-gemma3:1b --model_name gemma3:1b --num-iterations 3 --num-few-shot-examples 7 --use-vector-searchRun evaluation for specific results:
# Q1 Experiments
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:12b*.json
uv run run_evaluation.py --results-path results/q1-zero-shot-gemma3:27b*.json
# Q2 Experiments
uv run run_evaluation.py --results-path results/q2-gemma3:1b*.json
# Q3 Experiments
uv run run_evaluation.py --results-path results/q3-zero-shot-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q3-few-shot-gemma3:1b*.json
# Q4 Experiments
uv run run_evaluation.py --results-path results/q4-few-shot-1-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-3-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-5-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q4-few-shot-7-gemma3:1b*.json
# Q5 Experiments
uv run run_evaluation.py --results-path results/q5-zero-shot-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q5-zero-shot-gemma3:12b*.json
uv run run_evaluation.py --results-path results/q5-reflection-approach-gemma3:4b*.json
uv run run_evaluation.py --results-path results/q5-reflection-approach-gemma3:12b*.json
# Q6 Experiments
uv run run_evaluation.py --results-path results/q6-reflection-approach-few-shot-gemma3:12b*.json
# Q7 Experiments
uv run run_evaluation.py --results-path results/q7-vector-search-few-shot-gemma3:1b*.json
# Q8 Experiments
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-1-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-3-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-5-gemma3:1b*.json
uv run run_evaluation.py --results-path results/q8-vector-search-few-shot-7-gemma3:1b*.jsonGenerate visualizations:
uv run run_analysis.py






