This paper presents a comprehensive analysis of Programming Language Confusion (PLC) in Large Language Models (LLMs) when performing code generation and code translation tasks. We evaluate how effectively LLMs maintain programming language consistency when performing these tasks. Our experiments with 16 popular programming languages across 10 LLMs show that these models often produce code in unintended languages, even when given clear instructions or contextual cues.
Figure 1: Overview of our research methodology for evaluating Programming Language Confusion in LLMs
We used five datasets for our evaluation
- Code Generation :
- LiveCodeBench
- BabelCode dataset (containing BC-MBPP, BC-TP3, and BC-HumanEval)
- HumanEval-XL
- Code Generation:
- McEval
Supplementary materials and detailed results supporting the main paper.
This directory contains all datasets used in our research, as well as the results obtained for each evaluated model.
/evaluation_data/:code_generation/: houses datasets specifically used for code generation tasks.code_translation/: stores datasets employed for code translation tasks.
/results/: includes all generated outputs from the evaluated models.generation_results/: stores results from code generation experiments.translation_results/: contains results from code translation experiments.
This directory contains the complete implementation of our evaluation framework.
setup_ollama.py: serves as the main entry point for running experiments with Ollama framework. To execute, use the commandpython src/setup.py llama3.1 code_translation mceval./language_identification/: contains scripts dedicated to identifying the programming language in generated outputs.process_results.py: processes outputs and performs overall language classificationvisualize.py: creates visual representations, including chord diagrams, to showcase language migration patterns.compute_performances.py: computes and presents detailed performance metrics.
This directory stores all migration diagrams corresponding to each dataset and model used in this study.
Example 1 of Confusion
class Solution {
/**
* Write a java function to find the last two digits in factorial of a given number.
*/
public Integer lastTwoDigits(Integer n) {
Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.
Generated Code by CodeLLama-13B-Instruct-Ollama
def get_last_two_digits_of_factorial(n):
fact = 1
for i in range(2, n+1):
fact *= i
return str(fact)[-2:]
# Test cases
assert get_last_two_digits_of_factorial(1) == "1"
assert get_last_two_digits_of_factorial(10) == "4"
assert get_last_two_digits_of_factorial(100) == "7"Example 2 of Confusion
class Solution {
/**
* Write a function to find frequency of the elements in a given array of arrays using collections module.
*/
public Map<Integer, Integer> freqElement(ArrayList<ArrayList<Integer>> nums) {
Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.
Generated Code by CodeLLama-13B-Instruct
import collections
def freqElement(nums):
freq = collections.defaultdict(int)
for arr in nums:
for num in arr:
freq[num] += 1
return freqExample 3 of Confusion
/**
* Verifies that the inputs satisfy the problem:
* Determine, for each string in x, whether the last character is an isolated letter
* ["a b c", "abc"] => [True, False]
*/
def sat(y: List[Boolean], x: List[String]): Boolean = {
Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.
Generated Code by GPT-4.1-Mini
from typing import List
def sat(y: List[bool], x: List[str]) -> bool:
return all(
y_i == (
len(s) > 0 and
s[-1].isalpha() and
(len(s) == 1 or (s[-2] == ' ' and (len(s) == 2 or s[-3] == ' ')))
)
for y_i, s in zip(y, x)
)