Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Summary

This paper presents a comprehensive analysis of Programming Language Confusion (PLC) in Large Language Models (LLMs) when performing code generation and code translation tasks. We evaluate how effectively LLMs maintain programming language consistency when performing these tasks. Our experiments with 16 popular programming languages across 10 LLMs show that these models often produce code in unintended languages, even when given clear instructions or contextual cues.

Research Workflow

Figure 1: Overview of our research methodology for evaluating Programming Language Confusion in LLMs

Dataset and Benchmarks

We used five datasets for our evaluation

Code Generation :
- LiveCodeBench
- BabelCode dataset (containing BC-MBPP, BC-TP3, and BC-HumanEval)
- HumanEval-XL
Code Generation:
- McEval

Repository Structure

`/Appendix`

Supplementary materials and detailed results supporting the main paper.

`/data`

This directory contains all datasets used in our research, as well as the results obtained for each evaluated model.

/evaluation_data/:
- code_generation/: houses datasets specifically used for code generation tasks.
- code_translation/: stores datasets employed for code translation tasks.
/results/: includes all generated outputs from the evaluated models.
- generation_results/: stores results from code generation experiments.
- translation_results/: contains results from code translation experiments.

`/src`

This directory contains the complete implementation of our evaluation framework.

setup_ollama.py: serves as the main entry point for running experiments with Ollama framework. To execute, use the command python src/setup.py llama3.1 code_translation mceval.
/language_identification/: contains scripts dedicated to identifying the programming language in generated outputs.
process_results.py: processes outputs and performs overall language classification
visualize.py: creates visual representations, including chord diagrams, to showcase language migration patterns.
compute_performances.py: computes and presents detailed performance metrics.

`/images`

This directory stores all migration diagrams corresponding to each dataset and model used in this study.

Code Confusion Examples

Example 1 of Confusion

class Solution {
    /**
     * Write a java function to find the last two digits in factorial of a given number.
     */
    public Integer lastTwoDigits(Integer n) {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.

Generated Code by CodeLLama-13B-Instruct-Ollama

def get_last_two_digits_of_factorial(n):
    fact = 1
    for i in range(2, n+1):
        fact *= i
    return str(fact)[-2:]
# Test cases
assert get_last_two_digits_of_factorial(1) == "1"
assert get_last_two_digits_of_factorial(10) == "4"
assert get_last_two_digits_of_factorial(100) == "7"

Example 2 of Confusion

class Solution {
    /**
     * Write a function to find frequency of the elements in a given array of arrays using collections module.
     */
    public Map<Integer, Integer> freqElement(ArrayList<ArrayList<Integer>> nums) {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.

Generated Code by CodeLLama-13B-Instruct

import collections
def freqElement(nums):
    freq = collections.defaultdict(int)
    for arr in nums:
        for num in arr:
            freq[num] += 1
    return freq

Example 3 of Confusion

/**
 * Verifies that the inputs satisfy the problem:
 * Determine, for each string in x, whether the last character is an isolated letter
 * ["a b c", "abc"] => [True, False]
 */
def sat(y: List[Boolean], x: List[String]): Boolean = {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments.

Generated Code by GPT-4.1-Mini

from typing import List
def sat(y: List[bool], x: List[str]) -> bool:
    return all(
        y_i == (
            len(s) > 0 and
            s[-1].isalpha() and
            (len(s) == 1 or (s[-2] == ' ' and (len(s) == 2 or s[-3] == ' ')))
        )
        for y_i, s in zip(y, x)
    )

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Appendix		Appendix
data		data
images		images
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Summary

Research Workflow

Dataset and Benchmarks

Repository Structure

`/Appendix`

`/data`

`/src`

`/images`

Code Confusion Examples

About

Uh oh!

Releases

Packages

Languages

TruX-DTF/PLC

Folders and files

Latest commit

History

Repository files navigation

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Summary

Research Workflow

Dataset and Benchmarks

Repository Structure

/Appendix

/data

/src

/images

Code Confusion Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`/Appendix`

`/data`

`/src`

`/images`

Packages