Skip to content

TruX-DTF/PLC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Summary

This paper presents a comprehensive analysis of Programming Language Confusion (PLC) in Large Language Models (LLMs) when performing code generation and code translation tasks. We evaluate how effectively LLMs maintain programming language consistency when performing these tasks. Our experiments with 16 popular programming languages across 10 LLMs show that these models often produce code in unintended languages, even when given clear instructions or contextual cues.

Research Workflow

PLC-LLM Research Workflow

Figure 1: Overview of our research methodology for evaluating Programming Language Confusion in LLMs

Dataset and Benchmarks

We used five datasets for our evaluation

  • Code Generation :
    • LiveCodeBench
    • BabelCode dataset (containing BC-MBPP, BC-TP3, and BC-HumanEval)
    • HumanEval-XL
  • Code Generation:
    • McEval

Repository Structure

/Appendix

Supplementary materials and detailed results supporting the main paper.

/data

This directory contains all datasets used in our research, as well as the results obtained for each evaluated model.

  • /evaluation_data/:
    • code_generation/: houses datasets specifically used for code generation tasks.
    • code_translation/: stores datasets employed for code translation tasks.
  • /results/: includes all generated outputs from the evaluated models.
    • generation_results/: stores results from code generation experiments.
    • translation_results/: contains results from code translation experiments.

/src

This directory contains the complete implementation of our evaluation framework.

  • setup_ollama.py: serves as the main entry point for running experiments with Ollama framework. To execute, use the command python src/setup.py llama3.1 code_translation mceval.
  • /language_identification/: contains scripts dedicated to identifying the programming language in generated outputs.
  • process_results.py: processes outputs and performs overall language classification
  • visualize.py: creates visual representations, including chord diagrams, to showcase language migration patterns.
  • compute_performances.py: computes and presents detailed performance metrics.

/images

This directory stores all migration diagrams corresponding to each dataset and model used in this study.

Code Confusion Examples

Example 1 of Confusion

class Solution {
    /**
     * Write a java function to find the last two digits in factorial of a given number.
     */
    public Integer lastTwoDigits(Integer n) {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments. 

Generated Code by CodeLLama-13B-Instruct-Ollama

def get_last_two_digits_of_factorial(n):
    fact = 1
    for i in range(2, n+1):
        fact *= i
    return str(fact)[-2:]
# Test cases
assert get_last_two_digits_of_factorial(1) == "1"
assert get_last_two_digits_of_factorial(10) == "4"
assert get_last_two_digits_of_factorial(100) == "7"

Example 2 of Confusion

class Solution {
    /**
     * Write a function to find frequency of the elements in a given array of arrays using collections module.
     */
    public Map<Integer, Integer> freqElement(ArrayList<ArrayList<Integer>> nums) {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments. 

Generated Code by CodeLLama-13B-Instruct

import collections
def freqElement(nums):
    freq = collections.defaultdict(int)
    for arr in nums:
        for num in arr:
            freq[num] += 1
    return freq

Example 3 of Confusion

/**
 * Verifies that the inputs satisfy the problem:
 * Determine, for each string in x, whether the last character is an isolated letter
 * ["a b c", "abc"] => [True, False]
 */
def sat(y: List[Boolean], x: List[String]): Boolean = {

Write the full solution for the problem above.
Return the entire code, including the function definition and its implementation.
Do not include explanations or comments. 

Generated Code by GPT-4.1-Mini

from typing import List
def sat(y: List[bool], x: List[str]) -> bool:
    return all(
        y_i == (
            len(s) > 0 and
            s[-1].isalpha() and
            (len(s) == 1 or (s[-2] == ' ' and (len(s) == 2 or s[-3] == ' ')))
        )
        for y_i, s in zip(y, x)
    )

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published