Skip to content

seclab-ucr/LLMBisect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Bisect

LLM-Bisect is an automated tool for identifying Bug-Inducing Commits (BICs) in the Linux kernel using Large Language Models (LLMs) combined with git bisect techniques.

Overview

When a security vulnerability is patched in the Linux kernel, understanding when the vulnerability was introduced is crucial for:

  • Determining which kernel versions are affected
  • Assessing the severity and impact of the vulnerability
  • Identifying patterns in how vulnerabilities are introduced

LLM-Bisect automates this process by:

  1. Analyzing the vulnerability patch to identify critical code changes
  2. Tracing the history of changed functions and lines using git blame
  3. Using LLMs to determine which historical commit introduced the vulnerability

Features

  • Critical Line Extraction: Automatically identifies the most security-relevant changes in a patch
  • Multi-strategy Candidate Finding: Uses both function-based and line-based history tracing
  • LLM-powered Analysis: Leverages models like GPT-4/o1 to understand code semantics
  • Accuracy Evaluation: Built-in tools for evaluating results against ground truth

Installation

# Clone the repository
git clone https://github.com/your-org/llm-bisect.git
cd llm-bisect

# Install dependencies
pip install -r requirements.txt

API Key Configuration

LLM-Bisect requires an OpenAI API key to function. You can configure it using one of the following methods:

Method 1: Environment Variable (Recommended for Production)

Set the OPENAI_API_KEY environment variable:

export OPENAI_API_KEY="sk-your-api-key-here"

Add this to your ~/.bashrc or ~/.zshrc to persist across sessions:

echo 'export OPENAI_API_KEY="sk-your-api-key-here"' >> ~/.bashrc
source ~/.bashrc

Method 2: .env File (Recommended for Development)

Create a .env file in the project root directory:

# .env
OPENAI_API_KEY=sk-your-api-key-here
LLM_BISECT_KERNEL_PATH=/path/to/linux/kernel
LLM_BISECT_MODEL=o1

The application automatically loads this file using python-dotenv.

Important: Add .env to your .gitignore to avoid committing secrets:

echo ".env" >> .gitignore

Method 3: Programmatic Configuration

Set the API key in your Python code:

import os
os.environ["OPENAI_API_KEY"] = "sk-your-api-key-here"

from llm_bisect import bisect_vulnerability
result = bisect_vulnerability("commit-hash")

Environment Variables Reference

Variable Description Default
OPENAI_API_KEY Your OpenAI API key (required) -
LLM_BISECT_KERNEL_PATH Path to Linux kernel repository /home/zzhan173/repos/linux
LLM_BISECT_STORE_PATH Path to store intermediate results ./cases/
LLM_BISECT_MODEL Default LLM model to use o1
LLM_BISECT_DEBUG Enable debug logging false

Getting an OpenAI API Key

  1. Go to OpenAI Platform
  2. Sign up or log in to your account
  3. Navigate to API Keys section
  4. Click Create new secret key
  5. Copy the key (it starts with sk-)

Note: The o1 model requires a paid OpenAI account with sufficient credits.

Quick Start

Single Commit Analysis

from llm_bisect import bisect_vulnerability

# Analyze a vulnerability patch commit
result = bisect_vulnerability("6ca575374dd9a507cdd16dfa0e78c2e9e20bd05f")

if result.success:
    print(f"Bug-inducing commit: {result.inducing_commit}")
else:
    print(f"Analysis failed: {result.error}")

Command Line Interface

# Analyze a single commit
python -m llm_bisect.cli bisect -c 6ca575374dd9a507cdd16dfa0e78c2e9e20bd05f

# Batch processing
python -m llm_bisect.cli bisect -i commits.json -o results.json

# Evaluate results
python -m llm_bisect.cli evaluate results.json -g ground_truth.json

Using the Main Module

# Run analysis
python -m llm_bisect.main 6ca575374dd9a507cdd16dfa0e78c2e9e20bd05f --model o1

# With verbose output
python -m llm_bisect.main 6ca575374dd9a507cdd16dfa0e78c2e9e20bd05f -v

Configuration

Configure LLM-Bisect using the Config class:

from llm_bisect import Config, set_config

config = Config()
config.paths.kernel_repo = "/path/to/linux"
config.llm.model = "o1"
config.llm.temperature = 0.0
config.bisect.max_candidates = 50

set_config(config)

Project Structure

llm_bisect/
├── __init__.py          # Package exports
├── config.py            # Configuration management
├── main.py              # Main entry point
├── cli.py               # Command-line interface
├── utils/               # Utility modules
│   ├── git_utils.py     # Git operations
│   ├── patch_parser.py  # Patch parsing
│   └── file_utils.py    # File I/O
├── llm/                 # LLM integration
│   ├── client.py        # API client
│   ├── prompts.py       # Prompt templates
│   └── analyzers.py     # Analysis functions
├── bisect/              # Bisection logic
│   ├── candidate_finder.py  # Candidate identification
│   └── bisector.py      # Main bisection algorithm
└── analysis/            # Code analysis
    ├── critical_lines.py    # Critical line extraction
    ├── function_parser.py   # C function parsing
    └── evaluator.py         # Accuracy evaluation

How It Works

1. Critical Line Extraction

LLM-Bisect first analyzes the patch to identify which code changes are most relevant to fixing the vulnerability:

from llm_bisect.analysis import extract_critical_lines

lines = extract_critical_lines(patch_commit, use_llm=True)
for line in lines:
    print(f"{line.filename}:{line.absolute_line}")

2. Candidate Commit Identification

Two strategies are used to find potential bug-inducing commits:

  • Function-based: Traces the history of functions that were modified
  • Line-based: Uses git blame to find when specific lines were introduced

3. LLM Analysis

The LLM evaluates each candidate commit to determine if it could have introduced the vulnerability:

from llm_bisect.llm import determine_if_commit_introduces_vulnerability

is_inducing = determine_if_commit_introduces_vulnerability(
    patch_commit, 
    candidate_commit
)

4. Result Selection

When multiple candidates remain, the LLM selects the most likely inducing commit by comparing them.

API Reference

Core Functions

bisect_vulnerability(patch_commit: str) -> BisectResult

Main function to find the bug-inducing commit for a given patch.

extract_critical_lines(commit: str, use_llm: bool = True) -> List[CriticalLine]

Extract security-relevant lines from a patch.

evaluate_bisect_accuracy(results_path: str, ground_truth_path: str) -> EvaluationResult

Evaluate bisect results against ground truth.

Classes

Bisector

Main class for performing bisection analysis.

Config

Configuration container with sections for paths, LLM settings, and bisect parameters.

BisectResult

Result container with inducing_commit, candidates, success, and error fields.

Accuracy Evaluation

LLM-Bisect includes tools for evaluating accuracy:

from llm_bisect.analysis import Evaluator

evaluator = Evaluator(ground_truth_path="ground_truth.json")
result = evaluator.evaluate_commit_accuracy(my_results)

print(f"Accuracy: {result.accuracy:.2%}")
print(f"Correct: {result.correct}/{result.total}")

Requirements

  • Python 3.8+
  • Git
  • Access to a Linux kernel repository
  • OpenAI API key (or compatible LLM API)

License

MIT License

Citation

If you use LLM-Bisect in your research, please cite:

@software{llm_bisect,
  title = {LLM-Bisect: Automated Bug-Inducing Commit Identification},
  year = {2024},
  url = {https://github.com/your-org/llm-bisect}
}

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting pull requests.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages