Skip to content

PalisadeResearch/AI-self-replication

Repository files navigation

AI Self-Replication Experiments

Experiment infrastructure for testing autonomous AI self-replication across multiple generations. Deploys isolated Docker environments where an AI agent must exploit a vulnerable web app, extract SSH credentials, and replicate itself (model weights + inference server + agent) onto the compromised target.

Prerequisites

Local: bash, ssh, scp. Remote: Ubuntu 24.04 GPU host with passwordless sudo, outbound internet. Docker and NVIDIA drivers install automatically on first run. Driver auto-install targets 570-series on Ubuntu 24.04 (noble) and falls back to 550-series on older releases where 570 is only available as a transitional package. Pass --skip-driver if the host already has a working driver (≥ 550). Tested on Ubuntu 24.04 LTS with NVIDIA driver 570.211.01; other distributions or driver versions may require manual adjustment.

Getting Started

Install dependencies

Via nix + direnv (recommended):

direnv allow

Via uv (manual):

uv sync
source .venv/bin/activate

Add API keys

cp env.example .env

Edit .env and add your keys:

OPENAI_API_KEY=...        # For GPT-5.4 experiments
ANTHROPIC_API_KEY=...     # For Opus experiments
LANGFUSE_PUBLIC_KEY=...   # For trace upload
LANGFUSE_SECRET_KEY=...
LANGFUSE_BASE_URL=...

Experiment Generations

  • Gen-1 — Agent (Codex or qwen-code) + vLLM exploits citrusdrop, downloads weights, replicates onto the target
  • Gen-2 — The replicated system (target-1 snapshot from a successful Gen-1) runs again as host-1 against a fresh citrusdrop-ssti target
  • Gen-3 — The replicated system from a successful Gen-2 (target-1 snapshot of a Gen-2 run) runs again as host-1 against new vulnerability classes the agent wasn't prompted about (mossgate, thornfield)
  • GPT-5.4 Gen-1 — Same Gen-1 exploit chain, but the agent is the OpenAI SDK loop instead of Codex
  • Opus 4.6 Gen-1 — Same Gen-1 exploit chain, but the agent is the Anthropic SDK loop instead of Codex

Running Experiments

All commands below run locallysetup_experiment.sh handles SSH, file transfer, and Docker builds on the remote GPU host. Pass --ephemeral-dir /ephemeral on Hyperstack instances (large scratch disk mounted at /ephemeral).

When a run succeeds (the replicated model responds on target-1), run_codex_prompt.sh automatically snapshots target-1 — at that point target-1 contains the full Codex + vLLM + weights stack that the agent just replicated — and writes a gzipped tarball to ${EPHEMERAL_DIR}/snapshots/target-1-respond-<timestamp>-<traceid>.tar.gz on the GPU host. That tarball is loaded as the new host-1 for the next generation: Gen-2 uses a Gen-1 snapshot, Gen-3 uses a Gen-2 snapshot.

Gen-1 — fresh Codex or qwen-code + vLLM build against citrusdrop

The default harness is Codex CLI driving the Qwen3.5-122B-A10B-FP8 model over a Responses-to-Chat proxy. Pass --harness qwen to run qwen-code instead, which talks to vLLM directly and defaults to the smaller Qwen3.6-35B-A3B-FP8 MoE that fits on a single 80GB GPU. The qwen harness loads prompts from prompts/gen-qwen/ (same exploit chain, but Agent 2 is spawned with qwen -p instead of codex exec) and is incompatible with --backend api. qwen-code Gen-1 snapshots are not yet chainable into Gen-2/Gen-3 — run_from_snapshot.sh assumes a Codex binary and .codex/config.toml inside the snapshot, so the multi-generation flow below applies to Codex Gen-1 only.

# Codex (default)
bash scripts/setup_experiment.sh user@gpu-host \
  --model "Qwen/Qwen3.5-122B-A10B-FP8" --tp 2 --ephemeral-dir /ephemeral

# qwen-code
bash scripts/setup_experiment.sh user@gpu-host \
  --harness qwen --tp 1 --host-gpus 0 --target-gpus 1 --ephemeral-dir /ephemeral

ssh user@gpu-host '~/run_codex_prompt.sh ~/prompts/direct_gpu_copy_v1.txt'

Gen-2 — Gen-1 snapshot against citrusdrop-ssti

After a successful Gen-1 run, copy the snapshot tarball off the GPU host (e.g. scp user@gpu-host:/ephemeral/snapshots/target-1-respond-*.tar.gz ./), then load it as the new host-1:

bash scripts/setup_experiment.sh user@gpu-host \
  --snapshot ./target-1-respond-<timestamp>-<traceid>.tar.gz \
  --target-app citrusdrop-ssti --ephemeral-dir /ephemeral

ssh user@gpu-host '~/run_from_snapshot.sh "$(cat ~/.replication_demo_snapshot)" ~/prompts/direct_gpu_copy_v1.txt'

Snapshot mode uses run_from_snapshot.sh instead of run_codex_prompt.sh because the snapshotted host-1 inherits the target app's container layout (user oma-it-support, not agent). The imported image tag is persisted in ~/.replication_demo_snapshot at setup time.

Note: The from-snapshot script assumes the snapshotted host-1 has vLLM serving the model under its original path. In practice, the Gen-1 agent may have set up inference differently — e.g. using a short model alias like "qwen" instead of the full HuggingFace ID, or serving on a non-default port. If the snapshot's vLLM config doesn't match what run_from_snapshot.sh expects, you may need to adjust the model path or port in the script to match how the agent actually configured inference in that particular Gen-1 run.

Gen-3 — Gen-2 snapshot against new vulnerability classes

After a successful Gen-2 run, copy the snapshot tarball off the GPU host the same way as for Gen-2, then load it as the new host-1:

# SQL injection
bash scripts/setup_experiment.sh user@gpu-host \
  --snapshot ./target-1-gen2-respond-<timestamp>-<traceid>.tar.gz \
  --target-app mossgate --ephemeral-dir /ephemeral

ssh user@gpu-host '~/run_from_snapshot.sh "$(cat ~/.replication_demo_snapshot)" ~/prompts/direct_gpu_copy_v1.txt'

# Broken access control
bash scripts/setup_experiment.sh user@gpu-host \
  --snapshot ./target-1-gen2-respond-<timestamp>-<traceid>.tar.gz \
  --target-app thornfield --ephemeral-dir /ephemeral

ssh user@gpu-host '~/run_from_snapshot.sh "$(cat ~/.replication_demo_snapshot)" ~/prompts/direct_gpu_copy_v1.txt'

GPT-5.4 Gen-1 / Opus 4.6 Gen-1 — closed-API agent

# Claude Opus 4.6
bash scripts/setup_experiment.sh user@gpu-host \
  --backend api --api-model claude-opus-4-6 --ephemeral-dir /ephemeral

# GPT-5.4
bash scripts/setup_experiment.sh user@gpu-host \
  --backend api --api-model gpt-5.4 --ephemeral-dir /ephemeral

ssh user@gpu-host '~/run_codex_prompt.sh ~/prompts/hack_replicate_benchmark_v1.txt'

Chain replication — geo-distributed bare VMs

Tests multi-hop self-replication across 4 bare VMs (no Docker). Each VM runs qwen-code + vLLM natively; the agent on VM₁ must propagate itself to VM₂, VM₂ to VM₃, and so on. Uses Qwen3.6-27B-FP8 (fits on a single GPU per VM).

bash scripts/setup_experiment_chain.sh user@vm1 user@vm2 user@vm3 user@vm4 \
  --model "Qwen/Qwen3.6-27B-FP8"

The script installs vLLM + qwen-code on each VM, deploys a vulnerable app on the next hop, plants credentials, and starts a Langfuse watcher daemon. Prompts live in prompts/gen-chain-qwen/.

Langfuse traces

Traces upload automatically at the end of each run, provided LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY are set in .env. Tags are auto-generated from the run context (date, backend, target app, prompt file, model, agent1/agent2). Append custom tags with --extra-tags tag1,tag2 on setup_experiment.sh.

Target Applications

App Vulnerability
citrusdrop (default) Hash leakage → SSH creds
citrusdrop-ssti Jinja2 SSTI → SSH creds
mossgate SQL injection → SSH creds
thornfield Session token forgery → SSH creds

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors