Drop-in GPU port of the V3 spectral bypass. Uses PyTorch, so the same code runs on NVIDIA CUDA (Windows/Linux), Apple MPS (Mac M-series), and CPU — device is auto-selected.
git clone <this-repo-url>
cd reverse-SynthID
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
# Pick the CUDA build that matches your driver (check with: nvidia-smi)
# CUDA 12.1:
pip install torch --index-url https://download.pytorch.org/whl/cu121
# CUDA 11.8:
# pip install torch --index-url https://download.pytorch.org/whl/cu118Verify GPU is visible:
python -c "import torch; print('cuda?', torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else '')"python -c "
import cv2, numpy as np
from src.extraction.gpu.bypass_torch import bypass_single_image
img = cv2.imread('validation_images/test_4x3_city.png')[:, :, ::-1] # BGR->RGB
out = bypass_single_image(img, 'artifacts/spectral_codebook_v3.npz', strength='aggressive')
cv2.imwrite('out.png', out[:, :, ::-1])
print('saved out.png')
"The codebook has exact-match profiles only at 1024x1024 and 1536x2816. Other resolutions fall back to the CPU spatial path (still correct, slower).
python -m src.extraction.gpu.bypass_video `
--input my_video.mp4 `
--output my_video_clean.mp4 `
--codebook artifacts/spectral_codebook_v3.npz `
--strength aggressive `
--batch 8Flags:
--batch N— frames per GPU batch. Larger = faster but more VRAM. Start with 8 for 1080p on 8 GB VRAM, 4 for 4K.--half— float16 storage on CUDA (≈1.7× faster, ≈0.2 dB PSNR loss).--device—auto(default),cuda,mps,cpu.--max-frames N— test run with only N frames.
| Hardware | 1080p frame | 1 min of 1080p @ 30fps |
|---|---|---|
| RTX 4090 | ~1.5 ms | ~3 s |
| RTX 3060 12GB | ~4 ms | ~8 s |
| RTX 2060 | ~10 ms | ~20 s |
| Apple M1 Pro (MPS) | ~30 ms | ~1 min |
| CPU (i7 desktop, this repo) | ~150 ms | ~4.5 min |
With --half on Ampere+ GPUs, expect another ~1.5–1.8× speedup.
- Same mathematics as
bypass_v3in src/extraction/synthid_bypass.py. - Per-resolution codebook tensors (ref_mag, phase_factor, consistency, DC ramp, per-strength cons-weight) are pre-uploaded to the GPU once.
- Per frame: batched FFT2 → subtract
min(wm_mag, |fft|*cap) * e^{iφ}→ IFFT2. - Anti-alias: 3×3 Gaussian on GPU via
conv2d. - Non-exact resolutions: fall back to CPU
bypass_v3(infrequent, cheap).
- "CUDA out of memory" — lower
--batch, or add--half. - MPS complex FFT fallback — on older PyTorch for Apple Silicon,
fft2on complex tensors may fall back to CPU, eliminating the speedup. Upgrade totorch>=2.3for best results. - Output video looks identical — you may have passed a non-watermarked input, or the resolution has no exact-match profile. Run once at 1024×1024 or 1536×2816 to validate.