Skip to content
9 changes: 9 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
[submodule "contribs/quark"]
path = contribs/quark
url = https://github.com/amd/Quark.git
[submodule "contribs/llm-compressor"]
path = contribs/llm-compressor
url = https://github.com/vllm-project/llm-compressor.git
[submodule "contribs/transformers"]
path = contribs/transformers
url = https://github.com/huggingface/transformers.git
[submodule "contribs/vllm"]
path = contribs/vllm
url = https://github.com/vllm-project/vllm.git
113 changes: 113 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What is Quanto

Quanto is an LLM quantization toolkit built on AMD Quark. It quantizes HuggingFace models to INT4/INT8/FP8/MXFP4/MXFP6 precisions with multiple memory strategies for different GPU constraints. Source code lives in `src/quanto/`.

## Commands

```bash
# Install
pip install -e ".[dev]" # dev (pytest, ruff)
pip install -e ".[nvidia]" # with NVIDIA extras
pip install -e ".[rocm]" # with ROCm extras

# Tests (requires Quark — run on remote server with amd-quark installed)
pytest tests/ -v # all tests
pytest tests/test_unified_quantizer.py -v # single file
pytest tests/test_unified_quantizer.py::TestUnifiedConfig::test_default_config -v # single test

# Lint & format
ruff check src/ # lint
ruff check src/ --fix # lint with autofix
ruff format src/ # format

# Quantize a model (CLI)
python -m quanto \
--model_path model/path \
--output_dir ./output \
--precision mxfp4 \
--sensitivity_analysis \
--sensitivity_threshold 0.12

# Quantize with explicit exclude list (e.g., attn-excl strategy)
python -m quanto \
--model_path model/path \
--output_dir ./output \
--precision mxfp4 \
--exclude_layers_file exclude.json

# Quantize (Python API)
from quanto import UnifiedQuantizer, UnifiedConfig
config = UnifiedConfig(
model_path='model/path', output_dir='./output',
precision='mxfp4', sensitivity_analysis=True,
sensitivity_threshold=0.12,
)
UnifiedQuantizer(config).run()

# Dequantize
python -m quanto --dequantize --model_path ./quantized --output_dir ./dequantized

# Docker-based integration tests
./scripts/run_e2e_tests.sh rocm # all ROCm tests
./scripts/run_e2e_tests.sh cuda 1,2 # specific CUDA tests
```

## Architecture

### Pipeline flow
`UnifiedConfig` (dataclass validation) -> `UnifiedQuantizer.run()` -> strategy dispatch -> `QuantizationResult`

### Quantization paths

**MXFP4/MXFP6** — Uses Quark's `quantize_model_per_safetensor` (file2file). Processes each safetensors shard independently without loading the full model. Produces packed uint8 weights + E8M0 scales compatible with vLLM's Quark loader.

**INT4/INT8/FP8** — Uses in-memory quantization via `ModelQuantizer` + `export_safetensors`. Three memory strategies:
- `full` — entire model on GPU
- `layerwise_cpu` — model on CPU, layers quantized one-by-one on GPU
- `lazy` — weights loaded on-demand from safetensors

### Core modules (`src/quanto/core/`)
- **`config.py`** — `UnifiedConfig` dataclass. Key fields: `precision`, `memory_strategy`, `algorithm` (rtn/awq/gptq), `sensitivity_analysis`, `sensitivity_threshold`, `exclude_layers`.
- **`unified_quantizer.py`** — Main quantizer. `run()` dispatches to `_run_file2file_quantization()` for MXFP or `_run_full_gpu_quantization()` / `_run_lazy_quantization()` for INT4/INT8. Contains `_determine_exclude_layers()` with sensitivity analysis and `_align_exclude_groups()` for vLLM fused layer compatibility.
- **`sensitivity/sequential_analyzer.py`** — Iterative sensitivity analysis. Scores each layer using the actual target precision (MXFP4 uses `OCP_MXFP4Spec`, not INT4 proxy). `_build_quant_config_for_scoring()` maps precision to the correct Quark spec class.

### Supporting modules
- **`constants.py`** — `PRECISION_TO_SCHEME` mapping, `MODEL_TYPE_MAPPINGS` (includes `solar_open` -> `qwen3_moe`, `kimi_k2` -> `kimi_k25`), `SUPPORTED_ALGORITHMS`.
- **`auto_quantize.py`** — CLI `main()` entry point. Parses args and creates `UnifiedConfig`. Supports `--exclude_layers_file` for JSON exclude lists.
- **`utils/model_utils.py`** — `detect_model_type()` and `get_template()` for Quark `LLMTemplate` lookup.
- **`utils/calibration.py`** — `CalibrationDataManager` loads from HuggingFace datasets or local files.
- **`utils/int4_pack.py`** — INT4 <-> INT32 packing/unpacking.

### External dependency
AMD Quark is vendored as a git submodule in `contribs/quark/`. Key Quark APIs used:
- `LLMTemplate.get_config(scheme, algorithm, exclude_layers)` — generates per-architecture quantization configs
- `quantize_model_per_safetensor()` — file-to-file quantization (MXFP4 path)
- `ModelQuantizer` / `export_safetensors()` — in-memory quantization (INT4/INT8 path)
- `OCP_MXFP4Spec`, `Int4PerGroupSpec` — precision-specific quantization specs

## Code style

- Ruff configured: 100-char line length, Python 3.10 target
- Lint rules: E, W, F, I (isort), B (bugbear), C4, UP, ARG, SIM
- Double quotes, space indentation
- `contribs/` directory is excluded from linting

## Key patterns

- **vLLM fused layer alignment**: `_align_exclude_groups()` ensures q/k/v projections and gate/up projections are excluded together (vLLM fuses these into `qkv_proj` and `gate_up_proj`)
- **AWQ/GPTQ algorithm support**: Enabled via config validation matrix in `constants.ALGORITHM_PRECISION_SUPPORT`. Valid combinations:
- RTN: all precisions (int4, int4_64, int4_32, int8, fp8, mxfp4, mxfp6, uint4)
- AWQ: INT4 only (int4, int4_64, int4_32) — activation-aware, Quark `AwqProcessor`
- GPTQ: INT4 only (int4) — Hessian-based, Quark `GptqProcessor`
- Invalid combos (e.g., AWQ+MXFP4, GPTQ+INT8) raise `ValueError` in `UnifiedConfig.validate()`
- **Sensitivity analysis algorithm-awareness**: `SequentialSensitivityAnalyzer._build_quant_config_for_scoring()` passes actual algorithm (not RTN proxy) to `LLMTemplate.get_config()` for correct Quark spec (critical for AWQ/GPTQ accuracy)
- **Backward compat aliases**: `QuantizationConfig = UnifiedConfig`, `AutoQuantizer = UnifiedQuantizer`
- **HF hub resolution**: File2file path auto-resolves HF hub IDs to local cache via `snapshot_download`

## Testing environment

Remote server mi355-gpu-16 (aac14 cluster) with MI355 GPUs. Use podman containers with `rocm/vllm-dev:nightly` image which includes PyTorch, Quark, and all dependencies. See `memory/reference_mi355_server.md` for access details.
47 changes: 0 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,53 +82,6 @@ docker build -f docker/Dockerfile.rocm.dev -t quanto:rocm-dev .
docker run --device=/dev/kfd --device=/dev/dri --group-add video -v $(pwd):/workspace -w /workspace quanto:rocm-dev bash
```

## Project Structure

```
quanto/
├── pyproject.toml # Package configuration
├── README.md # This file
├── requirements.txt # Base requirements
├── requirements-nvidia.txt # NVIDIA-specific deps
├── requirements-rocm.txt # ROCm-specific deps
├── contribs/
│ └── quark/ # AMD Quark (submodule)
├── docker/
│ ├── Dockerfile.cuda # Pre-built for CUDA
│ ├── Dockerfile.cuda.dev # Development for CUDA
│ ├── Dockerfile.rocm # Pre-built for ROCm
│ └── Dockerfile.rocm.dev # Development for ROCm
├── docs/
│ └── examples.md # Experiment results
├── examples/ # Example scripts
├── scripts/
│ └── repack.py # Weight packing utilities
├── src/quanto/ # Main package
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── constants.py # Shared constants
│ ├── core/ # Quantization engines
│ │ ├── base_quantizer.py
│ │ ├── auto_quantize.py
│ │ ├── layerwise_quant.py
│ │ ├── lazy_layerwise_quant.py
│ │ ├── iterative_quantizer.py
│ │ └── dequantize.py
│ ├── analysis/ # Layer analysis
│ │ ├── layer_analyzer.py
│ │ └── sensitivity_analyzer.py
│ ├── export/ # Export utilities
│ │ ├── hf_export.py
│ │ └── model_assembler.py
│ └── utils/ # Shared utilities
│ ├── calibration.py
│ ├── int4_pack.py
│ ├── logging.py
│ ├── memory.py
│ └── model_utils.py
└── tests/ # Test suite
```

## Usage

### Basic Usage
Expand Down
42 changes: 17 additions & 25 deletions src/quanto/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,39 +14,31 @@

def main() -> int:
"""Main entry point that dispatches to quantize or dequantize."""
parser = argparse.ArgumentParser(
description="Quanto: LLM Quantization Tool",
add_help=False,
)

# Add --dequantize flag to detect mode
parser.add_argument("--dequantize", action="store_true", help="Run dequantization mode")
parser.add_argument("--help", "-h", action="store_true", help="Show help")
# Check if --dequantize is in args
if "--dequantize" in sys.argv:
from quanto.core.dequantize import main as dequant_main

# Parse known args to detect mode
args, remaining = parser.parse_known_args()
return dequant_main()

if args.help:
parser.print_help()
print("\nModes:")
# Show top-level help only when no args or just --help with no other flags
if len(sys.argv) <= 1 or (len(sys.argv) == 2 and sys.argv[1] in ("--help", "-h")):
print("usage: python -m quanto [--dequantize] [options]")
print()
print("Quanto: LLM Quantization Tool")
print()
print("Modes:")
print(
" Quantization: python -m quanto --model_path ... --output_dir ... --precision int4"
" Quantization: python -m quanto --model_path ... --output_dir ... --precision mxfp4"
)
print(" Dequantization: python -m quanto --dequantize --model_path ... --output_dir ...")
print()
print("Run 'python -m quanto --model_path x --output_dir y --help' for full quantization options.")
return 0

if args.dequantize:
# Run dequantization
from quanto.core.dequantize import main as dequant_main

# Add back --dequantize flag since dequantize module expects it
sys.argv = [sys.argv[0], "--dequantize"] + remaining
return dequant_main()
else:
# Run quantization
from quanto.core.auto_quantize import main as quant_main
# Default: quantization mode
from quanto.core.auto_quantize import main as quant_main

return quant_main()
return quant_main()


if __name__ == "__main__":
Expand Down
17 changes: 16 additions & 1 deletion src/quanto/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@
"phi": "phi",
"phi3": "phi3",
"phi4": "phi3",
"solar_open": "qwen3_moe",
"exaone": "llama",
"exaone4_5": "llama",
"exaone4_5_text": "llama",
"exaone_moe": "qwen3_moe",
"kimi_k2": "kimi_k25",
"kimi_k25": "kimi_k25",
}

# Default layers to exclude from quantization
Expand Down Expand Up @@ -78,7 +85,15 @@

# Supported quantization algorithms
SUPPORTED_ALGORITHMS: list[str] = [
"rtn",
"awq",
"gptq",
"smoothquant",
]

# Algorithm-Precision support matrix
# Defines which precisions are supported for each quantization algorithm
ALGORITHM_PRECISION_SUPPORT: dict[str, list[str]] = {
"rtn": ["int4", "int4_64", "int4_32", "int8", "fp8", "mxfp4", "mxfp6", "uint4"],
"awq": ["int4", "int4_64", "int4_32"], # AWQ is INT4-only (activation-aware)
"gptq": ["int4"], # GPTQ is INT4-only (Hessian-based)
}
93 changes: 93 additions & 0 deletions src/quanto/core/auto_quantize.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,97 @@
"QuantizationConfig",
"UnifiedQuantizer",
"UnifiedConfig",
"main",
]


def main() -> int:
"""CLI entry point for quantization."""
import argparse
import json
import sys

parser = argparse.ArgumentParser(
description="Quanto: Quantize a model",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)

# Required
parser.add_argument("--model_path", required=True, help="HuggingFace model ID or local path")
parser.add_argument("--output_dir", required=True, help="Output directory for quantized model")

# Quantization settings
parser.add_argument(
"--precision",
default="mxfp4",
choices=["int4", "int4_64", "int4_32", "int8", "fp8", "mxfp4", "mxfp6", "uint4"],
help="Target precision",
)
parser.add_argument("--algorithm", default="rtn", choices=["rtn", "awq", "gptq"], help="Quantization algorithm")
parser.add_argument("--memory_strategy", default="auto", choices=["full", "layerwise_cpu", "lazy", "auto"])

# Sensitivity analysis
parser.add_argument("--sensitivity_analysis", action="store_true", help="Enable iterative sensitivity analysis")
parser.add_argument("--sensitivity_threshold", type=float, default=0.0, help="Sensitivity threshold for layer exclusion")
parser.add_argument(
"--sensitivity_metric",
type=str,
default="relative",
choices=["relative", "mse", "mae", "cosine", "kl"],
help="Metric used to rank sensitive layers",
)
parser.add_argument("--max_iterations", type=int, default=10, help="Max iterations for sensitivity analysis")

# Layer exclusion
parser.add_argument("--exclude_layers", nargs="*", help="Layer name patterns to exclude from quantization")
parser.add_argument("--exclude_layers_file", help="JSON file containing exclude layer list")

# Calibration data
parser.add_argument("--calibration_data", default="pileval", help="Calibration dataset name or path")
parser.add_argument("--num_calib_samples", type=int, default=128, help="Number of calibration samples")
parser.add_argument("--seq_len", type=int, default=512, help="Sequence length for calibration")

# Other
parser.add_argument("--device", default="cuda", help="Device (cuda, cuda:0, cpu)")
parser.add_argument("--trust_remote_code", action="store_true", default=True)
parser.add_argument("--no_trust_remote_code", action="store_true", help="Disable trust_remote_code")
parser.add_argument("--skip_evaluation", action="store_true", help="Skip perplexity evaluation")
parser.add_argument("--sensitivity_cache_on_gpu", action="store_true", default=True)

args = parser.parse_args()

# Handle exclude_layers from file
exclude_layers = args.exclude_layers
if args.exclude_layers_file:
with open(args.exclude_layers_file) as f:
exclude_layers = json.load(f)

config = UnifiedConfig(
model_path=args.model_path,
output_dir=args.output_dir,
precision=args.precision,
algorithm=args.algorithm,
memory_strategy=args.memory_strategy,
sensitivity_analysis=args.sensitivity_analysis,
sensitivity_threshold=args.sensitivity_threshold,
sensitivity_metric=args.sensitivity_metric,
max_iterations=args.max_iterations,
exclude_layers=exclude_layers,
calibration_data=args.calibration_data,
num_calib_samples=args.num_calib_samples,
seq_len=args.seq_len,
device=args.device,
trust_remote_code=not args.no_trust_remote_code,
skip_evaluation=args.skip_evaluation,
sensitivity_cache_on_gpu=args.sensitivity_cache_on_gpu,
)

quantizer = UnifiedQuantizer(config)
result = quantizer.run()

if result.success:
print(json.dumps(result.to_dict(), indent=2))
return 0
else:
print(f"FAILED: {result.error_message}", file=sys.stderr)
return 1
Loading