🚀 FLAT-LLM

This is the official PyTorch implementation of FLAT-LLM Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression arxiv [Findings of EACL 2026]

📦 Environment Setup

Installation instructions can be found in INSTALL.md.

✅ Checklist

Throughput/Memory Efficiency Evaluation
Zero-shot/Few-shot Downstream Task Evaluation
Head-wise QK Support
Model Support: Llama-2, Llama-3, Mistral
Multi-GPU Support: Llama-2 70B
Post-pruning Quantization support: GPTQ

🛠️ Run the Code

All scripts for reproducing our main results (Table 1) are available in the scripts directory.

🔍 Importance-Preserving Rank Selection (IPRS)

Run llama_bi.sh to compute decoder-wise importance scores.
Run compute_rank.py to
- Allocate ranks with our IPRS algorithm according to the importance scores.
- Compute the compression ratio for V,O,MLP layers (Q,K are not pruned) according to required total compresstion ratio.

✂️ FLAT-LLM Pruning

Run one of the following scripts to prune and evaluate the PPL of corresponding model:

llama_7b.sh # use 1 A100 40GB
llama_13b.sh # use 1 A100 40GB
llama_70b.sh # use 4 A100 40GB
mistral.sh # use 1 A100 40GB

These reproduce the perplexity results reported in Table 1 of the paper when using wikitext2 for calibration.

🔢 FLAT-LLM Post-pruning Quantization

Run scripts/gptq.sh # load pruned flat-llm model

🔧 Command-Line Arguments

📦 Model and Dataset

--model: Name or path of the LLM to prune. Choices: meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, mistralai/Mistral-7B-v0.1
--dataset: Calibration dataset. Choices: wikitext2, c4, alpaca.
--cache_dir: Directory to cache model weights.

⚙️ Pruning Configuration

--prune_method: Pruning stage. Options:
- bi: Rank allocation via importance scores.
- flatllm: Final pruning using head-wise PCA.
--sparsity_ratio: Target sparsity level (as an integer percentage).
--tol: Tolerance threshold on cumulative eigenvalues. Default: 0.96. (this hyper-para is only for monitoring the calibration, not used in the algorithm)
--bi_score: Path to save/load the importance scores/allocated ranks.
--seed: Random seed for reproducibility.
--nsamples: Number of calibration samples.
--save: Path to save logs.
--save_model: Path to save the pruned model.

📊 Evaluation

🧠 Zero-Shot Evaluation

We evaluate zero-shot downstream task performance using the EleutherAI LM Harness. Run scripts/eval.sh to evaluate the zero-shot / few-shot downstream accuracy.

⚡ Inference Speedup

To benchmark inference speedup, we build upon the evaluation framework from SliceGPT. Run scripts/test_speedup.sh to evalaute the inference throughput and cuda memory usage.

Citation

If you find FLAT-LLM useful for your research and applications, please kindly cite using this BibTeX:

@article{tian2025flat,
      title={FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression},
      author={Tian, Jiayi and Solgi, Ryan and Lu, Jinming and Yang, Yifan and Li, Hai and Zhang, Zheng},
      journal={arXiv preprint arXiv:2505.23966},
      year={2025}
    }

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
gptq		gptq
lib		lib
lm_eval		lm_eval
models		models
ranks/wikitext2		ranks/wikitext2
scripts		scripts
src		src
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
compute_rank.py		compute_rank.py
eval.py		eval.py
main.py		main.py
main_mistral.py		main_mistral.py
quant_flatllm.py		quant_flatllm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 FLAT-LLM

📦 Environment Setup

✅ Checklist

🛠️ Run the Code

🔍 Importance-Preserving Rank Selection (IPRS)

✂️ FLAT-LLM Pruning

🔢 FLAT-LLM Post-pruning Quantization

🔧 Command-Line Arguments

📦 Model and Dataset

⚙️ Pruning Configuration

📊 Evaluation

🧠 Zero-Shot Evaluation

⚡ Inference Speedup

Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 FLAT-LLM

📦 Environment Setup

✅ Checklist

🛠️ Run the Code

🔍 Importance-Preserving Rank Selection (IPRS)

✂️ FLAT-LLM Pruning

🔢 FLAT-LLM Post-pruning Quantization

🔧 Command-Line Arguments

📦 Model and Dataset

⚙️ Pruning Configuration

📊 Evaluation

🧠 Zero-Shot Evaluation

⚡ Inference Speedup

Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages