This is the official PyTorch implementation of FLAT-LLM Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression arxiv [Findings of EACL 2026]
Installation instructions can be found in INSTALL.md.
- Throughput/Memory Efficiency Evaluation
- Zero-shot/Few-shot Downstream Task Evaluation
- Head-wise QK Support
- Model Support: Llama-2, Llama-3, Mistral
- Multi-GPU Support: Llama-2 70B
- Post-pruning Quantization support: GPTQ
All scripts for reproducing our main results (Table 1) are available in the scripts directory.
- Run
llama_bi.shto compute decoder-wise importance scores. - Run
compute_rank.pyto- Allocate ranks with our IPRS algorithm according to the importance scores.
- Compute the compression ratio for V,O,MLP layers (Q,K are not pruned) according to required total compresstion ratio.
Run one of the following scripts to prune and evaluate the PPL of corresponding model:
llama_7b.sh# use 1 A100 40GBllama_13b.sh# use 1 A100 40GBllama_70b.sh# use 4 A100 40GBmistral.sh# use 1 A100 40GB
These reproduce the perplexity results reported in Table 1 of the paper when using wikitext2 for calibration.
Run scripts/gptq.sh # load pruned flat-llm model
--model: Name or path of the LLM to prune. Choices:meta-llama/Llama-2-7b-hf,meta-llama/Llama-2-13b-hf,meta-llama/Llama-2-70b-hf,mistralai/Mistral-7B-v0.1--dataset: Calibration dataset. Choices:wikitext2,c4,alpaca.--cache_dir: Directory to cache model weights.
--prune_method: Pruning stage. Options:bi: Rank allocation via importance scores.flatllm: Final pruning using head-wise PCA.
--sparsity_ratio: Target sparsity level (as an integer percentage).--tol: Tolerance threshold on cumulative eigenvalues. Default:0.96. (this hyper-para is only for monitoring the calibration, not used in the algorithm)--bi_score: Path to save/load the importance scores/allocated ranks.--seed: Random seed for reproducibility.--nsamples: Number of calibration samples.--save: Path to save logs.--save_model: Path to save the pruned model.
We evaluate zero-shot downstream task performance using the EleutherAI LM Harness. Run scripts/eval.sh to evaluate the zero-shot / few-shot downstream accuracy.
To benchmark inference speedup, we build upon the evaluation framework from SliceGPT. Run scripts/test_speedup.sh to evalaute the inference throughput and cuda memory usage.
If you find FLAT-LLM useful for your research and applications, please kindly cite using this BibTeX:
@article{tian2025flat,
title={FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression},
author={Tian, Jiayi and Solgi, Ryan and Lu, Jinming and Yang, Yifan and Li, Hai and Zhang, Zheng},
journal={arXiv preprint arXiv:2505.23966},
year={2025}
}
This project is licensed under the MIT License.