Skip to content

TTTTTTris/FLAT-LLM

Repository files navigation

🚀 FLAT-LLM

This is the official PyTorch implementation of FLAT-LLM Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression arxiv [Findings of EACL 2026]


📦 Environment Setup

Installation instructions can be found in INSTALL.md.


✅ Checklist

  • Throughput/Memory Efficiency Evaluation
  • Zero-shot/Few-shot Downstream Task Evaluation
  • Head-wise QK Support
  • Model Support: Llama-2, Llama-3, Mistral
  • Multi-GPU Support: Llama-2 70B
  • Post-pruning Quantization support: GPTQ

🛠️ Run the Code

All scripts for reproducing our main results (Table 1) are available in the scripts directory.

🔍 Importance-Preserving Rank Selection (IPRS)

  1. Run llama_bi.sh to compute decoder-wise importance scores.
  2. Run compute_rank.py to
    • Allocate ranks with our IPRS algorithm according to the importance scores.
    • Compute the compression ratio for V,O,MLP layers (Q,K are not pruned) according to required total compresstion ratio.

✂️ FLAT-LLM Pruning

Run one of the following scripts to prune and evaluate the PPL of corresponding model:

  • llama_7b.sh # use 1 A100 40GB
  • llama_13b.sh # use 1 A100 40GB
  • llama_70b.sh # use 4 A100 40GB
  • mistral.sh # use 1 A100 40GB

These reproduce the perplexity results reported in Table 1 of the paper when using wikitext2 for calibration.

🔢 FLAT-LLM Post-pruning Quantization

Run scripts/gptq.sh # load pruned flat-llm model


🔧 Command-Line Arguments

📦 Model and Dataset

  • --model: Name or path of the LLM to prune. Choices: meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, mistralai/Mistral-7B-v0.1
  • --dataset: Calibration dataset. Choices: wikitext2, c4, alpaca.
  • --cache_dir: Directory to cache model weights.

⚙️ Pruning Configuration

  • --prune_method: Pruning stage. Options:
    • bi: Rank allocation via importance scores.
    • flatllm: Final pruning using head-wise PCA.
  • --sparsity_ratio: Target sparsity level (as an integer percentage).
  • --tol: Tolerance threshold on cumulative eigenvalues. Default: 0.96. (this hyper-para is only for monitoring the calibration, not used in the algorithm)
  • --bi_score: Path to save/load the importance scores/allocated ranks.
  • --seed: Random seed for reproducibility.
  • --nsamples: Number of calibration samples.
  • --save: Path to save logs.
  • --save_model: Path to save the pruned model.

📊 Evaluation

🧠 Zero-Shot Evaluation

We evaluate zero-shot downstream task performance using the EleutherAI LM Harness. Run scripts/eval.sh to evaluate the zero-shot / few-shot downstream accuracy.

⚡ Inference Speedup

To benchmark inference speedup, we build upon the evaluation framework from SliceGPT. Run scripts/test_speedup.sh to evalaute the inference throughput and cuda memory usage.


Citation

If you find FLAT-LLM useful for your research and applications, please kindly cite using this BibTeX:

@article{tian2025flat,
      title={FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression},
      author={Tian, Jiayi and Solgi, Ryan and Lu, Jinming and Yang, Yifan and Li, Hai and Zhang, Zheng},
      journal={arXiv preprint arXiv:2505.23966},
      year={2025}
    }

📄 License

This project is licensed under the MIT License.

About

[Findings of EACL 2026] Official code for FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors