FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

🤗 FlexGuard | 🤗 Flexbench | 📑 Paper

👋 Overview

FlexGuard addresses a practical deployment gap in LLM moderation: enforcement strictness (how conservatively “unsafe” is defined) varies across products and evolves over time, making fixed binary moderators brittle under strictness shifts.

To enable controlled evaluation in this setting, we introduce FlexBench, a benchmark with prompt and response moderation subsets, annotated with risk categories and 5-tier severity (BENIGN / LOW / MODERATE / HIGH / EXTREME). These severity tiers induce three strictness regimes (strict / moderate / loose), allowing direct measurement of robustness and consistency under policy changes.

Building on FlexBench’s findings that existing moderators show substantial cross-strictness inconsistency, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score (0–100) aligned with severity, and supports strictness-specific decisions via simple thresholding. FlexGuard is trained with rubric-guided score distillation and a two-stage risk-alignment strategy (SFT warm-up + GRPO), along with practical thresholding strategies for deployment.

🚀 Requirements

mkdir -p logs results ckpt
pip3 install -e .
pip install -r requirements.txt

🤗 FlexGuard Model

Load the released checkpoint with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Tommy-DING/FlexGuard-Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)

Please refer to the Hugging Face model card for the recommended inference prompt format and usage notes.

📊 FlexBench Dataset

FlexBench is available on Hugging Face Datasets:
https://huggingface.co/datasets/Tommy-DING/FlexBench

Loading

The dataset includes prompt and response subsets (each with valid and test splits). Since the repo stores raw CSV files, load each subset with data_files:

from datasets import load_dataset

# Prompt subset
ds_prompt = load_dataset(
    "Tommy-DING/FlexBench",
    data_files={
        "valid": "Guard_prompt_valid.csv",
        "test": "Guard_prompt_test.csv",
    },
)

# Response subset
ds_response = load_dataset(
    "Tommy-DING/FlexBench",
    data_files={
        "valid": "Guard_response_valid.csv",
        "test": "Guard_response_test.csv",
    },
)

Evaluating strictness settings

# Example: strict policy on prompt test split
y = ds_prompt["test"]["label_strict"]

# Or moderate / loose:
y_mod = ds_prompt["test"]["label_moderate"]
y_loose = ds_prompt["test"]["label_loose"]

Benchmark statistics (test)

Field	Prompt	Response
Risk severity
Total	2000	2000
BENIGN	1000	1000
LOW	250	250
MODERATE	250	250
HIGH	250	250
EXTREME	250	250
Category
SAFE	1000	1000
VIO	194	239
ILG	146	453
SEX	130	38
INF	61	77
DIS	282	211
MIS	62	93
JAIL	130	5
Data source
Aegis2.0	286	63
XSTest	83	259
BeaverTails	0	370
HarmBench	0	84
OpenAI	497	0
SafeRLHF	0	894
ToxicChat	769	0
WildGuard	365	330

🚀 FlexGuard Training

1) Download LLM backend (Qwen3-8B by default)

Follow the backend model instructions in the scripts/configs used by this repo.

2) SFT warm-up

accelerate launch --multi_gpu train_guard_prob_linear_cot.py

3) Merge LoRA adapter

python merge.py

4) 🔥 GRPO alignment

bash main_grpo_guard.sh

🙏 Acknowledgement

We build upon excellent open-source tools:

VERL
TRL

Citation

@misc{ding2026flexguard,
      title={FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation}, 
      author={Zhihao Ding and Jinming Li and Ze Lu and Jieming Shi},
      year={2026},
      eprint={2602.23636},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.23636}, 
}

Contact

Issues / questions: please open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data/train		data/train
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
add_special_tokens.py		add_special_tokens.py
config.yaml		config.yaml
config_Zero2.yaml		config_Zero2.yaml
main_grpo_guard.sh		main_grpo_guard.sh
merge.py		merge.py
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
train_guard_prob_linear_cot.py		train_guard_prob_linear_cot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

👋 Overview

🚀 Requirements

🤗 FlexGuard Model

📊 FlexBench Dataset

Loading

Evaluating strictness settings

Benchmark statistics (test)

🚀 FlexGuard Training

1) Download LLM backend (Qwen3-8B by default)

2) SFT warm-up

3) Merge LoRA adapter

4) 🔥 GRPO alignment

🙏 Acknowledgement

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

👋 Overview

🚀 Requirements

🤗 FlexGuard Model

📊 FlexBench Dataset

Loading

Evaluating strictness settings

Benchmark statistics (test)

🚀 FlexGuard Training

1) Download LLM backend (Qwen3-8B by default)

2) SFT warm-up

3) Merge LoRA adapter

4) 🔥 GRPO alignment

🙏 Acknowledgement

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages