FlexGuard addresses a practical deployment gap in LLM moderation: enforcement strictness (how conservatively “unsafe” is defined) varies across products and evolves over time, making fixed binary moderators brittle under strictness shifts.
To enable controlled evaluation in this setting, we introduce FlexBench, a benchmark with prompt and response moderation subsets, annotated with risk categories and 5-tier severity (BENIGN / LOW / MODERATE / HIGH / EXTREME). These severity tiers induce three strictness regimes (strict / moderate / loose), allowing direct measurement of robustness and consistency under policy changes.
Building on FlexBench’s findings that existing moderators show substantial cross-strictness inconsistency, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score (0–100) aligned with severity, and supports strictness-specific decisions via simple thresholding. FlexGuard is trained with rubric-guided score distillation and a two-stage risk-alignment strategy (SFT warm-up + GRPO), along with practical thresholding strategies for deployment.
mkdir -p logs results ckpt
pip3 install -e .
pip install -r requirements.txtLoad the released checkpoint with 🤗 Transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "Tommy-DING/FlexGuard-Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)Please refer to the Hugging Face model card for the recommended inference prompt format and usage notes.
FlexBench is available on Hugging Face Datasets:
https://huggingface.co/datasets/Tommy-DING/FlexBench
The dataset includes prompt and response subsets (each with valid and test splits). Since the repo stores raw CSV files, load each subset with data_files:
from datasets import load_dataset
# Prompt subset
ds_prompt = load_dataset(
"Tommy-DING/FlexBench",
data_files={
"valid": "Guard_prompt_valid.csv",
"test": "Guard_prompt_test.csv",
},
)
# Response subset
ds_response = load_dataset(
"Tommy-DING/FlexBench",
data_files={
"valid": "Guard_response_valid.csv",
"test": "Guard_response_test.csv",
},
)# Example: strict policy on prompt test split
y = ds_prompt["test"]["label_strict"]
# Or moderate / loose:
y_mod = ds_prompt["test"]["label_moderate"]
y_loose = ds_prompt["test"]["label_loose"]| Field | Prompt | Response |
|---|---|---|
| Risk severity | ||
| Total | 2000 | 2000 |
| BENIGN | 1000 | 1000 |
| LOW | 250 | 250 |
| MODERATE | 250 | 250 |
| HIGH | 250 | 250 |
| EXTREME | 250 | 250 |
| Category | ||
| SAFE | 1000 | 1000 |
| VIO | 194 | 239 |
| ILG | 146 | 453 |
| SEX | 130 | 38 |
| INF | 61 | 77 |
| DIS | 282 | 211 |
| MIS | 62 | 93 |
| JAIL | 130 | 5 |
| Data source | ||
| Aegis2.0 | 286 | 63 |
| XSTest | 83 | 259 |
| BeaverTails | 0 | 370 |
| HarmBench | 0 | 84 |
| OpenAI | 497 | 0 |
| SafeRLHF | 0 | 894 |
| ToxicChat | 769 | 0 |
| WildGuard | 365 | 330 |
Follow the backend model instructions in the scripts/configs used by this repo.
accelerate launch --multi_gpu train_guard_prob_linear_cot.pypython merge.pybash main_grpo_guard.shWe build upon excellent open-source tools:
@misc{ding2026flexguard,
title={FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation},
author={Zhihao Ding and Jinming Li and Ze Lu and Jieming Shi},
year={2026},
eprint={2602.23636},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.23636},
}- Issues / questions: please open a GitHub issue.

