Skip to content

TommyDzh/FlexGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ByteDance     The Hong Kong Polytechnic University (PolyU)

FlexGuard Logo
FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

🤗 FlexGuard  |   🤗 Flexbench  |   📑 Paper  

HF Model HF Dataset License


👋 Overview

FlexGuard addresses a practical deployment gap in LLM moderation: enforcement strictness (how conservatively “unsafe” is defined) varies across products and evolves over time, making fixed binary moderators brittle under strictness shifts.

To enable controlled evaluation in this setting, we introduce FlexBench, a benchmark with prompt and response moderation subsets, annotated with risk categories and 5-tier severity (BENIGN / LOW / MODERATE / HIGH / EXTREME). These severity tiers induce three strictness regimes (strict / moderate / loose), allowing direct measurement of robustness and consistency under policy changes.

Building on FlexBench’s findings that existing moderators show substantial cross-strictness inconsistency, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score (0–100) aligned with severity, and supports strictness-specific decisions via simple thresholding. FlexGuard is trained with rubric-guided score distillation and a two-stage risk-alignment strategy (SFT warm-up + GRPO), along with practical thresholding strategies for deployment.

FlexBench & FlexGuard Overview

🚀 Requirements

mkdir -p logs results ckpt
pip3 install -e .
pip install -r requirements.txt

🤗 FlexGuard Model

Load the released checkpoint with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Tommy-DING/FlexGuard-Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)

Please refer to the Hugging Face model card for the recommended inference prompt format and usage notes.


📊 FlexBench Dataset

FlexBench is available on Hugging Face Datasets:
https://huggingface.co/datasets/Tommy-DING/FlexBench

Loading

The dataset includes prompt and response subsets (each with valid and test splits). Since the repo stores raw CSV files, load each subset with data_files:

from datasets import load_dataset

# Prompt subset
ds_prompt = load_dataset(
    "Tommy-DING/FlexBench",
    data_files={
        "valid": "Guard_prompt_valid.csv",
        "test": "Guard_prompt_test.csv",
    },
)

# Response subset
ds_response = load_dataset(
    "Tommy-DING/FlexBench",
    data_files={
        "valid": "Guard_response_valid.csv",
        "test": "Guard_response_test.csv",
    },
)

Evaluating strictness settings

# Example: strict policy on prompt test split
y = ds_prompt["test"]["label_strict"]

# Or moderate / loose:
y_mod = ds_prompt["test"]["label_moderate"]
y_loose = ds_prompt["test"]["label_loose"]

Benchmark statistics (test)

Field Prompt Response
Risk severity
Total 2000 2000
BENIGN 1000 1000
LOW 250 250
MODERATE 250 250
HIGH 250 250
EXTREME 250 250
Category
SAFE 1000 1000
VIO 194 239
ILG 146 453
SEX 130 38
INF 61 77
DIS 282 211
MIS 62 93
JAIL 130 5
Data source
Aegis2.0 286 63
XSTest 83 259
BeaverTails 0 370
HarmBench 0 84
OpenAI 497 0
SafeRLHF 0 894
ToxicChat 769 0
WildGuard 365 330

🚀 FlexGuard Training

1) Download LLM backend (Qwen3-8B by default)

Follow the backend model instructions in the scripts/configs used by this repo.

2) SFT warm-up

accelerate launch --multi_gpu train_guard_prob_linear_cot.py

3) Merge LoRA adapter

python merge.py

4) 🔥 GRPO alignment

bash main_grpo_guard.sh

🙏 Acknowledgement

We build upon excellent open-source tools:


Citation

@misc{ding2026flexguard,
      title={FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation}, 
      author={Zhihao Ding and Jinming Li and Ze Lu and Jieming Shi},
      year={2026},
      eprint={2602.23636},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.23636}, 
}

Contact

  • Issues / questions: please open a GitHub issue.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors