diff --git a/docs/docs/datasets/index.md b/docs/docs/datasets/index.md index 9adaf9a1..4d42bc26 100644 --- a/docs/docs/datasets/index.md +++ b/docs/docs/datasets/index.md @@ -13,6 +13,7 @@ Instead of manually specifying `goals`, use the `dataset` parameter to load goal - π― **Presets** β 30+ ready-to-use AI safety benchmarks (AgentHarm, JailbreakBench, BeaverTails, etc.) - π€ **HuggingFace Hub** β Any public or private dataset from HuggingFace - π **Local files** β JSON, JSONL, CSV, or TXT files from your filesystem +- π§ **Intent taxonomy selection** β Pick OmniSafeBench categories/subcategories with `intents` ```mermaid graph LR @@ -129,6 +130,32 @@ attack_config = { results = agent.hack(attack_config=attack_config) ``` +### 4. Selecting Intent Categories (OmniSafeBench) + +When you want category-balanced goals without manually writing prompts, use +`intents` to select categories and subcategories directly from the +OmniSafeBench taxonomy. + +```python +attack_config = { + "attack_type": "h4rm3l", + "intents": [ + { + "category": "A", + "subcategories": ["A1", "A2"], + "samples_per_subcategory": 2, + } + ], +} +``` + +HackAgent maps this to canonical labels in results/dashboard format: +`A. Ethical and Social Risks` / `A1. Bias and Discrimination`. + +Taxonomy source: [OmniSafeBench-MM](https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM/). + +[See full guide: Selecting intent categories β](./selecting-intent-categories.md) + --- ## Common Dataset Options @@ -172,6 +199,7 @@ When both `shuffle` and `offset` are used, shuffling happens **first**, then off ## Next Steps - π [**Datasets Tutorial**](../getting-started/datasets-tutorial.mdx) β Complete walkthrough with examples +- π§ [**Selecting intent categories**](./selecting-intent-categories.md) β Use taxonomy categories/subcategories with strings, enums, or label codes - π― [**Presets**](./presets.md) β All 30+ pre-configured benchmarks - π€ [**HuggingFace Provider**](./huggingface.md) β Load any HuggingFace dataset - π [**File Provider**](./file.md) β Load from local JSON, CSV, or TXT files diff --git a/docs/docs/datasets/selecting-intent-categories.md b/docs/docs/datasets/selecting-intent-categories.md new file mode 100644 index 00000000..7e7d5604 --- /dev/null +++ b/docs/docs/datasets/selecting-intent-categories.md @@ -0,0 +1,201 @@ +--- +sidebar_position: 2 +title: Selecting intent categories +--- + +# Selecting intent categories + +Use `intents` when you want to build attack goals from the OmniSafeBench +risk taxonomy instead of manually providing `goals` or selecting a full +`dataset` provider. + +This is useful when you want to: + +- target specific risk families +- keep labels consistent in dashboard/results metadata +- skip category-classifier preflight when labels are explicitly provided + +## Citation + +The intent taxonomy used in this page is based on OmniSafeBench-MM: + +- OmniSafeBench-MM repository: [jiaxiaojunQAQ/OmniSafeBench-MM](https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM/) + +## Three ways to write intents config + +### 1. Full labels as strings + +```python +attack_config = { + "attack_type": "h4rm3l", + "intents": [ + { + "category": "Ethical and Social Risks", + "subcategories": [ + "Bias and Discrimination", + "Insulting or Harassing Speech", + ], + "samples_per_subcategory": 2, + }, + { + "category": "Decision and Cognitive Risks", + "subcategories": ["Medical Advice"], + "samples_per_subcategory": 2, + }, + ], +} +``` + +### 2. Enums + +```python +from hackagent.datasets import IntentCategory, IntentSubcategory + +attack_config = { + "attack_type": "h4rm3l", + "intents": [ + { + "category": IntentCategory.ETHICAL_AND_SOCIAL_RISKS, + "subcategories": [ + IntentSubcategory.BIAS_AND_DISCRIMINATION, + IntentSubcategory.INSULTING_OR_HARASSING_SPEECH, + ], + "samples_per_subcategory": 2, + }, + { + "category": IntentCategory.DECISION_AND_COGNITIVE_RISKS, + "subcategories": [IntentSubcategory.MEDICAL_ADVICE], + "samples_per_subcategory": 2, + }, + ], +} +``` + +### 3. Label codes as strings + +```python +attack_config = { + "attack_type": "h4rm3l", + "intents": [ + { + "category": "A", + "subcategories": ["A1", "A2"], + "samples_per_subcategory": 2, + }, + { + "category": "I", + "subcategories": ["I1"], + "samples_per_subcategory": 2, + }, + ], +} +``` + +## Default behavior for omitted fields + +When some fields are omitted in an `intents` entry, HackAgent applies the +following defaults: + +- If `subcategories` is not provided, all subcategories of the selected + category are used. +- If `samples_per_subcategory` is not provided, the default is `1` sample + for each selected subcategory. +- Therefore, if both are omitted, HackAgent selects `1` sample for all + subcategories in the selected category. + +Example: + +```python +attack_config = { + "attack_type": "h4rm3l", + "intents": [ + { + "category": "A", + # subcategories omitted -> A1..A4 + # samples_per_subcategory omitted -> 1 each + } + ], +} +``` + +## Complete taxonomy map (code, label, enum ID) + +Source taxonomy and intents list: [OmniSafeBench-MM](https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM/). + +
| Code | +Name | +Enum ID | +Samples | +
|---|---|---|---|
| A | Ethical and Social Risks | IntentCategory.ETHICAL_AND_SOCIAL_RISKS | 988 |
| A1 | Bias and Discrimination | IntentSubcategory.BIAS_AND_DISCRIMINATION | 434 |
| A2 | Insulting or Harassing Speech | IntentSubcategory.INSULTING_OR_HARASSING_SPEECH | 113 |
| A3 | Hate Speech | IntentSubcategory.HATE_SPEECH | 411 |
| A4 | Discriminatory Behavior | IntentSubcategory.DISCRIMINATORY_BEHAVIOR | 30 |
| B | Privacy and Data Risks | IntentCategory.PRIVACY_AND_DATA_RISKS | 504 |
| B1 | Personal Data Leakage | IntentSubcategory.PERSONAL_DATA_LEAKAGE | 250 |
| B2 | Biometric or Health Data Exposure | IntentSubcategory.BIOMETRIC_OR_HEALTH_DATA_EXPOSURE | 30 |
| B3 | Doxxing or Tracking | IntentSubcategory.DOXXING_OR_TRACKING | 164 |
| B4 | Surveillance Misuse | IntentSubcategory.SURVEILLANCE_MISUSE | 30 |
| B5 | Training Data Leakage | IntentSubcategory.TRAINING_DATA_LEAKAGE | 30 |
| C | Safety and Physical Harm | IntentCategory.SAFETY_AND_PHYSICAL_HARM | 877 |
| C1 | Dangerous Instructions | IntentSubcategory.DANGEROUS_INSTRUCTIONS | 29 |
| C3 | Physical Harm | IntentSubcategory.PHYSICAL_HARM | 599 |
| C4 | Human Trafficking | IntentSubcategory.HUMAN_TRAFFICKING | 30 |
| C2 | Self-harm Promotion | IntentSubcategory.SELF_HARM_PROMOTION | 30 |
| C5 | Animal Abuse | IntentSubcategory.ANIMAL_ABUSE | 132 |
| C6 | Drug Guidance | IntentSubcategory.DRUG_GUIDANCE | 29 |
| C7 | Environmental Damage | IntentSubcategory.ENVIRONMENTAL_DAMAGE | 28 |
| D | Criminal and Economic Risks | IntentCategory.CRIMINAL_AND_ECONOMIC_RISKS | 1017 |
| D1 | Fraud or Scams | IntentSubcategory.FRAUD_OR_SCAMS | 367 |
| D2 | Economic Crime | IntentSubcategory.ECONOMIC_CRIME | 245 |
| D3 | White-Collar Crime | IntentSubcategory.WHITE_COLLAR_CRIME | 30 |
| D4 | Illegal Activity | IntentSubcategory.ILLEGAL_ACTIVITY | 345 |
| D5 | Evasion of Law | IntentSubcategory.EVASION_OF_LAW | 30 |
| E | Cybersecurity Threats | IntentCategory.CYBERSECURITY_THREATS | 725 |
| E1 | Malware Generation | IntentSubcategory.MALWARE_GENERATION | 275 |
| E2 | Exploit Development | IntentSubcategory.EXPLOIT_DEVELOPMENT | 201 |
| E3 | Phishing or Ransomware | IntentSubcategory.PHISHING_OR_RANSOMWARE | 30 |
| E4 | Jailbreak or Prompt Injection | IntentSubcategory.JAILBREAK_OR_PROMPT_INJECTION | 30 |
| E5 | Cross-modal Attack | IntentSubcategory.CROSS_MODAL_ATTACK | 30 |
| E6 | Data Poisoning | IntentSubcategory.DATA_POISONING | 30 |
| E7 | Unauthorized Model Access | IntentSubcategory.UNAUTHORIZED_MODEL_ACCESS | 129 |
| F | Information and Political Manipulation | IntentCategory.INFORMATION_AND_POLITICAL_MANIPULATION | 534 |
| F1 | Misinformation or Disinformation | IntentSubcategory.MISINFORMATION_OR_DISINFORMATION | 30 |
| F2 | Deepfakes or Synthetic Media | IntentSubcategory.DEEPFAKES_OR_SYNTHETIC_MEDIA | 30 |
| F3 | Political Propaganda | IntentSubcategory.POLITICAL_PROPAGANDA | 384 |
| F4 | Election Interference | IntentSubcategory.ELECTION_INTERFERENCE | 30 |
| F5 | Public Order Disruption | IntentSubcategory.PUBLIC_ORDER_DISRUPTION | 30 |
| F6 | National Security Threat | IntentSubcategory.NATIONAL_SECURITY_THREAT | 30 |
| G | Content and Cultural Safety | IntentCategory.CONTENT_AND_CULTURAL_SAFETY | 537 |
| G1 | Pornography or Nudity | IntentSubcategory.PORNOGRAPHY_OR_NUDITY | 118 |
| G2 | Sexual Content | IntentSubcategory.SEXUAL_CONTENT | 137 |
| G3 | Violence or Gore | IntentSubcategory.VIOLENCE_OR_GORE | 152 |
| G4 | Cultural or Religious Offense | IntentSubcategory.CULTURAL_OR_RELIGIOUS_OFFENSE | 130 |
| H | Intellectual Property and Ownership | IntentCategory.INTELLECTUAL_PROPERTY_AND_OWNERSHIP | 304 |
| H1 | Copyright Infringement | IntentSubcategory.COPYRIGHT_INFRINGEMENT | 117 |
| H2 | Trademark or Brand Misuse | IntentSubcategory.TRADEMARK_OR_BRAND_MISUSE | 30 |
| H3 | Proprietary Data Leakage | IntentSubcategory.PROPRIETARY_DATA_LEAKAGE | 127 |
| H4 | Unauthorized Style or Persona Imitation | IntentSubcategory.UNAUTHORIZED_STYLE_OR_PERSONA_IMITATION | 30 |
| I | Decision and Cognitive Risks | IntentCategory.DECISION_AND_COGNITIVE_RISKS | 1593 |
| I1 | Medical Advice | IntentSubcategory.MEDICAL_ADVICE | 321 |
| I2 | Health Misguidance | IntentSubcategory.HEALTH_MISGUIDANCE | 128 |
| I3 | Legal Advice | IntentSubcategory.LEGAL_ADVICE | 372 |
| I4 | Government Decision Support | IntentSubcategory.GOVERNMENT_DECISION_SUPPORT | 308 |
| I5 | Financial Advice | IntentSubcategory.FINANCIAL_ADVICE | 290 |
| I6 | Market Manipulation | IntentSubcategory.MARKET_MANIPULATION | 30 |
| I7 | Psychological Manipulation | IntentSubcategory.PSYCHOLOGICAL_MANIPULATION | 114 |
| I8 | Cognitive Bias or Overreliance | IntentSubcategory.COGNITIVE_BIAS_OR_OVERRELIANCE | 30 |
| J | Child Safety | IntentCategory.CHILD_SAFETY | 747 |
| J1 | CSAM & Sexualization | IntentSubcategory.CSAM_SEXUALIZATION | 171 |
| J2 | Grooming & Enticement | IntentSubcategory.GROOMING_ENTICEMENT | 147 |
| J3 | Child Trafficking | IntentSubcategory.CHILD_TRAFFICKING | 144 |
| J4 | Harmful Content Targeting Minors | IntentSubcategory.HARMFUL_CONTENT_TARGETING_MINORS | 161 |
| J5 | Age Verification Evasion | IntentSubcategory.AGE_VERIFICATION_EVASION | 124 |