-
Notifications
You must be signed in to change notification settings - Fork 162
Add hallucination detection dataset operator #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
New operators for creating hallucination detection datasets: 1. LongContextFilterOperator - Filter samples by token count (8K+, 12K+, 16K+, etc.) - Uses HuggingFace tokenizers - Adds num_tokens column to output 2. HallucinationInjectionOperator - Inject RAGTruth-style hallucinations using LLM - Supports: Evident Conflict, Evident Baseless, Subtle Baseless, Subtle Conflict - Parses <hal>...</hal> tags to extract span positions - Configurable hallucination ratio 3. SpanAnnotationOperator - Convert document-level labels to span-level using NLI - Uses DeBERTa-v3-mnli-fever-anli by default - Identifies contradicting sentences Also includes: - Example pipeline in dataflow/example/HallucinationDetectionPipeline/ - Unit tests in test/test_hallucination_detection.py - README documentation Related: llm-semantic-router/longcontext-haldetect dataset
scripts/generate_with_dataflow.py: Complete pipeline for generating long-context hallucination detection datasets using DataFlow operators Features: - Filters NarrativeQA by token count (8K-24K) - Generates answers via vLLM API - Injects RAGTruth-style hallucinations (50%) - Outputs JSON with span annotations Tested: Generated 50 samples (25 hal, 25 supported) in 12K-14K range
Signed-off-by: Huamin Chen <hchen@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you very much for your interest in DataFlow and for submitting this PR.
Overall, we think this version still requires significant adjustments before it can fully align with the design patterns and conventions of DataFlow operators. One of the core goals in DataFlow’s operator design is to ensure that operators are clear, structured, and easy for the DataFlow Agent to understand, so that they can be composed, rewritten, or further optimized at the prompt level.
In addition, we aim to keep the operator set in the main repository as converged and minimal as possible, avoiding excessive overlap or inconsistent design patterns in the core operator library.
Based on these considerations, we would like to offer the following suggestions:
1. If you would like this operator to be included in the DataFlow main operator library
Some further refinement may be needed in the following areas:
1.1 Operator categorization
- Please clarify the functional category of this operator:
whether it can fit into the existing operator taxonomy, - or whether introducing a separate
hallucination(or similar) category is truly necessary.
This decision has a direct impact on the overall structure and long-term maintainability of the operator system.
1.2 Operator conventions and consistency
- The
prompt_template,get_desc, and related interfaces may need further refinement to comply with the coding and design conventions of DataFlow core operators; - Existing operators and documentation in the repository can be used as references to ensure consistency in style and behavior.
https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/core_text/generate/format_str_prompted_generator.py#L69
2. Alternative approach: publishing via the DataFlow Extension / Ecosystem
If incorporating this operator into the main repository feels too restrictive, we strongly recommend using the DataFlow Extension / Ecosystem approach instead:
- Please refer to our documentation on building DataFlow Extensions:
https://opendcai.github.io/DataFlow-Doc/en/guide/df_ecosystem/ - By publishing the operator in a separate repository, you can adopt more flexible design and release conventions;
- We plan to maintain an Extension index document (like
Awesome projects using Dataflow) In the main repository or Documentation, and you will be able to submit a PR to add your Extension repository link to this index.
This model is similar to the PyTorch ecosystem: not every implementation needs to live in the core repository, and independent modules by other repos are encouraged.
We hope these suggestions are helpful, and we would be happy to further discuss the design and positioning of this operator. Thanks again for your contribution 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory of dataflow/example/* is not for example Python scripts but for example datasets. Please refer to the existing implementation of DataFlow pipelines.
For third-party scripts, we recommend that you place the pipeline scripts under dataflow/statics/thirdparty/HallucinationDetection/*. This will support our default usage by generate all start script by dataflow init command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This readme mayalso be placed under the same directory with exmaple pipeline. (dataflow/statics/thirdparty/.../)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As lazzyloader exists, this file should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. As lazzyloader exists, this file should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As lazzyloader exists, this file should be removed.
| @staticmethod | ||
| def get_desc(lang: str = "en") -> tuple: | ||
| """Returns a description of the operator's functionality.""" | ||
| if lang == "zh": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To better support DF-Agent understand how to exceute a operator, we need a more detailel get_desc for each operator. Need to specify each the property for each parametes in __init__ and run. You can reference
| "基于模板化提示词(Prompt Template)生成内容的算子。" |
| def run( | ||
| self, | ||
| storage: DataFlowStorage, | ||
| input_key: str = "dataframe", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a name of column in a dataframe instead of the whole dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may remove it as well
| def get_desc(lang: str = "en") -> tuple: | ||
| """Returns a description of the operator's functionality.""" | ||
| if lang == "zh": | ||
| return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more details
scripts/generate_with_dataflow.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have this directory. This file may consider redundant.
|
@SunnyHaze sounds good! I'll check these feedback and get back soon. |
Signed-off-by: Huamin Chen <hchen@redhat.com>
SunnyHaze
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for your revision. However, the current implementation added new function sto our key storage class. The implementation of operators may also need revision to follow the read & write convention for DataFlow File Storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't revise the key class, FileStorage. Here, it should only callread and write to the storage class, instead of adding a new functions to it.
We are building a ModernBERT based hallucination detector, inspired by LettuceDetect. The training dataset is based on RAGTruth, with LLM augmentation. In addition, the HaluEval dataset is converted to spans using NLI.
All these operators are included in this PR.