Add hallucination detection dataset operator #437

rootfs · 2026-01-09T00:05:54Z

We are building a ModernBERT based hallucination detector, inspired by LettuceDetect. The training dataset is based on RAGTruth, with LLM augmentation. In addition, the HaluEval dataset is converted to spans using NLI.

All these operators are included in this PR.

New operators for creating hallucination detection datasets: 1. LongContextFilterOperator - Filter samples by token count (8K+, 12K+, 16K+, etc.) - Uses HuggingFace tokenizers - Adds num_tokens column to output 2. HallucinationInjectionOperator - Inject RAGTruth-style hallucinations using LLM - Supports: Evident Conflict, Evident Baseless, Subtle Baseless, Subtle Conflict - Parses <hal>...</hal> tags to extract span positions - Configurable hallucination ratio 3. SpanAnnotationOperator - Convert document-level labels to span-level using NLI - Uses DeBERTa-v3-mnli-fever-anli by default - Identifies contradicting sentences Also includes: - Example pipeline in dataflow/example/HallucinationDetectionPipeline/ - Unit tests in test/test_hallucination_detection.py - README documentation Related: llm-semantic-router/longcontext-haldetect dataset

scripts/generate_with_dataflow.py: Complete pipeline for generating long-context hallucination detection datasets using DataFlow operators Features: - Filters NarrativeQA by token count (8K-24K) - Generates answers via vLLM API - Injects RAGTruth-style hallucinations (50%) - Outputs JSON with span annotations Tested: Generated 50 samples (25 hal, 25 supported) in 12K-14K range

Signed-off-by: Huamin Chen <hchen@redhat.com>

SunnyHaze

Hi, thank you very much for your interest in DataFlow and for submitting this PR.

Overall, we think this version still requires significant adjustments before it can fully align with the design patterns and conventions of DataFlow operators. One of the core goals in DataFlow’s operator design is to ensure that operators are clear, structured, and easy for the DataFlow Agent to understand, so that they can be composed, rewritten, or further optimized at the prompt level.

In addition, we aim to keep the operator set in the main repository as converged and minimal as possible, avoiding excessive overlap or inconsistent design patterns in the core operator library.

Based on these considerations, we would like to offer the following suggestions:

1. If you would like this operator to be included in the DataFlow main operator library

Some further refinement may be needed in the following areas:

1.1 Operator categorization

Please clarify the functional category of this operator:
whether it can fit into the existing operator taxonomy,
or whether introducing a separate hallucination (or similar) category is truly necessary.
This decision has a direct impact on the overall structure and long-term maintainability of the operator system.

1.2 Operator conventions and consistency

The prompt_template, get_desc, and related interfaces may need further refinement to comply with the coding and design conventions of DataFlow core operators;
Existing operators and documentation in the repository can be used as references to ensure consistency in style and behavior.
https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/core_text/generate/format_str_prompted_generator.py#L69

2. Alternative approach: publishing via the DataFlow Extension / Ecosystem

If incorporating this operator into the main repository feels too restrictive, we strongly recommend using the DataFlow Extension / Ecosystem approach instead:

Please refer to our documentation on building DataFlow Extensions:
https://opendcai.github.io/DataFlow-Doc/en/guide/df_ecosystem/
By publishing the operator in a separate repository, you can adopt more flexible design and release conventions;
We plan to maintain an Extension index document (like Awesome projects using Dataflow) In the main repository or Documentation, and you will be able to submit a PR to add your Extension repository link to this index.

This model is similar to the PyTorch ecosystem: not every implementation needs to live in the core repository, and independent modules by other repos are encouraged.

We hope these suggestions are helpful, and we would be happy to further discuss the design and positioning of this operator. Thanks again for your contribution 🙌

SunnyHaze · 2026-01-09T04:02:27Z

dataflow/example/HallucinationDetectionPipeline/example_pipeline.py

The directory of dataflow/example/* is not for example Python scripts but for example datasets. Please refer to the existing implementation of DataFlow pipelines.

For third-party scripts, we recommend that you place the pipeline scripts under dataflow/statics/thirdparty/HallucinationDetection/*. This will support our default usage by generate all start script by dataflow init command.

SunnyHaze · 2026-01-09T04:05:43Z

dataflow/operators/hallucination_detection/README.md

This readme mayalso be placed under the same directory with exmaple pipeline. (dataflow/statics/thirdparty/.../)

SunnyHaze · 2026-01-09T04:07:17Z

dataflow/operators/hallucination_detection/eval/__init__.py

As lazzyloader exists, this file should be removed.

SunnyHaze · 2026-01-09T04:07:49Z

dataflow/operators/hallucination_detection/filter/__init__.py

Same as above. As lazzyloader exists, this file should be removed.

SunnyHaze · 2026-01-09T04:09:18Z

dataflow/operators/hallucination_detection/generate/__init__.py

As lazzyloader exists, this file should be removed.

SunnyHaze · 2026-01-09T06:31:51Z

dataflow/operators/hallucination_detection/filter/long_context_filter.py

+    @staticmethod
+    def get_desc(lang: str = "en") -> tuple:
+        """Returns a description of the operator's functionality."""
+        if lang == "zh":


To better support DF-Agent understand how to exceute a operator, we need a more detailel get_desc for each operator. Need to specify each the property for each parametes in __init__ and run. You can reference

DataFlow/dataflow/operators/core_text/generate/format_str_prompted_generator.py

Line 69 in 7ba618e

"基于模板化提示词（Prompt Template）生成内容的算子。"

as example.

SunnyHaze · 2026-01-09T06:33:55Z

dataflow/operators/hallucination_detection/filter/long_context_filter.py

+    def run(
+        self,
+        storage: DataFlowStorage,
+        input_key: str = "dataframe",


this should be a name of column in a dataframe instead of the whole dataframe

SunnyHaze · 2026-01-09T06:35:08Z

dataflow/operators/hallucination_detection/generate/__init__.py

may remove it as well

SunnyHaze · 2026-01-09T06:35:57Z

dataflow/operators/hallucination_detection/generate/hallucination_injection.py

+    def get_desc(lang: str = "en") -> tuple:
+        """Returns a description of the operator's functionality."""
+        if lang == "zh":
+            return (


need more details

SunnyHaze · 2026-01-09T06:37:49Z

scripts/generate_with_dataflow.py

we don't have this directory. This file may consider redundant.

rootfs · 2026-01-09T14:34:18Z

@SunnyHaze sounds good! I'll check these feedback and get back soon.

Signed-off-by: Huamin Chen <hchen@redhat.com>

SunnyHaze

Hi, thanks for your revision. However, the current implementation added new function sto our key storage class. The implementation of operators may also need revision to follow the read & write convention for DataFlow File Storage.

SunnyHaze · 2026-01-10T05:14:33Z

dataflow/utils/storage.py

Please don't revise the key class, FileStorage. Here, it should only callread and write to the storage class, instead of adding a new functions to it.

rootfs added 3 commits January 8, 2026 23:28

fix CI issues

a923a3e

Signed-off-by: Huamin Chen <hchen@redhat.com>

SunnyHaze requested changes Jan 9, 2026

View reviewed changes

review feedback

abe37e4

Signed-off-by: Huamin Chen <hchen@redhat.com>

SunnyHaze requested changes Jan 10, 2026

View reviewed changes

Add hallucination detection dataset operator #437

Are you sure you want to change the base?

Add hallucination detection dataset operator #437

Uh oh!

Conversation

rootfs commented Jan 9, 2026

Uh oh!

SunnyHaze left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. If you would like this operator to be included in the DataFlow main operator library

2. Alternative approach: publishing via the DataFlow Extension / Ecosystem

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rootfs commented Jan 9, 2026

Uh oh!

SunnyHaze left a comment

Choose a reason for hiding this comment

Uh oh!

SunnyHaze Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SunnyHaze left a comment •

edited

Loading

SunnyHaze Jan 10, 2026 •

edited

Loading