Skip to content

Conversation

@rootfs
Copy link

@rootfs rootfs commented Jan 9, 2026

We are building a ModernBERT based hallucination detector, inspired by LettuceDetect. The training dataset is based on RAGTruth, with LLM augmentation. In addition, the HaluEval dataset is converted to spans using NLI.

All these operators are included in this PR.

rootfs added 3 commits January 8, 2026 23:28
New operators for creating hallucination detection datasets:

1. LongContextFilterOperator
   - Filter samples by token count (8K+, 12K+, 16K+, etc.)
   - Uses HuggingFace tokenizers
   - Adds num_tokens column to output

2. HallucinationInjectionOperator
   - Inject RAGTruth-style hallucinations using LLM
   - Supports: Evident Conflict, Evident Baseless, Subtle Baseless, Subtle Conflict
   - Parses <hal>...</hal> tags to extract span positions
   - Configurable hallucination ratio

3. SpanAnnotationOperator
   - Convert document-level labels to span-level using NLI
   - Uses DeBERTa-v3-mnli-fever-anli by default
   - Identifies contradicting sentences

Also includes:
- Example pipeline in dataflow/example/HallucinationDetectionPipeline/
- Unit tests in test/test_hallucination_detection.py
- README documentation

Related: llm-semantic-router/longcontext-haldetect dataset
scripts/generate_with_dataflow.py: Complete pipeline for generating
long-context hallucination detection datasets using DataFlow operators

Features:
- Filters NarrativeQA by token count (8K-24K)
- Generates answers via vLLM API
- Injects RAGTruth-style hallucinations (50%)
- Outputs JSON with span annotations

Tested: Generated 50 samples (25 hal, 25 supported) in 12K-14K range
Signed-off-by: Huamin Chen <hchen@redhat.com>
Copy link
Collaborator

@SunnyHaze SunnyHaze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you very much for your interest in DataFlow and for submitting this PR.

Overall, we think this version still requires significant adjustments before it can fully align with the design patterns and conventions of DataFlow operators. One of the core goals in DataFlow’s operator design is to ensure that operators are clear, structured, and easy for the DataFlow Agent to understand, so that they can be composed, rewritten, or further optimized at the prompt level.

In addition, we aim to keep the operator set in the main repository as converged and minimal as possible, avoiding excessive overlap or inconsistent design patterns in the core operator library.

Based on these considerations, we would like to offer the following suggestions:

1. If you would like this operator to be included in the DataFlow main operator library

Some further refinement may be needed in the following areas:

1.1 Operator categorization

  • Please clarify the functional category of this operator:
    whether it can fit into the existing operator taxonomy,
  • or whether introducing a separate hallucination (or similar) category is truly necessary.
    This decision has a direct impact on the overall structure and long-term maintainability of the operator system.

1.2 Operator conventions and consistency

2. Alternative approach: publishing via the DataFlow Extension / Ecosystem

If incorporating this operator into the main repository feels too restrictive, we strongly recommend using the DataFlow Extension / Ecosystem approach instead:

  • Please refer to our documentation on building DataFlow Extensions:
    https://opendcai.github.io/DataFlow-Doc/en/guide/df_ecosystem/
  • By publishing the operator in a separate repository, you can adopt more flexible design and release conventions;
  • We plan to maintain an Extension index document (like Awesome projects using Dataflow) In the main repository or Documentation, and you will be able to submit a PR to add your Extension repository link to this index.

This model is similar to the PyTorch ecosystem: not every implementation needs to live in the core repository, and independent modules by other repos are encouraged.

We hope these suggestions are helpful, and we would be happy to further discuss the design and positioning of this operator. Thanks again for your contribution 🙌

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory of dataflow/example/* is not for example Python scripts but for example datasets. Please refer to the existing implementation of DataFlow pipelines.

For third-party scripts, we recommend that you place the pipeline scripts under dataflow/statics/thirdparty/HallucinationDetection/*. This will support our default usage by generate all start script by dataflow init command.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readme mayalso be placed under the same directory with exmaple pipeline. (dataflow/statics/thirdparty/.../)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As lazzyloader exists, this file should be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. As lazzyloader exists, this file should be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As lazzyloader exists, this file should be removed.

@staticmethod
def get_desc(lang: str = "en") -> tuple:
"""Returns a description of the operator's functionality."""
if lang == "zh":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better support DF-Agent understand how to exceute a operator, we need a more detailel get_desc for each operator. Need to specify each the property for each parametes in __init__ and run. You can reference

"基于模板化提示词(Prompt Template)生成内容的算子。"
as example.

def run(
self,
storage: DataFlowStorage,
input_key: str = "dataframe",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a name of column in a dataframe instead of the whole dataframe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may remove it as well

def get_desc(lang: str = "en") -> tuple:
"""Returns a description of the operator's functionality."""
if lang == "zh":
return (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more details

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have this directory. This file may consider redundant.

@rootfs
Copy link
Author

rootfs commented Jan 9, 2026

@SunnyHaze sounds good! I'll check these feedback and get back soon.

Signed-off-by: Huamin Chen <hchen@redhat.com>
Copy link
Collaborator

@SunnyHaze SunnyHaze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your revision. However, the current implementation added new function sto our key storage class. The implementation of operators may also need revision to follow the read & write convention for DataFlow File Storage.

Copy link
Collaborator

@SunnyHaze SunnyHaze Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't revise the key class, FileStorage. Here, it should only callread and write to the storage class, instead of adding a new functions to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants