Skip to content

Dataset Auditor #7

@AndyLone22

Description

@AndyLone22

Create a new standalone Python script named DatasetAuditor.py .
Context: The script is used for Stable Diffusion LoRA training preparation. We need a pre-training dataset curation tool to find redundant images and filter synthetic datasets based on facial biometrics.

Core Engine: Use the insightface library (model: 'buffalo_l') with CUDAExecutionProvider to extract 512-dimensional facial embeddings.

It will scan your folder, rank likeness, flag redundancies, and generate a visual HTML report so you can see what to prune before hitting train.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions