BePrep is a Python-based tool for preparing high-quality image datasets for diffusion model fine-tuning. It provides a powerful workflow DSL for processing images with both automated and human-in-the-loop operations.
-
Node System (
beprepared/node.py)- Base class for all processing nodes
- Uses metaclass to enable DSL patterns like
Node1 >> Node2 - Each node processes datasets and returns new datasets
- Supports chaining via
>>and<<operators
-
Dataset (
beprepared/dataset.py)- Container for images being processed
- Supports copying for non-destructive operations
-
Image (
beprepared/image.py)- Represents individual images with properties
- Uses PropertyBag pattern for flexible attributes
- Allowed formats: JPEG, PNG, WEBP, GIF, TIFF, BMP
-
Properties System (
beprepared/properties.py)PropertyBag: Base class for objects holding propertiesCachedProperty: Properties cached in SQLite databaseConstProperty: Immutable propertiesComputedProperty: Dynamically computed properties
-
Workspace (
beprepared/workspace.py)- Manages global state and database
- SQLite database for caching operations
- Thread-local database connections
- Object storage for image data
-
Web Interface (
beprepared/web.py,beprepared/web/)- FastAPI-based web server for human-in-the-loop tasks
- Vue.js frontend for filtering and tagging interfaces
- Real-time progress updates via WebSockets
Load: Load images from directorySave: Save processed images
ConvertFormat: Convert image formatsUpscale/Downscale: Resize imagesAnonymize: Blur faces in imagesEdgeWatermarkRemoval: Remove watermarks
FilterBySize: Filter by image dimensionsFilterByAspectRatio: Filter by aspect ratioHumanFilter: Manual human filtering via web UISmartHumanFilter: Intelligent human filtering
JoyCaptionAlphaOne/JoyCaptionAlphaTwo: JoyCaption modelsGPT4oCaption: OpenAI GPT-4 VisionGeminiCaption: Google GeminiLlamaCaption: Meta LlamaQwenVLCaption: Qwen Vision-LanguageXGenMMCaption: xGen multimodalFlorence2Caption: Microsoft Florence2MolmoCaption: Molmo captioning
NudeNet: NSFW content detectionAestheticScore: Aesthetic quality scoringClipEmbed: CLIP embeddings
HumanTag: Manual tagging via web UIAddTags/RemoveTags/RewriteTags: Tag manipulationLLMCaptionTransform: Transform captions using LLMsLLMCaptionVariations: Generate caption variations
ExactDedupe: Exact duplicate removalFuzzyDedupe: CLIP-based fuzzy deduplication
Info: Print dataset informationConcat: Concatenate multiple datasetsTake: Take first N imagesSorted: Sort images by propertyShuffle: Randomize orderMap/Apply/Filter: Functional operations
from beprepared.node import Node
from beprepared.dataset import Dataset
from beprepared.properties import CachedProperty
class MyNode(Node):
'''Node description for documentation'''
def __init__(self, param1: str, param2: int = 10):
'''Initialize the node
Args:
param1: Description of parameter 1
param2: Description of parameter 2 (default: 10)
'''
super().__init__()
self.param1 = param1
self.param2 = param2
def eval(self, dataset: Dataset) -> Dataset:
'''Process the dataset
Args:
dataset: Input dataset
Returns:
Processed dataset
'''
# Process each image
for image in dataset.images:
# Access existing properties
width = image.width.value
height = image.height.value
# Add new cached property
result_prop = CachedProperty('mynode_result', image)
if not result_prop.has_value:
# Compute and cache result
result = self.process_image(image)
result_prop.value = result
# Add property to image
image.mynode_result = result_prop
return dataset
def process_image(self, image):
# Your processing logic here
pass- Property Caching: Use
CachedPropertyto avoid recomputing expensive operations - Non-destructive: Always return a new dataset or copy
- Logging: Use
self.logfor logging (automatically connected to web UI) - Progress: Use
tqdmfrombeprepared.nodes.utilsfor progress bars - Web Integration: For human-in-the-loop, see
HumanFilter/HumanTagexamples
For GPU-intensive operations, use the ParallelWorker pattern:
from beprepared.nodes.parallelworker import ParallelWorker
class MyGPUNode(ParallelWorker):
def __init__(self):
super().__init__(num_workers=2) # Number of GPU workers
def load_models(self):
# Load models once per worker
import torch
self.model = load_my_model()
def process_image(self, image):
# Process single image on GPU
result = self.model(image)
return resultfrom beprepared import *
(
Load("input_images")
>> FilterBySize(min_edge=512)
>> JoyCaptionAlphaOne
>> Save("output")
)(
Load("raw_images")
>> FilterBySize(min_edge=512)
>> HumanFilter # Web UI for filtering
>> Anonymize # Blur faces
>> JoyCaptionAlphaOne # Auto-caption
>> HumanTag(tags=["style1", "style2"]) # Manual tagging
>> LLMCaptionTransform( # Enhance captions
system_prompt="Improve this caption",
user_prompt="Caption: {caption}"
)
>> Save("final_dataset")
)- Create test script in project root:
from beprepared import *
from beprepared.nodes.mynode import MyNode
(
Load("test_images")
>> MyNode(param1="test")
>> Info
>> Save("test_output")
)- Run with CLI:
beprepared run test_script.pyThe workspace uses SQLite with:
property_cache: Cached properties (key, domain, value, timestamp)objects: Stored image data (objectid, data)migrations: Schema version tracking
- Immutable Images: Once loaded, images are considered immutable
- Caching: All expensive operations should be cached
- Non-destructive: Never modify original images
- Progress Feedback: Use tqdm for long operations
- Web Integration: Human tasks automatically launch web UI
beprepared run <workflow.py>: Execute workflow filebeprepared exec "<pipeline>": Quick one-liner executionbeprepared db list [pattern]: List cached propertiesbeprepared db clear [pattern]: Clear cached data
Key libraries:
- PyTorch & torchvision
- FastAPI & uvicorn (web interface)
- Vue.js (frontend)
- Pillow (image processing)
- OpenAI, LiteLLM (LLM integrations)
- CLIP, transformers (ML models)
- SQLite3 (caching)
- Add to init.py: Export new nodes in
beprepared/nodes/__init__.py - Documentation: Add docstrings for auto-generated docs
- Error Handling: Use try/except and log errors with
self.log.exception() - Testing: Test with small datasets first
- GPU Memory: Be mindful of GPU memory when processing batches
beprepared/
├── __init__.py # Main exports
├── cli.py # CLI interface
├── node.py # Base Node class
├── dataset.py # Dataset container
├── image.py # Image class
├── properties.py # Property system
├── workspace.py # Global state & DB
├── web.py # Web server
├── nodes/ # All node implementations
│ ├── __init__.py # Node exports
│ ├── load.py # Load node
│ ├── save.py # Save node
│ ├── humanfilter.py # Human filtering
│ ├── humantag.py # Human tagging
│ └── ... # Other nodes
└── web/ # Frontend code
├── App.vue # Main app
├── HumanFilter.vue # Filter UI
└── HumanTag.vue # Tag UI