|
| 1 | +# BePrep - Image Dataset Preparation Tool |
| 2 | + |
| 3 | +## Project Overview |
| 4 | +BePrep is a Python-based tool for preparing high-quality image datasets for diffusion model fine-tuning. It provides a powerful workflow DSL for processing images with both automated and human-in-the-loop operations. |
| 5 | + |
| 6 | +## Architecture |
| 7 | + |
| 8 | +### Core Components |
| 9 | + |
| 10 | +1. **Node System** (`beprepared/node.py`) |
| 11 | + - Base class for all processing nodes |
| 12 | + - Uses metaclass to enable DSL patterns like `Node1 >> Node2` |
| 13 | + - Each node processes datasets and returns new datasets |
| 14 | + - Supports chaining via `>>` and `<<` operators |
| 15 | + |
| 16 | +2. **Dataset** (`beprepared/dataset.py`) |
| 17 | + - Container for images being processed |
| 18 | + - Supports copying for non-destructive operations |
| 19 | + |
| 20 | +3. **Image** (`beprepared/image.py`) |
| 21 | + - Represents individual images with properties |
| 22 | + - Uses PropertyBag pattern for flexible attributes |
| 23 | + - Allowed formats: JPEG, PNG, WEBP, GIF, TIFF, BMP |
| 24 | + |
| 25 | +4. **Properties System** (`beprepared/properties.py`) |
| 26 | + - `PropertyBag`: Base class for objects holding properties |
| 27 | + - `CachedProperty`: Properties cached in SQLite database |
| 28 | + - `ConstProperty`: Immutable properties |
| 29 | + - `ComputedProperty`: Dynamically computed properties |
| 30 | + |
| 31 | +5. **Workspace** (`beprepared/workspace.py`) |
| 32 | + - Manages global state and database |
| 33 | + - SQLite database for caching operations |
| 34 | + - Thread-local database connections |
| 35 | + - Object storage for image data |
| 36 | + |
| 37 | +6. **Web Interface** (`beprepared/web.py`, `beprepared/web/`) |
| 38 | + - FastAPI-based web server for human-in-the-loop tasks |
| 39 | + - Vue.js frontend for filtering and tagging interfaces |
| 40 | + - Real-time progress updates via WebSockets |
| 41 | + |
| 42 | +## Node Categories |
| 43 | + |
| 44 | +### Data Loading & Saving |
| 45 | +- `Load`: Load images from directory |
| 46 | +- `Save`: Save processed images |
| 47 | + |
| 48 | +### Image Processing |
| 49 | +- `ConvertFormat`: Convert image formats |
| 50 | +- `Upscale`/`Downscale`: Resize images |
| 51 | +- `Anonymize`: Blur faces in images |
| 52 | +- `EdgeWatermarkRemoval`: Remove watermarks |
| 53 | + |
| 54 | +### Filtering & Selection |
| 55 | +- `FilterBySize`: Filter by image dimensions |
| 56 | +- `FilterByAspectRatio`: Filter by aspect ratio |
| 57 | +- `HumanFilter`: Manual human filtering via web UI |
| 58 | +- `SmartHumanFilter`: Intelligent human filtering |
| 59 | + |
| 60 | +### Captioning (Multiple VLM providers) |
| 61 | +- `JoyCaptionAlphaOne`/`JoyCaptionAlphaTwo`: JoyCaption models |
| 62 | +- `GPT4oCaption`: OpenAI GPT-4 Vision |
| 63 | +- `GeminiCaption`: Google Gemini |
| 64 | +- `LlamaCaption`: Meta Llama |
| 65 | +- `QwenVLCaption`: Qwen Vision-Language |
| 66 | +- `XGenMMCaption`: xGen multimodal |
| 67 | +- `Florence2Caption`: Microsoft Florence2 |
| 68 | +- `MolmoCaption`: Molmo captioning |
| 69 | + |
| 70 | +### Analysis & Scoring |
| 71 | +- `NudeNet`: NSFW content detection |
| 72 | +- `AestheticScore`: Aesthetic quality scoring |
| 73 | +- `ClipEmbed`: CLIP embeddings |
| 74 | + |
| 75 | +### Tagging & Metadata |
| 76 | +- `HumanTag`: Manual tagging via web UI |
| 77 | +- `AddTags`/`RemoveTags`/`RewriteTags`: Tag manipulation |
| 78 | +- `LLMCaptionTransform`: Transform captions using LLMs |
| 79 | +- `LLMCaptionVariations`: Generate caption variations |
| 80 | + |
| 81 | +### Deduplication |
| 82 | +- `ExactDedupe`: Exact duplicate removal |
| 83 | +- `FuzzyDedupe`: CLIP-based fuzzy deduplication |
| 84 | + |
| 85 | +### Utility Nodes |
| 86 | +- `Info`: Print dataset information |
| 87 | +- `Concat`: Concatenate multiple datasets |
| 88 | +- `Take`: Take first N images |
| 89 | +- `Sorted`: Sort images by property |
| 90 | +- `Shuffle`: Randomize order |
| 91 | +- `Map`/`Apply`/`Filter`: Functional operations |
| 92 | + |
| 93 | +## Creating New Nodes |
| 94 | + |
| 95 | +### Basic Node Template |
| 96 | + |
| 97 | +```python |
| 98 | +from beprepared.node import Node |
| 99 | +from beprepared.dataset import Dataset |
| 100 | +from beprepared.properties import CachedProperty |
| 101 | + |
| 102 | +class MyNode(Node): |
| 103 | + '''Node description for documentation''' |
| 104 | + |
| 105 | + def __init__(self, param1: str, param2: int = 10): |
| 106 | + '''Initialize the node |
| 107 | + |
| 108 | + Args: |
| 109 | + param1: Description of parameter 1 |
| 110 | + param2: Description of parameter 2 (default: 10) |
| 111 | + ''' |
| 112 | + super().__init__() |
| 113 | + self.param1 = param1 |
| 114 | + self.param2 = param2 |
| 115 | + |
| 116 | + def eval(self, dataset: Dataset) -> Dataset: |
| 117 | + '''Process the dataset |
| 118 | + |
| 119 | + Args: |
| 120 | + dataset: Input dataset |
| 121 | + |
| 122 | + Returns: |
| 123 | + Processed dataset |
| 124 | + ''' |
| 125 | + # Process each image |
| 126 | + for image in dataset.images: |
| 127 | + # Access existing properties |
| 128 | + width = image.width.value |
| 129 | + height = image.height.value |
| 130 | + |
| 131 | + # Add new cached property |
| 132 | + result_prop = CachedProperty('mynode_result', image) |
| 133 | + if not result_prop.has_value: |
| 134 | + # Compute and cache result |
| 135 | + result = self.process_image(image) |
| 136 | + result_prop.value = result |
| 137 | + |
| 138 | + # Add property to image |
| 139 | + image.mynode_result = result_prop |
| 140 | + |
| 141 | + return dataset |
| 142 | + |
| 143 | + def process_image(self, image): |
| 144 | + # Your processing logic here |
| 145 | + pass |
| 146 | +``` |
| 147 | + |
| 148 | +### Key Patterns |
| 149 | + |
| 150 | +1. **Property Caching**: Use `CachedProperty` to avoid recomputing expensive operations |
| 151 | +2. **Non-destructive**: Always return a new dataset or copy |
| 152 | +3. **Logging**: Use `self.log` for logging (automatically connected to web UI) |
| 153 | +4. **Progress**: Use `tqdm` from `beprepared.nodes.utils` for progress bars |
| 154 | +5. **Web Integration**: For human-in-the-loop, see `HumanFilter`/`HumanTag` examples |
| 155 | + |
| 156 | +### Parallel Processing Pattern |
| 157 | + |
| 158 | +For GPU-intensive operations, use the `ParallelWorker` pattern: |
| 159 | + |
| 160 | +```python |
| 161 | +from beprepared.nodes.parallelworker import ParallelWorker |
| 162 | + |
| 163 | +class MyGPUNode(ParallelWorker): |
| 164 | + def __init__(self): |
| 165 | + super().__init__(num_workers=2) # Number of GPU workers |
| 166 | + |
| 167 | + def load_models(self): |
| 168 | + # Load models once per worker |
| 169 | + import torch |
| 170 | + self.model = load_my_model() |
| 171 | + |
| 172 | + def process_image(self, image): |
| 173 | + # Process single image on GPU |
| 174 | + result = self.model(image) |
| 175 | + return result |
| 176 | +``` |
| 177 | + |
| 178 | +## Workflow Examples |
| 179 | + |
| 180 | +### Basic Workflow |
| 181 | +```python |
| 182 | +from beprepared import * |
| 183 | + |
| 184 | +( |
| 185 | + Load("input_images") |
| 186 | + >> FilterBySize(min_edge=512) |
| 187 | + >> JoyCaptionAlphaOne |
| 188 | + >> Save("output") |
| 189 | +) |
| 190 | +``` |
| 191 | + |
| 192 | +### Complex Workflow with Human Tasks |
| 193 | +```python |
| 194 | +( |
| 195 | + Load("raw_images") |
| 196 | + >> FilterBySize(min_edge=512) |
| 197 | + >> HumanFilter # Web UI for filtering |
| 198 | + >> Anonymize # Blur faces |
| 199 | + >> JoyCaptionAlphaOne # Auto-caption |
| 200 | + >> HumanTag(tags=["style1", "style2"]) # Manual tagging |
| 201 | + >> LLMCaptionTransform( # Enhance captions |
| 202 | + system_prompt="Improve this caption", |
| 203 | + user_prompt="Caption: {caption}" |
| 204 | + ) |
| 205 | + >> Save("final_dataset") |
| 206 | +) |
| 207 | +``` |
| 208 | + |
| 209 | +## Testing New Nodes |
| 210 | + |
| 211 | +1. Create test script in project root: |
| 212 | +```python |
| 213 | +from beprepared import * |
| 214 | +from beprepared.nodes.mynode import MyNode |
| 215 | + |
| 216 | +( |
| 217 | + Load("test_images") |
| 218 | + >> MyNode(param1="test") |
| 219 | + >> Info |
| 220 | + >> Save("test_output") |
| 221 | +) |
| 222 | +``` |
| 223 | + |
| 224 | +2. Run with CLI: |
| 225 | +```bash |
| 226 | +beprepared run test_script.py |
| 227 | +``` |
| 228 | + |
| 229 | +## Database Schema |
| 230 | + |
| 231 | +The workspace uses SQLite with: |
| 232 | +- `property_cache`: Cached properties (key, domain, value, timestamp) |
| 233 | +- `objects`: Stored image data (objectid, data) |
| 234 | +- `migrations`: Schema version tracking |
| 235 | + |
| 236 | +## Important Conventions |
| 237 | + |
| 238 | +1. **Immutable Images**: Once loaded, images are considered immutable |
| 239 | +2. **Caching**: All expensive operations should be cached |
| 240 | +3. **Non-destructive**: Never modify original images |
| 241 | +4. **Progress Feedback**: Use tqdm for long operations |
| 242 | +5. **Web Integration**: Human tasks automatically launch web UI |
| 243 | + |
| 244 | +## CLI Commands |
| 245 | + |
| 246 | +- `beprepared run <workflow.py>`: Execute workflow file |
| 247 | +- `beprepared exec "<pipeline>"`: Quick one-liner execution |
| 248 | +- `beprepared db list [pattern]`: List cached properties |
| 249 | +- `beprepared db clear [pattern]`: Clear cached data |
| 250 | + |
| 251 | +## Dependencies |
| 252 | + |
| 253 | +Key libraries: |
| 254 | +- PyTorch & torchvision |
| 255 | +- FastAPI & uvicorn (web interface) |
| 256 | +- Vue.js (frontend) |
| 257 | +- Pillow (image processing) |
| 258 | +- OpenAI, LiteLLM (LLM integrations) |
| 259 | +- CLIP, transformers (ML models) |
| 260 | +- SQLite3 (caching) |
| 261 | + |
| 262 | +## Development Tips |
| 263 | + |
| 264 | +1. **Add to __init__.py**: Export new nodes in `beprepared/nodes/__init__.py` |
| 265 | +2. **Documentation**: Add docstrings for auto-generated docs |
| 266 | +3. **Error Handling**: Use try/except and log errors with `self.log.exception()` |
| 267 | +4. **Testing**: Test with small datasets first |
| 268 | +5. **GPU Memory**: Be mindful of GPU memory when processing batches |
| 269 | + |
| 270 | +## Project Structure |
| 271 | +``` |
| 272 | +beprepared/ |
| 273 | +├── __init__.py # Main exports |
| 274 | +├── cli.py # CLI interface |
| 275 | +├── node.py # Base Node class |
| 276 | +├── dataset.py # Dataset container |
| 277 | +├── image.py # Image class |
| 278 | +├── properties.py # Property system |
| 279 | +├── workspace.py # Global state & DB |
| 280 | +├── web.py # Web server |
| 281 | +├── nodes/ # All node implementations |
| 282 | +│ ├── __init__.py # Node exports |
| 283 | +│ ├── load.py # Load node |
| 284 | +│ ├── save.py # Save node |
| 285 | +│ ├── humanfilter.py # Human filtering |
| 286 | +│ ├── humantag.py # Human tagging |
| 287 | +│ └── ... # Other nodes |
| 288 | +└── web/ # Frontend code |
| 289 | + ├── App.vue # Main app |
| 290 | + ├── HumanFilter.vue # Filter UI |
| 291 | + └── HumanTag.vue # Tag UI |
| 292 | +``` |
0 commit comments