This project provides a series of workflows and tools for data cleaning, developed based on the ComfyUI platform.
This project depends on ComfyUI. Please first clone and install it:
git clone https://github.com/comfyanonymous/ComfyUI.gitPlease install the relevant dependencies according to the ComfyUI project documentation.
First, clone this project's code repository and initialize submodules:
git clone https://github.com/LikeSwim/COCLP.git
cd COCLP/
git submodule update --init --recursiveEnter the project directory and install dependencies:
cd COCLP/Corpus_Cleansing_Pipeline/
pip install -r requirements.txtDepends on the following plugins or modules:
MinerU
rgthree-comfy
ComfyUI-to-Python-Extension
Comfyui-LG_GroupExecutor
faster-whisper
Please install the corresponding dependencies according to each project's documentation.
If you encounter the following error when using faster-whisper:
Could not load library libcudnn_ops_infer.so.8
Unable to load any of {libcudnn_cnn.so.9.1.0, libcudnn_cnn.so.9.1, libcudnn_cnn.so.9, libcudnn_cnn.so}
libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directoryIt indicates that the system lacks the CUDA Deep Neural Network library (cuDNN).
Install cuDNN (example for apt-based systems):
sudo apt update
sudo apt install libcudnn8 libcudnn8-dev -y Check the local cuDNN corresponding .so files:
find / -name "libcudnn_ops.so*" 2>/dev/nullSpecify the cuDNN path:
export LD_LIBRARY_PATH=/path/to/your/cudnn/lib:$LD_LIBRARY_PATHPut the following folders into ComfyUI's custom_nodes folder:
rgthree-comfy
ComfyUI-to-Python-Extension
Comfyui-LG_GroupExecutor
Corpus_Cleansing_Pipeline
Then start ComfyUI:
cd ComfyUI/
python main.pyIn the example folder, the following workflow examples are provided:
Image processing workflow
PDF document processing workflow
DOCX document processing workflow
File decompression and classification workflow
Data privacy processing workflow
These workflow files can be imported into ComfyUI and run directly.
In the ComfyUI interface:
Click the top-left menu: Workflow ➡ Save as Script
Save the workflow as an executable .py file
Run in the terminal:
python your_workflow_script.pyThanks to the following projects or individuals for their support:
Thanks to the ComfyUI, MinerU, ComfyUI-to-Python-Extension, Comfyui-LG_GroupExecutor, and faster-whisper teams for providing platform support.