Law firms and Pro Se litigants frequently download massive batches of disorganized public records, discovery files, and court dockets. These files arrive with cryptic UUID filenames and fragmented metadata, making chronological case review impossible.
The Legal AI Swarm OS is a hybrid multi-threaded application built for macOS Apple Silicon. It acts as an automated digital data engineer. It deploys a swarm of simultaneous bots to ingest chaotic document dumps, execute OCR, extract chronological metadata via cloud LLMs, and then perform offline semantic routing and cryptographic deduplication to build pristine, air-gapped Retrieval-Augmented Generation (RAG) databases.
The Swarm OS features a unified CustomTkinter GUI that routes operations through 4 distinct phases:
- Dual-Engine Extraction: Utilizes the
google-genaiSDK (gemini-2.5-flash) to perform two tasks simultaneously:- Chronological Metadata: Extracts Execution Dates and Document Types to rename files to a strict
YYYY-MM-DD_Document_Type.extformat. - Full-Document OCR: Extracts 100% of the raw text and generates a sibling
.txtfile for lightweight vectorization.
- Chronological Metadata: Extracts Execution Dates and Document Types to rename files to a strict
- Format Agnostic: Natively intercepts proprietary formats (Apple
.HEICphotos, multi-page.TIFFs) and silently converts them to PDFs in the background for flawless AI ingestion. - Strict Cloud Hygiene: Enforces a
client.files.delete()command the millisecond data is extracted, ensuring confidential legal documents are wiped from Google's servers instantly.
- Semantic Routing: Rather than relying on simple keywords, this phase reads the lightweight
.txtfiles and asks Gemini to intelligently classify the document's context (e.g.,Medical,Legal_Pleadings,Corporate,Financial). - Cost Efficiency: By only sending the first 10,000 characters of the extracted
.txtfile instead of re-uploading massive PDFs, this phase categorizes thousands of documents for fractions of a cent. It moves both the.txtand the sibling media file into specific Knowledge Base subfolders.
- Fast Keyword Sieve: An entirely offline process that scans the
.txtfiles for specific family names, conditions (e.g., "synovial cyst"), and medical terminology. - Privacy First: Instantly segregates highly sensitive medical records into a quarantined folder without ever connecting to the internet.
- SHA-256 Hashing: Uses local mathematical algorithms to read the exact binary fingerprint of every file in the directory.
- RAG Hygiene: If it finds an exact byte-for-byte duplicate (even if the filenames are completely different), it quarantines the duplicate media file and its sibling
.txtfile into aDuplicates_Binto prevent poisoning the downstream RAG vector database.
- Multi-Threaded Swarm: Utilizes Python's
concurrent.futures.ThreadPoolExecutorstrictly capped at 8 concurrent bots to maximize processing speed while respecting Google API rate limits. - Thread-Safe Locks: Implements
threading.Lock()to prevent file-move collisions and CSV write-crashes when multiple bots access the same directories simultaneously. - Ghost Correction Regex: A built-in sanitization engine that strips conversational AI filler (e.g., "Here is the extracted text:") and stray markdown formatting from the OCR output.
- Immutable Ledgering: Every action (renames, categorizations, and deduplications) is permanently logged to local CSV files (
document_catalog_gemini.csvandduplicate_log.csv) for forensic auditing.
- Ensure your Google API Key is pasted into the GUI (it will securely save to
~/.legal_sorter_api_key.txt). - Select your Master Source folder.
- Select your Target folder (or select the same folder for in-place deduplication/routing).
- Select the desired Pipeline Phase and click Run Swarm Engine.