This document contains structured feature requests formatted for GitHub Issues. Copy each section when creating issues in the repository.
- Good First Issues
- Pipeline & Processing
- API Enhancements
- Integrations
- User Experience
- Performance & Scalability
- Security
- Documentation
- Testing & Quality
Labels: good first issue, enhancement, code-quality
Description: HTTP client timeouts are currently hardcoded across multiple services. These should be unified and made configurable.
Current Behavior:
- Timeouts scattered: 30s in some places, 10s in others, 300s for Open WebUI
- No central configuration
Desired Behavior:
- Single
http_client.pyutility with configurable defaults - Timeout settings in
settings.yaml
Files to modify:
- Create
dedox/utils/http_client.py - Update
dedox/services/paperless_service.py - Update
dedox/services/openwebui_sync_service.py - Update
dedox/pipeline/llm_extractor.py
Acceptance Criteria:
- Centralized HTTP client factory function
- Configurable timeout in settings.yaml
- All services use the shared client
- Tests pass
Labels: good first issue, enhancement, observability
Description: Add correlation IDs to log messages to track document processing across services.
Current Behavior:
- Basic logging without request/job correlation
- Difficult to trace a single document's processing journey
Desired Behavior:
- Each document processing job gets a unique correlation ID
- All log messages include the correlation ID
- Logs can be filtered by correlation ID
Acceptance Criteria:
- Correlation ID generated at job creation
- ID passed through all pipeline stages
- Structured JSON logging option
- Documentation updated
Labels: good first issue, enhancement, operations
Description:
Extend /health endpoint to report connectivity status of Paperless-ngx, Ollama, and Open WebUI.
Current Behavior:
- Basic health check only verifies DeDox is running
Desired Behavior:
{
"status": "healthy",
"services": {
"paperless": {"status": "connected", "latency_ms": 45},
"ollama": {"status": "connected", "model": "qwen2.5:14b"},
"openwebui": {"status": "connected"}
}
}Acceptance Criteria:
- Health check tests all external service connections
- Returns degraded status if optional services unavailable
- Returns unhealthy if required services unavailable
- Response includes latency measurements
Labels: good first issue, enhancement, api
Description: Create an API endpoint that returns processing statistics.
Endpoint: GET /api/stats
Response:
{
"total_documents": 1234,
"documents_by_status": {
"completed": 1200,
"processing": 5,
"failed": 29
},
"documents_by_type": {
"invoice": 450,
"contract": 200,
"letter": 300
},
"avg_processing_time_seconds": 12.5,
"avg_confidence_score": 0.87
}Acceptance Criteria:
- Endpoint returns accurate statistics
- Statistics cached for performance (5 min TTL)
- Filterable by date range
- Documentation added
Labels: enhancement, feature, pipeline
Description: Detect and flag potentially duplicate documents before processing to avoid redundant work and storage.
Proposed Solution:
- Calculate file hash (SHA-256) during upload
- Store hash in database
- Check for existing hash before processing
- If duplicate found:
- Tag document with
dedox:potential-duplicate - Link to original document
- Skip processing (configurable)
- Tag document with
Configuration:
processing:
duplicate_detection:
enabled: true
action: "tag" # tag, skip, or ask
similarity_threshold: 0.95 # for fuzzy matchingAcceptance Criteria:
- Exact duplicate detection via hash
- Optional fuzzy duplicate detection via OCR text similarity
- Configurable action (tag, skip, prompt review)
- API endpoint to manually mark as not-duplicate
- Tests with duplicate and similar documents
Labels: enhancement, feature, pipeline
Description: Add a pipeline stage that validates extracted metadata against configurable business rules.
Use Cases:
- Invoice amount must be positive
- Due date must be after document date
- Contract must have both parties identified
- Tax documents must have tax year
Configuration:
# config/validation_rules.yaml
rules:
- document_type: invoice
validations:
- field: total_amount
rule: greater_than
value: 0
severity: error
- field: due_date
rule: after_field
reference: document_date
severity: warningAcceptance Criteria:
- Configurable validation rules per document type
- Support for: required, regex, range, date comparisons, cross-field
- Validation failures flagged for review
- Severity levels: error (blocks), warning (flags)
- Validation report in metadata
Labels: enhancement, feature, llm
Description: Allow switching between different LLM providers beyond Ollama.
Providers to support:
- Ollama (current)
- OpenAI API
- Anthropic Claude API
- Azure OpenAI
- Local llama.cpp
Configuration:
llm:
provider: "openai" # ollama, openai, anthropic, azure, llamacpp
api_key: "${LLM_API_KEY}"
model: "gpt-4o"
# Provider-specific options
openai:
organization: ""
anthropic:
version: "2024-01-01"Acceptance Criteria:
- Abstract LLM interface
- Provider implementations
- Graceful fallback if provider unavailable
- Cost tracking per provider
- Documentation for each provider
Labels: enhancement, feature, pipeline
Description: Assess OCR quality before metadata extraction and suggest improvements.
Features:
- Calculate average confidence score
- Detect common issues (skew, blur, low resolution)
- Suggest image preprocessing
- Flag for manual review if quality below threshold
Output:
{
"ocr_quality": {
"overall_score": 0.72,
"issues": ["low_contrast", "slight_skew"],
"suggestions": ["increase_contrast", "deskew"],
"confidence_by_region": {...}
}
}Acceptance Criteria:
- Quality score calculation
- Issue detection (skew, blur, contrast, resolution)
- Configurable quality threshold
- Auto-retry with preprocessing (optional)
- Quality metrics stored with document
Labels: enhancement, feature, api
Description: Allow uploading and processing multiple documents in a single request.
Endpoint: POST /api/documents/batch
Features:
- Upload multiple files (multipart)
- Process as single job with sub-tasks
- Progress tracking per document
- Bulk metadata application
- ZIP file upload support
Acceptance Criteria:
- Batch upload endpoint
- ZIP file extraction
- Progress tracking per document
- Batch job status endpoint
- Configurable batch size limit
- Error handling (partial failures)
Labels: enhancement, feature, pipeline
Description: Allow reprocessing documents with custom or modified extraction prompts.
Use Case:
- Initial extraction missed important field
- User wants to try different prompt
- Testing prompt improvements
Endpoint: POST /api/documents/{id}/reprocess
Request:
{
"stages": ["llm_extraction"],
"custom_prompts": {
"sender": "Extract the company name that issued this invoice..."
},
"preserve_existing": true
}Acceptance Criteria:
- Selective stage reprocessing
- Custom prompt override
- Option to preserve or replace existing metadata
- Comparison view (old vs new extraction)
- Prompt effectiveness tracking
Labels: enhancement, feature, api
Description: Implement WebSocket endpoints for real-time job progress updates instead of polling.
Endpoint: WS /api/ws/jobs/{job_id}
Events:
{"event": "stage_started", "stage": "ocr", "timestamp": "..."}
{"event": "stage_progress", "stage": "ocr", "progress": 50}
{"event": "stage_completed", "stage": "ocr", "duration_ms": 2500}
{"event": "job_completed", "status": "success", "document_id": "..."}Acceptance Criteria:
- WebSocket endpoint for job progress
- Events for stage start/progress/complete
- Error event with details
- Connection management (heartbeat, reconnect)
- Fallback to polling for clients without WS
Labels: enhancement, feature, api
Description: Export documents with their metadata in various formats.
Endpoint: GET /api/documents/export
Formats:
- JSON (metadata only)
- CSV (tabular metadata)
- PDF (document with metadata cover page)
- ZIP (documents + metadata files)
Query Parameters:
format: json, csv, pdf, zipids: comma-separated document IDs (or all)include_ocr_text: booleandate_from,date_to: filter by date range
Acceptance Criteria:
- Export single document
- Bulk export with filters
- All format options
- Streaming for large exports
- Export job for very large datasets
Labels: enhancement, feature, api
Description: Allow users to save and reuse search queries and filter combinations.
Endpoints:
POST /api/saved-searches- Create saved searchGET /api/saved-searches- List saved searchesGET /api/saved-searches/{id}/execute- Run saved search
Model:
{
"id": "uuid",
"name": "Unpaid Invoices",
"query": {
"document_type": "invoice",
"payment_status": "unpaid",
"due_date_before": "today"
},
"notify_on_match": true
}Acceptance Criteria:
- CRUD for saved searches
- Execute saved search
- Optional notification on new matches
- Share searches between users
- Pre-built system searches (needs review, high urgency, etc.)
Labels: enhancement, feature, data-management
Description: Track all changes to document metadata with ability to view history and revert.
Features:
- Automatic versioning on metadata change
- View revision history
- Compare revisions
- Revert to previous version
- Track who made changes
Endpoint: GET /api/documents/{id}/revisions
Response:
{
"revisions": [
{
"revision": 3,
"timestamp": "2024-01-15T10:30:00Z",
"changed_by": "user@example.com",
"changes": {
"sender": {"old": "ACME Inc", "new": "ACME Corporation"}
}
}
]
}Acceptance Criteria:
- Automatic revision creation on update
- Revision history endpoint
- Diff between revisions
- Revert endpoint
- Configurable retention (keep last N revisions)
Labels: enhancement, feature, search
Description: Enable searching within the OCR-extracted text content of documents.
Features:
- Full-text search index on ocr_text
- Highlight matching text in results
- Search within specific document types
- Combine with metadata filters
Endpoint: GET /api/documents/search
Query:
{
"text_query": "payment overdue",
"document_type": "invoice",
"highlight": true
}Technical Notes:
- SQLite FTS5 for full-text indexing
- Consider PostgreSQL migration for production scale
Acceptance Criteria:
- Full-text index on OCR content
- Search endpoint with text query
- Result highlighting
- Combined text + metadata search
- Pagination for results
Labels: enhancement, feature, integration
Description: Process documents received via email attachments.
Features:
- IMAP/POP3 mailbox monitoring
- Extract PDF/image attachments
- Use email metadata (sender, subject, date)
- Auto-categorize by sender domain
- Mark emails as processed
Configuration:
email:
enabled: true
protocol: imap
server: imap.example.com
username: "${EMAIL_USER}"
password: "${EMAIL_PASSWORD}"
folder: "INBOX"
poll_interval_seconds: 300
mark_as_read: true
move_to_folder: "Processed"Acceptance Criteria:
- IMAP connection and monitoring
- Attachment extraction
- Email metadata mapping
- Duplicate email detection
- Error handling for malformed emails
Labels: enhancement, feature, integration
Description: Sync documents from cloud storage providers.
Providers:
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
- MinIO (S3-compatible)
Features:
- Watch bucket/container for new files
- Download and process new documents
- Optional upload of processed metadata
- Sync status tracking
Configuration:
cloud_storage:
provider: s3
bucket: my-documents
prefix: inbox/
aws_access_key: "${AWS_ACCESS_KEY}"
aws_secret_key: "${AWS_SECRET_KEY}"
region: eu-west-1
poll_interval_seconds: 60Acceptance Criteria:
- S3 integration
- GCS integration
- Azure Blob integration
- New file detection
- Processed file tracking
- Error handling and retry
Labels: enhancement, feature, integration
Description: Send notifications to Slack or Discord for document events.
Events:
- Document processing complete
- Document needs review
- High urgency document detected
- Processing error
- Daily summary
Configuration:
notifications:
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_URL}"
channel: "#documents"
events:
- needs_review
- high_urgency
- errorAcceptance Criteria:
- Slack webhook integration
- Discord webhook integration
- Configurable events
- Rich message formatting
- Rate limiting to prevent spam
Labels: enhancement, feature, integration
Description: Sync metadata changes made in Paperless-ngx back to DeDox.
Current State:
- DeDox → Paperless only
- Manual edits in Paperless not reflected in DeDox
Proposed:
- Periodic sync of Paperless metadata
- Webhook on Paperless document update
- Conflict resolution strategy
- Audit trail of sync changes
Acceptance Criteria:
- Detect changes made in Paperless
- Update DeDox metadata from Paperless
- Conflict detection (both modified)
- Configurable conflict resolution
- Sync status tracking
Labels: enhancement, feature, integration
Description: Manage Open WebUI knowledge bases from DeDox.
Features:
- List all knowledge bases
- Create/delete knowledge bases
- Move documents between KBs
- View KB statistics
- Manage KB permissions
Endpoints:
GET /api/openwebui/knowledge-basesPOST /api/openwebui/knowledge-basesDELETE /api/openwebui/knowledge-bases/{id}POST /api/openwebui/knowledge-bases/{id}/documents
Acceptance Criteria:
- KB listing endpoint
- KB creation/deletion
- Document assignment to KB
- Permission management
- Sync status per KB
Labels: enhancement, feature, ui
Description: Create a dedicated interface for reviewing documents that need attention.
Features:
- Queue of documents needing review
- Side-by-side document view and metadata editor
- Quick actions (approve, reject, flag)
- Keyboard shortcuts for efficiency
- Bulk approval/rejection
Review Reasons:
- Low confidence extraction
- Missing required fields
- Potential duplicate
- Validation errors
Acceptance Criteria:
- Review queue endpoint
- Review UI page
- Document preview
- Inline metadata editing
- Keyboard navigation
- Review history
Labels: enhancement, feature, ui
Description: Generate and serve document previews/thumbnails.
Features:
- Thumbnail generation for list views
- Full preview for detail views
- PDF page previews
- Image optimization for web
Endpoints:
GET /api/documents/{id}/thumbnailGET /api/documents/{id}/previewGET /api/documents/{id}/preview/page/{page}
Acceptance Criteria:
- Thumbnail generation on upload
- Preview endpoint with caching
- PDF multi-page preview
- Image format optimization
- Lazy loading support
Labels: enhancement, feature, ui
Description: Show a visual timeline of document processing stages.
Features:
- Stage-by-stage breakdown
- Duration per stage
- Error points highlighted
- Retry indicators
- Expandable stage details
API Response:
{
"timeline": [
{"stage": "upload", "started": "...", "completed": "...", "duration_ms": 150},
{"stage": "ocr", "started": "...", "completed": "...", "duration_ms": 2500},
{"stage": "llm_extraction", "started": "...", "completed": "...", "duration_ms": 8000, "retries": 1}
]
}Acceptance Criteria:
- Timeline data in job response
- Visual timeline component
- Stage detail expansion
- Error visualization
- Performance comparison across documents
Labels: enhancement, feature, ui
Description: Allow editing metadata for multiple documents at once.
Use Cases:
- Set same correspondent for multiple documents
- Bulk tag assignment
- Batch date correction
- Mass category change
Endpoint: PATCH /api/documents/batch
Request:
{
"document_ids": ["id1", "id2", "id3"],
"updates": {
"correspondent": "ACME Inc",
"tags": {"add": ["reviewed"], "remove": ["pending"]}
}
}Acceptance Criteria:
- Batch update endpoint
- Partial updates (only specified fields)
- Add/remove for array fields
- Validation across all documents
- Rollback on partial failure
Labels: enhancement, feature, infrastructure
Description: Replace the in-memory task queue with Redis for reliability and scalability.
Benefits:
- Job persistence across restarts
- Distributed worker support
- Priority queues
- Dead letter queue for failed jobs
- Job scheduling
Configuration:
queue:
backend: redis # memory, redis
redis_url: "${REDIS_URL:redis://localhost:6379}"
default_queue: dedox
worker_count: 4Acceptance Criteria:
- Redis queue implementation
- Job persistence
- Multiple worker support
- Priority queue support
- Failed job handling
- Backward compatible with memory queue
Labels: enhancement, feature, performance
Description: Cache frequently accessed data to reduce database load.
Cache Targets:
- Document metadata (short TTL)
- Statistics/counts (medium TTL)
- Configuration data (long TTL)
- LLM responses (content-based)
Configuration:
cache:
backend: redis # memory, redis
ttl_seconds:
documents: 60
statistics: 300
config: 3600Acceptance Criteria:
- Cache abstraction layer
- Memory and Redis backends
- Configurable TTL per cache type
- Cache invalidation on update
- Cache hit/miss metrics
Labels: enhancement, feature, performance
Description: Cache LLM extraction results to avoid redundant API calls.
Strategy:
- Hash OCR text + prompt
- Cache extraction results
- Configurable TTL
- Invalidate on prompt change
Benefits:
- Faster reprocessing
- Reduced LLM costs
- Faster testing/development
Acceptance Criteria:
- Cache key generation (text hash + prompt hash)
- Cache storage (file or Redis)
- TTL configuration
- Cache hit logging
- Manual cache clear option
Labels: enhancement, performance, database
Description: Optimize database queries and add proper indexing.
Improvements:
- Add indexes on frequently queried fields
- Implement query pagination
- Add query result limits
- Optimize JOIN queries
- Consider read replicas for scale
Indexes to add:
documents.paperless_iddocuments.statusdocuments.created_atdocuments.document_type(metadata)jobs.statusjobs.created_at
Acceptance Criteria:
- Index migration script
- Query pagination for list endpoints
- Query explain analysis
- Performance benchmarks before/after
Labels: enhancement, security
Description: Implement rate limiting to prevent API abuse.
Configuration:
security:
rate_limiting:
enabled: true
requests_per_minute: 60
burst: 10
by: ip # ip, user, api_keyAcceptance Criteria:
- Rate limit middleware
- Configurable limits
- Rate limit headers in response
- Different limits per endpoint
- Whitelist for internal services
Labels: enhancement, security
Description: Validate and scan uploaded files for security.
Checks:
- File type validation (magic bytes, not just extension)
- Maximum file size
- Malware scanning (ClamAV integration)
- Image bomb detection
- PDF bomb detection
Configuration:
security:
upload:
max_size_mb: 50
allowed_types:
- application/pdf
- image/jpeg
- image/png
- image/tiff
malware_scan: true
clamav_socket: /var/run/clamav/clamd.sockAcceptance Criteria:
- Magic byte validation
- Size limit enforcement
- Optional ClamAV integration
- Decompression bomb protection
- Detailed rejection reasons
Labels: enhancement, security
Description: Encrypt sensitive documents stored on disk.
Features:
- AES-256 encryption
- Key management
- Selective encryption (by document type)
- Encrypted backup support
Configuration:
security:
encryption:
enabled: true
algorithm: aes-256-gcm
key_source: env # env, vault, kms
key_env_var: DEDOX_ENCRYPTION_KEYAcceptance Criteria:
- Encryption on write
- Decryption on read
- Key rotation support
- Vault/KMS integration option
- Migration tool for existing documents
Labels: enhancement, security, compliance
Description: Log all security-relevant actions for compliance and debugging.
Events to log:
- Authentication attempts
- Document access
- Metadata changes
- Configuration changes
- API key usage
- Admin actions
Log Format:
{
"timestamp": "2024-01-15T10:30:00Z",
"event": "document.accessed",
"user": "user@example.com",
"document_id": "uuid",
"ip_address": "192.168.1.100",
"user_agent": "..."
}Acceptance Criteria:
- Audit log middleware
- Configurable event types
- Structured JSON logging
- Log rotation
- Log export endpoint (admin only)
Labels: documentation, good first issue
Description: Generate and serve interactive API documentation.
Features:
- OpenAPI 3.0 specification
- Swagger UI at
/docs - ReDoc at
/redoc - Example requests/responses
- Authentication documentation
Acceptance Criteria:
- OpenAPI spec generation from routes
- Swagger UI endpoint
- Request/response examples
- Authentication section
- Error response documentation
Labels: documentation, help wanted
Description: Create video tutorials for common use cases.
Topics:
- Installation and first run
- Configuring Paperless-ngx integration
- Customizing metadata extraction
- Setting up Open WebUI sync
- Troubleshooting common issues
Acceptance Criteria:
- Script/outline for each video
- Recorded videos
- YouTube playlist
- Links in README
Labels: documentation
Description: Create a comprehensive troubleshooting guide for common issues.
Topics:
- Connection issues (Paperless, Ollama, Open WebUI)
- OCR quality problems
- Extraction accuracy issues
- Performance problems
- Docker/container issues
- Authentication problems
Format:
## Problem: OCR quality is poor
### Symptoms
- Low confidence scores
- Garbled text
### Causes
- Poor image quality
- Wrong language setting
- Skewed documents
### Solutions
1. Check image resolution (minimum 300 DPI)
2. Verify language settings in config
3. Enable image preprocessingAcceptance Criteria:
- Common issues documented
- Clear symptom descriptions
- Step-by-step solutions
- Links to relevant config options
Labels: testing, quality
Description: Create comprehensive integration tests for the full pipeline.
Test Scenarios:
- Full document processing flow
- Paperless-ngx webhook handling
- Open WebUI sync
- Error recovery
- Concurrent processing
Requirements:
- Docker-compose test environment
- Test fixtures (sample documents)
- Mocked external services option
- CI/CD integration
Acceptance Criteria:
- Integration test framework
- Test docker-compose
- Sample document fixtures
- CI workflow for integration tests
- Coverage reporting
Labels: testing, performance
Description: Create automated performance benchmarks.
Metrics:
- Documents per minute throughput
- Average processing time by stage
- Memory usage under load
- API response times
- Database query times
Features:
- Benchmark CLI command
- Historical tracking
- Regression detection
- CI integration
Acceptance Criteria:
- Benchmark test suite
- Metric collection
- Result storage/comparison
- Performance regression CI check
- Benchmark documentation
Labels: testing, quality, ml
Description: Test extraction accuracy against a labeled dataset.
Features:
- Ground truth dataset format
- Accuracy metrics (precision, recall, F1)
- Per-field metrics
- Regression tracking
- A/B testing for prompts
Dataset Format:
{
"document": "path/to/document.pdf",
"expected": {
"document_type": "invoice",
"sender": "ACME Corporation",
"total_amount": 1234.56
}
}Acceptance Criteria:
- Test dataset format specification
- Accuracy calculation
- Per-field metrics
- CI integration
- Accuracy trend reporting
Labels: infrastructure, deployment
Description: Provide Kubernetes manifests for production deployment.
Components:
- Deployment for API server
- Deployment for workers
- ConfigMap for settings
- Secret for credentials
- Service for API
- Ingress for external access
- HPA for autoscaling
Acceptance Criteria:
- K8s deployment manifests
- Helm chart (optional)
- Documentation for K8s deployment
- Health/readiness probes
- Resource limits
Labels: infrastructure, monitoring
Description: Expose Prometheus-compatible metrics for monitoring.
Metrics:
dedox_documents_total(counter)dedox_processing_duration_seconds(histogram)dedox_queue_size(gauge)dedox_extraction_confidence(histogram)dedox_errors_total(counter by type)
Endpoint: GET /metrics
Acceptance Criteria:
- Prometheus client integration
- Core metrics defined
-
/metricsendpoint - Grafana dashboard template
- Documentation
Labels: infrastructure, operations
Description: Implement backup and restore for DeDox data.
Features:
- Database backup
- Document files backup
- Configuration backup
- Scheduled backups
- Restore procedure
Commands:
dedox backup --output /path/to/backup.tar.gz
dedox restore --input /path/to/backup.tar.gzAcceptance Criteria:
- Backup CLI command
- Restore CLI command
- Scheduled backup option
- Incremental backup support
- Documentation
Use these labels when creating issues:
| Label | Description |
|---|---|
good first issue |
Good for newcomers |
help wanted |
Extra attention needed |
enhancement |
New feature or improvement |
bug |
Something isn't working |
documentation |
Documentation improvements |
security |
Security-related |
performance |
Performance improvements |
testing |
Testing improvements |
infrastructure |
Deployment/ops related |
api |
API changes |
pipeline |
Processing pipeline |
integration |
External integrations |
ui |
User interface |
code-quality |
Code cleanup/refactoring |
P0 - Critical: Security vulnerabilities, data loss risks P1 - High: Core functionality, blocking issues P2 - Medium: Important features, significant improvements P3 - Low: Nice to have, minor improvements