A comprehensive Spring Boot microservices application that automatically syncs GitHub organization repositories to Milvus vector database. The system fetches README files and API definition files, chunks them, generates embeddings using Azure OpenAI, and stores them in Milvus for vector search capabilities.
This project follows a microservices architecture with six independent services plus a complete monitoring stack:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator Service β
β (Coordinates workflow, Scheduled jobs) β
ββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββββ
β GitHub β βDocument β βEmbeddingβ β Milvus β βMonitoringβ
β Service β βProcessorβ β Service β β Service β β Service β
β β β Service β β β β β β β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββ¬ββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββββ
β GitHub β βChunking β β Azure β β Milvus β βPrometheusβ
β API β β Logic β β OpenAI β β DB β β & Grafanaβ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββββ
-
GitHub Service (Port 8081)
- Fetches repositories from GitHub organization
- Retrieves README and API definition files
- Filters repositories by keyword
-
Document Processor Service (Port 8082)
- Chunks documents using recursive character splitting
- Maintains context with configurable overlap
- Preserves metadata for each chunk
-
Embedding Service (Port 8083)
- Generates embeddings using Azure OpenAI
- Batch processing with rate limit handling
- Automatic retry on failures
-
Milvus Service (Port 8084)
- Manages Milvus vector database collections
- Automatic collection creation if it doesn't exist
- Handles vector upserts and schema management
- Works with cloud Milvus (Zilliz)
-
Orchestrator Service (Port 8086)
- Coordinates the entire sync workflow
- Auto-sync on startup (configurable, enabled by default)
- Scheduled execution (daily at 8:00 AM)
- Manual trigger via REST API
- Resilience with retry logic
-
Monitoring Service (Port 8085)
- Health monitoring and metrics aggregation
- Automated health checks every 30 seconds
- REST API for monitoring status
- Exposes metrics to Prometheus
-
Prometheus (Port 9090)
- Metrics collection and storage
- Scrapes all services every 15 seconds
- Time-series database
- Alert rule evaluation
-
Grafana (Port 3000)
- Real-time dashboards and visualization
- Pre-configured dashboards for all services
- Connected to Prometheus
- Default credentials: admin/admin
- β Auto-Sync on Startup: Automatically fetches and syncs repositories when application starts
- β Automatic Collection Creation: Milvus collection created automatically if it doesn't exist
- β Automated Daily Sync: Runs at 8:00 AM every day via scheduled task and GitHub Actions
- β Cloud-Native Vector Storage: Uses Zilliz cloud Milvus (no local database required)
- β Microservices Architecture: Independent, scalable services following SOLID principles
- β Complete Monitoring System: Prometheus + Grafana with custom monitoring service
- β Real-time Dashboards: 8 pre-configured Grafana panels for all metrics
- β Intelligent Alerting: 8 alert rules for critical conditions
- β Docker & Kubernetes Ready: Complete containerization and K8s manifests
- β CI/CD Pipeline: Automated build, test, and deployment
- β Dependency Updates: Weekly automated dependency and security checks
- β Security Scanning: OWASP, Trivy, and license compliance checks
- β Local & Cloud Support: Works in both local and GitHub Actions environments
- β Resilience: Retry logic, circuit breakers, and fallback mechanisms
- β Full Observability: Health checks, metrics, logging, and monitoring
- Java 21 (OpenJDK 21 or higher)
- Maven 3.6+
- Docker and Docker Compose
- GitHub Personal Access Token
- Azure OpenAI API access
- Cloud Milvus instance (Zilliz Cloud recommended)
git clone <your-repo-url>
cd Microservices_with_RepoSyncCopy the example environment file:
cp .env.example .envEdit .env and fill in your credentials:
# GitHub Configuration
REPOSYNC_GITHUB_TOKEN=ghp_your_token_here
REPOSYNC_ORGANIZATION=your-org-name
REPOSYNC_FILTER_KEYWORD=microservices
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your-azure-openai-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT=text-embedding-ada-002
# Cloud Milvus (Zilliz) Configuration
MILVUS_URI=https://your-instance.vectordb.xxxxxxxxxx.com:19530
MILVUS_TOKEN=your-zilliz-token
MILVUS_COLLECTION_NAME=reposync_collection
# Auto-Sync Configuration (Optional)
REPOSYNC_AUTO_SYNC_ON_STARTUP=true # Default: true# Full build with all checks
mvn clean install
# Fast build (skip tests and checkstyle)
mvn clean package -DskipTests -Dcheckstyle.skip=trueThe easiest way to run the application is using the start script, which handles everything automatically:
# Run this single command - everything is automatic!
./scripts/start-local.sh
# The script will:
# 1. Check prerequisites (Java 21, Maven, Docker)
# 2. Validate your .env configuration
# 3. Build all services
# 4. Start Docker Compose
# 5. Wait for services to be healthy
# 6. Auto-sync triggers automatically after 5 seconds!
# Verify auto-sync is working
./scripts/verify-auto-sync.sh
# Watch live sync progress
docker compose logs -f orchestrator-serviceWhat happens automatically:
- β Fetches all repositories from your GitHub organization
- β Extracts documents from each repository
- β Chunks documents for optimal embedding
- β Generates embeddings using Azure OpenAI
- β Creates Milvus collection if it doesn't exist
- β Stores all vectors in your cloud Milvus collection
# Start all services including monitoring stack
docker compose up -d
# Auto-sync will trigger automatically in 5 seconds after startup!
# Check logs
docker compose logs -f
# (Optional) Manually trigger sync if needed
curl -X POST http://localhost:8086/api/orchestrator/sync
# Access monitoring interfaces
# Grafana: http://localhost:3030 (admin/admin)
# Prometheus: http://localhost:9090
# Monitoring API: http://localhost:8085/api/monitoring
# Stop all services
docker compose downOpen 5 terminal windows and run:
# Terminal 1 - GitHub Service
cd github-service
mvn spring-boot:run
# Terminal 2 - Document Processor Service
cd document-processor-service
mvn spring-boot:run
# Terminal 3 - Embedding Service
cd embedding-service
mvn spring-boot:run
# Terminal 4 - Milvus Service
cd milvus-service
mvn spring-boot:run
# Terminal 5 - Orchestrator Service
cd orchestrator-service
mvn spring-boot:run
# Note: Auto-sync will trigger when orchestrator starts!If you prefer to trigger sync manually:
# Set environment variable before starting
export REPOSYNC_AUTO_SYNC_ON_STARTUP=false
# Then start services
./scripts/start-local.sh
# Or with Docker Compose
docker compose restart orchestrator-service
# Manually trigger sync when needed
curl -X POST http://localhost:8086/api/orchestrator/sync | jq '.'Edit k8s/01-namespace-config.yaml and add your credentials to the Secret.
# Apply all manifests
kubectl apply -f k8s/
# Check deployment status
kubectl get pods -n reposync
kubectl get services -n reposync
# View logs
kubectl logs -f deployment/orchestrator-service -n reposync# Get the orchestrator service external IP
kubectl get service orchestrator-service -n reposync
# Trigger sync
curl -X POST http://<EXTERNAL-IP>:8086/api/orchestrator/syncGo to your GitHub repository β Settings β Secrets and variables β Actions
Add the following secrets:
REPOSYNC_GITHUB_TOKENREPOSYNC_ORGANIZATIONREPOSYNC_FILTER_KEYWORDAZURE_OPENAI_API_KEYAZURE_OPENAI_ENDPOINTAZURE_OPENAI_EMBEDDINGS_DEPLOYMENTMILVUS_URIMILVUS_TOKENMILVUS_COLLECTION_NAMEDOCKER_USERNAME(for CI/CD)DOCKER_PASSWORD(for CI/CD)KUBE_CONFIG(base64 encoded kubeconfig for CI/CD)
Two workflows are configured:
-
Daily Sync (
.github/workflows/daily-sync.yml)- Runs daily at 8:00 AM UTC
- Can be triggered manually
- Executes the complete sync workflow
-
CI/CD Pipeline (
.github/workflows/ci-cd.yml)- Runs on push to main/develop
- Builds and tests all services
- Builds and pushes Docker images
- Deploys to Kubernetes
| Service | Port | Description |
|---|---|---|
| Orchestrator | 8086 | Main workflow coordinator |
| GitHub | 8081 | GitHub API integration |
| Document Processor | 8082 | Document chunking |
| Embedding | 8083 | Azure OpenAI embeddings |
| Milvus | 8084 | Vector database service |
| Monitoring | 8085 | Health & metrics aggregation |
| Prometheus | 9090 | Metrics collection |
| Grafana | 3030 | Monitoring dashboards |
Edit document-processor-service/src/main/resources/application.yml:
chunking:
chunk-size: 1000 # Characters per chunk
overlap: 200 # Overlap between chunksEdit orchestrator-service/src/main/resources/application.yml:
reposync:
auto-sync-on-startup: true # Enable/disable auto-sync on startup
schedule:
cron: "0 0 8 * * *" # Daily at 8:00 AMOr use environment variable:
export REPOSYNC_AUTO_SYNC_ON_STARTUP=false # Disable auto-syncThe application includes a comprehensive monitoring system built with Prometheus and Grafana, following SOLID principles.
# Start the monitoring stack
./docs/scripts/start-monitoring.sh
# Or manually with docker-compose
docker-compose up -d monitoring-service prometheus grafana| Interface | URL | Credentials | Description |
|---|---|---|---|
| Grafana | http://localhost:3030 | admin/admin | Visual dashboards |
| Prometheus | http://localhost:9090 | - | Metrics & queries |
| Monitoring API | http://localhost:8085/api/monitoring | - | Health status API |
The Monitoring Service (Port 8085) provides:
- Automated Health Checks: Polls all services every 30 seconds
- Metrics Aggregation: Collects and aggregates metrics from all services
- REST API: Programmatic access to health and metrics data
- Prometheus Integration: Exposes metrics in Prometheus format
# Get system-wide health status
curl http://localhost:8085/api/monitoring/health
# Get health of all services
curl http://localhost:8085/api/monitoring/services/health
# Get specific service health
curl http://localhost:8085/api/monitoring/services/github-service/health
# Get unhealthy services
curl http://localhost:8085/api/monitoring/services/unhealthy
# Trigger manual health check
curl -X POST http://localhost:8085/api/monitoring/health/checkPrometheus (Port 9090) collects metrics from all services:
- Scrape Interval: 15 seconds
- Retention: 15 days (default)
- Alert Evaluation: Every 15 seconds
Each service exposes Prometheus metrics at /actuator/prometheus:
- JVM Metrics: Memory, threads, GC, class loading
- HTTP Metrics: Request count, latency, status codes
- System Metrics: CPU, disk, uptime
- Custom Metrics: Service-specific business metrics
# Service availability
up{job=~".*-service"}
# Request rate (requests per second)
rate(http_server_requests_seconds_count[5m])
# Memory usage percentage
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100
# 95th percentile response time
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (job, le))
# Error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
# CPU usage
system_cpu_usage * 100
Grafana (Port 3000) provides real-time visualization:
- Pre-configured Dashboard: RepoSync Microservices Overview
- 8 Monitoring Panels:
- Service Availability - Real-time service status
- HTTP Request Rate - Requests per second by service
- Response Time (95th percentile) - Latency tracking
- JVM Memory Usage - Heap memory monitoring
- CPU Usage - System and process CPU
- Thread Count - Thread pool monitoring
- Error Rate - 4xx and 5xx errors
- Garbage Collection Time - GC performance
- Navigate to http://localhost:3030
- Login with
admin/admin - Dashboard is auto-provisioned and ready to use
The system includes 8 pre-configured alert rules:
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | Service down > 1 minute | Critical |
| HighMemoryUsage | Heap > 85% for 5 minutes | Warning |
| CriticalMemoryUsage | Heap > 95% for 2 minutes | Critical |
| HighCPUUsage | CPU > 80% for 5 minutes | Warning |
| HighErrorRate | Error rate > 10% | Critical |
| LowRequestRate | Request rate very low | Info |
| FrequentGC | GC > 5 times/sec for 5 min | Warning |
| HighThreadCount | Threads > 200 | Warning |
View active alerts in Prometheus: http://localhost:9090/alerts
Each service exposes Spring Boot Actuator health endpoints:
# Check individual services
curl http://localhost:8086/actuator/health # Orchestrator
curl http://localhost:8081/actuator/health # GitHub
curl http://localhost:8082/actuator/health # Document Processor
curl http://localhost:8083/actuator/health # Embedding
curl http://localhost:8084/actuator/health # Milvus
curl http://localhost:8085/actuator/health # MonitoringAccess Prometheus-formatted metrics:
curl http://localhost:8086/actuator/prometheus
curl http://localhost:8081/actuator/prometheus
# ... etc for all servicesFor detailed monitoring documentation, see:
- Monitoring Guide - Comprehensive 400+ line guide
- Monitoring Quick Start - Quick reference
- Monitoring Architecture - Architecture diagrams
- Monitoring Implementation - Implementation details
Services not showing in Prometheus:
- Check Prometheus targets: http://localhost:9090/targets
- Verify services are running:
docker ps - Check actuator endpoints:
curl http://localhost:8081/actuator/prometheus
Grafana shows no data:
- Verify Prometheus connection in Configuration β Data Sources
- Check time range in dashboard
- Run queries in Prometheus UI first
High memory alerts:
- Check service logs:
docker logs <container-name> - Review JVM heap settings in Dockerfile
- Consider increasing memory allocation
# Run all tests
mvn test
# Run tests for specific service
cd github-service
mvn testPOST /api/orchestrator/sync- Trigger manual syncGET /api/orchestrator/health- Health check
GET /api/github/repositories?organization={org}&filterKeyword={keyword}- Get repositoriesGET /api/github/documents/{owner}/{repo}- Get documents from repository
POST /api/processor/chunk- Chunk single documentPOST /api/processor/chunk/batch- Chunk multiple documents
POST /api/embedding/generate- Generate embedding for single chunkPOST /api/embedding/generate/batch- Generate embeddings for multiple chunks
POST /api/milvus/collection/create- Create collectionPOST /api/milvus/vectors/upsert- Upsert vectorsGET /api/milvus/collection/{name}/exists- Check collection existence
GET /api/monitoring/health- Get system-wide health statusGET /api/monitoring/services/health- Get health of all servicesGET /api/monitoring/services/{serviceName}/health- Get specific service healthGET /api/monitoring/services/unhealthy- Get list of unhealthy servicesPOST /api/monitoring/health/check- Trigger manual health check
- Single Responsibility: Each service has one clear responsibility
- Open/Closed: Services can be extended without modification
- Liskov Substitution: Services can be replaced with compatible implementations
- Interface Segregation: Clean REST APIs with specific endpoints
- Dependency Inversion: Services depend on abstractions (REST APIs), not implementations
- Fetch Repositories: GitHub Service retrieves repositories matching criteria
- Extract Documents: README and API definition files are extracted
- Chunk Documents: Documents are split into manageable chunks with overlap
- Generate Embeddings: Azure OpenAI creates vector embeddings
- Store in Milvus: Vectors with metadata are stored in Milvus collection
When you run ./scripts/start-local.sh, the system automatically:
- Starts All Services - Docker Compose brings up all microservices
- Waits for Health - Ensures all services are ready (health checks pass)
- Triggers Sync - After 5 seconds, the orchestrator automatically initiates sync
- Fetches Data - Retrieves all repositories from your GitHub organization
- Processes Documents - Extracts, chunks, and embeds all documents
- Creates Collection - If the Milvus collection doesn't exist, creates it automatically
- Stores Vectors - Upserts all embeddings to your cloud Milvus collection
# Start with auto-sync (default)
./scripts/start-local.sh
# Verify auto-sync status
./scripts/verify-auto-sync.sh
# Watch sync progress
docker compose logs -f orchestrator-service
# Disable auto-sync
export REPOSYNC_AUTO_SYNC_ON_STARTUP=false
docker compose restart orchestrator-service
# Manually trigger sync
curl -X POST http://localhost:8086/api/orchestrator/sync | jq '.'- β Zero Manual Steps - Just run one script
- β Immediate Data - Your vector database is populated on first startup
- β Auto-Recovery - Collection created if missing
- β Scheduled Updates - Daily sync at 8:00 AM keeps data fresh
- β Cloud-Native - Works with Zilliz cloud Milvus
For detailed documentation, see AUTO_SYNC_IMPLEMENTATION.md.
# Check if port is already in use
lsof -i :8086
# Check logs
docker-compose logs <service-name>- Ensure all services are running
- Check service URLs in configuration
- Verify network connectivity in Docker/K8s
- Verify MILVUS_URI is correct
- Check if Milvus is running:
docker ps | grep milvus - Review Milvus logs:
docker logs milvus-standalone
Comprehensive documentation is available in the docs/readmes/ directory:
- Auto-Sync Implementation - Complete auto-sync guide
- Implementation Summary - Technical implementation details
- Quick Start Guide - Get started in 5 minutes
- Local Setup Guide - Detailed local development setup
- Monitoring Guide - Comprehensive monitoring guide (400+ lines)
- Monitoring Quick Start - Quick reference
- Monitoring Architecture - Architecture diagrams
- Monitoring Implementation - Implementation details
- Local Run Guide - Running services locally
- IntelliJ IDEA Guide - IDE setup and configuration
- Setup Checklist - Complete setup checklist
- Visual Guide - Screenshots and visual walkthrough
- Project Structure - Project organization and structure
- Pipeline Architecture - CI/CD pipeline architecture
- GitHub Actions Pipeline - Complete pipeline documentation
- Dependency Updates Pipeline - Security and dependency management
- Dependency Updates Quick Start - Quick reference guide
- Integration Verification - Pipeline integration validation
- Implementation Summary - Complete implementation guide
- Build Status - Current build status
- Java 21 Build Fix - Java 21 migration notes
- Build Fix Summary - Build fixes applied
- Project Complete - Project completion summary
- Final Summary - Final project summary
- Complete Documentation Index - Navigate all documentation
The project includes comprehensive security scanning:
- OWASP Dependency Check: Weekly CVE scanning (every Monday 9 AM UTC)
- Trivy Scanner: Multi-purpose vulnerability scanning
- License Compliance: Automated third-party license tracking
- GitHub Security: Integration with GitHub Advanced Security
Security reports are generated automatically and available in GitHub Actions artifacts:
- OWASP Dependency Check Reports (HTML)
- Trivy Security Reports (SARIF)
- Dependency Tree Analysis
- License Compliance Reports
See the Dependency Updates Pipeline Documentation for details.
This project is licensed under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues and questions, please open an issue on GitHub.