diff --git a/README.md b/README.md
index d1dd428..1104907 100644
--- a/README.md
+++ b/README.md
@@ -4,349 +4,589 @@
---
+## 개요
-### 개요
-
-**LLM-Quality-Observer** 는 대형 언어 모델(LLM)의 응답 품질을 **모니터링하고 평가**하기 위한 개인 MLOps 포트폴리오 프로젝트입니다.
-이 프로젝트의 목표는 다음과 같습니다.
-
-- LLM 기반 **Gateway API** 구성 ✅
-- 프롬프트 / 응답 / 지연 시간(latency) / 모델 버전 등을 **DB에 로깅** ✅
-- 평가 서비스(Evaluator)로 품질 점수 계산 ✅
-- 대시보드에서 품질/지연/에러율 등 **지표 시각화** ✅
-
-> 현재 상태: **v0.2.0 — 웹 대시보드 + 평가 서비스 추가 완료**
-
-### v0.2.0 주요 기능
-
-🎉 **새로운 기능**:
-- **웹 대시보드** (Next.js 14 + Tailwind CSS + Recharts)
- - **Overview**: 전체 통계 카드 + 시간별 추이 차트 + 최근 활동
- - 총 로그 수, 평가된 수, 평균 지연시간, 평균 점수
- - 품질 점수 추이 선 그래프 (최근 30일)
- - 지연시간 추이 선 그래프 (최근 30일)
- - 요청 수 추이 선 그래프 (최근 30일)
- - 최근 5개 로그 활동 미리보기
- - **Logs**: LLM 로그 목록 조회 (페이지네이션 지원)
- - **Evaluations**: 평가 결과 목록 조회 (점수별 색상 구분)
- - **Models**: 모델별 성능 비교 테이블 + 요약 카드
-- **다국어 지원** (i18n)
- - 영어(EN), 한국어(KR), 일본어(JP), 중국어(CN) 4개 언어
- - 우측 상단 언어 선택 드롭다운
- - localStorage에 언어 설정 저장
-- **Evaluator 서비스**: 룰 기반 품질 평가
-- **Dashboard API**: 읽기 전용 API 엔드포인트
- - GET `/api/dashboard/summary` - 전체 통계
- - GET `/api/dashboard/logs` - 로그 목록 (페이지네이션)
- - GET `/api/dashboard/evaluations` - 평가 목록 (페이지네이션)
- - GET `/api/dashboard/models/stats` - 모델 통계
- - GET `/api/dashboard/timeseries` - 시간별 추이 데이터 (1-30일)
+**LLM-Quality-Observer**는 대형 언어 모델(LLM)의 응답 품질을 **모니터링하고 평가**하기 위한 MLOps 플랫폼입니다.
+마이크로서비스 아키텍처 기반으로 구축되어 LLM 상호작용을 로깅하고, 자동으로 품질을 평가하며, 실시간 모니터링 대시보드를 제공합니다.
----
+### 주요 기능
+
+- ✅ **Gateway API**: LLM 요청 처리 및 자동 로깅
+- ✅ **자동 평가**: 규칙 기반 + LLM-as-a-Judge 이중 평가 시스템
+- ✅ **스케줄러**: 배치 평가 자동 실행 (APScheduler)
+- ✅ **다중 채널 알림**: Slack, Discord, Email 통합
+- ✅ **모니터링**: Prometheus 메트릭 수집 + Grafana 대시보드
+- ✅ **웹 대시보드**: Next.js 기반 실시간 품질 시각화
+- ✅ **다국어 지원**: 영어, 한국어, 일본어, 중국어
+- ✅ **CI/CD**: GitHub Actions 자동화 파이프라인
-### 아키텍처 개요
+> **현재 버전: v0.5.0** — Prometheus, Grafana, 이메일 알림 추가 완료
+
+---
-현재 v1 아키텍처:
+## 📊 아키텍처
```mermaid
-flowchart TD
- C["Client (Swagger UI / HTTP)"]
- G["Gateway API (FastAPI)"]
- DB["Postgres (table: llm_logs)"]
- E["Evaluator Service (future)"]
- D["Dashboard Service (future)"]
-
- C --> G
- G -->|LLM call + latency + logging| DB
- DB --> E
- DB --> D
+flowchart TB
+ subgraph "클라이언트"
+ ClientApp[Client/Browser]
+ end
+
+ subgraph "프론트엔드"
+ WebDashboard["Next.js Dashboard
:3000"]
+ Grafana["Grafana
:3001"]
+ end
+
+ subgraph "백엔드 서비스"
+ Gateway["Gateway API
:18000"]
+ Evaluator["Evaluator
:18001"]
+ Dashboard["Streamlit Dashboard
:18002"]
+ end
+
+ subgraph "데이터베이스"
+ Postgres["PostgreSQL
:5432"]
+ end
+
+ subgraph "모니터링"
+ Prometheus["Prometheus
:9090"]
+ end
+
+ subgraph "외부 서비스"
+ OpenAI_Main["OpenAI GPT
(Main Model)"]
+ OpenAI_Judge["OpenAI GPT
(Judge Model)"]
+ end
+
+ subgraph "알림 채널"
+ Slack["Slack"]
+ Discord["Discord"]
+ Email["Email
(SMTP)"]
+ end
+
+ %% 클라이언트 연결
+ ClientApp --> WebDashboard
+ ClientApp --> Gateway
+
+ %% Gateway 연결
+ Gateway --> OpenAI_Main
+ Gateway --> Postgres
+ Gateway -.메트릭.-> Prometheus
+
+ %% Evaluator 연결
+ Postgres --> Evaluator
+ Evaluator --> OpenAI_Judge
+ Evaluator --> Slack
+ Evaluator --> Discord
+ Evaluator --> Email
+ Evaluator -.메트릭.-> Prometheus
+
+ %% Dashboard 연결
+ Postgres --> Dashboard
+
+ %% 모니터링 연결
+ Prometheus --> Grafana
+
+ style Gateway fill:#4CAF50
+ style Evaluator fill:#2196F3
+ style Postgres fill:#FF9800
+ style Prometheus fill:#E91E63
+ style Grafana fill:#9C27B0
+ style OpenAI_Main fill:#00BCD4
+ style OpenAI_Judge fill:#00BCD4
```
-### 기술 스택
+### 서비스 구성
-* **언어**: Python 3.12
-* **LLM Provider**: OpenAI GPT-5 mini (`responses` API 사용)
-* **웹 프레임워크**: FastAPI
-* **DB**: PostgreSQL 16
-* **ORM**: SQLAlchemy
-* **설정 관리**: Pydantic Settings
-* **의존성 관리**: [`uv`](https://github.com/astral-sh/uv)
-* **컨테이너**: Docker, Docker Compose
+| 서비스 | 포트 | 설명 |
+|--------|------|------|
+| **Gateway API** | 18000 | LLM 요청 처리 및 로깅 (FastAPI) |
+| **Evaluator** | 18001 | 자동 평가 및 알림 (FastAPI) |
+| **Dashboard** | 18002 | Streamlit 대시보드 (레거시) |
+| **Web Dashboard** | 3000 | Next.js 웹 대시보드 |
+| **PostgreSQL** | 5432 | 로그 및 평가 결과 저장 |
+| **Prometheus** | 9090 | 메트릭 수집 |
+| **Grafana** | 3001 | 모니터링 대시보드 |
---
-### 프로젝트 구조
-
-대략적인 디렉토리 구조:
-
-```text
-LLM-Quality-Observer/
-├── services/
-│ ├── gateway-api/
-│ │ ├── app/
-│ │ │ ├── app/
-│ │ │ │ ├── main.py
-│ │ │ │ ├── config.py
-│ │ │ │ ├── llm_client.py
-│ │ │ │ ├── db.py
-│ │ │ │ ├── models.py
-│ │ │ │ ├── schemas.py
-│ │ │ └── pyproject.toml
-│ │ └── Dockerfile
-│ ├── evaluator/
-│ │ ├── app/
-│ │ │ └── pyproject.toml
-│ │ └── Dockerfile
-│ └── dashboard/
-│ ├── app/
-│ │ └── pyproject.toml
-│ └── Dockerfile
-├── infra/
-│ └── docker/
-│ └── docker-compose.local.yml
-├── configs/
-│ └── env/
-│ └── .env.local # local 환경 변수 (git ignore 대상)
-└── README.md
-```
-
-#### `services/gateway-api`
+## 🚀 빠른 시작
-LLM 호출을 담당하는 **Gateway API 서비스**입니다.
+### 사전 요구사항
-* `/health` : 헬스 체크
-* `/chat` : LLM 호출 + DB 로깅
+- Docker & Docker Compose
+- OpenAI API Key
+- (선택) Slack/Discord Webhook URL
+- (선택) Gmail SMTP 계정
-주요 파일:
+### 설치
-* `app/app/main.py`
-
- * FastAPI 엔트리 포인트
- * `/health`, `/chat` 엔드포인트 정의
- * 최초 실행 시 `llm_logs` 테이블 생성
- * LLM 응답을 DB에 저장하고 `ChatResponse`로 반환
+1. **리포지토리 클론**
+```bash
+git clone https://github.com/dongkoony/LLM-Quality-Observer.git
+cd LLM-Quality-Observer
+```
-* `app/app/config.py`
+2. **환경 변수 설정**
+```bash
+cp configs/env/.env.local.example configs/env/.env.local
+# .env.local 파일 편집하여 API 키 설정
+```
- * Pydantic `Settings` 정의
- * 환경 변수 로드:
+3. **서비스 시작**
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml up --build
+```
- * `APP_ENV`
- * `DATABASE_URL`
- * `OPENAI_MODEL_MAIN`
- * `LLM_API_BASE_URL`
- * `LLM_API_KEY`
- * `LOG_LEVEL`
+4. **서비스 확인**
+```bash
+# Gateway API
+curl http://localhost:18000/health
-* `app/app/llm_client.py`
+# Evaluator
+curl http://localhost:18001/health
- * OpenAI Python SDK 래퍼
- * `OPENAI_MODEL_MAIN` 을 기본 모델로 사용
- * `client.responses.create(...)` 호출
- * `(response_text, latency_ms)` 튜플 반환
+# Prometheus
+open http://localhost:9090
-* `app/app/db.py`
+# Grafana
+open http://localhost:3001 # admin/admin
+```
- * SQLAlchemy 엔진 및 세션 생성
- * FastAPI `Depends` 로 사용하는 `get_db()` 제공
+---
-* `app/app/models.py`
+## 📖 사용 가이드
- * SQLAlchemy ORM 모델: `LLMLog`
- * 컬럼:
+### 1. LLM 요청 전송
- * `id`, `created_at`
- * `user_id`, `prompt`, `response`
- * `model_version`
- * `latency_ms`
- * `status` (예: `"success"`)
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "prompt": "Explain quantum computing in simple terms",
+ "user_id": "test-user",
+ "model_version": "gpt-5-mini"
+ }'
+```
-* `app/app/schemas.py`
+**응답 예시:**
+```json
+{
+ "id": 1,
+ "prompt": "Explain quantum computing...",
+ "response": "Quantum computing is...",
+ "model_version": "gpt-5-mini",
+ "latency_ms": 1234,
+ "status": "success"
+}
+```
- * Pydantic 스키마:
+### 2. 평가 실행
- * `ChatRequest` (요청)
- * `ChatResponse` (응답)
+**수동 평가:**
+```bash
+# 규칙 기반 평가
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=rule"
-* `app/pyproject.toml`
+# LLM-as-a-Judge 평가
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=llm"
+```
- * gateway-api 서비스용 Python 패키지/의존성 정의
- * 로컬 및 Docker 빌드 시 `uv sync`에 사용
+**자동 평가:** 스케줄러가 설정된 간격(기본 60분)마다 자동 실행
-#### `services/evaluator` (향후 구현)
+### 3. 대시보드 확인
-* `llm_logs` 테이블을 읽어 LLM 응답의 품질 점수를 계산하는 서비스
-* 휴리스틱, LLM-as-a-judge, 사람 피드백 등 다양한 방식의 평가를 시도할 예정
-* 현재는 `pyproject.toml`과 `Dockerfile`만 준비된 상태 (스켈레톤)
+**Grafana 대시보드:**
+1. http://localhost:3001 접속
+2. admin/admin으로 로그인
+3. Dashboards → LLM Quality Observer 선택
-#### `services/dashboard` (향후 구현)
+**포함된 메트릭:**
+- HTTP 요청 비율 및 지연시간
+- LLM 모델별 성능
+- 평가 점수 분포
+- 알림 전송 현황
+- 스케줄러 실행 상태
-* 품질 지표, 지연 시간, 에러율 등을 시각화하는 대시보드 서비스
-* Streamlit 또는 FastAPI 기반 UI를 고려
-* 마찬가지로 `pyproject.toml`과 `Dockerfile`만 준비된 상태
+### 4. 데이터베이스 조회
-#### `infra/docker`
+```bash
+# PostgreSQL 접속
+docker exec -it llm-postgres psql -U llm_user -d llm_quality
-* `docker-compose.local.yml`
+# 최근 로그 확인
+SELECT id, created_at, user_id,
+ LEFT(prompt, 50) AS prompt,
+ model_version, latency_ms, status
+FROM llm_logs
+ORDER BY id DESC
+LIMIT 10;
- * 로컬 개발용 Docker Compose 스택:
+# 평가 결과 확인
+SELECT l.id, l.prompt,
+ e.score_overall, e.score_instruction_following, e.score_truthfulness,
+ e.judge_type, e.comments
+FROM llm_logs l
+JOIN llm_evaluations e ON l.id = e.log_id
+ORDER BY e.created_at DESC
+LIMIT 10;
+```
- * `llm-postgres` (Postgres 16)
- * `llm-gateway-api` (FastAPI + OpenAI client)
- * `llm-evaluator` (placeholder)
- * `llm-dashboard` (placeholder)
- * 기본적으로 gateway-api를 `localhost:18000`에 바인딩
+---
-#### `configs/env`
+## 🔧 주요 기능 상세
+
+### Gateway API (v0.1.0+)
+
+**엔드포인트:**
+- `GET /health` - 헬스 체크
+- `POST /chat` - LLM 요청 처리
+- `GET /docs` - Swagger UI
+- `GET /metrics` - Prometheus 메트릭
+
+**기능:**
+- OpenAI GPT 모델 호출
+- 자동 로깅 (프롬프트, 응답, 지연시간, 상태)
+- 모델 버전 추적
+- Prometheus 메트릭 수출
+
+### Evaluator Service (v0.2.0+)
+
+**평가 방식:**
+
+1. **규칙 기반 평가** (빠름, 저렴):
+ - 응답 길이 검사
+ - 키워드 검증
+ - 포맷 준수 확인
+
+2. **LLM-as-a-Judge** (v0.3.0+, 정확, 비용 발생):
+ - GPT-4 기반 품질 평가
+ - 다차원 점수 (전체, 지시사항 준수, 진실성)
+ - 상세한 평가 코멘트
+
+**자동 스케줄러** (v0.4.0+):
+- APScheduler로 주기적 평가
+- 설정 가능한 간격 및 배치 크기
+- 자동 시작/정지
+
+**알림 시스템** (v0.4.0+, v0.5.0):
+- **Slack**: 웹훅 통합
+- **Discord**: 웹훅 통합
+- **Email** (v0.5.0): SMTP (Gmail 등)
+- 낮은 품질 즉시 알림
+- 배치 평가 요약
+
+### 모니터링 (v0.5.0)
+
+**Prometheus 메트릭:**
+- `llm_gateway_http_requests_total` - HTTP 요청 수
+- `llm_gateway_http_request_duration_seconds` - 요청 지연시간
+- `llm_gateway_llm_requests_total` - LLM 호출 수
+- `llm_evaluator_evaluations_total` - 평가 수
+- `llm_evaluator_evaluation_scores` - 점수 분포
+- `llm_evaluator_notifications_sent_total` - 알림 전송 수
+- `llm_evaluator_pending_logs` - 평가 대기 로그 수
+
+**Grafana 대시보드:**
+- 14개 시각화 패널
+- 실시간 성능 모니터링
+- 품질 추세 분석
+- 알림 현황 추적
-* `.env.local`
+---
- * docker-compose에서 참조하는 local 환경 변수 파일
- * 실제 경로/파일명은 `docker-compose.local.yml` 의 `env_file` 설정과 맞춰 사용
+## ⚙️ 설정
-예시 `.env.local`:
+### 환경 변수
-```env
-# Application
+```bash
+# 애플리케이션
APP_ENV=local
LOG_LEVEL=DEBUG
-# LLM
-OPENAI_MODEL_MAIN=gpt-5-mini
+# LLM 모델
+OPENAI_MODEL_MAIN=gpt-5-mini # Gateway에서 사용할 모델
+OPENAI_MODEL_JUDGE=gpt-4o-mini # 평가에 사용할 모델
LLM_API_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-...
-# Database
+# 데이터베이스
DATABASE_URL=postgresql://llm_user:llm_password@postgres:5432/llm_quality
+
+# 배치 평가 스케줄러 (v0.4.0+)
+ENABLE_AUTO_EVALUATION=true # 자동 평가 활성화
+EVALUATION_INTERVAL_MINUTES=60 # 평가 주기 (분)
+EVALUATION_BATCH_SIZE=10 # 배치 크기
+EVALUATION_JUDGE_TYPE=rule # 기본 평가 방식 (rule/llm)
+
+# 알림 설정 (v0.4.0+)
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
+DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/YOUR/WEBHOOK/URL
+NOTIFICATION_SCORE_THRESHOLD=3 # 알림 임계값 (≤ 3점)
+
+# 이메일 알림 (v0.5.0+)
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=your-app-password
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=recipient1@example.com,recipient2@example.com
+```
+
+---
+
+## 🏗️ 프로젝트 구조
+
+```
+LLM-Quality-Observer/
+├── services/
+│ ├── gateway-api/ # Gateway API 서비스
+│ │ ├── app/
+│ │ │ ├── main.py # FastAPI 앱
+│ │ │ ├── config.py # 설정
+│ │ │ ├── llm_client.py # OpenAI 클라이언트
+│ │ │ ├── db.py # 데이터베이스
+│ │ │ ├── models.py # SQLAlchemy 모델
+│ │ │ ├── schemas.py # Pydantic 스키마
+│ │ │ └── metrics.py # Prometheus 메트릭
+│ │ ├── tests/
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ ├── evaluator/ # Evaluator 서비스
+│ │ ├── app/
+│ │ │ ├── main.py # FastAPI 앱
+│ │ │ ├── rules.py # 규칙 기반 평가
+│ │ │ ├── llm_judge.py # LLM-as-a-Judge
+│ │ │ ├── scheduler.py # APScheduler
+│ │ │ ├── notifier.py # 알림 시스템
+│ │ │ ├── metrics.py # Prometheus 메트릭
+│ │ │ └── utils.py # 유틸리티
+│ │ ├── tests/
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ ├── dashboard/ # Streamlit 대시보드
+│ │ ├── app/
+│ │ │ ├── main.py
+│ │ │ ├── models.py
+│ │ │ └── config.py
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ └── web/ # Next.js 웹 대시보드
+│ └── dashboard/
+│ ├── app/
+│ ├── components/
+│ ├── locales/ # 다국어 지원
+│ └── lib/
+│
+├── infra/
+│ ├── docker/
+│ │ └── docker-compose.local.yml
+│ ├── prometheus/
+│ │ └── prometheus.yml
+│ └── grafana/
+│ ├── provisioning/
+│ ├── dashboards/
+│ └── DASHBOARD_GUIDE-ko.md
+│
+├── configs/
+│ └── env/
+│ ├── .env.local.example
+│ └── .env.local # gitignored
+│
+├── docs/
+│ ├── release_notes/ # 릴리즈 노트
+│ │ ├── RELEASE_NOTES_v0.1.0.md
+│ │ ├── RELEASE_NOTES_v0.2.0.md
+│ │ ├── RELEASE_NOTES_v0.3.0.md
+│ │ ├── RELEASE_NOTES_v0.4.0.md
+│ │ └── RELEASE_NOTES_v0.5.0.md
+│ ├── RELEASE_NOTES_v0.5.0_ko.md
+│ ├── METRICS.md
+│ ├── EMAIL_SETUP.md
+│ └── README-main-us.md
+│
+├── .github/
+│ └── workflows/
+│ └── ci.yml # GitHub Actions CI/CD
+│
+├── .flake8 # Flake8 설정
+└── README.md
```
---
-### 로컬 실행 방법 (Docker)
+## 🧪 테스트
-#### 1. 리포지토리 클론
+### 헬스 체크 테스트
```bash
-git clone https://github.com/dongkoony/LLM-Quality-Observer.git
-cd LLM-Quality-Observer
+# 모든 서비스 헬스 체크
+curl http://localhost:18000/health # Gateway API
+curl http://localhost:18001/health # Evaluator
+curl http://localhost:9090/-/healthy # Prometheus
+curl http://localhost:3001/api/health # Grafana
```
-#### 2. `.env.local` 설정
+### 통합 테스트
```bash
-cp configs/env/.env.local configs/env/.env.local.example # 필요시 백업
-# 이후 configs/env/.env.local 내용을 직접 수정
-```
+# 1. LLM 요청 전송
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Test", "user_id": "test"}'
-* `LLM_API_KEY` 에 OpenAI API 키 입력
-* `OPENAI_MODEL_MAIN` 을 `gpt-5-mini` 로 설정 (또는 다른 모델)
+# 2. 평가 실행
+curl -X POST "http://localhost:18001/evaluate-once?limit=1"
-#### 3. Docker Compose 실행
+# 3. 메트릭 확인
+curl http://localhost:18000/metrics | grep llm_gateway
+curl http://localhost:18001/metrics | grep llm_evaluator
-```bash
-cd infra/docker
-docker compose -f docker-compose.local.yml up --build
+# 4. Grafana 대시보드 확인
+open http://localhost:3001
```
-* Gateway API: `http://localhost:18000`
-* Postgres: 컨테이너 내부에서 `postgres:5432`
+### 자동화 테스트
+
+```bash
+# CI/CD 파이프라인 로컬 실행
+cd services/gateway-api
+pytest tests/
+
+cd ../evaluator
+pytest tests/
+
+# Lint 체크
+flake8 services/
+```
---
-### Gateway API 사용법
+## 📈 모니터링 가이드
-#### Health 체크
+### Prometheus 쿼리 예시
-```bash
-curl http://localhost:18000/health
-# -> { "status": "ok" }
-```
+```promql
+# HTTP 요청 비율
+sum(rate(llm_gateway_http_requests_total[5m]))
-#### Swagger UI
+# LLM 지연시간 p95
+histogram_quantile(0.95, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))
-브라우저에서:
+# 평가 점수 중앙값
+histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le))
-```text
-http://localhost:18000/docs
+# 평가 대기 로그 수
+llm_evaluator_pending_logs
```
-에 접속 후 `POST /chat` 엔드포인트로 테스트 가능.
+### Grafana 대시보드 사용
-#### `/chat` 예시 요청
+자세한 가이드는 [Grafana 대시보드 가이드](./infra/grafana/DASHBOARD_GUIDE-ko.md) 참조
-```bash
-curl -X POST "http://localhost:18000/chat" \
- -H "Content-Type: application/json" \
- -d '{
- "prompt": "Explain what LLM-Quality-Observer is in one sentence.",
- "user_id": "test-user",
- "model_version": null
- }'
-```
+---
-예시 응답:
+## 📚 문서
-```json
-{
- "response": "LLM-Quality-Observer is a monitoring and evaluation framework that continuously assesses and tracks the quality of large language model outputs.",
- "model_version": "gpt-5-mini",
- "latency_ms": 4735.19
-}
-```
+### 릴리즈 노트
+
+- [v0.5.0 (Latest)](./docs/RELEASE_NOTES_v0.5.0_ko.md) - Prometheus, Grafana, 이메일 알림
+- [v0.4.0](./docs/release_notes/RELEASE_NOTES_v0.4.0.md) - 스케줄러, Slack/Discord 알림, CI/CD
+- [v0.3.0](./docs/release_notes/RELEASE_NOTES_v0.3.0.md) - LLM-as-a-Judge, 다국어 지원
+- [v0.2.0](./docs/release_notes/RELEASE_NOTES_v0.2.0.md) - Dashboard, CORS, 규칙 기반 평가
+- [v0.1.0](./docs/release_notes/RELEASE_NOTES_v0.1.0.md) - 초기 릴리즈 (Gateway + Evaluator)
-이 호출 시:
+### 기술 문서
-* OpenAI GPT-5 mini가 실제로 호출되고
-* 응답과 지연 시간이 계산되며
-* `llm_logs` 테이블에 로그가 저장됨
+- [메트릭 참조](./docs/METRICS.md) - Prometheus 메트릭 상세
+- [이메일 설정 가이드](./docs/EMAIL_SETUP.md) - Gmail SMTP 설정
+- [Grafana 대시보드 가이드](./infra/grafana/DASHBOARD_GUIDE-ko.md) - 대시보드 사용법
---
-### Postgres에서 로그 확인
+## 🛣️ 로드맵
-```bash
-docker exec -it llm-postgres psql -U llm_user -d llm_quality
+### 완료된 기능
-SELECT id, created_at, user_id,
- LEFT(prompt, 60) AS prompt_snippet,
- LEFT(response, 60) AS response_snippet,
- model_version,
- latency_ms,
- status
-FROM llm_logs
-ORDER BY id DESC
-LIMIT 10;
-```
+- ✅ v0.1.0: Gateway API + Evaluator 기본 구조
+- ✅ v0.2.0: 웹 대시보드 + 규칙 기반 평가
+- ✅ v0.3.0: LLM-as-a-Judge + 다국어 지원
+- ✅ v0.4.0: 자동 스케줄러 + Slack/Discord 알림
+- ✅ v0.5.0: Prometheus + Grafana + 이메일 알림
+
+### 향후 계획 (v0.6.0+)
+
+- [ ] **Alertmanager 통합**: 고급 알림 규칙 및 라우팅
+- [ ] **다중 LLM 제공자 지원**: Anthropic Claude, Google Gemini 등
+- [ ] **비용 추적**: 토큰 사용량 및 비용 모니터링
+- [ ] **A/B 테스트**: 프롬프트 및 모델 비교
+- [ ] **사용자 피드백**: RLHF 스타일 사람 평가
+- [ ] **Kubernetes 배포**: Helm 차트 및 배포 가이드
+- [ ] **API 인증**: JWT 기반 보안
+- [ ] **Rate Limiting**: 요청 제한 및 할당량 관리
+
+---
+
+## 🔒 보안
+
+### 주의사항
+
+- `.env.local` 파일을 절대 커밋하지 마세요 (gitignored)
+- OpenAI API 키를 안전하게 보관하세요
+- Slack/Discord 웹훅 URL을 공개하지 마세요
+- SMTP 비밀번호는 앱 비밀번호를 사용하세요 (Gmail)
+
+### 권장사항
+
+- 프로덕션에서는 환경 변수를 시크릿 관리자에 저장
+- API 엔드포인트에 인증 추가 (v0.6.0+)
+- HTTPS/TLS 사용
+- 정기적인 의존성 업데이트
+
+---
+
+## 🤝 기여
+
+기여를 환영합니다! 다음 절차를 따라주세요:
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feat/amazing-feature`)
+3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
+4. Push to the branch (`git push origin feat/amazing-feature`)
+5. Open a Pull Request
+
+### 개발 가이드라인
+
+- Python 코드는 Flake8 스타일 가이드 준수
+- 모든 PR은 CI 테스트 통과 필수
+- 커밋 메시지는 Conventional Commits 형식 사용
+- 새 기능에는 테스트 추가
---
-### 로드맵 (Roadmap)
+## 📄 라이선스
-향후 계획:
+이 프로젝트는 MIT 라이선스 하에 배포됩니다.
-* **Evaluator Service**
+---
- * `llm_logs` 기반 품질 점수 계산
- * 규칙/휴리스틱 기반 평가
- * LLM-as-a-judge 프롬프트 기반 평가
- * 사람 피드백(RLHF 스타일) 저장 및 활용
+## 👥 제작자
-* **Dashboard Service**
+**Dong-hyeon Shin (dongkoony)**
+- GitHub: [@dongkoony](https://github.com/dongkoony)
+- Email: dhyeon.shin@icloud.com
- * 모델/버전별 평균 점수
- * 지연 시간 분포
- * 에러율, 실패 패턴
- * 기간 / 사용자 / 모델 버전 / 태그별 필터링
+---
-* **Alerting / 알림**
- * 점수가 특정 임계값 이하로 떨어질 때 알림
- * p95 latency 가 기준치를 넘을 때 알림
- * Slack / 이메일 연동
+## 📞 문의 및 지원
-* **Cost Awareness**
+- **Issues**: [GitHub Issues](https://github.com/dongkoony/LLM-Quality-Observer/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/dongkoony/LLM-Quality-Observer/discussions)
+- **Email**: dhyeon.shin@icloud.com
- * 모델/버전별 토큰 사용량 및 비용 추적
- * 품질 점수와 비용을 함께 보며 cost–quality 트레이드오프 분석
+---
----
\ No newline at end of file
+**⭐ 이 프로젝트가 도움이 되셨다면 Star를 눌러주세요!**
diff --git a/configs/env/.env.local.example b/configs/env/.env.local.example
index c38260e..5a98d34 100644
--- a/configs/env/.env.local.example
+++ b/configs/env/.env.local.example
@@ -21,3 +21,11 @@ EVALUATION_JUDGE_TYPE=rule
# SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/YOUR/WEBHOOK/URL
NOTIFICATION_SCORE_THRESHOLD=3
+
+# Email Notification Settings (optional)
+# SMTP_HOST=smtp.gmail.com
+# SMTP_PORT=587
+# SMTP_USERNAME=your-email@gmail.com
+# SMTP_PASSWORD=your-app-password
+# SMTP_FROM_EMAIL=your-email@gmail.com
+# SMTP_TO_EMAILS=recipient1@example.com,recipient2@example.com
diff --git a/docs/EMAIL_SETUP.md b/docs/EMAIL_SETUP.md
new file mode 100644
index 0000000..9155850
--- /dev/null
+++ b/docs/EMAIL_SETUP.md
@@ -0,0 +1,495 @@
+# Email Notification Setup Guide
+
+This guide walks you through setting up email notifications for the LLM Quality Observer system.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Prerequisites](#prerequisites)
+- [Gmail Setup](#gmail-setup)
+- [Other SMTP Providers](#other-smtp-providers)
+- [Configuration](#configuration)
+- [Testing](#testing)
+- [Troubleshooting](#troubleshooting)
+- [Security Best Practices](#security-best-practices)
+
+---
+
+## Overview
+
+LLM Quality Observer supports email notifications for:
+- **Low-quality alerts**: Sent when LLM responses score below the configured threshold
+- **Batch evaluation summaries**: Sent after scheduled batch evaluation runs complete
+
+Email notifications are sent via SMTP and support multiple recipients.
+
+---
+
+## Prerequisites
+
+- SMTP server credentials (email provider)
+- Sender email address
+- Recipient email address(es)
+- For Gmail: 2-factor authentication enabled and app password
+
+---
+
+## Gmail Setup
+
+### Step 1: Enable 2-Factor Authentication
+
+1. Go to your [Google Account](https://myaccount.google.com/)
+2. Navigate to **Security**
+3. Under "Signing in to Google", select **2-Step Verification**
+4. Follow the prompts to enable 2FA
+
+### Step 2: Generate App Password
+
+1. In **Security** settings, scroll to **2-Step Verification**
+2. At the bottom, select **App passwords**
+3. You may need to sign in again
+4. Under "Select app", choose **Mail**
+5. Under "Select device", choose **Other (Custom name)**
+6. Enter "LLM Quality Observer" and click **Generate**
+7. Copy the 16-character password (you won't be able to see it again)
+
+### Step 3: Configure Environment Variables
+
+Add these to your `.env.local` file:
+
+```bash
+# Email Notification Settings
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=abcd efgh ijkl mnop # App password from Step 2
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=recipient1@example.com,recipient2@example.com
+```
+
+**Note:** Replace spaces in the app password with no spaces when copying to .env file.
+
+---
+
+## Other SMTP Providers
+
+### Microsoft 365 / Outlook
+
+```bash
+SMTP_HOST=smtp.office365.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@outlook.com
+SMTP_PASSWORD=your-password
+SMTP_FROM_EMAIL=your-email@outlook.com
+SMTP_TO_EMAILS=recipient@example.com
+```
+
+**Note:** If using 2FA, generate an app password in your Microsoft account settings.
+
+### SendGrid
+
+```bash
+SMTP_HOST=smtp.sendgrid.net
+SMTP_PORT=587
+SMTP_USERNAME=apikey
+SMTP_PASSWORD=your-sendgrid-api-key
+SMTP_FROM_EMAIL=verified-sender@yourdomain.com
+SMTP_TO_EMAILS=recipient@example.com
+```
+
+**Note:** You must verify your sender email in SendGrid before sending.
+
+### AWS SES
+
+```bash
+SMTP_HOST=email-smtp.us-east-1.amazonaws.com
+SMTP_PORT=587
+SMTP_USERNAME=your-smtp-username
+SMTP_PASSWORD=your-smtp-password
+SMTP_FROM_EMAIL=verified-sender@yourdomain.com
+SMTP_TO_EMAILS=recipient@example.com
+```
+
+**Note:** Generate SMTP credentials in AWS SES console and verify your sender domain.
+
+### Mailgun
+
+```bash
+SMTP_HOST=smtp.mailgun.org
+SMTP_PORT=587
+SMTP_USERNAME=postmaster@your-domain.mailgun.org
+SMTP_PASSWORD=your-smtp-password
+SMTP_FROM_EMAIL=noreply@your-domain.mailgun.org
+SMTP_TO_EMAILS=recipient@example.com
+```
+
+---
+
+## Configuration
+
+### Environment Variables
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `SMTP_HOST` | Yes | None | SMTP server hostname |
+| `SMTP_PORT` | No | 587 | SMTP server port (use 587 for TLS) |
+| `SMTP_USERNAME` | Yes | None | SMTP authentication username |
+| `SMTP_PASSWORD` | Yes | None | SMTP authentication password |
+| `SMTP_FROM_EMAIL` | Yes | None | Sender email address |
+| `SMTP_TO_EMAILS` | Yes | None | Comma-separated recipient emails |
+| `NOTIFICATION_SCORE_THRESHOLD` | No | 3 | Threshold for low-quality alerts (1-5) |
+
+### Multiple Recipients
+
+To send notifications to multiple recipients, separate email addresses with commas:
+
+```bash
+SMTP_TO_EMAILS=team-lead@company.com,on-call@company.com,quality-team@company.com
+```
+
+### Notification Threshold
+
+The `NOTIFICATION_SCORE_THRESHOLD` determines when low-quality alerts are sent:
+
+```bash
+# Send alerts for scores of 3 or below
+NOTIFICATION_SCORE_THRESHOLD=3
+
+# Only send alerts for very low scores (2 or below)
+NOTIFICATION_SCORE_THRESHOLD=2
+```
+
+---
+
+## Testing
+
+### Step 1: Update Configuration
+
+1. Edit your `.env.local` file with SMTP credentials
+2. Set `SMTP_TO_EMAILS` to your test email address
+
+### Step 2: Restart Services
+
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+### Step 3: Trigger a Low-Quality Alert
+
+#### Option A: Submit a Low-Quality Request
+
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "prompt": "x",
+ "user_id": "test-user"
+ }'
+```
+
+Wait a few seconds, then trigger evaluation:
+
+```bash
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+```
+
+#### Option B: Use Python Test Script
+
+Create `test_email.py`:
+
+```python
+import asyncio
+import sys
+sys.path.append('services/evaluator')
+
+from app.notifier import send_email_notification
+
+async def test():
+ result = await send_email_notification(
+ subject="🚨 Test Alert - LLM Quality Observer",
+ message="""
+Test email notification from LLM Quality Observer.
+
+If you're seeing this, your email configuration is working correctly!
+
+Score: 2/5
+Judge: rule-based
+Status: Low Quality
+ """.strip(),
+ notification_type="alert"
+ )
+ print(f"Email sent: {result}")
+
+asyncio.run(test())
+```
+
+Run the test:
+
+```bash
+cd services/evaluator
+uv run python test_email.py
+```
+
+### Step 4: Check Logs
+
+```bash
+docker compose -f docker-compose.local.yml logs evaluator | grep -i email
+```
+
+You should see:
+```
+이메일 알림 전송 성공: ['recipient@example.com']
+```
+
+---
+
+## Troubleshooting
+
+### Error: "SMTP settings not configured"
+
+**Cause:** One or more required SMTP environment variables are missing.
+
+**Solution:**
+1. Verify all required variables are set in `.env.local`
+2. Restart the evaluator service
+3. Check logs: `docker compose logs evaluator`
+
+### Error: "Authentication failed"
+
+**Cause:** Invalid SMTP credentials.
+
+**Solution:**
+- **Gmail:** Ensure you're using an app password, not your account password
+- **Other providers:** Verify username and password are correct
+- Check if 2FA is required and generate an app password
+
+### Error: "Connection timeout"
+
+**Cause:** Cannot reach SMTP server.
+
+**Solution:**
+1. Verify `SMTP_HOST` is correct
+2. Check `SMTP_PORT` (587 for TLS, 465 for SSL)
+3. Ensure firewall allows outbound SMTP connections
+4. Try from host machine: `telnet smtp.gmail.com 587`
+
+### Error: "Sender address rejected"
+
+**Cause:** Sender email not verified or doesn't match credentials.
+
+**Solution:**
+- **Gmail:** Use the same email as `SMTP_USERNAME`
+- **SendGrid/SES:** Verify sender email/domain in provider dashboard
+- Check `SMTP_FROM_EMAIL` matches verified sender
+
+### Emails Not Arriving
+
+**Possible causes:**
+1. **Spam folder:** Check recipient's spam/junk folder
+2. **Rate limiting:** Provider may be rate-limiting sends
+3. **Blocked sender:** Recipient's email server may be blocking emails
+
+**Solutions:**
+1. Add sender to recipient's contacts/whitelist
+2. Check evaluator logs for send confirmation
+3. Use a verified domain instead of Gmail
+4. Contact email provider support
+
+### SSL/TLS Errors
+
+**Symptoms:** "SSL handshake failed" or similar
+
+**Solution:**
+```bash
+# Try different port configurations
+SMTP_PORT=587 # STARTTLS (recommended)
+SMTP_PORT=465 # SSL
+SMTP_PORT=25 # Unencrypted (not recommended)
+```
+
+---
+
+## Security Best Practices
+
+### 1. Use App Passwords
+
+Never use your main email account password. Always generate app-specific passwords.
+
+**Why:** If compromised, you can revoke the app password without changing your main password.
+
+### 2. Restrict SMTP Credentials
+
+```bash
+# .env.local should be gitignored
+echo ".env.local" >> .gitignore
+
+# Set proper file permissions
+chmod 600 configs/env/.env.local
+```
+
+### 3. Use Environment-Specific Credentials
+
+Don't share credentials between environments:
+
+```bash
+# Development
+SMTP_FROM_EMAIL=dev-alerts@company.com
+
+# Production
+SMTP_FROM_EMAIL=prod-alerts@company.com
+```
+
+### 4. Limit Recipients
+
+Only send to authorized recipients:
+
+```bash
+# Good: Specific team addresses
+SMTP_TO_EMAILS=ml-team@company.com,on-call@company.com
+
+# Bad: Personal emails
+SMTP_TO_EMAILS=john.personal@gmail.com
+```
+
+### 5. Monitor Email Usage
+
+Track email notification metrics:
+
+```promql
+# Email notification rate
+rate(llm_evaluator_notifications_sent_total{channel="email"}[5m])
+
+# Email failure rate
+rate(llm_evaluator_notifications_sent_total{channel="email",status="error"}[5m])
+```
+
+### 6. Rotate Credentials Regularly
+
+- Rotate app passwords every 90 days
+- Update credentials immediately if compromised
+- Document who has access to SMTP credentials
+
+---
+
+## Advanced Configuration
+
+### Custom Email Templates
+
+To customize email format, modify `services/evaluator/app/notifier.py`:
+
+```python
+# Current simple format
+html_message = message.replace("\n", "
")
+html_part = MIMEText(f"
{html_message}", "html")
+
+# Custom HTML template
+html_template = f"""
+
+
+
+
+
+
+
🚨 Low Quality Alert
+
Score: {score}/5
+
{message}
+
+
+
+"""
+```
+
+### Conditional Email Sending
+
+Only send emails for specific conditions:
+
+```python
+# In notifier.py, modify send_low_quality_alert()
+def send_low_quality_alert(log: LLMLog, evaluation: LLMEvaluation):
+ # Only send email for very low scores
+ if evaluation.overall_score <= 2:
+ asyncio.run(send_email_notification(subject, message))
+
+ # Always send Slack/Discord
+ send_slack_notification(message)
+ send_discord_notification(message)
+```
+
+### Email Throttling
+
+Prevent email spam by adding rate limiting:
+
+```python
+import time
+from collections import deque
+
+# Track recent emails
+email_timestamps = deque(maxlen=10)
+
+def should_send_email():
+ now = time.time()
+ # Max 10 emails per hour
+ if len(email_timestamps) == 10:
+ if now - email_timestamps[0] < 3600:
+ return False
+ email_timestamps.append(now)
+ return True
+```
+
+---
+
+## Monitoring
+
+### Key Metrics
+
+Monitor these Prometheus metrics for email health:
+
+```promql
+# Email send rate
+rate(llm_evaluator_notifications_sent_total{channel="email"}[5m])
+
+# Email success rate
+sum(rate(llm_evaluator_notifications_sent_total{channel="email",status="success"}[5m])) /
+sum(rate(llm_evaluator_notifications_sent_total{channel="email"}[5m]))
+
+# Recent email failures
+increase(llm_evaluator_notifications_sent_total{channel="email",status="error"}[1h])
+```
+
+### Grafana Alerts
+
+Create alerts for email delivery issues:
+
+```yaml
+- alert: EmailDeliveryFailure
+ expr: |
+ rate(llm_evaluator_notifications_sent_total{channel="email",status="error"}[5m]) > 0.1
+ for: 10m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Email notifications failing"
+ description: "Email delivery failure rate: {{ $value }}"
+```
+
+---
+
+## Support
+
+For issues not covered in this guide:
+- Check evaluator service logs: `docker compose logs evaluator`
+- Review SMTP provider documentation
+- Open GitHub issue with logs and configuration (redact credentials)
+
+---
+
+## Related Documentation
+
+- [Notification Settings](../configs/env/.env.local.example)
+- [Metrics Reference](./METRICS.md)
+- [Release Notes](./RELEASE_NOTES_v0.5.0.md)
diff --git a/docs/METRICS.md b/docs/METRICS.md
new file mode 100644
index 0000000..3c288e1
--- /dev/null
+++ b/docs/METRICS.md
@@ -0,0 +1,382 @@
+# Metrics Reference Guide
+
+This document provides a comprehensive reference for all Prometheus metrics exposed by the LLM Quality Observer system.
+
+## Table of Contents
+
+- [Gateway API Metrics](#gateway-api-metrics)
+- [Evaluator Service Metrics](#evaluator-service-metrics)
+- [Common Labels](#common-labels)
+- [Example Queries](#example-queries)
+
+---
+
+## Gateway API Metrics
+
+All Gateway API metrics are prefixed with `llm_gateway_`.
+
+### HTTP Request Metrics
+
+#### `llm_gateway_http_requests_total`
+- **Type:** Counter
+- **Description:** Total number of HTTP requests received
+- **Labels:**
+ - `method`: HTTP method (GET, POST, etc.)
+ - `endpoint`: Request endpoint (/chat, /health, /metrics)
+ - `status`: HTTP status code (200, 400, 500, etc.)
+
+#### `llm_gateway_http_request_duration_seconds`
+- **Type:** Histogram
+- **Description:** HTTP request latency in seconds
+- **Labels:**
+ - `method`: HTTP method
+ - `endpoint`: Request endpoint
+- **Buckets:** 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0
+
+#### `llm_gateway_active_requests`
+- **Type:** Gauge
+- **Description:** Number of currently active HTTP requests
+
+### LLM Request Metrics
+
+#### `llm_gateway_llm_requests_total`
+- **Type:** Counter
+- **Description:** Total number of LLM API calls made
+- **Labels:**
+ - `model`: LLM model used (gpt-5-mini, etc.)
+ - `status`: Request status (success, error)
+
+#### `llm_gateway_llm_request_duration_seconds`
+- **Type:** Histogram
+- **Description:** LLM API call latency in seconds
+- **Labels:**
+ - `model`: LLM model used
+- **Buckets:** 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0
+
+### Database Metrics
+
+#### `llm_gateway_db_queries_total`
+- **Type:** Counter
+- **Description:** Total number of database queries executed
+- **Labels:**
+ - `operation`: Query type (select, insert, update, delete)
+ - `table`: Database table name
+ - `status`: Query status (success, error)
+
+#### `llm_gateway_db_query_duration_seconds`
+- **Type:** Histogram
+- **Description:** Database query latency in seconds
+- **Labels:**
+ - `operation`: Query type
+ - `table`: Database table name
+- **Buckets:** 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0
+
+#### `llm_gateway_logs_saved_total`
+- **Type:** Counter
+- **Description:** Total number of log entries saved to database
+- **Labels:**
+ - `status`: Save status (success, error)
+
+### Application Info
+
+#### `llm_gateway_info`
+- **Type:** Info
+- **Description:** Gateway API application metadata
+- **Labels:**
+ - `version`: Application version
+ - `environment`: Deployment environment (local, dev, prod)
+
+---
+
+## Evaluator Service Metrics
+
+All Evaluator metrics are prefixed with `llm_evaluator_`.
+
+### Evaluation Metrics
+
+#### `llm_evaluator_evaluations_total`
+- **Type:** Counter
+- **Description:** Total number of evaluations performed
+- **Labels:**
+ - `judge_type`: Type of judge used (rule, llm)
+ - `status`: Evaluation status (success, error)
+
+#### `llm_evaluator_evaluation_duration_seconds`
+- **Type:** Histogram
+- **Description:** Time taken to complete an evaluation
+- **Labels:**
+ - `judge_type`: Type of judge used
+- **Buckets:** 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0
+
+#### `llm_evaluator_evaluation_scores`
+- **Type:** Histogram
+- **Description:** Distribution of evaluation scores
+- **Labels:**
+ - `judge_type`: Type of judge used
+ - `score_type`: Type of score (overall, instruction, truthfulness)
+- **Buckets:** 1, 2, 3, 4, 5
+
+### Batch Evaluation Metrics
+
+#### `llm_evaluator_batch_evaluations_total`
+- **Type:** Counter
+- **Description:** Total number of batch evaluation runs
+- **Labels:**
+ - `judge_type`: Type of judge used
+
+#### `llm_evaluator_batch_logs_processed`
+- **Type:** Counter
+- **Description:** Total number of logs processed in batch evaluations
+- **Labels:**
+ - `judge_type`: Type of judge used
+
+### Notification Metrics
+
+#### `llm_evaluator_notifications_sent_total`
+- **Type:** Counter
+- **Description:** Total number of notifications sent
+- **Labels:**
+ - `channel`: Notification channel (slack, discord, email)
+ - `type`: Notification type (alert, summary)
+ - `status`: Delivery status (success, error)
+
+#### `llm_evaluator_low_quality_alerts_total`
+- **Type:** Counter
+- **Description:** Total number of low-quality alerts triggered
+- **Labels:**
+ - `judge_type`: Type of judge that detected low quality
+
+### Scheduler Metrics
+
+#### `llm_evaluator_scheduler_runs_total`
+- **Type:** Counter
+- **Description:** Total number of scheduler executions
+- **Labels:**
+ - `status`: Execution status (success, error)
+
+#### `llm_evaluator_pending_logs`
+- **Type:** Gauge
+- **Description:** Current number of logs pending evaluation
+
+### LLM Judge Metrics
+
+#### `llm_evaluator_llm_judge_requests_total`
+- **Type:** Counter
+- **Description:** Total number of LLM judge API calls
+- **Labels:**
+ - `model`: Judge model used
+ - `status`: Request status (success, error)
+
+#### `llm_evaluator_llm_judge_request_duration_seconds`
+- **Type:** Histogram
+- **Description:** LLM judge API call latency
+- **Labels:**
+ - `model`: Judge model used
+- **Buckets:** 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0
+
+### Application Info
+
+#### `llm_evaluator_info`
+- **Type:** Info
+- **Description:** Evaluator service application metadata
+- **Labels:**
+ - `version`: Application version
+ - `environment`: Deployment environment
+
+---
+
+## Common Labels
+
+### Environment Labels (Added by Prometheus)
+
+All metrics automatically include these labels from Prometheus configuration:
+- `service`: Service name (gateway-api, evaluator)
+- `environment`: Deployment environment (local, dev, prod)
+- `instance`: Container instance ID
+- `job`: Prometheus job name
+
+---
+
+## Example Queries
+
+### Performance Monitoring
+
+**Average request latency (p50, p95, p99):**
+```promql
+# p50
+histogram_quantile(0.50, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))
+
+# p95
+histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))
+
+# p99
+histogram_quantile(0.99, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))
+```
+
+**Request rate per second:**
+```promql
+sum(rate(llm_gateway_http_requests_total[1m])) by (endpoint)
+```
+
+**Error rate percentage:**
+```promql
+(sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) * 100
+```
+
+### LLM Performance
+
+**Average LLM latency by model:**
+```promql
+rate(llm_gateway_llm_request_duration_seconds_sum[5m]) /
+rate(llm_gateway_llm_request_duration_seconds_count[5m])
+```
+
+**LLM error rate:**
+```promql
+sum(rate(llm_gateway_llm_requests_total{status="error"}[5m])) by (model)
+```
+
+**LLM requests per minute:**
+```promql
+sum(rate(llm_gateway_llm_requests_total[1m])) by (model) * 60
+```
+
+### Evaluation Quality
+
+**Average score by judge type:**
+```promql
+sum(rate(llm_evaluator_evaluation_scores_sum{score_type="overall"}[5m])) by (judge_type) /
+sum(rate(llm_evaluator_evaluation_scores_count{score_type="overall"}[5m])) by (judge_type)
+```
+
+**Low quality alert rate:**
+```promql
+sum(rate(llm_evaluator_low_quality_alerts_total[5m])) by (judge_type)
+```
+
+**Evaluation throughput:**
+```promql
+sum(rate(llm_evaluator_evaluations_total[5m])) by (judge_type)
+```
+
+### System Health
+
+**Pending logs trend:**
+```promql
+llm_evaluator_pending_logs
+```
+
+**Scheduler success rate:**
+```promql
+sum(rate(llm_evaluator_scheduler_runs_total{status="success"}[5m])) /
+sum(rate(llm_evaluator_scheduler_runs_total[5m]))
+```
+
+**Notification delivery rate:**
+```promql
+sum(rate(llm_evaluator_notifications_sent_total{status="success"}[5m])) by (channel) /
+sum(rate(llm_evaluator_notifications_sent_total[5m])) by (channel)
+```
+
+### Resource Utilization
+
+**Active requests gauge:**
+```promql
+llm_gateway_active_requests
+```
+
+**Database query rate:**
+```promql
+sum(rate(llm_gateway_db_queries_total[5m])) by (operation, table)
+```
+
+**Batch processing rate:**
+```promql
+sum(rate(llm_evaluator_batch_logs_processed[5m])) by (judge_type)
+```
+
+### Alerting Rules Examples
+
+**High error rate alert:**
+```yaml
+- alert: HighErrorRate
+ expr: |
+ (sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) > 0.05
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "High error rate detected"
+ description: "Error rate is {{ $value | humanizePercentage }}"
+```
+
+**High LLM latency alert:**
+```yaml
+- alert: HighLLMLatency
+ expr: |
+ histogram_quantile(0.95,
+ sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model)
+ ) > 10
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "High LLM latency detected"
+ description: "p95 latency is {{ $value }}s for model {{ $labels.model }}"
+```
+
+**Growing backlog alert:**
+```yaml
+- alert: GrowingEvaluationBacklog
+ expr: |
+ llm_evaluator_pending_logs > 100
+ for: 10m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Evaluation backlog is growing"
+ description: "{{ $value }} logs pending evaluation"
+```
+
+---
+
+## Best Practices
+
+1. **Use rate() for counters:** Always use `rate()` function with counters to get per-second rates
+2. **Appropriate time ranges:** Use longer time ranges (5m-15m) for alerts to avoid flapping
+3. **Label cardinality:** Be mindful of label combinations - avoid high-cardinality labels like user_id
+4. **Histogram buckets:** Current buckets cover common latency ranges, adjust if your use case differs
+5. **Recording rules:** Consider creating recording rules for frequently used complex queries
+6. **Retention:** Default Prometheus retention is 15 days - adjust based on your needs
+
+---
+
+## Troubleshooting
+
+**Metrics not appearing:**
+1. Check service is running: `docker compose ps`
+2. Verify metrics endpoint: `curl http://localhost:18000/metrics`
+3. Check Prometheus targets: http://localhost:9090/targets
+4. Review Prometheus logs: `docker compose logs prometheus`
+
+**Incorrect values:**
+1. Verify time range in queries
+2. Check label filters are correct
+3. Ensure rate() is used with counters
+4. Validate histogram_quantile() syntax
+
+**Performance issues:**
+1. Reduce query time range
+2. Add more specific label filters
+3. Use recording rules for complex queries
+4. Increase Prometheus resources if needed
+
+---
+
+## Additional Resources
+
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [PromQL Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/)
+- [Grafana Templating](https://grafana.com/docs/grafana/latest/dashboards/variables/)
diff --git a/docs/README-main-us.md b/docs/README-main-us.md
index 416a5c7..b90f20b 100644
--- a/docs/README-main-us.md
+++ b/docs/README-main-us.md
@@ -1,361 +1,591 @@
# LLM-Quality-Observer
-[🇰🇷 KR](../README.md) | [🇺🇸 EN](docs/README-main-us.md)
+[🇰🇷 KR](../README.md) | [🇺🇸 EN](README-main-us.md)
---
-LLM-Quality-Observer is a small MLOps playground project for **monitoring and evaluating LLM responses**.
-It is designed as a portfolio-ready system that shows how to:
+## Overview
-- Expose an LLM-backed chat API
-- Log prompts, responses, and latency into a database
-- (Planned) Run batch evaluation jobs on logged data
-- (Planned) Visualize quality metrics on a dashboard
+**LLM-Quality-Observer** is an MLOps platform for **monitoring and evaluating** the quality of Large Language Model (LLM) responses.
+Built on a microservices architecture, it logs LLM interactions, automatically evaluates quality, and provides real-time monitoring dashboards.
-> Status: **v1 — Gateway API + logging to Postgres is working**
+### Key Features
----
+- ✅ **Gateway API**: LLM request processing and automatic logging
+- ✅ **Automated Evaluation**: Dual evaluation system (rule-based + LLM-as-a-Judge)
+- ✅ **Scheduler**: Automated batch evaluation (APScheduler)
+- ✅ **Multi-Channel Notifications**: Slack, Discord, Email integration
+- ✅ **Monitoring**: Prometheus metrics collection + Grafana dashboards
+- ✅ **Web Dashboard**: Next.js-based real-time quality visualization
+- ✅ **Multi-Language Support**: English, Korean, Japanese, Chinese
+- ✅ **CI/CD**: GitHub Actions automation pipeline
+
+> **Current Version: v0.5.0** — Prometheus, Grafana, Email notifications added
-## Architecture Overview
+---
-Current v1 architecture:
+## 📊 Architecture
```mermaid
-flowchart TD
- C["Client (Swagger UI / HTTP)"]
- G["Gateway API (FastAPI)"]
- DB["Postgres (table: llm_logs)"]
- E["Evaluator Service (future)"]
- D["Dashboard Service (future)"]
-
- C --> G
- G -->|LLM call + latency + logging| DB
- DB --> E
- DB --> D
+flowchart TB
+ subgraph "Client"
+ ClientApp[Client/Browser]
+ end
+
+ subgraph "Frontend"
+ WebDashboard["Next.js Dashboard
:3000"]
+ Grafana["Grafana
:3001"]
+ end
+
+ subgraph "Backend Services"
+ Gateway["Gateway API
:18000"]
+ Evaluator["Evaluator
:18001"]
+ Dashboard["Streamlit Dashboard
:18002"]
+ end
+
+ subgraph "Database"
+ Postgres["PostgreSQL
:5432"]
+ end
+
+ subgraph "Monitoring"
+ Prometheus["Prometheus
:9090"]
+ end
+
+ subgraph "External Services"
+ OpenAI_Main["OpenAI GPT
(Main Model)"]
+ OpenAI_Judge["OpenAI GPT
(Judge Model)"]
+ end
+
+ subgraph "Notification Channels"
+ Slack["Slack"]
+ Discord["Discord"]
+ Email["Email
(SMTP)"]
+ end
+
+ %% Client connections
+ ClientApp --> WebDashboard
+ ClientApp --> Gateway
+
+ %% Gateway connections
+ Gateway --> OpenAI_Main
+ Gateway --> Postgres
+ Gateway -.metrics.-> Prometheus
+
+ %% Evaluator connections
+ Postgres --> Evaluator
+ Evaluator --> OpenAI_Judge
+ Evaluator --> Slack
+ Evaluator --> Discord
+ Evaluator --> Email
+ Evaluator -.metrics.-> Prometheus
+
+ %% Dashboard connections
+ Postgres --> Dashboard
+
+ %% Monitoring connections
+ Prometheus --> Grafana
+
+ style Gateway fill:#4CAF50
+ style Evaluator fill:#2196F3
+ style Postgres fill:#FF9800
+ style Prometheus fill:#E91E63
+ style Grafana fill:#9C27B0
+ style OpenAI_Main fill:#00BCD4
+ style OpenAI_Judge fill:#00BCD4
```
----
-
-## Tech Stack
+### Service Components
-* **Language**: Python 3.12
-* **LLM Provider**: OpenAI GPT-5 mini (via `responses` API)
-* **Web Framework**: FastAPI
-* **Database**: PostgreSQL 16
-* **ORM**: SQLAlchemy
-* **Config & Settings**: Pydantic Settings
-* **Dependency Management**: [`uv`](https://github.com/astral-sh/uv)
-* **Container / Orchestration**: Docker, Docker Compose
+| Service | Port | Description |
+|---------|------|-------------|
+| **Gateway API** | 18000 | LLM request processing and logging (FastAPI) |
+| **Evaluator** | 18001 | Automated evaluation and notifications (FastAPI) |
+| **Dashboard** | 18002 | Streamlit dashboard (legacy) |
+| **Web Dashboard** | 3000 | Next.js web dashboard |
+| **PostgreSQL** | 5432 | Log and evaluation result storage |
+| **Prometheus** | 9090 | Metrics collection |
+| **Grafana** | 3001 | Monitoring dashboard |
---
-## Project Structure
-
-High-level directory layout:
+## 🚀 Quick Start
-```text
-LLM-Quality-Observer/
-├── services/
-│ ├── gateway-api/
-│ │ ├── app/
-│ │ │ ├── app/
-│ │ │ │ ├── main.py
-│ │ │ │ ├── config.py
-│ │ │ │ ├── llm_client.py
-│ │ │ │ ├── db.py
-│ │ │ │ ├── models.py
-│ │ │ │ ├── schemas.py
-│ │ │ └── pyproject.toml
-│ │ └── Dockerfile
-│ ├── evaluator/
-│ │ ├── app/
-│ │ │ └── pyproject.toml
-│ │ └── Dockerfile
-│ └── dashboard/
-│ ├── app/
-│ │ └── pyproject.toml
-│ └── Dockerfile
-├── infra/
-│ └── docker/
-│ └── docker-compose.local.yml
-├── configs/
-│ └── env/
-│ └── .env.local # local env file (not committed)
-└── README.md
-```
-
-### services/gateway-api
-
-FastAPI-based gateway that:
-
-* exposes `/health`, `/chat` endpoints
-* calls OpenAI GPT-5 mini
-* logs interactions to Postgres
-
-Files:
+### Prerequisites
-* `app/app/main.py`
+- Docker & Docker Compose
+- OpenAI API Key
+- (Optional) Slack/Discord Webhook URL
+- (Optional) Gmail SMTP account
- * FastAPI application entrypoint
- * `/health` and `/chat` endpoints
- * Creates DB tables on startup (simple version)
- * Persists `LLMLog` rows and returns `ChatResponse`
+### Installation
-* `app/app/config.py`
+1. **Clone repository**
+```bash
+git clone https://github.com/dongkoony/LLM-Quality-Observer.git
+cd LLM-Quality-Observer
+```
- * Pydantic `Settings` class
- * Reads environment variables:
+2. **Configure environment variables**
+```bash
+cp configs/env/.env.local.example configs/env/.env.local
+# Edit .env.local to set API keys
+```
- * `APP_ENV`
- * `DATABASE_URL`
- * `OPENAI_MODEL_MAIN`
- * `LLM_API_BASE_URL`
- * `LLM_API_KEY`
- * `LOG_LEVEL`
+3. **Start services**
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml up --build
+```
-* `app/app/llm_client.py`
+4. **Verify services**
+```bash
+# Gateway API
+curl http://localhost:18000/health
- * Wraps OpenAI Python client
- * Resolves model version (uses `OPENAI_MODEL_MAIN` by default)
- * Calls `client.responses.create(...)`
- * Returns `(response_text, latency_ms)`
+# Evaluator
+curl http://localhost:18001/health
-* `app/app/db.py`
+# Prometheus
+open http://localhost:9090
- * SQLAlchemy engine & session factory
- * `get_db()` dependency used by FastAPI
+# Grafana
+open http://localhost:3001 # admin/admin
+```
-* `app/app/models.py`
+---
- * SQLAlchemy ORM model: `LLMLog`
- * Columns:
+## 📖 Usage Guide
- * `id`, `created_at`
- * `user_id`, `prompt`, `response`
- * `model_version`
- * `latency_ms`
- * `status`
+### 1. Send LLM Request
-* `app/app/schemas.py`
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "prompt": "Explain quantum computing in simple terms",
+ "user_id": "test-user",
+ "model_version": "gpt-5-mini"
+ }'
+```
- * Pydantic models for request/response:
+**Response example:**
+```json
+{
+ "id": 1,
+ "prompt": "Explain quantum computing...",
+ "response": "Quantum computing is...",
+ "model_version": "gpt-5-mini",
+ "latency_ms": 1234,
+ "status": "success"
+}
+```
- * `ChatRequest`
- * `ChatResponse`
+### 2. Run Evaluation
-* `app/pyproject.toml`
+**Manual evaluation:**
+```bash
+# Rule-based evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=rule"
- * Python package metadata / dependencies for the gateway service
- * Used by `uv sync` both locally and inside Docker image.
+# LLM-as-a-Judge evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=llm"
+```
-### services/evaluator (planned)
+**Automatic evaluation:** Scheduler runs automatically at configured intervals (default: 60 minutes)
-* Python service for **batch evaluation of logged LLM outputs**
-* Will read from Postgres (`llm_logs`), compute metrics, and write evaluation results
-* `app/pyproject.toml` is already prepared for future implementation.
-* `Dockerfile` builds a Python 3.12 image and installs dependencies via `uv`.
+### 3. Check Dashboards
-### services/dashboard (planned)
+**Grafana Dashboard:**
+1. Navigate to http://localhost:3001
+2. Login with admin/admin
+3. Go to Dashboards → LLM Quality Observer
-* UI/dashboard service (e.g. Streamlit or FastAPI + frontend)
-* Will visualize:
+**Included metrics:**
+- HTTP request rate and latency
+- LLM performance by model
+- Evaluation score distribution
+- Notification delivery status
+- Scheduler execution state
- * latency stats
- * quality scores
- * error rates
-* `app/pyproject.toml` prepared for future implementation.
-* `Dockerfile` ready for Python 3.12 + `uv` build.
+### 4. Query Database
-### infra/docker
+```bash
+# Connect to PostgreSQL
+docker exec -it llm-postgres psql -U llm_user -d llm_quality
-* `docker-compose.local.yml`
+# View recent logs
+SELECT id, created_at, user_id,
+ LEFT(prompt, 50) AS prompt,
+ model_version, latency_ms, status
+FROM llm_logs
+ORDER BY id DESC
+LIMIT 10;
- * Local development stack:
+# View evaluation results
+SELECT l.id, l.prompt,
+ e.score_overall, e.score_instruction_following, e.score_truthfulness,
+ e.judge_type, e.comments
+FROM llm_logs l
+JOIN llm_evaluations e ON l.id = e.log_id
+ORDER BY e.created_at DESC
+LIMIT 10;
+```
- * `llm-postgres` (Postgres 16)
- * `llm-gateway-api` (FastAPI + OpenAI client)
- * `llm-evaluator` (placeholder)
- * `llm-dashboard` (placeholder)
- * Binds gateway API to `localhost:18000` by default.
+---
-### configs/env
+## 🔧 Feature Details
+
+### Gateway API (v0.1.0+)
+
+**Endpoints:**
+- `GET /health` - Health check
+- `POST /chat` - LLM request processing
+- `GET /docs` - Swagger UI
+- `GET /metrics` - Prometheus metrics
+
+**Features:**
+- OpenAI GPT model calls
+- Automatic logging (prompt, response, latency, status)
+- Model version tracking
+- Prometheus metrics export
+
+### Evaluator Service (v0.2.0+)
+
+**Evaluation Methods:**
+
+1. **Rule-Based Evaluation** (fast, cheap):
+ - Response length validation
+ - Keyword verification
+ - Format compliance checks
+
+2. **LLM-as-a-Judge** (v0.3.0+, accurate, costs money):
+ - GPT-4 based quality evaluation
+ - Multi-dimensional scoring (overall, instruction following, truthfulness)
+ - Detailed evaluation comments
+
+**Automated Scheduler** (v0.4.0+):
+- Periodic evaluation via APScheduler
+- Configurable interval and batch size
+- Automatic start/stop
+
+**Notification System** (v0.4.0+, v0.5.0):
+- **Slack**: Webhook integration
+- **Discord**: Webhook integration
+- **Email** (v0.5.0): SMTP (Gmail, etc.)
+- Immediate low-quality alerts
+- Batch evaluation summaries
+
+### Monitoring (v0.5.0)
+
+**Prometheus Metrics:**
+- `llm_gateway_http_requests_total` - HTTP request count
+- `llm_gateway_http_request_duration_seconds` - Request latency
+- `llm_gateway_llm_requests_total` - LLM call count
+- `llm_evaluator_evaluations_total` - Evaluation count
+- `llm_evaluator_evaluation_scores` - Score distribution
+- `llm_evaluator_notifications_sent_total` - Notification count
+- `llm_evaluator_pending_logs` - Pending log count
+
+**Grafana Dashboard:**
+- 14 visualization panels
+- Real-time performance monitoring
+- Quality trend analysis
+- Notification status tracking
-* `.env.local`
+---
- * Local configuration used by `docker-compose.local.yml`
- * Mounted as env file for gateway/evaluator/dashboard
+## ⚙️ Configuration
-Example `.env.local`:
+### Environment Variables
-```env
+```bash
# Application
APP_ENV=local
LOG_LEVEL=DEBUG
-# LLM
-OPENAI_MODEL_MAIN=gpt-5-mini
+# LLM Models
+OPENAI_MODEL_MAIN=gpt-5-mini # Model for Gateway
+OPENAI_MODEL_JUDGE=gpt-4o-mini # Model for evaluation
LLM_API_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-...
-# Database (used by gateway-api)
+# Database
DATABASE_URL=postgresql://llm_user:llm_password@postgres:5432/llm_quality
+
+# Batch Evaluation Scheduler (v0.4.0+)
+ENABLE_AUTO_EVALUATION=true # Enable automatic evaluation
+EVALUATION_INTERVAL_MINUTES=60 # Evaluation interval (minutes)
+EVALUATION_BATCH_SIZE=10 # Batch size
+EVALUATION_JUDGE_TYPE=rule # Default evaluation method (rule/llm)
+
+# Notification Settings (v0.4.0+)
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
+DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/YOUR/WEBHOOK/URL
+NOTIFICATION_SCORE_THRESHOLD=3 # Alert threshold (≤ 3)
+
+# Email Notifications (v0.5.0+)
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=your-app-password
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=recipient1@example.com,recipient2@example.com
```
---
-## Getting Started (Local, Docker)
+## 🏗️ Project Structure
-### Prerequisites
+```
+LLM-Quality-Observer/
+├── services/
+│ ├── gateway-api/ # Gateway API service
+│ │ ├── app/
+│ │ │ ├── main.py # FastAPI app
+│ │ │ ├── config.py # Configuration
+│ │ │ ├── llm_client.py # OpenAI client
+│ │ │ ├── db.py # Database
+│ │ │ ├── models.py # SQLAlchemy models
+│ │ │ ├── schemas.py # Pydantic schemas
+│ │ │ └── metrics.py # Prometheus metrics
+│ │ ├── tests/
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ ├── evaluator/ # Evaluator service
+│ │ ├── app/
+│ │ │ ├── main.py # FastAPI app
+│ │ │ ├── rules.py # Rule-based evaluation
+│ │ │ ├── llm_judge.py # LLM-as-a-Judge
+│ │ │ ├── scheduler.py # APScheduler
+│ │ │ ├── notifier.py # Notification system
+│ │ │ ├── metrics.py # Prometheus metrics
+│ │ │ └── utils.py # Utilities
+│ │ ├── tests/
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ ├── dashboard/ # Streamlit dashboard
+│ │ ├── app/
+│ │ │ ├── main.py
+│ │ │ ├── models.py
+│ │ │ └── config.py
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ │
+│ └── web/ # Next.js web dashboard
+│ └── dashboard/
+│ ├── app/
+│ ├── components/
+│ ├── locales/ # Multi-language support
+│ └── lib/
+│
+├── infra/
+│ ├── docker/
+│ │ └── docker-compose.local.yml
+│ ├── prometheus/
+│ │ └── prometheus.yml
+│ └── grafana/
+│ ├── provisioning/
+│ ├── dashboards/
+│ └── DASHBOARD_GUIDE-us.md
+│
+├── configs/
+│ └── env/
+│ ├── .env.local.example
+│ └── .env.local # gitignored
+│
+├── docs/
+│ ├── release_notes/ # Release notes
+│ │ ├── RELEASE_NOTES_v0.1.0.md
+│ │ ├── RELEASE_NOTES_v0.2.0.md
+│ │ ├── RELEASE_NOTES_v0.3.0.md
+│ │ ├── RELEASE_NOTES_v0.4.0.md
+│ │ └── RELEASE_NOTES_v0.5.0.md
+│ ├── RELEASE_NOTES_v0.5.0_ko.md
+│ ├── METRICS.md
+│ ├── EMAIL_SETUP.md
+│ └── README-main-us.md
+│
+├── .github/
+│ └── workflows/
+│ └── ci.yml # GitHub Actions CI/CD
+│
+├── .flake8 # Flake8 configuration
+└── README.md
+```
-* Docker
-* Docker Compose plugin (e.g. `docker compose` command available)
-* OpenAI API key with some available credit
+---
-### 1. Clone the repository
+## 🧪 Testing
+
+### Health Check Tests
```bash
-git clone https://github.com//LLM-Quality-Observer.git
-cd LLM-Quality-Observer
+# Check all services
+curl http://localhost:18000/health # Gateway API
+curl http://localhost:18001/health # Evaluator
+curl http://localhost:9090/-/healthy # Prometheus
+curl http://localhost:3001/api/health # Grafana
```
-### 2. Create local env file
+### Integration Tests
```bash
-cp configs/env/.env.local configs/env/.env.local.example # optional backup
-# edit configs/env/.env.local with your values
-```
+# 1. Send LLM request
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Test", "user_id": "test"}'
-Make sure `LLM_API_KEY` and `OPENAI_MODEL_MAIN` are set.
+# 2. Run evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=1"
-### 3. Run with Docker Compose
+# 3. Check metrics
+curl http://localhost:18000/metrics | grep llm_gateway
+curl http://localhost:18001/metrics | grep llm_evaluator
-```bash
-cd infra/docker
-docker compose -f docker-compose.local.yml up --build
+# 4. Verify Grafana dashboard
+open http://localhost:3001
```
-Services:
+### Automated Tests
+
+```bash
+# Run CI/CD pipeline locally
+cd services/gateway-api
+pytest tests/
-* Gateway API: `http://localhost:18000`
-* Postgres: exposed inside Docker network as `postgres:5432`
+cd ../evaluator
+pytest tests/
+
+# Lint check
+flake8 services/
+```
---
-## Using the Gateway API
+## 📈 Monitoring Guide
-### Health Check
+### Prometheus Query Examples
-```bash
-curl http://localhost:18000/health
-# -> { "status": "ok" }
-```
+```promql
+# HTTP request rate
+sum(rate(llm_gateway_http_requests_total[5m]))
-### Interactive API docs (Swagger UI)
+# LLM latency p95
+histogram_quantile(0.95, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))
-Open in browser:
+# Evaluation score median
+histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le))
-```text
-http://localhost:18000/docs
+# Pending logs count
+llm_evaluator_pending_logs
```
-You can use `POST /chat` directly from Swagger.
+### Grafana Dashboard Usage
-### Chat Example
+For detailed guide, see [Grafana Dashboard Guide](../infra/grafana/DASHBOARD_GUIDE-us.md)
-Request:
+---
-```bash
-curl -X POST "http://localhost:18000/chat" \
- -H "Content-Type: application/json" \
- -d '{
- "prompt": "Explain what LLM-Quality-Observer is in one sentence.",
- "user_id": "test-user",
- "model_version": null
- }'
-```
+## 📚 Documentation
-Example response:
+### Release Notes
-```json
-{
- "response": "LLM-Quality-Observer is a monitoring and evaluation framework that continuously assesses and tracks the quality of LLM outputs.",
- "model_version": "gpt-5-mini",
- "latency_ms": 4735.19
-}
-```
+- [v0.5.0 (Latest)](./RELEASE_NOTES_v0.5.0.md) - Prometheus, Grafana, Email notifications
+- [v0.4.0](./release_notes/RELEASE_NOTES_v0.4.0.md) - Scheduler, Slack/Discord notifications, CI/CD
+- [v0.3.0](./release_notes/RELEASE_NOTES_v0.3.0.md) - LLM-as-a-Judge, Multi-language support
+- [v0.2.0](./release_notes/RELEASE_NOTES_v0.2.0.md) - Dashboard, CORS, Rule-based evaluation
+- [v0.1.0](./release_notes/RELEASE_NOTES_v0.1.0.md) - Initial release (Gateway + Evaluator)
-Behind the scenes, the request/response pair is also stored in `llm_logs` table in Postgres.
+### Technical Documentation
+
+- [Metrics Reference](./METRICS.md) - Prometheus metrics details
+- [Email Setup Guide](./EMAIL_SETUP.md) - Gmail SMTP configuration
+- [Grafana Dashboard Guide](../infra/grafana/DASHBOARD_GUIDE-us.md) - Dashboard usage
---
-## Checking Logged Data (Postgres)
+## 🛣️ Roadmap
-```bash
-docker exec -it llm-postgres psql -U llm_user -d llm_quality
+### Completed Features
-SELECT id, created_at, user_id,
- LEFT(prompt, 60) AS prompt_snippet,
- LEFT(response, 60) AS response_snippet,
- model_version,
- latency_ms,
- status
-FROM llm_logs
-ORDER BY id DESC
-LIMIT 10;
-```
+- ✅ v0.1.0: Gateway API + Evaluator foundation
+- ✅ v0.2.0: Web dashboard + Rule-based evaluation
+- ✅ v0.3.0: LLM-as-a-Judge + Multi-language support
+- ✅ v0.4.0: Automated scheduler + Slack/Discord notifications
+- ✅ v0.5.0: Prometheus + Grafana + Email notifications
+
+### Future Plans (v0.6.0+)
+
+- [ ] **Alertmanager Integration**: Advanced alerting rules and routing
+- [ ] **Multi-LLM Provider Support**: Anthropic Claude, Google Gemini, etc.
+- [ ] **Cost Tracking**: Token usage and cost monitoring
+- [ ] **A/B Testing**: Prompt and model comparison
+- [ ] **User Feedback**: RLHF-style human evaluation
+- [ ] **Kubernetes Deployment**: Helm charts and deployment guides
+- [ ] **API Authentication**: JWT-based security
+- [ ] **Rate Limiting**: Request limits and quota management
---
-## Roadmap
+## 🔒 Security
-Planned enhancements for future versions:
+### Precautions
-* **Evaluator Service**
+- Never commit `.env.local` file (gitignored)
+- Store OpenAI API key securely
+- Don't expose Slack/Discord webhook URLs
+- Use app passwords for SMTP (Gmail)
- * Batch jobs to score LLM outputs using:
+### Recommendations
- * heuristics (e.g. length, keyword constraints)
- * LLM-as-a-judge prompts
- * human labels / RLHF-like feedback
- * Persist evaluation results in new tables (e.g. `llm_evaluations`)
+- Store environment variables in secret managers for production
+- Add authentication to API endpoints (v0.6.0+)
+- Use HTTPS/TLS
+- Regular dependency updates
-* **Dashboard Service**
+---
- * High-level quality metrics:
+## 🤝 Contributing
- * average score per model/version
- * latency distributions
- * error rates and failure patterns
- * Filters by:
+Contributions are welcome! Please follow these steps:
- * time range, user, use case, model version
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feat/amazing-feature`)
+3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
+4. Push to the branch (`git push origin feat/amazing-feature`)
+5. Open a Pull Request
-* **Alerting / Notifications**
+### Development Guidelines
- * Simple rules:
+- Python code must follow Flake8 style guide
+- All PRs must pass CI tests
+- Use Conventional Commits format for commit messages
+- Add tests for new features
- * “Alert when average score drops below threshold”
- * “Alert when latency p95 > X ms”
- * Send alerts to Slack / email.
+---
-* **Cost Awareness**
+## 📄 License
- * Track token usage and cost per model/version
- * Combine with quality metrics for cost–quality tradeoff analysis.
+This project is distributed under the MIT License.
---
-## License
+## 👥 Author
-MIT (or your preferred license)
+**Dong-hyeon Shin (dongkoony)**
+- GitHub: [@dongkoony](https://github.com/dongkoony)
+- Email: dhyeon.shin@icloud.com
---
-## Notes
+## 📞 Contact & Support
+
+- **Issues**: [GitHub Issues](https://github.com/dongkoony/LLM-Quality-Observer/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/dongkoony/LLM-Quality-Observer/discussions)
+- **Email**: dhyeon.shin@icloud.com
-This project is intended as a **learning and portfolio project** to demonstrate practical MLOps patterns for LLM applications:
+---
-* clean separation of services (gateway / evaluator / dashboard)
-* environment-based configuration
-* logging & observability for LLM responses
-* room to extend into production-style MLOps workflows.
+**⭐ If this project helped you, please give it a star!**
diff --git a/docs/release_notes/RELEASE_NOTES_v0.1.0.md b/docs/release_notes/RELEASE_NOTES_v0.1.0.md
new file mode 100644
index 0000000..c94f20e
--- /dev/null
+++ b/docs/release_notes/RELEASE_NOTES_v0.1.0.md
@@ -0,0 +1,395 @@
+# Release Notes - v0.1.0
+
+**Release Date**: 2025 (Initial Release)
+**Status**: Foundation Release
+
+---
+
+## Overview
+
+LLM-Quality-Observer v0.1.0 marks the initial release of the project, establishing the foundational architecture for monitoring and evaluating LLM response quality. This version implements the core microservices architecture with Gateway API and Evaluator services.
+
+---
+
+## 🎯 Key Features
+
+### Gateway API Service
+- **LLM Request Handling**: FastAPI-based service that receives chat requests and forwards them to OpenAI GPT models
+- **Database Logging**: Automatic logging of all LLM interactions to PostgreSQL
+- **RESTful API**:
+ - `POST /chat`: Accept user prompts and return LLM responses
+ - `GET /health`: Health check endpoint
+- **OpenAI Integration**: Direct integration with OpenAI API for GPT model access
+- **Request/Response Tracking**: Records prompt, response, model version, latency, and status for every interaction
+
+### Evaluator Service
+- **Evaluation Framework**: Service dedicated to evaluating logged LLM responses
+- **Database Integration**: Reads logs from PostgreSQL and stores evaluation results
+- **Basic Evaluation Logic**: Initial evaluation implementation
+- **RESTful API**:
+ - `GET /health`: Health check endpoint
+ - `POST /evaluate-once`: Manual evaluation trigger endpoint
+
+### Infrastructure
+- **PostgreSQL Database**: Centralized data storage for logs and evaluations
+ - `llm_logs` table: Stores all LLM interactions
+ - `llm_evaluations` table: Stores evaluation results
+- **Docker Compose**: Complete local development environment
+- **Microservices Architecture**: Separation of concerns between gateway and evaluation
+
+---
+
+## 🏗️ Architecture
+
+```
+Client → Gateway API (port 18000) → OpenAI GPT
+ ↓
+ PostgreSQL (port 5432)
+ ↑
+ Evaluator (port 18001)
+```
+
+---
+
+## 📦 Technology Stack
+
+### Gateway API
+- **Framework**: FastAPI
+- **Server**: Uvicorn
+- **ORM**: SQLAlchemy
+- **Database Driver**: psycopg2-binary
+- **LLM Client**: OpenAI Python SDK
+- **Configuration**: pydantic-settings
+- **Package Manager**: uv
+
+### Evaluator
+- **Framework**: FastAPI
+- **Server**: Uvicorn
+- **ORM**: SQLAlchemy
+- **Database Driver**: psycopg2-binary
+- **LLM Client**: OpenAI Python SDK (for future judge capability)
+- **Package Manager**: uv
+
+### Infrastructure
+- **Database**: PostgreSQL 16
+- **Containerization**: Docker, Docker Compose
+- **Python Version**: 3.12
+
+---
+
+## 🗄️ Database Schema
+
+### llm_logs Table
+```sql
+CREATE TABLE llm_logs (
+ id SERIAL PRIMARY KEY,
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+ user_id VARCHAR,
+ prompt TEXT,
+ response TEXT,
+ model_version VARCHAR,
+ latency_ms INTEGER,
+ status VARCHAR
+);
+```
+
+### llm_evaluations Table
+```sql
+CREATE TABLE llm_evaluations (
+ id SERIAL PRIMARY KEY,
+ log_id INTEGER REFERENCES llm_logs(id),
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+ score_overall FLOAT,
+ comments TEXT,
+ judge_model VARCHAR
+);
+```
+
+---
+
+## 📖 API Documentation
+
+### Gateway API (port 18000)
+
+#### POST /chat
+**Description**: Submit a prompt and receive LLM response
+
+**Request**:
+```json
+{
+ "prompt": "What is Python?",
+ "user_id": "user123",
+ "model_version": "gpt-3.5-turbo"
+}
+```
+
+**Response**:
+```json
+{
+ "id": 1,
+ "prompt": "What is Python?",
+ "response": "Python is a high-level programming language...",
+ "model_version": "gpt-3.5-turbo",
+ "latency_ms": 1234,
+ "status": "success"
+}
+```
+
+#### GET /health
+**Description**: Health check endpoint
+
+**Response**:
+```json
+{
+ "status": "healthy"
+}
+```
+
+### Evaluator (port 18001)
+
+#### POST /evaluate-once
+**Description**: Trigger manual evaluation of pending logs
+
+**Query Parameters**:
+- `limit` (optional): Number of logs to evaluate
+
+**Response**:
+```json
+{
+ "message": "Evaluation completed",
+ "evaluated_count": 5
+}
+```
+
+#### GET /health
+**Description**: Health check endpoint
+
+**Response**:
+```json
+{
+ "status": "healthy"
+}
+```
+
+---
+
+## 🚀 Getting Started
+
+### Prerequisites
+- Docker and Docker Compose
+- OpenAI API Key
+
+### Installation
+
+1. **Clone the repository**:
+```bash
+git clone https://github.com/your-username/LLM-Quality-Observer.git
+cd LLM-Quality-Observer
+```
+
+2. **Set up environment variables**:
+```bash
+cp configs/env/.env.local.example configs/env/.env.local
+# Edit .env.local and add your OpenAI API key
+```
+
+3. **Start services**:
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml up --build
+```
+
+4. **Verify services**:
+```bash
+# Gateway API
+curl http://localhost:18000/health
+
+# Evaluator
+curl http://localhost:18001/health
+```
+
+### Usage Example
+
+**Send a chat request**:
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "prompt": "Explain machine learning in simple terms",
+ "user_id": "test-user",
+ "model_version": "gpt-3.5-turbo"
+ }'
+```
+
+**Trigger evaluation**:
+```bash
+curl -X POST "http://localhost:18001/evaluate-once?limit=10"
+```
+
+**Check database**:
+```bash
+docker exec -it llm-postgres psql -U llm_user -d llm_quality
+
+# View logs
+SELECT * FROM llm_logs ORDER BY created_at DESC LIMIT 5;
+
+# View evaluations
+SELECT * FROM llm_evaluations ORDER BY created_at DESC LIMIT 5;
+```
+
+---
+
+## 📁 Project Structure
+
+```
+LLM-Quality-Observer/
+├── services/
+│ ├── gateway-api/
+│ │ ├── app/
+│ │ │ ├── main.py # FastAPI application
+│ │ │ ├── models.py # SQLAlchemy models
+│ │ │ ├── schemas.py # Pydantic schemas
+│ │ │ ├── config.py # Configuration
+│ │ │ ├── db.py # Database connection
+│ │ │ └── llm_client.py # OpenAI client
+│ │ ├── Dockerfile
+│ │ └── pyproject.toml
+│ └── evaluator/
+│ ├── app/
+│ │ ├── main.py # FastAPI application
+│ │ ├── models.py # SQLAlchemy models
+│ │ ├── config.py # Configuration
+│ │ └── db.py # Database connection
+│ ├── Dockerfile
+│ └── pyproject.toml
+├── infra/
+│ └── docker/
+│ └── docker-compose.local.yml
+├── configs/
+│ └── env/
+│ └── .env.local.example
+└── README.md
+```
+
+---
+
+## 🔧 Configuration
+
+### Environment Variables
+
+**Required**:
+- `DATABASE_URL`: PostgreSQL connection string
+- `OPENAI_API_KEY`: OpenAI API key
+- `OPENAI_MODEL_MAIN`: Default LLM model (e.g., gpt-3.5-turbo)
+
+**Optional**:
+- `LOG_LEVEL`: Logging level (default: INFO)
+- `APP_ENV`: Application environment (default: local)
+
+### Example .env.local
+```bash
+APP_ENV=local
+LOG_LEVEL=INFO
+OPENAI_MODEL_MAIN=gpt-3.5-turbo
+OPENAI_API_KEY=sk-...
+DATABASE_URL=postgresql://llm_user:llm_password@postgres:5432/llm_quality
+```
+
+---
+
+## 🧪 Testing
+
+### Manual Testing
+
+**Test Gateway API**:
+```bash
+# Health check
+curl http://localhost:18000/health
+
+# Chat request
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Hello", "user_id": "test"}'
+```
+
+**Test Evaluator**:
+```bash
+# Health check
+curl http://localhost:18001/health
+
+# Trigger evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+```
+
+---
+
+## 🐛 Known Limitations
+
+- **Manual Evaluation Only**: No automated scheduler for continuous evaluation
+- **Basic Evaluation Logic**: Evaluation criteria not yet fully implemented
+- **No Dashboard**: No web interface for viewing metrics and logs
+- **No Notifications**: No alerting system for quality issues
+- **Single Model Support**: Designed for OpenAI models only
+- **No Metrics Export**: No Prometheus/Grafana integration
+
+---
+
+## 🛣️ Future Roadmap
+
+- [ ] Implement comprehensive evaluation criteria (rule-based and LLM-as-a-judge)
+- [ ] Add web dashboard for visualization
+- [ ] Implement automated evaluation scheduler
+- [ ] Add notification system (Slack, Discord, Email)
+- [ ] Add Prometheus metrics and Grafana dashboards
+- [ ] Support multiple LLM providers
+- [ ] Add batch processing capabilities
+
+---
+
+## 📝 Technical Notes
+
+### Database Connection Pooling
+SQLAlchemy handles connection pooling automatically. Default pool size is 5 connections.
+
+### Error Handling
+Failed LLM requests are logged with `status="error"` and stored in the database for analysis.
+
+### OpenAI API Usage
+The gateway uses `client.chat.completions.create()` for standard chat completions. Response streaming is not implemented in v0.1.0.
+
+### Docker Networking
+All services communicate via Docker Compose default network. Service names (e.g., `postgres`, `gateway-api`) are used as hostnames.
+
+---
+
+## 🤝 Contributing
+
+This is the initial release. Contributions are welcome! Please:
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Submit a pull request
+
+---
+
+## 📄 License
+
+[Add your license information here]
+
+---
+
+## 👥 Authors
+
+- Initial work - [Your Name]
+
+---
+
+## 🙏 Acknowledgments
+
+- FastAPI for the excellent web framework
+- OpenAI for the LLM API
+- PostgreSQL for reliable data storage
+
+---
+
+**Next Release**: v0.2.0 will introduce web dashboard and enhanced evaluation capabilities.
diff --git a/docs/release_notes/RELEASE_NOTES_v0.2.0.md b/docs/release_notes/RELEASE_NOTES_v0.2.0.md
new file mode 100644
index 0000000..dc62442
--- /dev/null
+++ b/docs/release_notes/RELEASE_NOTES_v0.2.0.md
@@ -0,0 +1,439 @@
+# Release Notes - v0.2.0
+
+**Release Date**: 2025
+**Status**: Dashboard Release
+
+---
+
+## Overview
+
+LLM-Quality-Observer v0.2.0 introduces a web-based dashboard for visualizing quality metrics and monitoring LLM performance. This release adds a new Dashboard service using Streamlit, enhances the evaluation logic with rule-based criteria, and improves API accessibility with CORS middleware.
+
+---
+
+## 🎯 What's New
+
+### Dashboard Service (NEW)
+- **Streamlit Web Interface**: Interactive dashboard for real-time monitoring
+- **Quality Metrics Visualization**: View evaluation scores, trends, and distributions
+- **Log Exploration**: Browse and filter LLM interaction logs
+- **Performance Analytics**: Track latency, success rates, and model performance
+- **FastAPI Backend**: Dashboard API endpoints for data retrieval
+- **Port**: Accessible at http://localhost:8501
+
+### Enhanced Evaluator Service
+- **Rule-Based Evaluation**: Comprehensive rule-based scoring system
+ - Response length validation
+ - Keyword presence checking
+ - Format compliance verification
+ - Language quality assessment
+- **Structured Evaluation Schema**: Improved evaluation data models
+- **Better Error Handling**: Enhanced error tracking and logging
+
+### Improved Gateway API
+- **CORS Middleware**: Enable cross-origin requests from web dashboard
+- **Dashboard API Endpoints**: New endpoints for serving dashboard data
+- **Enhanced Logging**: More detailed request/response logging
+
+---
+
+## 🏗️ Updated Architecture
+
+```
+Client/Browser → Dashboard (port 8501)
+ ↓
+ Gateway API (port 18000) → OpenAI GPT
+ ↓
+ PostgreSQL (port 5432)
+ ↑
+ Evaluator (port 18001)
+```
+
+---
+
+## 📦 New Technology Stack
+
+### Dashboard Service
+- **Framework**: Streamlit
+- **Visualization**: Plotly, Pandas
+- **ORM**: SQLAlchemy
+- **Database Driver**: psycopg2-binary
+- **Package Manager**: uv
+
+---
+
+## 🆕 New Features in Detail
+
+### Dashboard Pages
+
+#### Overview Page
+- **Summary Statistics**: Total requests, average score, success rate
+- **Recent Activity**: Latest LLM interactions
+- **Quick Stats**: Key performance indicators
+
+#### Quality Metrics Page
+- **Score Distribution**: Histogram of evaluation scores
+- **Score Trends**: Time series chart of quality over time
+- **Score by Model**: Compare quality across different models
+
+#### Latency Analysis Page
+- **Latency Distribution**: p50, p95, p99 percentiles
+- **Latency by Model**: Performance comparison
+- **Response Time Trends**: Track performance over time
+
+#### Logs Explorer Page
+- **Filterable Table**: Search and filter logs
+- **Column Selection**: Customize displayed columns
+- **Detailed View**: Inspect individual interactions
+- **Export Capability**: Download filtered results
+
+### Rule-Based Evaluation Criteria
+
+The evaluator now implements structured evaluation rules:
+
+**Response Length Check**:
+- Minimum length requirement
+- Maximum length limit
+- Optimal range scoring
+
+**Content Quality**:
+- Keyword presence validation
+- Banned word detection
+- Format compliance
+
+**Scoring System**:
+- Score range: 1-5
+- Weighted criteria
+- Automatic pass/fail determination
+
+---
+
+## 📖 New API Endpoints
+
+### Gateway API
+
+#### GET /api/logs
+**Description**: Retrieve LLM logs with filtering
+
+**Query Parameters**:
+- `limit` (optional): Maximum number of logs to return
+- `offset` (optional): Pagination offset
+- `user_id` (optional): Filter by user ID
+- `model_version` (optional): Filter by model
+
+**Response**:
+```json
+{
+ "logs": [
+ {
+ "id": 1,
+ "created_at": "2024-01-01T12:00:00Z",
+ "user_id": "user123",
+ "prompt": "What is Python?",
+ "response": "Python is...",
+ "model_version": "gpt-3.5-turbo",
+ "latency_ms": 1234,
+ "status": "success"
+ }
+ ],
+ "total": 100
+}
+```
+
+#### GET /api/evaluations
+**Description**: Retrieve evaluation results
+
+**Query Parameters**:
+- `limit` (optional): Maximum number of evaluations
+- `min_score` (optional): Filter by minimum score
+- `max_score` (optional): Filter by maximum score
+
+**Response**:
+```json
+{
+ "evaluations": [
+ {
+ "id": 1,
+ "log_id": 1,
+ "score_overall": 4.5,
+ "comments": "Good response quality",
+ "judge_model": "rule-based"
+ }
+ ],
+ "total": 50
+}
+```
+
+#### GET /api/stats
+**Description**: Get summary statistics
+
+**Response**:
+```json
+{
+ "total_requests": 1000,
+ "average_score": 4.2,
+ "success_rate": 0.98,
+ "average_latency_ms": 1500,
+ "total_evaluations": 950
+}
+```
+
+---
+
+## 🔧 Configuration Updates
+
+### New Environment Variables
+
+```bash
+# Dashboard Service (optional)
+DASHBOARD_TITLE=LLM Quality Observer
+DASHBOARD_REFRESH_INTERVAL=30
+
+# CORS Settings (Gateway API)
+CORS_ALLOWED_ORIGINS=http://localhost:8501,http://127.0.0.1:8501
+CORS_ALLOW_CREDENTIALS=true
+CORS_ALLOW_METHODS=*
+CORS_ALLOW_HEADERS=*
+```
+
+### Updated docker-compose.local.yml
+
+```yaml
+services:
+ # ... existing services ...
+
+ dashboard:
+ build: ../../services/dashboard
+ container_name: llm-dashboard
+ depends_on:
+ - postgres
+ env_file:
+ - ../../configs/env/.env.local
+ environment:
+ DATABASE_URL: postgresql://llm_user:llm_password@postgres:5432/llm_quality
+ ports:
+ - "8501:8501"
+```
+
+---
+
+## 🚀 Getting Started with Dashboard
+
+### Access the Dashboard
+
+1. **Start all services**:
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml up --build
+```
+
+2. **Open browser**:
+```
+http://localhost:8501
+```
+
+3. **Navigate pages**:
+- Use sidebar to switch between Overview, Quality Metrics, Latency Analysis, and Logs
+
+### Generate Sample Data
+
+```bash
+# Send multiple requests to populate dashboard
+for i in {1..20}; do
+ curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d "{\"prompt\": \"Test prompt $i\", \"user_id\": \"user$i\"}"
+done
+
+# Trigger evaluations
+curl -X POST "http://localhost:18001/evaluate-once?limit=20"
+
+# Refresh dashboard to see data
+```
+
+---
+
+## 📁 Updated Project Structure
+
+```
+LLM-Quality-Observer/
+├── services/
+│ ├── gateway-api/
+│ │ └── app/
+│ │ └── main.py # Added CORS and dashboard endpoints
+│ ├── evaluator/
+│ │ └── app/
+│ │ ├── main.py
+│ │ ├── rules.py # NEW: Rule-based evaluation logic
+│ │ └── schemas.py # NEW: Evaluation schemas
+│ └── dashboard/ # NEW: Dashboard service
+│ ├── app/
+│ │ ├── main.py # Streamlit application
+│ │ ├── models.py # Database models
+│ │ ├── config.py # Configuration
+│ │ └── db.py # Database connection
+│ ├── Dockerfile
+│ └── pyproject.toml
+└── configs/
+ └── env/
+ └── .env.local.example # Updated with CORS settings
+```
+
+---
+
+## 🔄 Migration from v0.1.0
+
+### Database Schema (No changes)
+The database schema remains compatible with v0.1.0. No migration required.
+
+### Configuration Changes
+Add CORS settings to your `.env.local`:
+```bash
+CORS_ALLOWED_ORIGINS=http://localhost:8501
+```
+
+### Docker Compose
+Update your docker-compose setup:
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml down
+docker compose -f docker-compose.local.yml up --build
+```
+
+---
+
+## 🧪 Testing
+
+### Test Dashboard Access
+```bash
+# Check dashboard is running
+curl http://localhost:8501
+
+# Should return Streamlit HTML
+```
+
+### Test CORS Functionality
+```bash
+# From browser console on http://localhost:8501
+fetch('http://localhost:18000/api/stats')
+ .then(r => r.json())
+ .then(console.log)
+```
+
+### Test New API Endpoints
+```bash
+# Get logs
+curl http://localhost:18000/api/logs?limit=10
+
+# Get evaluations
+curl http://localhost:18000/api/evaluations
+
+# Get statistics
+curl http://localhost:18000/api/stats
+```
+
+---
+
+## 🐛 Bug Fixes
+
+- Fixed README documentation links for consistency
+- Improved error handling in evaluator service
+- Enhanced database connection stability
+- Fixed gitignore to properly exclude .env.local files
+
+---
+
+## 💡 Improvements
+
+- **Better Code Organization**: Separated evaluation logic into dedicated modules
+- **Configuration Management**: Added .env.local.example template
+- **Documentation**: Updated README with dashboard features
+- **Error Logging**: More comprehensive error tracking
+
+---
+
+## 🔒 Security Updates
+
+- **CORS Configuration**: Properly configured CORS to only allow dashboard origin
+- **Environment Variables**: Sensitive data moved to .env.local (gitignored)
+
+---
+
+## ⚠️ Known Limitations
+
+- **Single Page App**: Dashboard does not support deep linking
+- **No Authentication**: Dashboard and APIs are unauthenticated
+- **Limited Caching**: Dashboard queries database on every refresh
+- **No Real-time Updates**: Manual page refresh required to see new data
+- **Rule-Based Only**: LLM-as-a-judge not yet implemented
+
+---
+
+## 🛣️ Roadmap to v0.3.0
+
+- [ ] Implement LLM-as-a-judge evaluation
+- [ ] Add multi-language support for dashboard
+- [ ] Implement additional evaluation metrics
+- [ ] Add real-time dashboard updates
+- [ ] Improve dashboard performance with caching
+
+---
+
+## 📝 Technical Notes
+
+### CORS Configuration
+The Gateway API now includes CORS middleware to allow requests from the Streamlit dashboard:
+
+```python
+from fastapi.middleware.cors import CORSMiddleware
+
+app.add_middleware(
+ CORSMiddleware,
+ allow_origins=["http://localhost:8501"],
+ allow_credentials=True,
+ allow_methods=["*"],
+ allow_headers=["*"],
+)
+```
+
+### Rule-Based Evaluation Logic
+Evaluations now follow structured rules defined in `services/evaluator/app/rules.py`:
+
+```python
+def evaluate_response_length(response: str) -> float:
+ """Score based on response length."""
+ length = len(response)
+ if length < 50:
+ return 1.0
+ elif length < 200:
+ return 3.0
+ elif length < 1000:
+ return 5.0
+ else:
+ return 4.0
+```
+
+### Streamlit Dashboard
+The dashboard uses Streamlit's session state for maintaining filters and selections across reruns.
+
+---
+
+## 📚 Documentation Updates
+
+- Updated README.md with dashboard setup instructions
+- Added dashboard usage examples
+- Documented new API endpoints
+- Updated architecture diagrams
+
+---
+
+## 🙏 Acknowledgments
+
+- Streamlit team for the excellent dashboard framework
+- FastAPI for CORS middleware support
+- Community feedback on initial v0.1.0 release
+
+---
+
+**Previous Release**: [v0.1.0](./RELEASE_NOTES_v0.1.0.md)
+**Next Release**: v0.3.0 will introduce LLM-as-a-judge and multi-language support
diff --git a/docs/release_notes/RELEASE_NOTES_v0.3.0.md b/docs/release_notes/RELEASE_NOTES_v0.3.0.md
new file mode 100644
index 0000000..cb43b48
--- /dev/null
+++ b/docs/release_notes/RELEASE_NOTES_v0.3.0.md
@@ -0,0 +1,557 @@
+# Release Notes - v0.3.0
+
+**Release Date**: 2025
+**Status**: Intelligence & Localization Release
+
+---
+
+## Overview
+
+LLM-Quality-Observer v0.3.0 introduces LLM-as-a-Judge evaluation capabilities, enabling AI-powered quality assessment alongside rule-based evaluation. This release also adds comprehensive multi-language support for the dashboard and expands evaluation metrics with detailed scoring dimensions.
+
+---
+
+## 🎯 What's New
+
+### LLM-as-a-Judge Evaluation (NEW)
+- **AI-Powered Assessment**: Use GPT-4 or similar models to evaluate response quality
+- **Dual Evaluation System**: Choose between rule-based or LLM-as-a-judge evaluation
+- **Structured Prompting**: Carefully designed evaluation prompts for consistent scoring
+- **Judge Model Configuration**: Configurable judge model selection
+- **Raw Response Storage**: Store complete judge reasoning for transparency
+
+### Enhanced Evaluation Metrics (NEW)
+- **Multi-Dimensional Scoring**:
+ - `score_overall`: Overall response quality (1-5)
+ - `score_instruction_following`: How well the response follows the prompt (1-5)
+ - `score_truthfulness`: Factual accuracy and reliability (1-5)
+- **Judge Type Tracking**: Distinguish between rule-based and LLM judge evaluations
+- **Detailed Comments**: Comprehensive feedback on evaluation decisions
+- **Raw Judge Response**: Complete LLM judge output for audit trail
+
+### Multi-Language Dashboard Support (NEW)
+- **Localization Framework**: Complete i18n implementation
+- **Supported Languages**:
+ - 🇺🇸 English (en)
+ - 🇰🇷 Korean (ko)
+ - 🇯🇵 Japanese (ja)
+ - 🇨🇳 Chinese (zh)
+- **Language Switcher**: Easy language selection in dashboard UI
+- **Localized UI Elements**: All dashboard text and labels translated
+- **Persistent Language Selection**: Language preference saved in session
+
+### Time Series Analytics (NEW)
+- **Trend Analysis**: Track quality metrics over time
+- **Time-Based Aggregation**: Hourly, daily, weekly views
+- **Performance Metrics**: Monitor latency and success rate trends
+- **Historical Comparison**: Compare current vs. past performance
+
+---
+
+## 🏗️ Architecture Updates
+
+The architecture remains the same as v0.2.0, with enhanced evaluation logic:
+
+```
+Client/Browser → Dashboard (port 8501) [Multi-language UI]
+ ↓
+ Gateway API (port 18000) → OpenAI GPT
+ ↓
+ PostgreSQL (port 5432)
+ ↑
+ Evaluator (port 18001) → OpenAI GPT (Judge Model)
+```
+
+---
+
+## 🗄️ Database Schema Changes
+
+### Updated llm_evaluations Table
+
+```sql
+ALTER TABLE llm_evaluations
+ADD COLUMN score_instruction_following FLOAT,
+ADD COLUMN score_truthfulness FLOAT,
+ADD COLUMN judge_type VARCHAR, -- 'rule' or 'llm'
+ADD COLUMN raw_judge_response TEXT; -- Complete LLM judge output
+```
+
+**Migration**: These columns are nullable, so existing data remains compatible.
+
+---
+
+## 🆕 New Features in Detail
+
+### LLM-as-a-Judge Implementation
+
+**Evaluation Prompt Structure**:
+```python
+"""
+You are an expert AI assistant evaluator. Evaluate the following LLM response.
+
+Prompt: {prompt}
+Response: {response}
+
+Provide scores (1-5) for:
+1. Overall Quality
+2. Instruction Following
+3. Truthfulness
+
+Return JSON:
+{
+ "score_overall": float,
+ "score_instruction_following": float,
+ "score_truthfulness": float,
+ "comments": "detailed explanation"
+}
+"""
+```
+
+**Usage**:
+```python
+# services/evaluator/app/llm_judge.py
+from .llm_judge import evaluate_with_llm
+
+result = evaluate_with_llm(
+ prompt="What is Python?",
+ response="Python is a programming language...",
+ judge_model="gpt-4"
+)
+
+# Returns structured evaluation with scores and comments
+```
+
+### Multi-Dimensional Scoring
+
+Each evaluation now includes three distinct scores:
+
+1. **Overall Quality (score_overall)**:
+ - Comprehensive quality assessment
+ - Considers all aspects of the response
+ - Range: 1 (poor) to 5 (excellent)
+
+2. **Instruction Following (score_instruction_following)**:
+ - How well the response addresses the prompt
+ - Relevance and completeness
+ - Range: 1 (off-topic) to 5 (perfect match)
+
+3. **Truthfulness (score_truthfulness)**:
+ - Factual accuracy
+ - Absence of hallucinations
+ - Reliability of information
+ - Range: 1 (false) to 5 (accurate)
+
+### Localization System
+
+**Language Configuration**:
+```python
+# Dashboard language files
+locales/
+├── en.json # English
+├── ko.json # Korean
+├── ja.json # Japanese
+└── zh.json # Chinese
+```
+
+**Example Translation**:
+```json
+{
+ "dashboard.title": "LLM Quality Observer",
+ "dashboard.overview": "Overview",
+ "dashboard.quality_metrics": "Quality Metrics",
+ "dashboard.select_language": "Select Language",
+ "metrics.total_requests": "Total Requests",
+ "metrics.average_score": "Average Score"
+}
+```
+
+**Usage in Dashboard**:
+```python
+from utils.localization import get_text
+
+st.title(get_text("dashboard.title"))
+st.metric(get_text("metrics.total_requests"), total_count)
+```
+
+---
+
+## 📖 Updated API Endpoints
+
+### Evaluator Service
+
+#### POST /evaluate-once
+**Enhanced**: Now supports judge type selection
+
+**Query Parameters**:
+- `limit` (optional): Number of logs to evaluate
+- `judge_type` (optional): "rule" or "llm" (default: "rule")
+
+**Request Example**:
+```bash
+# Use LLM-as-a-judge
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=llm"
+
+# Use rule-based evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=rule"
+```
+
+**Response**:
+```json
+{
+ "message": "Evaluation completed",
+ "evaluated_count": 10,
+ "judge_type": "llm",
+ "average_score_overall": 4.2,
+ "average_score_instruction_following": 4.5,
+ "average_score_truthfulness": 4.0
+}
+```
+
+### Gateway API
+
+#### GET /api/time-series
+**New**: Time series data for trend analysis
+
+**Query Parameters**:
+- `metric`: "score", "latency", or "count"
+- `interval`: "hour", "day", or "week"
+- `start_date` (optional): Start date (ISO 8601)
+- `end_date` (optional): End date (ISO 8601)
+
+**Request Example**:
+```bash
+curl "http://localhost:18000/api/time-series?metric=score&interval=day"
+```
+
+**Response**:
+```json
+{
+ "metric": "score",
+ "interval": "day",
+ "data": [
+ {
+ "timestamp": "2024-01-01T00:00:00Z",
+ "value": 4.2,
+ "count": 150
+ },
+ {
+ "timestamp": "2024-01-02T00:00:00Z",
+ "value": 4.5,
+ "count": 200
+ }
+ ]
+}
+```
+
+#### GET /api/evaluations
+**Enhanced**: Now returns additional score dimensions
+
+**Response**:
+```json
+{
+ "evaluations": [
+ {
+ "id": 1,
+ "log_id": 1,
+ "score_overall": 4.5,
+ "score_instruction_following": 5.0,
+ "score_truthfulness": 4.0,
+ "judge_type": "llm",
+ "comments": "Excellent response with minor factual nuance",
+ "judge_model": "gpt-4",
+ "raw_judge_response": "{...}"
+ }
+ ]
+}
+```
+
+---
+
+## 🔧 Configuration Updates
+
+### New Environment Variables
+
+```bash
+# LLM Judge Configuration
+OPENAI_MODEL_JUDGE=gpt-4 # Model used for LLM-as-a-judge
+EVALUATION_JUDGE_TYPE=rule # Default: 'rule' or 'llm'
+
+# Dashboard Localization
+DEFAULT_LANGUAGE=en # Default UI language
+SUPPORTED_LANGUAGES=en,ko,ja,zh # Available languages
+```
+
+### Updated .env.local.example
+
+```bash
+# Existing variables...
+
+# LLM Judge Settings (v0.3.0+)
+OPENAI_MODEL_JUDGE=gpt-4
+EVALUATION_JUDGE_TYPE=rule
+
+# Dashboard Localization (v0.3.0+)
+DEFAULT_LANGUAGE=en
+```
+
+---
+
+## 📁 Updated Project Structure
+
+```
+LLM-Quality-Observer/
+├── services/
+│ ├── evaluator/
+│ │ └── app/
+│ │ ├── llm_judge.py # NEW: LLM-as-a-judge implementation
+│ │ ├── rules.py # Enhanced rule-based evaluation
+│ │ ├── schemas.py # Updated with new score fields
+│ │ └── models.py # Updated evaluation model
+│ └── dashboard/
+│ ├── app/
+│ │ ├── locales/ # NEW: Translation files
+│ │ │ ├── en.json
+│ │ │ ├── ko.json
+│ │ │ ├── ja.json
+│ │ │ └── zh.json
+│ │ └── utils/
+│ │ └── localization.py # NEW: i18n utilities
+└── docs/
+ └── README.md # Updated with v0.3.0 features
+```
+
+---
+
+## 🔄 Migration from v0.2.0
+
+### Database Migration
+
+**Option 1: Automatic (on startup)**
+The services will automatically add new columns if they don't exist.
+
+**Option 2: Manual SQL**
+```sql
+-- Add new columns to llm_evaluations table
+ALTER TABLE llm_evaluations
+ADD COLUMN IF NOT EXISTS score_instruction_following FLOAT,
+ADD COLUMN IF NOT EXISTS score_truthfulness FLOAT,
+ADD COLUMN IF NOT EXISTS judge_type VARCHAR,
+ADD COLUMN IF NOT EXISTS raw_judge_response TEXT;
+```
+
+### Configuration Updates
+
+1. **Update .env.local**:
+```bash
+# Add new variables
+OPENAI_MODEL_JUDGE=gpt-4
+EVALUATION_JUDGE_TYPE=rule
+DEFAULT_LANGUAGE=en
+```
+
+2. **Restart services**:
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml down
+docker compose -f docker-compose.local.yml up --build
+```
+
+### Re-evaluate Existing Logs
+
+Existing evaluations only have `score_overall`. To populate new metrics:
+
+```bash
+# Re-evaluate with LLM judge
+curl -X POST "http://localhost:18001/evaluate-once?limit=100&judge_type=llm"
+```
+
+---
+
+## 🚀 Using LLM-as-a-Judge
+
+### Cost Considerations
+
+LLM-as-a-judge uses OpenAI API calls:
+- **GPT-4**: ~$0.03 per evaluation (input + output tokens)
+- **GPT-3.5-turbo**: ~$0.002 per evaluation
+
+**Recommendation**: Start with rule-based evaluation, use LLM judge for:
+- Quality assurance sampling
+- Disputed evaluations
+- Production monitoring of critical flows
+
+### Evaluation Comparison
+
+```bash
+# Evaluate same logs with both methods
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=rule"
+curl -X POST "http://localhost:18001/evaluate-once?limit=10&judge_type=llm"
+
+# Compare results in dashboard
+# Navigate to: Quality Metrics → Score by Judge Type
+```
+
+### Best Practices
+
+1. **Judge Model Selection**:
+ - GPT-4: Higher accuracy, slower, more expensive
+ - GPT-3.5-turbo: Faster, cheaper, good for high-volume
+
+2. **Hybrid Approach**:
+ - Use rule-based for all evaluations
+ - Use LLM judge for random 10% sampling
+ - Use LLM judge when rule-based score is borderline (2.5-3.5)
+
+3. **Prompt Engineering**:
+ - Customize evaluation prompts in `services/evaluator/app/llm_judge.py`
+ - Include domain-specific criteria
+ - Provide examples for consistency
+
+---
+
+## 🧪 Testing
+
+### Test LLM-as-a-Judge
+
+```bash
+# Send a test request
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Explain quantum computing", "user_id": "test"}'
+
+# Evaluate with LLM judge
+curl -X POST "http://localhost:18001/evaluate-once?limit=1&judge_type=llm"
+
+# Check evaluation result
+curl "http://localhost:18000/api/evaluations?limit=1" | jq
+```
+
+### Test Multi-Language Dashboard
+
+1. Open dashboard: http://localhost:8501
+2. Click language selector in sidebar
+3. Switch between English, Korean, Japanese, Chinese
+4. Verify all UI elements are translated
+
+### Test Time Series API
+
+```bash
+# Get daily score trends
+curl "http://localhost:18000/api/time-series?metric=score&interval=day" | jq
+
+# Get hourly latency trends
+curl "http://localhost:18000/api/time-series?metric=latency&interval=hour" | jq
+```
+
+---
+
+## 💡 Improvements
+
+- **Evaluation Quality**: Multi-dimensional scoring provides deeper insights
+- **Judge Transparency**: Raw LLM judge responses enable audit and debugging
+- **User Experience**: Multi-language support expands global accessibility
+- **Analytics**: Time series data enables trend analysis and forecasting
+- **Flexibility**: Choose evaluation method based on cost/quality tradeoff
+
+---
+
+## 🐛 Bug Fixes
+
+- Fixed evaluation schema to properly handle null values
+- Improved error handling for LLM API failures
+- Enhanced database connection retry logic
+- Fixed dashboard rendering issues with large datasets
+
+---
+
+## ⚠️ Known Limitations
+
+- **LLM Judge Cost**: Can be expensive for high-volume evaluation
+- **LLM Judge Latency**: Slower than rule-based (2-5 seconds per evaluation)
+- **Language Detection**: Dashboard doesn't auto-detect browser language
+- **Limited Translation Coverage**: Some error messages remain in English
+- **Time Series Caching**: No caching, queries can be slow for large datasets
+
+---
+
+## 🛣️ Roadmap to v0.4.0
+
+- [ ] Automated evaluation scheduler
+- [ ] Slack/Discord notification system
+- [ ] CI/CD pipeline with automated tests
+- [ ] Batch evaluation optimization
+- [ ] Prometheus metrics integration
+
+---
+
+## 📝 Technical Notes
+
+### LLM Judge Implementation
+
+The judge uses structured output to ensure consistent scoring:
+
+```python
+# services/evaluator/app/llm_judge.py
+import json
+from openai import OpenAI
+
+def evaluate_with_llm(prompt: str, response: str, judge_model: str) -> dict:
+ client = OpenAI(api_key=settings.llm_api_key)
+
+ evaluation_prompt = f"""
+ Evaluate this LLM response:
+
+ Prompt: {prompt}
+ Response: {response}
+
+ Provide JSON with scores (1-5) and comments.
+ """
+
+ result = client.chat.completions.create(
+ model=judge_model,
+ messages=[{"role": "user", "content": evaluation_prompt}],
+ response_format={"type": "json_object"}
+ )
+
+ return json.loads(result.choices[0].message.content)
+```
+
+### Localization System
+
+Uses a simple key-based translation system:
+
+```python
+# services/dashboard/app/utils/localization.py
+import json
+
+def load_translations(language: str) -> dict:
+ with open(f"locales/{language}.json") as f:
+ return json.load(f)
+
+def get_text(key: str, language: str = "en") -> str:
+ translations = load_translations(language)
+ return translations.get(key, key)
+```
+
+---
+
+## 📚 Documentation Updates
+
+- Updated README with LLM-as-a-judge setup
+- Added localization guide
+- Documented new scoring dimensions
+- Added cost estimation guide for LLM judge
+- Updated API documentation with new endpoints
+
+---
+
+## 🙏 Acknowledgments
+
+- OpenAI for GPT-4 API enabling intelligent evaluation
+- Community contributors for translation support
+- Users providing feedback on evaluation criteria
+
+---
+
+**Previous Release**: [v0.2.0](./RELEASE_NOTES_v0.2.0.md)
+**Next Release**: v0.4.0 will introduce automated scheduling and notifications
diff --git a/docs/release_notes/RELEASE_NOTES_v0.4.0.md b/docs/release_notes/RELEASE_NOTES_v0.4.0.md
new file mode 100644
index 0000000..4e3aa6b
--- /dev/null
+++ b/docs/release_notes/RELEASE_NOTES_v0.4.0.md
@@ -0,0 +1,667 @@
+# Release Notes - v0.4.0
+
+**Release Date**: 2025
+**Status**: Automation & DevOps Release
+
+---
+
+## Overview
+
+LLM-Quality-Observer v0.4.0 introduces automated evaluation scheduling and a multi-channel notification system, transforming the platform from a manual evaluation tool into a fully automated monitoring solution. This release also establishes a CI/CD pipeline with automated testing and code quality enforcement.
+
+---
+
+## 🎯 What's New
+
+### Automated Evaluation Scheduler (NEW)
+- **APScheduler Integration**: Continuous background evaluation without manual triggers
+- **Configurable Intervals**: Set evaluation frequency (minutes, hours, days)
+- **Batch Processing**: Efficient evaluation of pending logs in configurable batch sizes
+- **Auto-Start**: Scheduler starts automatically with the Evaluator service
+- **Health Monitoring**: Track scheduler status and execution history
+- **Graceful Shutdown**: Proper cleanup on service stop
+
+### Multi-Channel Notification System (NEW)
+- **Slack Integration**: Send alerts to Slack channels via webhooks
+- **Discord Integration**: Post notifications to Discord servers
+- **Low-Quality Alerts**: Automatic notifications when scores fall below threshold
+- **Batch Summaries**: Periodic summaries of evaluation results
+- **Configurable Thresholds**: Set custom score thresholds for alerts
+- **Rich Formatting**: Color-coded messages with detailed metrics
+
+### CI/CD Pipeline (NEW)
+- **GitHub Actions Workflow**: Automated build, lint, and test on every commit
+- **Multi-Service Testing**: Parallel testing of gateway-api, evaluator, and dashboard
+- **Code Quality Checks**: Flake8 linting for Python code style enforcement
+- **Health Check Tests**: Automated service health verification
+- **Build Validation**: Ensure all services build successfully
+- **PR Checks**: Required status checks before merge
+
+### Batch Evaluation Optimization (NEW)
+- **Efficient Queries**: Optimized database queries for pending logs
+- **Configurable Batch Size**: Process logs in customizable batches
+- **Progress Tracking**: Monitor evaluation progress in real-time
+- **Error Recovery**: Continue processing even if individual evaluations fail
+- **Utility Functions**: Helper functions for log management
+
+---
+
+## 🏗️ Architecture Updates
+
+```
+ ┌─────────────┐
+ │ Scheduler │ (APScheduler)
+ │ (cron) │
+ └──────┬──────┘
+ │
+ ↓
+Client/Browser → Dashboard (8501) → Gateway API (18000) → OpenAI GPT
+ ↓ ↓
+ PostgreSQL (5432)
+ ↑
+ Evaluator (18001) → OpenAI Judge
+ ↓
+ ┌──────┴──────┐
+ │ │
+ ┌───▼──┐ ┌───▼────┐
+ │Slack │ │Discord │
+ └──────┘ └────────┘
+```
+
+---
+
+## 🆕 New Features in Detail
+
+### APScheduler Implementation
+
+**Automatic Scheduling**:
+```python
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+
+scheduler = AsyncIOScheduler()
+
+# Schedule evaluation every 60 minutes
+scheduler.add_job(
+ run_batch_evaluation,
+ 'interval',
+ minutes=60,
+ id='batch_evaluation',
+ max_instances=1
+)
+
+scheduler.start()
+```
+
+**Configuration Options**:
+- `ENABLE_AUTO_EVALUATION`: Enable/disable automatic evaluation
+- `EVALUATION_INTERVAL_MINUTES`: Time between evaluation runs
+- `EVALUATION_BATCH_SIZE`: Number of logs per batch
+- `EVALUATION_JUDGE_TYPE`: Default judge type (rule/llm)
+
+**Scheduler Status API**:
+```bash
+# Get scheduler information
+curl http://localhost:18001/scheduler/status
+
+# Response:
+{
+ "enabled": true,
+ "interval_minutes": 60,
+ "next_run": "2024-01-01T13:00:00Z",
+ "last_run": "2024-01-01T12:00:00Z",
+ "total_runs": 42
+}
+```
+
+### Notification System
+
+**Slack Notifications**:
+```json
+{
+ "text": "🚨 Low Quality Alert",
+ "attachments": [{
+ "color": "danger",
+ "fields": [
+ {"title": "Log ID", "value": "123", "short": true},
+ {"title": "Score", "value": "2.0/5.0", "short": true},
+ {"title": "Model", "value": "gpt-3.5-turbo", "short": true},
+ {"title": "Judge", "value": "llm", "short": true}
+ ],
+ "footer": "LLM Quality Observer"
+ }]
+}
+```
+
+**Discord Notifications**:
+```json
+{
+ "embeds": [{
+ "title": "🚨 Low Quality Alert",
+ "color": 15158332,
+ "fields": [
+ {"name": "Log ID", "value": "123", "inline": true},
+ {"name": "Score", "value": "2.0/5.0", "inline": true},
+ {"name": "Prompt", "value": "User question..."}
+ ],
+ "footer": {"text": "LLM Quality Observer"}
+ }]
+}
+```
+
+**Notification Types**:
+
+1. **Low-Quality Alerts** (Real-time):
+ - Triggered when `score_overall ≤ NOTIFICATION_SCORE_THRESHOLD`
+ - Includes log details, prompt snippet, and scores
+ - Sent immediately after evaluation
+
+2. **Batch Summaries** (Periodic):
+ - Summary of batch evaluation results
+ - Statistics: total evaluated, average score, low-quality count
+ - Sent after each scheduler run
+
+### CI/CD Workflow
+
+**GitHub Actions Workflow** (`.github/workflows/ci.yml`):
+```yaml
+name: CI
+
+on:
+ push:
+ branches: [main, feat/*]
+ pull_request:
+ branches: [main]
+
+jobs:
+ lint:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v3
+ - name: Set up Python
+ uses: actions/setup-python@v4
+ with:
+ python-version: '3.12'
+ - name: Install dependencies
+ run: pip install flake8
+ - name: Lint with flake8
+ run: flake8 services/
+
+ build:
+ runs-on: ubuntu-latest
+ strategy:
+ matrix:
+ service: [gateway-api, evaluator, dashboard]
+ steps:
+ - uses: actions/checkout@v3
+ - name: Build ${{ matrix.service }}
+ run: docker build services/${{ matrix.service }}
+
+ test:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v3
+ - name: Start services
+ run: docker compose -f infra/docker/docker-compose.local.yml up -d
+ - name: Run health checks
+ run: |
+ curl -f http://localhost:18000/health
+ curl -f http://localhost:18001/health
+```
+
+**Flake8 Configuration** (`.flake8`):
+```ini
+[flake8]
+max-line-length = 100
+exclude = .git,__pycache__,.venv,venv,build,dist
+ignore = E203,W503
+```
+
+### Health Check Tests
+
+**Test Suite** (`services/*/tests/test_health.py`):
+```python
+import pytest
+from fastapi.testclient import TestClient
+from app.main import app
+
+client = TestClient(app)
+
+def test_health_endpoint():
+ response = client.get("/health")
+ assert response.status_code == 200
+ assert response.json()["status"] == "healthy"
+
+def test_health_response_time():
+ response = client.get("/health")
+ assert response.elapsed.total_seconds() < 1.0
+```
+
+---
+
+## 📖 New API Endpoints
+
+### Evaluator Service
+
+#### GET /scheduler/status
+**Description**: Get scheduler information
+
+**Response**:
+```json
+{
+ "enabled": true,
+ "interval_minutes": 60,
+ "batch_size": 10,
+ "judge_type": "rule",
+ "next_run_time": "2024-01-01T13:00:00Z",
+ "last_run_time": "2024-01-01T12:00:00Z",
+ "total_runs": 42,
+ "last_run_stats": {
+ "evaluated_count": 10,
+ "average_score": 4.2,
+ "low_quality_count": 1
+ }
+}
+```
+
+#### POST /scheduler/trigger
+**Description**: Manually trigger scheduler (in addition to automatic runs)
+
+**Response**:
+```json
+{
+ "message": "Scheduler triggered",
+ "job_id": "batch_evaluation"
+}
+```
+
+#### GET /notifications/stats
+**Description**: Get notification delivery statistics
+
+**Response**:
+```json
+{
+ "total_sent": 150,
+ "by_channel": {
+ "slack": 80,
+ "discord": 70
+ },
+ "by_type": {
+ "low_quality_alert": 30,
+ "batch_summary": 120
+ },
+ "success_rate": 0.98
+}
+```
+
+---
+
+## 🔧 Configuration Updates
+
+### New Environment Variables
+
+```bash
+# Batch Evaluation Scheduler (v0.4.0+)
+ENABLE_AUTO_EVALUATION=true # Enable automatic evaluation
+EVALUATION_INTERVAL_MINUTES=60 # Run every 60 minutes
+EVALUATION_BATCH_SIZE=10 # Process 10 logs per batch
+EVALUATION_JUDGE_TYPE=rule # Default judge type
+
+# Notification Settings (v0.4.0+)
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
+DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/YOUR/WEBHOOK/URL
+NOTIFICATION_SCORE_THRESHOLD=3 # Alert when score ≤ 3
+```
+
+### Updated .env.local.example
+
+```bash
+# Existing variables...
+
+# Batch Evaluation Scheduler (v0.4.0+)
+ENABLE_AUTO_EVALUATION=true
+EVALUATION_INTERVAL_MINUTES=60
+EVALUATION_BATCH_SIZE=10
+EVALUATION_JUDGE_TYPE=rule
+
+# Notification Settings (v0.4.0+)
+SLACK_WEBHOOK_URL=
+DISCORD_WEBHOOK_URL=
+NOTIFICATION_SCORE_THRESHOLD=3
+```
+
+---
+
+## 📁 Updated Project Structure
+
+```
+LLM-Quality-Observer/
+├── .github/
+│ └── workflows/
+│ └── ci.yml # NEW: GitHub Actions CI/CD
+├── .flake8 # NEW: Flake8 configuration
+├── services/
+│ ├── gateway-api/
+│ │ └── tests/ # NEW: Health check tests
+│ │ ├── __init__.py
+│ │ └── test_health.py
+│ ├── evaluator/
+│ │ ├── app/
+│ │ │ ├── scheduler.py # NEW: APScheduler integration
+│ │ │ ├── notifier.py # NEW: Slack/Discord notifications
+│ │ │ └── utils.py # NEW: Utility functions
+│ │ └── tests/ # NEW: Health check tests
+│ │ ├── __init__.py
+│ │ └── test_health.py
+│ └── dashboard/
+│ └── tests/
+└── docs/
+ └── RELEASE_NOTES_v0.4.0.md # This file
+```
+
+---
+
+## 🔄 Migration from v0.3.0
+
+### Database Schema (No changes)
+The database schema remains compatible with v0.3.0. No migration required.
+
+### Configuration Updates
+
+1. **Update .env.local with scheduler settings**:
+```bash
+# Add these new variables
+ENABLE_AUTO_EVALUATION=true
+EVALUATION_INTERVAL_MINUTES=60
+EVALUATION_BATCH_SIZE=10
+EVALUATION_JUDGE_TYPE=rule
+NOTIFICATION_SCORE_THRESHOLD=3
+```
+
+2. **Add notification webhooks** (optional):
+```bash
+# Slack
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
+
+# Discord
+DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/YOUR/WEBHOOK/URL
+```
+
+3. **Restart Evaluator service**:
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+### Verify Scheduler
+
+```bash
+# Check scheduler status
+curl http://localhost:18001/scheduler/status
+
+# Check logs for scheduler startup
+docker logs llm-evaluator | grep -i scheduler
+```
+
+---
+
+## 🚀 Setting Up Notifications
+
+### Slack Setup
+
+1. **Create Slack Webhook**:
+ - Go to https://api.slack.com/apps
+ - Create new app → Incoming Webhooks
+ - Add to workspace and select channel
+ - Copy webhook URL
+
+2. **Configure in .env.local**:
+```bash
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
+```
+
+3. **Test notification**:
+```bash
+# Trigger evaluation with low threshold to force alert
+NOTIFICATION_SCORE_THRESHOLD=5 docker compose -f docker-compose.local.yml restart evaluator
+
+# Send test request and evaluate
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Test", "user_id": "test"}'
+
+curl -X POST "http://localhost:18001/evaluate-once?limit=1"
+```
+
+### Discord Setup
+
+1. **Create Discord Webhook**:
+ - Open Discord server settings
+ - Integrations → Webhooks → New Webhook
+ - Select channel and copy URL
+
+2. **Configure in .env.local**:
+```bash
+DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/123456789/abcdefghijklmnopqrstuvwxyz
+```
+
+3. **Test notification**: Same as Slack test above
+
+---
+
+## 🧪 Testing
+
+### Test Scheduler
+
+```bash
+# Check scheduler is running
+curl http://localhost:18001/scheduler/status
+
+# Expected response:
+{
+ "enabled": true,
+ "next_run_time": "2024-01-01T13:00:00Z",
+ ...
+}
+
+# Manually trigger evaluation
+curl -X POST "http://localhost:18001/scheduler/trigger"
+
+# Check logs
+docker logs llm-evaluator | grep "Batch evaluation"
+```
+
+### Test Notifications
+
+```bash
+# Generate low-quality response (adjust prompt for poor response)
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "asdf", "user_id": "test"}'
+
+# Evaluate (should trigger notification if score is low)
+curl -X POST "http://localhost:18001/evaluate-once?limit=1"
+
+# Check Slack/Discord channel for notification
+```
+
+### Run CI Tests Locally
+
+```bash
+# Lint check
+pip install flake8
+flake8 services/
+
+# Build services
+docker build services/gateway-api
+docker build services/evaluator
+docker build services/dashboard
+
+# Run health check tests
+cd services/gateway-api
+pip install pytest httpx
+pytest tests/
+
+cd ../evaluator
+pytest tests/
+```
+
+---
+
+## 💡 Improvements
+
+- **Automation**: Eliminates need for manual evaluation triggers
+- **Proactive Monitoring**: Real-time alerts for quality issues
+- **Team Collaboration**: Slack/Discord integration keeps teams informed
+- **Code Quality**: Flake8 ensures consistent Python style
+- **Reliability**: CI/CD catches issues before deployment
+- **Scalability**: Batch processing handles high-volume scenarios
+- **Observability**: Scheduler status tracking and notification stats
+
+---
+
+## 🐛 Bug Fixes
+
+- Fixed database connection pool exhaustion during batch processing
+- Improved error handling for failed LLM API calls
+- Fixed race condition in concurrent evaluation requests
+- Enhanced graceful shutdown for scheduler cleanup
+- Fixed timezone handling in scheduler timestamps
+
+---
+
+## ⚠️ Known Limitations
+
+- **Scheduler Precision**: Interval-based, not cron-style scheduling
+- **Notification Rate Limits**: Slack/Discord have rate limits (respect webhook limits)
+- **Single Instance**: Scheduler designed for single evaluator instance
+- **No Notification Retry**: Failed notifications are logged but not retried
+- **Limited Test Coverage**: Only health checks, no integration tests
+
+---
+
+## 🛣️ Roadmap to v0.5.0
+
+- [ ] Prometheus metrics for monitoring
+- [ ] Grafana dashboards for visualization
+- [ ] Email notification support (SMTP)
+- [ ] Advanced alerting rules
+- [ ] Notification retry mechanism
+- [ ] Multi-instance scheduler coordination
+
+---
+
+## 📝 Technical Notes
+
+### APScheduler Configuration
+
+The scheduler uses `AsyncIOScheduler` for FastAPI compatibility:
+
+```python
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+from contextlib import asynccontextmanager
+
+scheduler = AsyncIOScheduler()
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+ # Startup
+ if settings.enable_auto_evaluation:
+ scheduler.add_job(
+ run_batch_evaluation,
+ 'interval',
+ minutes=settings.evaluation_interval_minutes,
+ id='batch_evaluation',
+ max_instances=1
+ )
+ scheduler.start()
+
+ yield
+
+ # Shutdown
+ scheduler.shutdown()
+
+app = FastAPI(lifespan=lifespan)
+```
+
+### Notification Implementation
+
+Uses `httpx` for async webhook calls:
+
+```python
+import httpx
+
+async def send_slack_notification(message: dict):
+ async with httpx.AsyncClient() as client:
+ response = await client.post(
+ settings.slack_webhook_url,
+ json=message,
+ timeout=10.0
+ )
+ return response.status_code == 200
+```
+
+### Batch Processing Logic
+
+```python
+from app.utils import get_pending_logs
+
+async def run_batch_evaluation():
+ # Get unevaluated logs
+ pending_logs = get_pending_logs(
+ limit=settings.evaluation_batch_size
+ )
+
+ results = []
+ for log in pending_logs:
+ try:
+ evaluation = await evaluate_log(
+ log,
+ judge_type=settings.evaluation_judge_type
+ )
+ results.append(evaluation)
+
+ # Send notification if low quality
+ if evaluation.score_overall <= settings.notification_score_threshold:
+ await send_low_quality_alert(log, evaluation)
+ except Exception as e:
+ logger.error(f"Evaluation failed for log {log.id}: {e}")
+ continue
+
+ # Send batch summary
+ await send_batch_summary(results)
+
+ return len(results)
+```
+
+---
+
+## 📚 Documentation Updates
+
+- Added scheduler configuration guide
+- Documented notification setup for Slack and Discord
+- Updated architecture diagrams with automation flow
+- Added CI/CD pipeline documentation
+- Created troubleshooting guide for common issues
+
+---
+
+## 🔒 Security Considerations
+
+- **Webhook URLs**: Store in environment variables, never commit to git
+- **Rate Limiting**: Consider implementing rate limits for notification webhooks
+- **Error Messages**: Ensure notifications don't expose sensitive data
+- **CI Secrets**: Use GitHub Secrets for sensitive CI/CD variables
+
+---
+
+## 🙏 Acknowledgments
+
+- APScheduler team for robust scheduling framework
+- Slack and Discord for webhook APIs
+- GitHub Actions for CI/CD platform
+- Community feedback on automation needs
+
+---
+
+**Previous Release**: [v0.3.0](./RELEASE_NOTES_v0.3.0.md)
+**Next Release**: v0.5.0 will introduce Prometheus/Grafana monitoring and email notifications
diff --git a/docs/release_notes/RELEASE_NOTES_v0.5.0.md b/docs/release_notes/RELEASE_NOTES_v0.5.0.md
new file mode 100644
index 0000000..e84c9ec
--- /dev/null
+++ b/docs/release_notes/RELEASE_NOTES_v0.5.0.md
@@ -0,0 +1,378 @@
+# Release Notes - v0.5.0
+
+**Release Date:** 2024-12-22
+**Focus:** Monitoring & Observability Enhancement
+
+## Overview
+
+v0.5.0 introduces comprehensive monitoring and observability capabilities through Prometheus metrics collection and Grafana dashboards, along with email notification support. This release enables real-time visibility into system performance, LLM quality metrics, and automated alerting across multiple channels.
+
+---
+
+## 🎯 Key Features
+
+### 1. Prometheus Metrics Integration
+
+Complete metrics instrumentation across all services:
+
+**Gateway API Metrics:**
+- HTTP request rate and latency (p50/p95/p99)
+- LLM API call rate and latency by model
+- Database query performance
+- Success/error rates
+
+**Evaluator Service Metrics:**
+- Evaluation rate and duration
+- Score distribution (overall, instruction-following, truthfulness)
+- Batch evaluation statistics
+- Scheduler health monitoring
+- Pending logs gauge
+
+**Notification Metrics:**
+- Notification delivery rates by channel (Slack, Discord, Email)
+- Low-quality alert frequency
+- Success/failure tracking
+
+### 2. Grafana Dashboards
+
+Pre-configured dashboard with 14 visualization panels:
+- **Overview Stats**: Request rate, evaluation rate, pending logs, notification rate
+- **HTTP Performance**: Request distribution, latency percentiles
+- **LLM Metrics**: Request rates by model, latency analysis
+- **Quality Scores**: Score distribution by judge type
+- **Notifications**: Delivery rates, alert tracking
+- **System Health**: Scheduler runs, batch processing stats
+
+### 3. Email Notification System
+
+SMTP-based email alerting:
+- Low-quality alerts with detailed evaluation context
+- Batch evaluation summaries
+- Multi-recipient support
+- HTML and plain text formatting
+- Integrated with existing Slack/Discord notifications
+
+---
+
+## 📦 What's New
+
+### New Services
+
+- **Prometheus** (port 9090): Metrics collection and time-series database
+- **Grafana** (port 3000): Visualization and dashboarding platform
+
+### New Dependencies
+
+- `prometheus-client>=0.19.0` - Metrics instrumentation library
+- `aiosmtplib>=3.0` - Async SMTP client for email notifications
+- `email-validator>=2.0` - Email address validation
+
+### New Endpoints
+
+- `GET /metrics` (Gateway API) - Prometheus metrics endpoint
+- `GET /metrics` (Evaluator) - Prometheus metrics endpoint
+
+### New Configuration Options
+
+```bash
+# Email Notification Settings
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=your-app-password
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=recipient1@example.com,recipient2@example.com
+```
+
+---
+
+## 🚀 Getting Started
+
+### Starting the Full Stack
+
+```bash
+cd infra/docker
+docker compose -f docker-compose.local.yml up --build
+```
+
+This will start:
+- Gateway API (port 18000)
+- Evaluator Service (port 18001)
+- Dashboard Service (port 8501)
+- Prometheus (port 9090)
+- Grafana (port 3000)
+- PostgreSQL (port 5432)
+
+### Accessing Monitoring Tools
+
+**Prometheus:**
+```bash
+# Access Prometheus UI
+http://localhost:9090
+
+# View all metrics
+http://localhost:9090/graph
+
+# Check targets
+http://localhost:9090/targets
+```
+
+**Grafana:**
+```bash
+# Access Grafana dashboard
+http://localhost:3000
+
+# Default credentials
+Username: admin
+Password: admin
+```
+
+The LLM Quality Observer dashboard will be automatically provisioned and available under "Dashboards".
+
+### Viewing Metrics
+
+**Gateway API Metrics:**
+```bash
+curl http://localhost:18000/metrics
+```
+
+**Evaluator Metrics:**
+```bash
+curl http://localhost:18001/metrics
+```
+
+### Configuring Email Notifications
+
+1. Update your `.env.local` file:
+```bash
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=your-app-password
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=team@example.com,oncall@example.com
+```
+
+2. For Gmail, create an App Password:
+ - Go to Google Account Settings → Security → 2-Step Verification → App Passwords
+ - Generate a new app password for "Mail"
+ - Use this password in `SMTP_PASSWORD`
+
+3. Restart the evaluator service:
+```bash
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+---
+
+## 📊 Metrics Guide
+
+### Key Metrics to Monitor
+
+**Performance:**
+- `llm_gateway_http_request_duration_seconds` - API response time
+- `llm_gateway_llm_request_duration_seconds` - LLM call latency
+- `llm_evaluator_evaluation_duration_seconds` - Evaluation processing time
+
+**Quality:**
+- `llm_evaluator_evaluation_scores` - Score distribution
+- `llm_evaluator_low_quality_alerts_total` - Alert frequency
+- `llm_evaluator_evaluations_total{status="error"}` - Evaluation failures
+
+**System Health:**
+- `llm_evaluator_pending_logs` - Evaluation backlog
+- `llm_evaluator_scheduler_runs_total` - Scheduler execution rate
+- `llm_evaluator_notifications_sent_total` - Notification delivery
+
+### Example PromQL Queries
+
+**Average LLM latency by model (last 5 minutes):**
+```promql
+rate(llm_gateway_llm_request_duration_seconds_sum[5m]) /
+rate(llm_gateway_llm_request_duration_seconds_count[5m])
+```
+
+**Error rate percentage:**
+```promql
+(sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) * 100
+```
+
+**Evaluation backlog trend:**
+```promql
+llm_evaluator_pending_logs
+```
+
+---
+
+## 🔄 Upgrade Guide
+
+### From v0.4.0 to v0.5.0
+
+1. **Update dependencies:**
+```bash
+# Gateway API
+cd services/gateway-api
+uv sync
+
+# Evaluator
+cd services/evaluator
+uv sync
+```
+
+2. **Update environment configuration:**
+```bash
+# Copy new config options from example
+cp configs/env/.env.local.example configs/env/.env.local
+
+# Add email settings if needed (optional)
+# SMTP_HOST, SMTP_PORT, SMTP_USERNAME, etc.
+```
+
+3. **Update Docker Compose:**
+```bash
+# The docker-compose.local.yml now includes Prometheus and Grafana
+# No manual changes needed - just restart
+cd infra/docker
+docker compose -f docker-compose.local.yml down
+docker compose -f docker-compose.local.yml up --build
+```
+
+4. **Verify metrics collection:**
+```bash
+# Check Gateway API metrics
+curl http://localhost:18000/metrics | grep llm_gateway
+
+# Check Evaluator metrics
+curl http://localhost:18001/metrics | grep llm_evaluator
+
+# Check Prometheus targets
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
+```
+
+5. **Access Grafana dashboard:**
+ - Navigate to http://localhost:3000
+ - Login with admin/admin
+ - Go to Dashboards → LLM Quality Observer
+
+### Breaking Changes
+
+None. This release is fully backward compatible with v0.4.0.
+
+### New Environment Variables (Optional)
+
+```bash
+# Email notifications (all optional)
+SMTP_HOST=smtp.gmail.com
+SMTP_PORT=587
+SMTP_USERNAME=your-email@gmail.com
+SMTP_PASSWORD=your-app-password
+SMTP_FROM_EMAIL=your-email@gmail.com
+SMTP_TO_EMAILS=recipient@example.com
+```
+
+---
+
+## 📁 File Structure Changes
+
+### New Files
+
+```
+infra/
+├── prometheus/
+│ └── prometheus.yml # Prometheus scrape configuration
+├── grafana/
+│ ├── provisioning/
+│ │ ├── datasources/
+│ │ │ └── prometheus.yml # Grafana datasource config
+│ │ └── dashboards/
+│ │ └── default.yml # Dashboard provisioning config
+│ └── dashboards/
+│ └── llm-quality-observer.json # Main dashboard definition
+
+services/
+├── gateway-api/
+│ └── app/
+│ └── metrics.py # Gateway metrics definitions
+└── evaluator/
+ └── app/
+ └── metrics.py # Evaluator metrics definitions
+```
+
+### Modified Files
+
+```
+services/
+├── gateway-api/
+│ ├── pyproject.toml # Added prometheus-client
+│ └── app/
+│ └── main.py # Added /metrics endpoint
+├── evaluator/
+│ ├── pyproject.toml # Added prometheus-client, aiosmtplib
+│ └── app/
+│ ├── config.py # Added SMTP config
+│ ├── main.py # Added /metrics endpoint
+│ ├── scheduler.py # Added metrics recording
+│ └── notifier.py # Added email notification
+
+infra/docker/
+└── docker-compose.local.yml # Added Prometheus and Grafana services
+
+configs/env/
+└── .env.local.example # Added SMTP configuration
+```
+
+---
+
+## 🐛 Bug Fixes
+
+- None in this release (feature-focused)
+
+---
+
+## 🔒 Security Notes
+
+- Email passwords should use app-specific passwords, not account passwords
+- SMTP credentials are stored in environment variables (not committed to git)
+- Grafana default password should be changed in production
+- Prometheus metrics do not expose sensitive data (no API keys, passwords, or user data)
+
+---
+
+## 📚 Documentation
+
+- [Prometheus Setup Guide](../infra/prometheus/README.md)
+- [Grafana Dashboard Guide](../infra/grafana/README.md)
+- [Email Notification Setup](../docs/EMAIL_SETUP.md)
+- [Metrics Reference](../docs/METRICS.md)
+
+---
+
+## 🎯 Next Steps (v0.6.0 Preview)
+
+Potential features for next release:
+- Advanced alerting rules in Prometheus
+- Custom dashboard templates
+- Metric retention policies
+- Performance optimization dashboard
+- A/B testing analytics
+- Advanced filtering and search UI
+
+---
+
+## 🤝 Contributors
+
+- Claude Sonnet 4.5 (Implementation)
+- sdhcokr (Project Lead)
+
+---
+
+## 📞 Support
+
+For issues or questions:
+- GitHub Issues: https://github.com/your-org/llm-quality-observer/issues
+- Documentation: https://github.com/your-org/llm-quality-observer/docs
+
+---
+
+**Full Changelog:** v0.4.0...v0.5.0
diff --git a/images/dashboard-overview.png b/images/dashboard-overview.png
new file mode 100644
index 0000000..812663d
Binary files /dev/null and b/images/dashboard-overview.png differ
diff --git a/infra/docker/docker-compose.local.yml b/infra/docker/docker-compose.local.yml
index 82f788e..8362b2d 100644
--- a/infra/docker/docker-compose.local.yml
+++ b/infra/docker/docker-compose.local.yml
@@ -1,4 +1,5 @@
version: "3.9"
+name: llm-quality-observer
services:
postgres:
@@ -51,5 +52,41 @@ services:
ports:
- "18002:8000"
+ prometheus:
+ image: prom/prometheus:latest
+ container_name: llm-prometheus
+ command:
+ - '--config.file=/etc/prometheus/prometheus.yml'
+ - '--storage.tsdb.path=/prometheus'
+ - '--web.console.libraries=/etc/prometheus/console_libraries'
+ - '--web.console.templates=/etc/prometheus/consoles'
+ - '--web.enable-lifecycle'
+ volumes:
+ - ../prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+ - prometheus_data:/prometheus
+ ports:
+ - "9090:9090"
+ depends_on:
+ - gateway-api
+ - evaluator
+
+ grafana:
+ image: grafana/grafana:latest
+ container_name: llm-grafana
+ environment:
+ - GF_SECURITY_ADMIN_PASSWORD=admin
+ - GF_SECURITY_ADMIN_USER=admin
+ - GF_USERS_ALLOW_SIGN_UP=false
+ volumes:
+ - grafana_data:/var/lib/grafana
+ - ../grafana/provisioning:/etc/grafana/provisioning
+ - ../grafana/dashboards:/var/lib/grafana/dashboards
+ ports:
+ - "3001:3000"
+ depends_on:
+ - prometheus
+
volumes:
postgres_data:
+ prometheus_data:
+ grafana_data:
diff --git a/infra/grafana/DASHBOARD_GUIDE-ko.md b/infra/grafana/DASHBOARD_GUIDE-ko.md
new file mode 100644
index 0000000..0558a80
--- /dev/null
+++ b/infra/grafana/DASHBOARD_GUIDE-ko.md
@@ -0,0 +1,506 @@
+# Grafana 대시보드 가이드
+
+ ## LLM Quality Observer Grafana Dashboard Overview
+
+ 
+
+LLM Quality Observer Grafana 대시보드는 시스템의 성능, 품질 메트릭, 알림 현황을 실시간으로 모니터링할 수 있는 14개의 시각화 패널로 구성되어 있습니다.
+
+**대시보드 접속:**
+- URL: http://localhost:3001
+- 기본 로그인: admin / admin
+- 경로: Dashboards → LLM Quality Observer
+
+---
+
+## 대시보드 구조
+
+### 1️⃣ Overview Stats (개요 통계)
+
+상단의 4개 통계 패널은 시스템의 현재 상태를 한눈에 보여줍니다.
+
+#### HTTP Request Rate (req/s)
+- **목적**: Gateway API로 들어오는 HTTP 요청의 초당 비율
+- **PromQL**: `sum(rate(llm_gateway_http_requests_total[5m]))`
+- **의미**:
+ - 시스템에 얼마나 많은 트래픽이 들어오는지 확인
+ - 최근 5분간의 평균 요청 비율 계산
+- **정상값**: 사용 패턴에 따라 다름 (테스트 환경에서는 낮음)
+- **No Data 원인**: Gateway API로 요청이 아직 없음
+- **해결**: `/chat` 엔드포인트로 테스트 요청 전송
+
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Hello", "user_id": "test"}'
+```
+
+#### Evaluation Rate (eval/s)
+- **목적**: Evaluator 서비스가 수행하는 초당 평가 비율
+- **PromQL**: `sum(rate(llm_evaluator_evaluations_total[5m]))`
+- **의미**:
+ - 얼마나 빠르게 로그를 평가하고 있는지 측정
+ - 배치 평가 스케줄러와 수동 평가 모두 포함
+- **정상값**: 스케줄러 설정에 따라 다름
+- **No Data 원인**: 평가가 아직 실행되지 않음
+- **해결**: 평가 수동 실행
+
+```bash
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+```
+
+#### Pending Logs
+- **목적**: 아직 평가되지 않은 로그의 개수
+- **PromQL**: `llm_evaluator_pending_logs`
+- **의미**:
+ - 평가 대기 중인 로그 수 (Gauge 메트릭)
+ - 값이 계속 증가하면 평가 속도가 느린 것
+- **정상값**:
+ - 0에 가까운 값 (모든 로그가 평가됨)
+ - 스케줄러가 활성화되어 있으면 주기적으로 감소
+- **No Data 원인**: Evaluator 서비스가 메트릭을 아직 업데이트하지 않음
+- **해결**: 평가를 한 번 실행하면 메트릭이 생성됨
+
+#### Notification Rate (notif/s)
+- **목적**: 초당 전송되는 알림 비율 (Slack, Discord, Email)
+- **PromQL**: `sum(rate(llm_evaluator_notifications_sent_total[5m]))`
+- **의미**:
+ - 품질 문제나 배치 완료 알림이 얼마나 자주 발생하는지
+ - 채널별로 분리되지 않은 전체 알림 수
+- **정상값**: 낮은 품질 로그가 있을 때만 증가
+- **No Data 원인**:
+ - 알림이 전송된 적 없음
+ - `NOTIFICATION_SCORE_THRESHOLD` 이하의 점수가 없음
+- **해결**: 낮은 품질 응답을 생성하여 알림 트리거
+
+---
+
+### 2️⃣ HTTP & LLM Performance (성능 메트릭)
+
+API 응답 성능과 LLM 호출 지연시간을 모니터링합니다.
+
+#### HTTP Requests by Endpoint
+- **목적**: 엔드포인트별 요청 분포 시계열 그래프
+- **PromQL**: `sum(rate(llm_gateway_http_requests_total[5m])) by (endpoint)`
+- **의미**:
+ - 어떤 엔드포인트가 가장 많이 사용되는지 확인
+ - `/chat`, `/health`, `/metrics` 등으로 구분
+- **시각화**: 시간에 따른 라인 차트
+- **활용**:
+ - 특정 엔드포인트의 트래픽 증가 감지
+ - 비정상적인 엔드포인트 호출 패턴 확인
+
+#### HTTP Request Latency (p50/p95/p99)
+- **목적**: HTTP 요청의 응답 시간 백분위수
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p50
+ histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p95
+ histogram_quantile(0.99, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p99
+ ```
+- **의미**:
+ - **p50 (중앙값)**: 50%의 요청이 이 시간 이내에 완료
+ - **p95**: 95%의 요청이 이 시간 이내에 완료
+ - **p99**: 99%의 요청이 이 시간 이내에 완료
+- **정상값**:
+ - p50: 1-2초 (LLM 응답 시간 포함)
+ - p95: 3-5초
+ - p99: 5-10초
+- **경고**: p99가 10초 이상이면 성능 문제 가능성
+- **No Data 원인**: 히스토그램 버킷에 데이터가 충분하지 않음
+
+#### LLM Requests by Model
+- **목적**: 모델별 LLM API 호출 비율
+- **PromQL**: `sum(rate(llm_gateway_llm_requests_total[5m])) by (model)`
+- **의미**:
+ - 어떤 LLM 모델이 가장 많이 사용되는지 확인
+ - `gpt-5-mini`, `gpt-4o-mini` 등으로 구분
+- **활용**:
+ - 모델별 사용량 추적
+ - 비용 최적화를 위한 모델 선택 분석
+
+#### LLM Request Latency by Model (p50/p95/p99)
+- **목적**: 모델별 LLM API 호출 지연시간 백분위수
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))
+ ```
+- **의미**:
+ - 각 LLM 모델의 응답 속도 비교
+ - 모델 간 성능 차이 확인
+- **활용**:
+ - 느린 모델 식별
+ - SLA 준수 여부 모니터링
+- **참고**:
+ - gpt-5-mini는 일반적으로 gpt-4보다 빠름
+ - Judge 모델(gpt-4o-mini)은 별도로 추적되지 않음 (Evaluator 서비스)
+
+---
+
+### 3️⃣ Quality & Notifications (품질 및 알림)
+
+LLM 응답 품질과 알림 전송 현황을 추적합니다.
+
+#### Evaluations by Judge Type
+- **목적**: 평가 방식별 평가 실행 비율
+- **PromQL**: `sum(rate(llm_evaluator_evaluations_total[5m])) by (judge_type)`
+- **의미**:
+ - `rule` (규칙 기반) vs `llm` (LLM-as-a-Judge) 비율
+ - 어떤 평가 방식이 더 많이 사용되는지 확인
+- **설정**: `.env.local`의 `EVALUATION_JUDGE_TYPE`로 제어
+ - `rule`: 빠르고 저렴, 단순한 규칙 적용
+ - `llm`: 느리고 비용 발생, 복잡한 품질 평가
+- **활용**: 비용과 정확도의 균형 분석
+
+#### Overall Score Distribution (p50/p95)
+- **목적**: 평가 점수의 중앙값과 95번째 백분위수
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le)) # p50
+ histogram_quantile(0.95, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le)) # p95
+ ```
+- **의미**:
+ - **p50**: 평가 점수의 중앙값 (대부분의 응답 품질)
+ - **p95**: 상위 95%의 점수 (우수한 응답의 기준)
+- **점수 범위**: 1-5점
+ - 1-2점: Critical (심각한 품질 문제)
+ - 3점: Warning (개선 필요)
+ - 4-5점: Good (양호)
+- **목표**: p50이 4점 이상 유지
+- **No Data 원인**:
+ - 평가 점수가 히스토그램으로 기록되지 않음
+ - 평가 횟수가 충분하지 않음
+
+#### Notifications by Channel
+- **목적**: 채널별 알림 전송 성공률 시계열
+- **PromQL**: `sum(rate(llm_evaluator_notifications_sent_total[5m])) by (channel, status)`
+- **의미**:
+ - Slack, Discord, Email 각각의 전송 성공/실패 추적
+ - 알림 인프라의 건강성 확인
+- **채널**:
+ - `slack`: Slack 웹훅
+ - `discord`: Discord 웹훅
+ - `email`: SMTP 이메일
+- **상태**:
+ - `success`: 전송 성공
+ - `error`: 전송 실패
+- **경고**: 특정 채널의 error rate가 높으면 설정 확인 필요
+
+#### Low Quality Alerts
+- **목적**: 낮은 품질 경고 발생 빈도
+- **PromQL**: `sum(rate(llm_evaluator_low_quality_alerts_total[5m])) by (judge_type)`
+- **의미**:
+ - `NOTIFICATION_SCORE_THRESHOLD` 이하의 점수가 얼마나 자주 발생하는지
+ - 품질 문제의 심각성 모니터링
+- **목표**: 값이 낮을수록 좋음
+- **활용**:
+ - 품질 저하 추세 조기 감지
+ - 프롬프트나 모델 변경 후 효과 측정
+- **No Data 원인**: 낮은 품질 응답이 없음 (좋은 신호!)
+
+---
+
+### 4️⃣ System Health (시스템 상태)
+
+스케줄러와 배치 평가 시스템의 작동 상태를 확인합니다.
+
+#### Scheduler Runs
+- **목적**: 자동 평가 스케줄러 실행 비율
+- **PromQL**: `sum(rate(llm_evaluator_scheduler_runs_total[5m]))`
+- **의미**:
+ - 스케줄러가 정상적으로 작동하는지 확인
+ - 설정된 간격마다 실행되는지 모니터링
+- **설정**: `.env.local`의 `EVALUATION_INTERVAL_MINUTES`
+ - 기본값: 60분 (1시간마다 실행)
+- **정상값**:
+ - 60분 간격이면 시간당 1회 = 0.000277 runs/s
+ - 그래프가 계단식으로 증가
+- **경고**: 값이 증가하지 않으면 스케줄러 중단
+- **확인**:
+ ```bash
+ docker logs llm-evaluator | grep "Scheduler"
+ ```
+
+#### Batch Evaluation - Logs Processed
+- **목적**: 배치 평가로 처리된 로그 수 누적
+- **PromQL**: `sum(llm_evaluator_batch_logs_processed_total)`
+- **의미**:
+ - 스케줄러가 총 몇 개의 로그를 평가했는지 추적
+ - 시스템의 처리량 확인
+- **시각화**: 누적 그래프 (계속 증가)
+- **설정**: `.env.local`의 `EVALUATION_BATCH_SIZE`
+ - 기본값: 10 (한 번에 10개씩 처리)
+- **활용**:
+ - 배치 크기 최적화
+ - 처리 속도 추세 분석
+
+---
+
+## No Data 문제 해결
+
+대시보드에서 "No Data"가 표시되는 경우 아래 단계를 따르세요:
+
+### 1. Prometheus 타겟 확인
+
+```bash
+# Prometheus UI에서 타겟 상태 확인
+http://localhost:9090/targets
+
+# 또는 API로 확인
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
+```
+
+**기대 결과:**
+- `gateway-api`: health = "up"
+- `evaluator`: health = "up"
+
+### 2. 메트릭 엔드포인트 확인
+
+```bash
+# Gateway API 메트릭 확인
+curl http://localhost:18000/metrics | grep llm_gateway
+
+# Evaluator 메트릭 확인
+curl http://localhost:18001/metrics | grep llm_evaluator
+```
+
+메트릭이 보이지 않으면 서비스 재시작:
+```bash
+docker compose -f docker-compose.local.yml restart gateway-api evaluator
+```
+
+### 3. 데이터 생성
+
+일부 메트릭은 활동이 있어야 데이터가 생성됩니다:
+
+```bash
+# 1. Gateway API로 요청 전송
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Test prompt", "user_id": "test-user"}'
+
+# 2. 평가 실행
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+
+# 3. 5-10분 대기 후 Grafana 새로고침
+```
+
+### 4. 시간 범위 조정
+
+Grafana 대시보드 우측 상단에서 시간 범위를 조정:
+- 기본값: Last 1 hour
+- 데이터가 없으면: Last 6 hours 또는 Last 24 hours로 변경
+
+### 5. PromQL 쿼리 테스트
+
+Prometheus UI에서 직접 쿼리 실행:
+
+```bash
+# Prometheus Graph 페이지
+http://localhost:9090/graph
+
+# 예제 쿼리:
+llm_gateway_http_requests_total
+llm_evaluator_evaluations_total
+llm_evaluator_pending_logs
+```
+
+---
+
+## PromQL 쿼리 설명
+
+### rate() 함수
+```promql
+rate(llm_gateway_http_requests_total[5m])
+```
+- **의미**: 최근 5분간의 초당 증가율
+- **사용처**: Counter 메트릭을 비율로 변환
+- **단위**: 초당 값 (per second)
+
+### histogram_quantile() 함수
+```promql
+histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le))
+```
+- **의미**: 히스토그램에서 95번째 백분위수 계산
+- **le**: less than or equal (버킷 상한값)
+- **사용처**: 지연시간, 점수 분포 분석
+
+### sum() by (label)
+```promql
+sum(rate(llm_gateway_http_requests_total[5m])) by (endpoint)
+```
+- **의미**: 라벨별로 그룹화하여 합계 계산
+- **사용처**: 엔드포인트별, 모델별, 채널별 분리
+
+---
+
+## 활용 팁
+
+### 1. 성능 이슈 감지
+- **HTTP Request Latency p99** > 10초: 성능 저하
+- **LLM Request Latency p95** > 5초: LLM API 지연
+
+대응:
+```bash
+# 느린 요청 로그 확인
+docker logs llm-gateway-api | grep "latency"
+
+# 데이터베이스 쿼리 성능 확인 (메트릭에 db_query_duration 추가 가능)
+```
+
+### 2. 품질 저하 추적
+- **Overall Score p50** < 3: 품질 문제 발생
+- **Low Quality Alerts** 급증: 즉각 조사 필요
+
+대응:
+```bash
+# 최근 낮은 점수 로그 확인
+docker exec -it llm-postgres psql -U llm_user -d llm_quality -c \
+ "SELECT l.id, l.prompt, l.response, e.overall_score
+ FROM llm_logs l
+ JOIN llm_evaluations e ON l.id = e.log_id
+ WHERE e.overall_score <= 3
+ ORDER BY l.created_at DESC
+ LIMIT 5;"
+```
+
+### 3. 알림 시스템 건강성
+- **Notifications by Channel** - error 비율 > 10%: 설정 확인
+
+대응:
+```bash
+# Evaluator 로그에서 알림 에러 확인
+docker logs llm-evaluator | grep -i "notification.*fail"
+
+# SMTP 설정 확인
+docker logs llm-evaluator | grep -i "smtp"
+```
+
+### 4. 스케줄러 모니터링
+- **Scheduler Runs** 증가 멈춤: 스케줄러 중단
+
+대응:
+```bash
+# 스케줄러 로그 확인
+docker logs llm-evaluator | grep "Scheduler\|APScheduler"
+
+# Evaluator 재시작
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+### 5. 대기 로그 누적
+- **Pending Logs** 계속 증가: 평가 속도 < 로그 생성 속도
+
+대응:
+```bash
+# 배치 크기 증가 (.env.local)
+EVALUATION_BATCH_SIZE=20 # 기존 10에서 증가
+
+# 평가 간격 단축
+EVALUATION_INTERVAL_MINUTES=30 # 기존 60에서 감소
+
+# 재시작
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+---
+
+## 대시보드 커스터마이징
+
+### 패널 추가하기
+
+1. Grafana UI에서 "Add panel" 클릭
+2. PromQL 쿼리 입력
+3. 시각화 타입 선택 (Stat, Time series, Gauge 등)
+4. 저장
+
+### 유용한 추가 패널 예시
+
+#### 에러율 백분율
+```promql
+(sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) * 100
+```
+
+#### 평가 유형별 점수 비교
+```promql
+histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="instruction_following"}[5m])) by (le))
+```
+
+#### 모델별 에러율
+```promql
+sum(rate(llm_gateway_llm_requests_total{status="error"}[5m])) by (model)
+```
+
+---
+
+## 알림 규칙 (향후 추가 예정)
+
+v0.6.0에서 Prometheus Alertmanager 통합 예정:
+
+```yaml
+# 예제: HTTP 에러율 경고
+- alert: HighHTTPErrorRate
+ expr: |
+ (sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) > 0.05
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "HTTP 에러율이 5% 초과"
+```
+
+---
+
+## 문제 해결
+
+### Grafana가 Prometheus에 연결되지 않음
+
+**증상**: "Post http://localhost:9090/api/v1/query_range: connection refused"
+
+**해결**:
+1. Datasource 설정 확인
+ - Settings → Data Sources → Prometheus
+ - URL이 `http://prometheus:9090`인지 확인 (localhost 아님!)
+
+2. Prometheus 컨테이너 상태 확인
+ ```bash
+ docker ps | grep prometheus
+ docker logs llm-prometheus
+ ```
+
+3. Docker 네트워크 확인
+ ```bash
+ docker network inspect docker_default
+ # gateway-api, evaluator, prometheus, grafana가 모두 같은 네트워크에 있어야 함
+ ```
+
+### 대시보드가 자동으로 로드되지 않음
+
+**해결**:
+```bash
+# Grafana provisioning 로그 확인
+docker logs llm-grafana | grep provision
+
+# 권한 확인
+ls -la /home/sdhcokr/project/LLM-Quality-Observer/infra/grafana/
+
+# 필요시 권한 수정
+chmod -R 755 /home/sdhcokr/project/LLM-Quality-Observer/infra/grafana/
+```
+
+---
+
+## 참고 자료
+
+- [Prometheus 설정 가이드](../prometheus/README.md)
+- [메트릭 참조 문서](../../docs/METRICS.md)
+- [Release Notes v0.5.0](../../docs/RELEASE_NOTES_v0.5.0_ko.md)
+- [Grafana 공식 문서](https://grafana.com/docs/)
+- [PromQL 쿼리 가이드](https://prometheus.io/docs/prometheus/latest/querying/basics/)
+
+---
+
+**Last Updated**: 2025-12-23
+**Dashboard Version**: v0.5.0
diff --git a/infra/grafana/DASHBOARD_GUIDE-us.md b/infra/grafana/DASHBOARD_GUIDE-us.md
new file mode 100644
index 0000000..95bccc1
--- /dev/null
+++ b/infra/grafana/DASHBOARD_GUIDE-us.md
@@ -0,0 +1,506 @@
+# Grafana Dashboard Guide
+
+## LLM Quality Observer Grafana Dashboard Overview
+
+ 
+
+The LLM Quality Observer Grafana dashboard consists of 14 visualization panels that provide real-time monitoring of system performance, quality metrics, and notification status.
+
+**Dashboard Access:**
+- URL: http://localhost:3001
+- Default Login: admin / admin
+- Path: Dashboards → LLM Quality Observer
+
+---
+
+## Dashboard Structure
+
+### 1️⃣ Overview Stats
+
+The top 4 stat panels provide a quick snapshot of the current system state.
+
+#### HTTP Request Rate (req/s)
+- **Purpose**: HTTP requests per second received by Gateway API
+- **PromQL**: `sum(rate(llm_gateway_http_requests_total[5m]))`
+- **Meaning**:
+ - Measures incoming traffic volume
+ - Calculates average request rate over the last 5 minutes
+- **Normal Value**: Varies by usage pattern (low in test environments)
+- **No Data Cause**: No requests to Gateway API yet
+- **Solution**: Send test request to `/chat` endpoint
+
+```bash
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Hello", "user_id": "test"}'
+```
+
+#### Evaluation Rate (eval/s)
+- **Purpose**: Evaluations per second performed by Evaluator service
+- **PromQL**: `sum(rate(llm_evaluator_evaluations_total[5m]))`
+- **Meaning**:
+ - Measures how fast logs are being evaluated
+ - Includes both scheduled batch evaluations and manual evaluations
+- **Normal Value**: Depends on scheduler configuration
+- **No Data Cause**: No evaluations executed yet
+- **Solution**: Trigger manual evaluation
+
+```bash
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+```
+
+#### Pending Logs
+- **Purpose**: Number of logs not yet evaluated
+- **PromQL**: `llm_evaluator_pending_logs`
+- **Meaning**:
+ - Current count of logs waiting for evaluation (Gauge metric)
+ - Continuously increasing value indicates slow evaluation speed
+- **Normal Value**:
+ - Close to 0 (all logs evaluated)
+ - Periodically decreases when scheduler is active
+- **No Data Cause**: Evaluator service hasn't updated metrics yet
+- **Solution**: Running evaluation once will create the metric
+
+#### Notification Rate (notif/s)
+- **Purpose**: Notifications sent per second (Slack, Discord, Email)
+- **PromQL**: `sum(rate(llm_evaluator_notifications_sent_total[5m]))`
+- **Meaning**:
+ - How frequently quality issues or batch completion notifications occur
+ - Total count across all channels
+- **Normal Value**: Only increases when low-quality logs exist
+- **No Data Cause**:
+ - No notifications sent yet
+ - No scores below `NOTIFICATION_SCORE_THRESHOLD`
+- **Solution**: Generate low-quality responses to trigger notifications
+
+---
+
+### 2️⃣ HTTP & LLM Performance
+
+Monitor API response performance and LLM call latency.
+
+#### HTTP Requests by Endpoint
+- **Purpose**: Time series distribution of requests by endpoint
+- **PromQL**: `sum(rate(llm_gateway_http_requests_total[5m])) by (endpoint)`
+- **Meaning**:
+ - Identifies which endpoints are most frequently used
+ - Separated by `/chat`, `/health`, `/metrics`, etc.
+- **Visualization**: Line chart over time
+- **Use Cases**:
+ - Detect traffic increases to specific endpoints
+ - Identify abnormal endpoint call patterns
+
+#### HTTP Request Latency (p50/p95/p99)
+- **Purpose**: HTTP request response time percentiles
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p50
+ histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p95
+ histogram_quantile(0.99, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le)) # p99
+ ```
+- **Meaning**:
+ - **p50 (median)**: 50% of requests complete within this time
+ - **p95**: 95% of requests complete within this time
+ - **p99**: 99% of requests complete within this time
+- **Normal Values**:
+ - p50: 1-2 seconds (including LLM response time)
+ - p95: 3-5 seconds
+ - p99: 5-10 seconds
+- **Warning**: p99 > 10 seconds indicates potential performance issues
+- **No Data Cause**: Insufficient data in histogram buckets
+
+#### LLM Requests by Model
+- **Purpose**: LLM API call rate by model
+- **PromQL**: `sum(rate(llm_gateway_llm_requests_total[5m])) by (model)`
+- **Meaning**:
+ - Identifies which LLM models are used most frequently
+ - Separated by `gpt-5-mini`, `gpt-4o-mini`, etc.
+- **Use Cases**:
+ - Track usage by model
+ - Analyze model selection for cost optimization
+
+#### LLM Request Latency by Model (p50/p95/p99)
+- **Purpose**: LLM API call latency percentiles by model
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))
+ ```
+- **Meaning**:
+ - Compare response speed across LLM models
+ - Identify performance differences between models
+- **Use Cases**:
+ - Identify slow models
+ - Monitor SLA compliance
+- **Note**:
+ - gpt-5-mini is typically faster than gpt-4
+ - Judge model (gpt-4o-mini) is not tracked here (Evaluator service)
+
+---
+
+### 3️⃣ Quality & Notifications
+
+Track LLM response quality and notification delivery status.
+
+#### Evaluations by Judge Type
+- **Purpose**: Evaluation execution rate by evaluation method
+- **PromQL**: `sum(rate(llm_evaluator_evaluations_total[5m])) by (judge_type)`
+- **Meaning**:
+ - Ratio of `rule` (rule-based) vs `llm` (LLM-as-a-Judge)
+ - Identifies which evaluation method is used more frequently
+- **Configuration**: Controlled by `EVALUATION_JUDGE_TYPE` in `.env.local`
+ - `rule`: Fast and cheap, applies simple rules
+ - `llm`: Slower and costs money, complex quality evaluation
+- **Use Case**: Analyze cost vs accuracy tradeoff
+
+#### Overall Score Distribution (p50/p95)
+- **Purpose**: Median and 95th percentile of evaluation scores
+- **PromQL**:
+ ```promql
+ histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le)) # p50
+ histogram_quantile(0.95, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="overall"}[5m])) by (le)) # p95
+ ```
+- **Meaning**:
+ - **p50**: Median evaluation score (quality of most responses)
+ - **p95**: Top 95% score (benchmark for excellent responses)
+- **Score Range**: 1-5 points
+ - 1-2 points: Critical (severe quality issues)
+ - 3 points: Warning (needs improvement)
+ - 4-5 points: Good
+- **Goal**: Maintain p50 at 4 or above
+- **No Data Cause**:
+ - Evaluation scores not recorded as histograms
+ - Insufficient number of evaluations
+
+#### Notifications by Channel
+- **Purpose**: Notification delivery success rate by channel over time
+- **PromQL**: `sum(rate(llm_evaluator_notifications_sent_total[5m])) by (channel, status)`
+- **Meaning**:
+ - Track success/failure for Slack, Discord, Email separately
+ - Verify notification infrastructure health
+- **Channels**:
+ - `slack`: Slack webhook
+ - `discord`: Discord webhook
+ - `email`: SMTP email
+- **Status**:
+ - `success`: Delivery successful
+ - `error`: Delivery failed
+- **Warning**: High error rate for specific channel requires configuration check
+
+#### Low Quality Alerts
+- **Purpose**: Frequency of low-quality alerts
+- **PromQL**: `sum(rate(llm_evaluator_low_quality_alerts_total[5m])) by (judge_type)`
+- **Meaning**:
+ - How often scores below `NOTIFICATION_SCORE_THRESHOLD` occur
+ - Monitor severity of quality issues
+- **Goal**: Lower is better
+- **Use Cases**:
+ - Early detection of quality degradation trends
+ - Measure effectiveness after prompt or model changes
+- **No Data Cause**: No low-quality responses (good sign!)
+
+---
+
+### 4️⃣ System Health
+
+Monitor scheduler and batch evaluation system operation status.
+
+#### Scheduler Runs
+- **Purpose**: Automated evaluation scheduler execution rate
+- **PromQL**: `sum(rate(llm_evaluator_scheduler_runs_total[5m]))`
+- **Meaning**:
+ - Verify scheduler is operating normally
+ - Monitor execution at configured intervals
+- **Configuration**: `EVALUATION_INTERVAL_MINUTES` in `.env.local`
+ - Default: 60 minutes (runs every hour)
+- **Normal Value**:
+ - For 60-minute interval: 1 run/hour = 0.000277 runs/s
+ - Graph increases in step pattern
+- **Warning**: Value stops increasing = scheduler stopped
+- **Verification**:
+ ```bash
+ docker logs llm-evaluator | grep "Scheduler"
+ ```
+
+#### Batch Evaluation - Logs Processed
+- **Purpose**: Cumulative count of logs processed by batch evaluation
+- **PromQL**: `sum(llm_evaluator_batch_logs_processed_total)`
+- **Meaning**:
+ - Track total number of logs evaluated by scheduler
+ - Monitor system throughput
+- **Visualization**: Cumulative graph (continuously increasing)
+- **Configuration**: `EVALUATION_BATCH_SIZE` in `.env.local`
+ - Default: 10 (processes 10 at a time)
+- **Use Cases**:
+ - Optimize batch size
+ - Analyze processing speed trends
+
+---
+
+## Troubleshooting No Data
+
+If "No Data" appears in the dashboard, follow these steps:
+
+### 1. Check Prometheus Targets
+
+```bash
+# Check target status in Prometheus UI
+http://localhost:9090/targets
+
+# Or check via API
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
+```
+
+**Expected Result:**
+- `gateway-api`: health = "up"
+- `evaluator`: health = "up"
+
+### 2. Check Metrics Endpoints
+
+```bash
+# Check Gateway API metrics
+curl http://localhost:18000/metrics | grep llm_gateway
+
+# Check Evaluator metrics
+curl http://localhost:18001/metrics | grep llm_evaluator
+```
+
+If metrics are missing, restart services:
+```bash
+docker compose -f docker-compose.local.yml restart gateway-api evaluator
+```
+
+### 3. Generate Data
+
+Some metrics require activity to generate data:
+
+```bash
+# 1. Send request to Gateway API
+curl -X POST "http://localhost:18000/chat" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "Test prompt", "user_id": "test-user"}'
+
+# 2. Run evaluation
+curl -X POST "http://localhost:18001/evaluate-once?limit=5"
+
+# 3. Wait 5-10 minutes and refresh Grafana
+```
+
+### 4. Adjust Time Range
+
+Adjust time range in top-right corner of Grafana dashboard:
+- Default: Last 1 hour
+- If no data: Change to Last 6 hours or Last 24 hours
+
+### 5. Test PromQL Queries
+
+Execute queries directly in Prometheus UI:
+
+```bash
+# Prometheus Graph page
+http://localhost:9090/graph
+
+# Example queries:
+llm_gateway_http_requests_total
+llm_evaluator_evaluations_total
+llm_evaluator_pending_logs
+```
+
+---
+
+## PromQL Query Explanations
+
+### rate() Function
+```promql
+rate(llm_gateway_http_requests_total[5m])
+```
+- **Meaning**: Per-second rate of increase over last 5 minutes
+- **Usage**: Convert Counter metrics to rates
+- **Unit**: Per second
+
+### histogram_quantile() Function
+```promql
+histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le))
+```
+- **Meaning**: Calculate 95th percentile from histogram
+- **le**: less than or equal (bucket upper bound)
+- **Usage**: Latency and score distribution analysis
+
+### sum() by (label)
+```promql
+sum(rate(llm_gateway_http_requests_total[5m])) by (endpoint)
+```
+- **Meaning**: Calculate sum grouped by label
+- **Usage**: Separate by endpoint, model, channel
+
+---
+
+## Usage Tips
+
+### 1. Detect Performance Issues
+- **HTTP Request Latency p99** > 10 seconds: Performance degradation
+- **LLM Request Latency p95** > 5 seconds: LLM API delay
+
+Response:
+```bash
+# Check slow request logs
+docker logs llm-gateway-api | grep "latency"
+
+# Check database query performance (can add db_query_duration metric)
+```
+
+### 2. Track Quality Degradation
+- **Overall Score p50** < 3: Quality issues occurring
+- **Low Quality Alerts** spike: Requires immediate investigation
+
+Response:
+```bash
+# Check recent low-score logs
+docker exec -it llm-postgres psql -U llm_user -d llm_quality -c \
+ "SELECT l.id, l.prompt, l.response, e.overall_score
+ FROM llm_logs l
+ JOIN llm_evaluations e ON l.id = e.log_id
+ WHERE e.overall_score <= 3
+ ORDER BY l.created_at DESC
+ LIMIT 5;"
+```
+
+### 3. Notification System Health
+- **Notifications by Channel** - error rate > 10%: Check configuration
+
+Response:
+```bash
+# Check notification errors in Evaluator logs
+docker logs llm-evaluator | grep -i "notification.*fail"
+
+# Check SMTP configuration
+docker logs llm-evaluator | grep -i "smtp"
+```
+
+### 4. Monitor Scheduler
+- **Scheduler Runs** stops increasing: Scheduler stopped
+
+Response:
+```bash
+# Check scheduler logs
+docker logs llm-evaluator | grep "Scheduler\|APScheduler"
+
+# Restart Evaluator
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+### 5. Pending Logs Accumulation
+- **Pending Logs** continuously increasing: Evaluation speed < Log creation speed
+
+Response:
+```bash
+# Increase batch size (.env.local)
+EVALUATION_BATCH_SIZE=20 # Increase from 10
+
+# Reduce evaluation interval
+EVALUATION_INTERVAL_MINUTES=30 # Reduce from 60
+
+# Restart
+docker compose -f docker-compose.local.yml restart evaluator
+```
+
+---
+
+## Dashboard Customization
+
+### Adding Panels
+
+1. Click "Add panel" in Grafana UI
+2. Enter PromQL query
+3. Select visualization type (Stat, Time series, Gauge, etc.)
+4. Save
+
+### Useful Additional Panel Examples
+
+#### Error Rate Percentage
+```promql
+(sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) * 100
+```
+
+#### Score Comparison by Evaluation Type
+```promql
+histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type="instruction_following"}[5m])) by (le))
+```
+
+#### Error Rate by Model
+```promql
+sum(rate(llm_gateway_llm_requests_total{status="error"}[5m])) by (model)
+```
+
+---
+
+## Alert Rules (Coming in Future Release)
+
+Prometheus Alertmanager integration planned for v0.6.0:
+
+```yaml
+# Example: High HTTP error rate alert
+- alert: HighHTTPErrorRate
+ expr: |
+ (sum(rate(llm_gateway_http_requests_total{status=~"5.."}[5m])) /
+ sum(rate(llm_gateway_http_requests_total[5m]))) > 0.05
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "HTTP error rate exceeds 5%"
+```
+
+---
+
+## Troubleshooting
+
+### Grafana Cannot Connect to Prometheus
+
+**Symptom**: "Post http://localhost:9090/api/v1/query_range: connection refused"
+
+**Solution**:
+1. Check Datasource configuration
+ - Settings → Data Sources → Prometheus
+ - Verify URL is `http://prometheus:9090` (not localhost!)
+
+2. Check Prometheus container status
+ ```bash
+ docker ps | grep prometheus
+ docker logs llm-prometheus
+ ```
+
+3. Check Docker network
+ ```bash
+ docker network inspect docker_default
+ # gateway-api, evaluator, prometheus, grafana must be on same network
+ ```
+
+### Dashboard Not Auto-Loading
+
+**Solution**:
+```bash
+# Check Grafana provisioning logs
+docker logs llm-grafana | grep provision
+
+# Check permissions
+ls -la /home/sdhcokr/project/LLM-Quality-Observer/infra/grafana/
+
+# Fix permissions if needed
+chmod -R 755 /home/sdhcokr/project/LLM-Quality-Observer/infra/grafana/
+```
+
+---
+
+## References
+
+- [Prometheus Configuration Guide](../prometheus/README.md)
+- [Metrics Reference](../../docs/METRICS.md)
+- [Release Notes v0.5.0](../../docs/RELEASE_NOTES_v0.5.0.md)
+- [Grafana Official Documentation](https://grafana.com/docs/)
+- [PromQL Query Guide](https://prometheus.io/docs/prometheus/latest/querying/basics/)
+
+---
+
+**Last Updated**: 2025-12-23
+**Dashboard Version**: v0.5.0
diff --git a/infra/grafana/dashboards/llm-quality-observer.json b/infra/grafana/dashboards/llm-quality-observer.json
new file mode 100644
index 0000000..90478ad
--- /dev/null
+++ b/infra/grafana/dashboards/llm-quality-observer.json
@@ -0,0 +1,1228 @@
+{
+ "annotations": {
+ "list": [
+ {
+ "builtIn": 1,
+ "datasource": {
+ "type": "grafana",
+ "uid": "-- Grafana --"
+ },
+ "enable": true,
+ "hide": true,
+ "iconColor": "rgba(0, 211, 255, 1)",
+ "name": "Annotations & Alerts",
+ "type": "dashboard"
+ }
+ ]
+ },
+ "editable": true,
+ "fiscalYearStartMonth": 0,
+ "graphTooltip": 0,
+ "id": null,
+ "links": [],
+ "liveNow": false,
+ "panels": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 0,
+ "y": 0
+ },
+ "id": 1,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "values": false,
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": ""
+ },
+ "textMode": "auto"
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_gateway_http_requests_total[5m]))",
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Request Rate (req/s)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 6,
+ "y": 0
+ },
+ "id": 2,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "values": false,
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": ""
+ },
+ "textMode": "auto"
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_evaluations_total[5m]))",
+ "refId": "A"
+ }
+ ],
+ "title": "Evaluation Rate (eval/s)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 10
+ },
+ {
+ "color": "red",
+ "value": 50
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 12,
+ "y": 0
+ },
+ "id": 3,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "values": false,
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": ""
+ },
+ "textMode": "auto"
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "llm_evaluator_pending_logs",
+ "refId": "A"
+ }
+ ],
+ "title": "Pending Logs",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 6,
+ "x": 18,
+ "y": 0
+ },
+ "id": 4,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "values": false,
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": ""
+ },
+ "textMode": "auto"
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_notifications_sent_total{status=\"success\"}[5m]))",
+ "refId": "A"
+ }
+ ],
+ "title": "Notification Rate (notif/s)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 4
+ },
+ "id": 5,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_gateway_http_requests_total[1m])) by (endpoint, status)",
+ "legendFormat": "{{endpoint}} ({{status}})",
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Requests by Endpoint",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 4
+ },
+ "id": 6,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.50, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
+ "legendFormat": "p50 - {{endpoint}}",
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.95, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
+ "legendFormat": "p95 - {{endpoint}}",
+ "refId": "B"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.99, sum(rate(llm_gateway_http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
+ "legendFormat": "p99 - {{endpoint}}",
+ "refId": "C"
+ }
+ ],
+ "title": "HTTP Request Latency (p50/p95/p99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 12
+ },
+ "id": 7,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_gateway_llm_requests_total[1m])) by (model, status)",
+ "legendFormat": "{{model}} ({{status}})",
+ "refId": "A"
+ }
+ ],
+ "title": "LLM Requests by Model",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 12
+ },
+ "id": 8,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.50, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))",
+ "legendFormat": "p50 - {{model}}",
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.95, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))",
+ "legendFormat": "p95 - {{model}}",
+ "refId": "B"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.99, sum(rate(llm_gateway_llm_request_duration_seconds_bucket[5m])) by (le, model))",
+ "legendFormat": "p99 - {{model}}",
+ "refId": "C"
+ }
+ ],
+ "title": "LLM Request Latency by Model (p50/p95/p99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 20
+ },
+ "id": 9,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_evaluations_total[1m])) by (judge_type, status)",
+ "legendFormat": "{{judge_type}} ({{status}})",
+ "refId": "A"
+ }
+ ],
+ "title": "Evaluations by Judge Type",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "max": 5,
+ "min": 0,
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 20
+ },
+ "id": 10,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.50, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type=\"overall\"}[5m])) by (le, judge_type))",
+ "legendFormat": "p50 - {{judge_type}}",
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "histogram_quantile(0.95, sum(rate(llm_evaluator_evaluation_scores_bucket{score_type=\"overall\"}[5m])) by (le, judge_type))",
+ "legendFormat": "p95 - {{judge_type}}",
+ "refId": "B"
+ }
+ ],
+ "title": "Overall Score Distribution (p50/p95)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 28
+ },
+ "id": 11,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_notifications_sent_total[1m])) by (channel, status)",
+ "legendFormat": "{{channel}} ({{status}})",
+ "refId": "A"
+ }
+ ],
+ "title": "Notifications by Channel",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 28
+ },
+ "id": 12,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_low_quality_alerts_total[1m])) by (judge_type)",
+ "legendFormat": "{{judge_type}}",
+ "refId": "A"
+ }
+ ],
+ "title": "Low Quality Alerts",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 36
+ },
+ "id": 13,
+ "options": {
+ "legend": {
+ "calcs": [],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_scheduler_runs_total[5m])) by (status)",
+ "legendFormat": "{{status}}",
+ "refId": "A"
+ }
+ ],
+ "title": "Scheduler Runs",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "tooltip": false,
+ "viz": false,
+ "legend": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "never",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 36
+ },
+ "id": 14,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "pluginVersion": "10.0.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "prometheus"
+ },
+ "expr": "sum(rate(llm_evaluator_batch_logs_processed[5m])) by (judge_type)",
+ "legendFormat": "{{judge_type}}",
+ "refId": "A"
+ }
+ ],
+ "title": "Batch Evaluation - Logs Processed",
+ "type": "timeseries"
+ }
+ ],
+ "refresh": "10s",
+ "schemaVersion": 38,
+ "style": "dark",
+ "tags": [
+ "llm",
+ "quality",
+ "monitoring"
+ ],
+ "templating": {
+ "list": []
+ },
+ "time": {
+ "from": "now-1h",
+ "to": "now"
+ },
+ "timepicker": {},
+ "timezone": "",
+ "title": "LLM Quality Observer",
+ "uid": "llm-quality-observer",
+ "version": 1,
+ "weekStart": ""
+}
diff --git a/infra/grafana/provisioning/dashboards/default.yml b/infra/grafana/provisioning/dashboards/default.yml
new file mode 100644
index 0000000..60aaad4
--- /dev/null
+++ b/infra/grafana/provisioning/dashboards/default.yml
@@ -0,0 +1,12 @@
+apiVersion: 1
+
+providers:
+ - name: 'LLM Quality Observer'
+ orgId: 1
+ folder: ''
+ type: file
+ disableDeletion: false
+ updateIntervalSeconds: 10
+ allowUiUpdates: true
+ options:
+ path: /var/lib/grafana/dashboards
diff --git a/infra/grafana/provisioning/datasources/prometheus.yml b/infra/grafana/provisioning/datasources/prometheus.yml
new file mode 100644
index 0000000..737c1cd
--- /dev/null
+++ b/infra/grafana/provisioning/datasources/prometheus.yml
@@ -0,0 +1,10 @@
+apiVersion: 1
+
+datasources:
+ - name: Prometheus
+ type: prometheus
+ access: proxy
+ url: http://prometheus:9090
+ uid: prometheus
+ isDefault: true
+ editable: true
diff --git a/infra/prometheus/prometheus.yml b/infra/prometheus/prometheus.yml
new file mode 100644
index 0000000..5f7b956
--- /dev/null
+++ b/infra/prometheus/prometheus.yml
@@ -0,0 +1,24 @@
+global:
+ scrape_interval: 15s
+ evaluation_interval: 15s
+ external_labels:
+ monitor: 'llm-quality-observer'
+
+scrape_configs:
+ - job_name: 'gateway-api'
+ static_configs:
+ - targets: ['gateway-api:8000']
+ labels:
+ service: 'gateway-api'
+ environment: 'local'
+
+ - job_name: 'evaluator'
+ static_configs:
+ - targets: ['evaluator:8000']
+ labels:
+ service: 'evaluator'
+ environment: 'local'
+
+ - job_name: 'prometheus'
+ static_configs:
+ - targets: ['localhost:9090']
diff --git a/services/evaluator/app/config.py b/services/evaluator/app/config.py
index 2b6d610..4352e1a 100644
--- a/services/evaluator/app/config.py
+++ b/services/evaluator/app/config.py
@@ -25,6 +25,14 @@ class Settings(BaseSettings):
discord_webhook_url: str | None = None # Discord 웹훅 URL
notification_score_threshold: int = 3 # 알림 보낼 점수 임계값 (이하일 때 알림)
+ # Email Notification Settings
+ smtp_host: str | None = None # SMTP 서버 주소
+ smtp_port: int = 587 # SMTP 포트 (기본 587 - TLS)
+ smtp_username: str | None = None # SMTP 사용자명
+ smtp_password: str | None = None # SMTP 비밀번호
+ smtp_from_email: str | None = None # 발신자 이메일
+ smtp_to_emails: str | None = None # 수신자 이메일들 (쉼표로 구분)
+
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
diff --git a/services/evaluator/app/main.py b/services/evaluator/app/main.py
index a59c733..ccb48cc 100644
--- a/services/evaluator/app/main.py
+++ b/services/evaluator/app/main.py
@@ -1,8 +1,10 @@
from typing import Literal
from contextlib import asynccontextmanager
-from fastapi import FastAPI, Depends, Query, HTTPException
+from fastapi import FastAPI, Depends, Query, HTTPException, Response
from sqlalchemy.orm import Session
import logging
+import time
+from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from .db import Base, engine, get_db
from .models import LLMLog, LLMEvaluation
@@ -11,6 +13,8 @@
from .config import settings
from .scheduler import start_scheduler, stop_scheduler
from .utils import get_pending_logs
+from .metrics import record_evaluation, update_pending_logs_count
+from .notifier import send_low_quality_alert
# 로깅 설정
logging.basicConfig(
@@ -57,6 +61,12 @@ def health_check():
}
+@app.get("/metrics")
+def metrics():
+ """Prometheus 메트릭 엔드포인트"""
+ return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)
+
+
@app.post("/evaluate-once")
def evaluate_once(
limit: int = Query(10, ge=1, le=100, description="한 번에 평가할 최대 로그 개수"),
@@ -129,6 +139,11 @@ def evaluate_once(
# DB에 추가
db.add(evaluation)
+ db.commit() # 커밋해서 evaluation.id 생성
+
+ # 낮은 품질 알림 전송
+ send_low_quality_alert(log, evaluation)
+
evaluated_count += 1
except HTTPException as e:
@@ -140,9 +155,6 @@ def evaluate_once(
db.rollback()
raise HTTPException(status_code=500, detail=f"Evaluation failed: {str(e)}")
- # 3. 모든 평가 결과 커밋
- db.commit()
-
# 4. 결과 반환
return {
"evaluated": evaluated_count,
diff --git a/services/evaluator/app/metrics.py b/services/evaluator/app/metrics.py
new file mode 100644
index 0000000..3a793da
--- /dev/null
+++ b/services/evaluator/app/metrics.py
@@ -0,0 +1,186 @@
+"""
+Prometheus 메트릭 정의 및 수집 - Evaluator Service
+"""
+
+from prometheus_client import Counter, Histogram, Gauge, Info
+
+# 애플리케이션 정보
+app_info = Info('llm_evaluator_app', 'LLM Evaluator Service application info')
+app_info.info({'version': '0.5.0', 'service': 'evaluator'})
+
+# 평가 관련 메트릭
+evaluations_total = Counter(
+ 'llm_evaluator_evaluations_total',
+ 'Total evaluations performed',
+ ['judge_type', 'status'] # judge_type: rule/llm, status: success/error
+)
+
+evaluation_duration_seconds = Histogram(
+ 'llm_evaluator_evaluation_duration_seconds',
+ 'Evaluation processing time',
+ ['judge_type'],
+ buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, float('inf'))
+)
+
+evaluation_scores = Histogram(
+ 'llm_evaluator_evaluation_scores',
+ 'Distribution of evaluation scores',
+ ['judge_type', 'score_type'], # score_type: overall/instruction/truthfulness
+ buckets=(1, 2, 3, 4, 5)
+)
+
+# 배치 평가 관련 메트릭
+batch_evaluations_total = Counter(
+ 'llm_evaluator_batch_evaluations_total',
+ 'Total batch evaluation runs',
+ ['judge_type']
+)
+
+batch_logs_processed = Histogram(
+ 'llm_evaluator_batch_logs_processed',
+ 'Number of logs processed per batch',
+ buckets=(1, 5, 10, 20, 50, 100)
+)
+
+# 알림 관련 메트릭
+notifications_sent_total = Counter(
+ 'llm_evaluator_notifications_sent_total',
+ 'Total notifications sent',
+ ['channel', 'type', 'status'] # channel: slack/discord/email, type: alert/summary
+)
+
+low_quality_alerts_total = Counter(
+ 'llm_evaluator_low_quality_alerts_total',
+ 'Total low quality alerts triggered',
+ ['judge_type']
+)
+
+# 스케줄러 관련 메트릭
+scheduler_runs_total = Counter(
+ 'llm_evaluator_scheduler_runs_total',
+ 'Total scheduler runs',
+ ['status'] # success/error
+)
+
+pending_logs_gauge = Gauge(
+ 'llm_evaluator_pending_logs',
+ 'Number of logs waiting for evaluation'
+)
+
+# LLM Judge 호출 메트릭
+llm_judge_requests_total = Counter(
+ 'llm_evaluator_llm_judge_requests_total',
+ 'Total LLM judge API requests',
+ ['model', 'status']
+)
+
+llm_judge_request_duration_seconds = Histogram(
+ 'llm_evaluator_llm_judge_request_duration_seconds',
+ 'LLM judge API request latency',
+ ['model'],
+ buckets=(0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, float('inf'))
+)
+
+
+def record_evaluation(judge_type: str, status: str, duration_seconds: float, scores: dict = None):
+ """
+ 평가 메트릭 기록.
+
+ Args:
+ judge_type: 'rule' or 'llm'
+ status: 'success' or 'error'
+ duration_seconds: 평가 소요 시간 (초)
+ scores: {'overall': int, 'instruction': int, 'truthfulness': int}
+ """
+ evaluations_total.labels(judge_type=judge_type, status=status).inc()
+ evaluation_duration_seconds.labels(judge_type=judge_type).observe(duration_seconds)
+
+ if scores and status == 'success':
+ if 'overall' in scores:
+ evaluation_scores.labels(
+ judge_type=judge_type,
+ score_type='overall'
+ ).observe(scores['overall'])
+
+ if 'instruction' in scores and scores['instruction'] is not None:
+ evaluation_scores.labels(
+ judge_type=judge_type,
+ score_type='instruction'
+ ).observe(scores['instruction'])
+
+ if 'truthfulness' in scores and scores['truthfulness'] is not None:
+ evaluation_scores.labels(
+ judge_type=judge_type,
+ score_type='truthfulness'
+ ).observe(scores['truthfulness'])
+
+
+def record_batch_evaluation(judge_type: str, logs_processed: int):
+ """
+ 배치 평가 메트릭 기록.
+
+ Args:
+ judge_type: 'rule' or 'llm'
+ logs_processed: 처리한 로그 개수
+ """
+ batch_evaluations_total.labels(judge_type=judge_type).inc()
+ batch_logs_processed.observe(logs_processed)
+
+
+def record_notification(channel: str, notification_type: str, status: str):
+ """
+ 알림 전송 메트릭 기록.
+
+ Args:
+ channel: 'slack', 'discord', 'email'
+ notification_type: 'alert' or 'summary'
+ status: 'success' or 'error'
+ """
+ notifications_sent_total.labels(
+ channel=channel,
+ type=notification_type,
+ status=status
+ ).inc()
+
+
+def record_low_quality_alert(judge_type: str):
+ """
+ 낮은 품질 경고 메트릭 기록.
+
+ Args:
+ judge_type: 'rule' or 'llm'
+ """
+ low_quality_alerts_total.labels(judge_type=judge_type).inc()
+
+
+def record_scheduler_run(status: str):
+ """
+ 스케줄러 실행 메트릭 기록.
+
+ Args:
+ status: 'success' or 'error'
+ """
+ scheduler_runs_total.labels(status=status).inc()
+
+
+def update_pending_logs_count(count: int):
+ """
+ 대기 중인 로그 수 업데이트.
+
+ Args:
+ count: 현재 대기 중인 로그 개수
+ """
+ pending_logs_gauge.set(count)
+
+
+def record_llm_judge_request(model: str, status: str, duration_seconds: float):
+ """
+ LLM Judge API 호출 메트릭 기록.
+
+ Args:
+ model: 모델 이름
+ status: 'success' or 'error'
+ duration_seconds: 요청 소요 시간 (초)
+ """
+ llm_judge_requests_total.labels(model=model, status=status).inc()
+ llm_judge_request_duration_seconds.labels(model=model).observe(duration_seconds)
diff --git a/services/evaluator/app/notifier.py b/services/evaluator/app/notifier.py
index 90b38cd..8134d44 100644
--- a/services/evaluator/app/notifier.py
+++ b/services/evaluator/app/notifier.py
@@ -1,24 +1,30 @@
"""
알림 시스템 모듈.
-Slack, Discord 웹훅을 통해 평가 결과 알림을 전송합니다.
+Slack, Discord 웹훅 및 이메일을 통해 평가 결과 알림을 전송합니다.
"""
import logging
+import asyncio
from typing import Optional
import httpx
+import aiosmtplib
+from email.mime.text import MIMEText
+from email.mime.multipart import MIMEMultipart
from .config import settings
from .models import LLMLog, LLMEvaluation
+from .metrics import record_notification, record_low_quality_alert
logger = logging.getLogger(__name__)
-def send_slack_notification(message: str) -> bool:
+def send_slack_notification(message: str, notification_type: str = "alert") -> bool:
"""
Slack 웹훅을 통해 메시지를 전송합니다.
Args:
message: 전송할 메시지
+ notification_type: 'alert' or 'summary'
Returns:
bool: 전송 성공 여부
@@ -35,19 +41,22 @@ def send_slack_notification(message: str) -> bool:
timeout=10.0,
)
response.raise_for_status()
+ record_notification("slack", notification_type, "success")
logger.info("Slack 알림 전송 성공")
return True
except Exception as e:
+ record_notification("slack", notification_type, "error")
logger.error(f"Slack 알림 전송 실패: {str(e)}")
return False
-def send_discord_notification(message: str) -> bool:
+def send_discord_notification(message: str, notification_type: str = "alert") -> bool:
"""
Discord 웹훅을 통해 메시지를 전송합니다.
Args:
message: 전송할 메시지
+ notification_type: 'alert' or 'summary'
Returns:
bool: 전송 성공 여부
@@ -64,13 +73,79 @@ def send_discord_notification(message: str) -> bool:
timeout=10.0,
)
response.raise_for_status()
+ record_notification("discord", notification_type, "success")
logger.info("Discord 알림 전송 성공")
return True
except Exception as e:
+ record_notification("discord", notification_type, "error")
logger.error(f"Discord 알림 전송 실패: {str(e)}")
return False
+async def send_email_notification(subject: str, message: str, notification_type: str = "alert", html_content: str = None) -> bool:
+ """
+ SMTP를 통해 이메일 알림을 전송합니다.
+
+ Args:
+ subject: 이메일 제목
+ message: 전송할 메시지 (plain text)
+ notification_type: 'alert' or 'summary'
+ html_content: HTML 콘텐츠 (없으면 자동 생성)
+
+ Returns:
+ bool: 전송 성공 여부
+ """
+ if not all([
+ settings.smtp_host,
+ settings.smtp_username,
+ settings.smtp_password,
+ settings.smtp_from_email,
+ settings.smtp_to_emails,
+ ]):
+ logger.debug("SMTP 설정이 완전하지 않습니다.")
+ return False
+
+ try:
+ # 수신자 이메일 리스트 파싱
+ to_emails = [email.strip() for email in settings.smtp_to_emails.split(",")]
+
+ # MIME 메시지 생성
+ msg = MIMEMultipart("alternative")
+ msg["Subject"] = subject
+ msg["From"] = settings.smtp_from_email
+ msg["To"] = ", ".join(to_emails)
+
+ # 텍스트 버전
+ text_part = MIMEText(message, "plain")
+
+ # HTML 버전 (제공되지 않으면 기본 템플릿 사용)
+ if html_content is None:
+ html_message = message.replace("\n", "
")
+ html_content = f"{html_message}"
+
+ html_part = MIMEText(html_content, "html")
+
+ msg.attach(text_part)
+ msg.attach(html_part)
+
+ # SMTP 연결 및 전송 (포트 587은 STARTTLS 사용)
+ async with aiosmtplib.SMTP(
+ hostname=settings.smtp_host,
+ port=settings.smtp_port,
+ start_tls=True,
+ ) as smtp:
+ await smtp.login(settings.smtp_username, settings.smtp_password)
+ await smtp.send_message(msg)
+
+ record_notification("email", notification_type, "success")
+ logger.info(f"이메일 알림 전송 성공: {to_emails}")
+ return True
+ except Exception as e:
+ record_notification("email", notification_type, "error")
+ logger.error(f"이메일 알림 전송 실패: {str(e)}")
+ return False
+
+
def send_low_quality_alert(log: LLMLog, evaluation: LLMEvaluation):
"""
품질 점수가 낮은 평가 결과에 대한 알림을 전송합니다.
@@ -83,7 +158,18 @@ def send_low_quality_alert(log: LLMLog, evaluation: LLMEvaluation):
# 임계값 이상이면 알림 안 보냄
return
- # 메시지 구성
+ # 점수에 따른 색상 결정
+ if evaluation.overall_score <= 2:
+ score_color = "#dc3545" # 빨강
+ status_badge = "Critical"
+ elif evaluation.overall_score <= 3:
+ score_color = "#fd7e14" # 주황
+ status_badge = "Warning"
+ else:
+ score_color = "#ffc107" # 노랑
+ status_badge = "Low Quality"
+
+ # Plain text 메시지 (Slack/Discord용)
message = f"""
🚨 **Low Quality Alert**
@@ -100,11 +186,138 @@ def send_low_quality_alert(log: LLMLog, evaluation: LLMEvaluation):
**Created:** {log.created_at.strftime('%Y-%m-%d %H:%M:%S')}
""".strip()
- # Slack과 Discord에 동시 전송
- slack_sent = send_slack_notification(message)
- discord_sent = send_discord_notification(message)
+ # HTML 이메일 템플릿
+ html_email = f"""
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 🚨 LLM Quality Alert
+
+
+ {status_badge} - Immediate Attention Required
+
+ |
+
+
+
+
+ |
+
+ {evaluation.overall_score}/5
+
+ Quality Score
+ |
+
+
+
+
+
+
+
+
+
+ |
+ Judge Information
+ {evaluation.judge_model}
+ Label: {evaluation.label}
+ |
+
+
+
+
+
+ 📝 Prompt
+
+ {log.prompt[:200]}{"..." if len(log.prompt) > 200 else ""}
+
+
+
+
+
+ 💬 Response
+
+ {log.response[:200]}{"..." if len(log.response) > 200 else ""}
+
+
+
+
+
+ 💡 Analysis
+
+ {evaluation.comment or 'No additional comments'}
+
+
+
+
+
+
+ |
+ Log ID: #{log.id}
+ |
+
+ Created: {log.created_at.strftime('%Y-%m-%d %H:%M:%S')} UTC
+ |
+
+
+
+ |
+
+
+
+
+ |
+
+ This is an automated alert from LLM Quality Observer
+
+
+ Powered by LLM-Ouality-Observer • {log.created_at.strftime('%Y')}
+
+ |
+
+
+
+ |
+
+
+
+
+""".strip()
+
+ # 낮은 품질 경고 메트릭 기록
+ judge_type = "llm" if "llm" in evaluation.judge_model or "gpt" in evaluation.judge_model else "rule"
+ record_low_quality_alert(judge_type)
+
+ # Slack, Discord, Email에 동시 전송
+ slack_sent = send_slack_notification(message, notification_type="alert")
+ discord_sent = send_discord_notification(message, notification_type="alert")
- if slack_sent or discord_sent:
+ # 이메일 전송 (HTML 템플릿 포함)
+ email_sent = False
+ try:
+ email_subject = f"🚨 LLM Quality Alert - Score: {evaluation.overall_score}/5"
+ email_sent = asyncio.run(send_email_notification(
+ email_subject,
+ message,
+ notification_type="alert",
+ html_content=html_email
+ ))
+ except Exception as e:
+ logger.error(f"Email notification error: {str(e)}")
+
+ if slack_sent or discord_sent or email_sent:
logger.info(f"Low quality alert sent for log_id={log.id}, score={evaluation.overall_score}")
else:
logger.warning(f"Failed to send alert for log_id={log.id}")
@@ -130,5 +343,12 @@ def send_batch_evaluation_summary(evaluated_count: int, judge_type: str, judge_m
**Judge Model:** {judge_model}
""".strip()
- send_slack_notification(message)
- send_discord_notification(message)
+ send_slack_notification(message, notification_type="summary")
+ send_discord_notification(message, notification_type="summary")
+
+ # 이메일 전송 (비동기 함수를 동기 컨텍스트에서 실행)
+ try:
+ email_subject = f"✅ Batch Evaluation Complete - {evaluated_count} logs evaluated"
+ asyncio.run(send_email_notification(email_subject, message, notification_type="summary"))
+ except Exception as e:
+ logger.error(f"Email notification error: {str(e)}")
diff --git a/services/evaluator/app/scheduler.py b/services/evaluator/app/scheduler.py
index caa82c1..b447acc 100644
--- a/services/evaluator/app/scheduler.py
+++ b/services/evaluator/app/scheduler.py
@@ -4,6 +4,7 @@
"""
import logging
+import time
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.triggers.interval import IntervalTrigger
from sqlalchemy.orm import Session
@@ -15,6 +16,12 @@
from .rules import basic_rule_evaluate
from .llm_judge import run_judge
from .notifier import send_low_quality_alert, send_batch_evaluation_summary
+from .metrics import (
+ record_evaluation,
+ record_batch_evaluation,
+ record_scheduler_run,
+ update_pending_logs_count,
+)
logger = logging.getLogger(__name__)
@@ -49,6 +56,7 @@ def run_batch_evaluation():
judge_model_name = ""
for log in pending_logs:
+ eval_start = time.time()
try:
if judge_type == "rule":
# 룰 기반 평가
@@ -84,6 +92,15 @@ def run_batch_evaluation():
db.add(evaluation)
db.commit()
evaluated_count += 1
+ eval_duration = time.time() - eval_start
+
+ # 메트릭 기록
+ scores = {
+ 'overall': evaluation.overall_score,
+ 'instruction': evaluation.score_instruction_following,
+ 'truthfulness': evaluation.score_truthfulness,
+ }
+ record_evaluation(judge_type, "success", eval_duration, scores)
# 품질 점수가 낮으면 알림 전송
send_low_quality_alert(log, evaluation)
@@ -94,6 +111,8 @@ def run_batch_evaluation():
)
except Exception as e:
+ eval_duration = time.time() - eval_start
+ record_evaluation(judge_type, "error", eval_duration)
logger.error(f"Failed to evaluate log_id={log.id}: {str(e)}")
db.rollback()
continue
@@ -105,12 +124,19 @@ def run_batch_evaluation():
judge_type=judge_type,
judge_model=judge_model_name
)
+ # 배치 메트릭 기록
+ record_batch_evaluation(judge_type, evaluated_count)
+
+ # 스케줄러 성공 기록
+ record_scheduler_run("success")
logger.info(
f"Batch evaluation completed: {evaluated_count}/{len(pending_logs)} logs evaluated"
)
except Exception as e:
+ # 스케줄러 실패 기록
+ record_scheduler_run("error")
logger.error(f"Batch evaluation failed: {str(e)}")
db.rollback()
finally:
diff --git a/services/evaluator/pyproject.toml b/services/evaluator/pyproject.toml
index 09187a5..b10681f 100644
--- a/services/evaluator/pyproject.toml
+++ b/services/evaluator/pyproject.toml
@@ -14,6 +14,9 @@ dependencies = [
"openai",
"apscheduler>=3.10",
"httpx",
+ "prometheus-client>=0.19.0",
+ "aiosmtplib>=3.0",
+ "email-validator>=2.0",
]
[build-system]
diff --git a/services/gateway-api/app/main.py b/services/gateway-api/app/main.py
index 437a697..5b19e3d 100644
--- a/services/gateway-api/app/main.py
+++ b/services/gateway-api/app/main.py
@@ -1,8 +1,10 @@
-from fastapi import FastAPI, Depends, Query
+from fastapi import FastAPI, Depends, Query, Response
from fastapi.middleware.cors import CORSMiddleware
from sqlalchemy.orm import Session
from sqlalchemy import func, select, distinct
import math
+import time
+from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from .db import Base, engine, get_db
from .models import LLMLog, LLMEvaluation
@@ -21,12 +23,21 @@
)
from .llm_client import call_llm
from .config import settings
+from .metrics import (
+ MetricsMiddleware,
+ record_llm_request,
+ record_db_query,
+ record_log_saved,
+)
# 최초 실행 시 테이블 생성 (간단 버전)
Base.metadata.create_all(bind=engine)
app = FastAPI(title="LLM Quality Observer - Gateway API")
+# Prometheus 메트릭 미들웨어 추가
+app.add_middleware(MetricsMiddleware)
+
# CORS 설정 추가 (웹 대시보드에서 API 호출을 위해 필요)
app.add_middleware(
CORSMiddleware,
@@ -42,6 +53,12 @@ def health_check():
return {"status": "ok"}
+@app.get("/metrics")
+def metrics():
+ """Prometheus 메트릭 엔드포인트"""
+ return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)
+
+
def resolve_model_version(request_model: str | None) -> str:
"""
요청에서 들어온 model_version이 없거나 Swagger 기본값("string")이면
@@ -58,9 +75,19 @@ def chat(request: ChatRequest, db: Session = Depends(get_db)):
used_model = resolve_model_version(request.model_version)
# LLM 호출 (사용할 모델 명을 넘겨줌)
+ llm_start = time.time()
response_text, latency_ms = call_llm(request.prompt, used_model)
+ llm_duration = time.time() - llm_start
+
+ # LLM 메트릭 기록
+ record_llm_request(
+ model=used_model,
+ status="success",
+ duration_seconds=llm_duration
+ )
# DB 로그 저장
+ db_start = time.time()
log = LLMLog(
user_id=request.user_id,
prompt=request.prompt,
@@ -72,6 +99,11 @@ def chat(request: ChatRequest, db: Session = Depends(get_db)):
db.add(log)
db.commit()
db.refresh(log)
+ db_duration = time.time() - db_start
+
+ # DB 메트릭 기록
+ record_db_query(operation="insert", table="llm_logs", duration_seconds=db_duration)
+ record_log_saved(status="success")
# 클라이언트 응답
return ChatResponse(
diff --git a/services/gateway-api/app/metrics.py b/services/gateway-api/app/metrics.py
new file mode 100644
index 0000000..37fef1b
--- /dev/null
+++ b/services/gateway-api/app/metrics.py
@@ -0,0 +1,161 @@
+"""
+Prometheus 메트릭 정의 및 수집.
+"""
+
+from prometheus_client import Counter, Histogram, Gauge, Info
+import time
+
+# 애플리케이션 정보
+app_info = Info('llm_gateway_app', 'LLM Gateway API application info')
+app_info.info({'version': '0.5.0', 'service': 'gateway-api'})
+
+# HTTP 요청 관련 메트릭
+http_requests_total = Counter(
+ 'llm_gateway_http_requests_total',
+ 'Total HTTP requests',
+ ['method', 'endpoint', 'status']
+)
+
+http_request_duration_seconds = Histogram(
+ 'llm_gateway_http_request_duration_seconds',
+ 'HTTP request latency',
+ ['method', 'endpoint']
+)
+
+# LLM 호출 관련 메트릭
+llm_requests_total = Counter(
+ 'llm_gateway_llm_requests_total',
+ 'Total LLM requests',
+ ['model', 'status']
+)
+
+llm_request_duration_seconds = Histogram(
+ 'llm_gateway_llm_request_duration_seconds',
+ 'LLM request latency in seconds',
+ ['model'],
+ buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, float('inf'))
+)
+
+llm_tokens_total = Counter(
+ 'llm_gateway_llm_tokens_total',
+ 'Total tokens processed',
+ ['model', 'type'] # type: prompt, completion
+)
+
+# 데이터베이스 관련 메트릭
+db_queries_total = Counter(
+ 'llm_gateway_db_queries_total',
+ 'Total database queries',
+ ['operation', 'table']
+)
+
+db_query_duration_seconds = Histogram(
+ 'llm_gateway_db_query_duration_seconds',
+ 'Database query latency',
+ ['operation', 'table']
+)
+
+# 로그 저장 메트릭
+logs_saved_total = Counter(
+ 'llm_gateway_logs_saved_total',
+ 'Total logs saved to database',
+ ['status']
+)
+
+# 현재 상태 게이지
+active_requests = Gauge(
+ 'llm_gateway_active_requests',
+ 'Number of active HTTP requests'
+)
+
+
+class MetricsMiddleware:
+ """
+ FastAPI 미들웨어로 HTTP 요청 메트릭을 자동 수집.
+ """
+
+ def __init__(self, app):
+ self.app = app
+
+ async def __call__(self, scope, receive, send):
+ if scope['type'] != 'http':
+ await self.app(scope, receive, send)
+ return
+
+ method = scope['method']
+ path = scope['path']
+
+ # /metrics 엔드포인트는 제외
+ if path == '/metrics':
+ await self.app(scope, receive, send)
+ return
+
+ active_requests.inc()
+ start_time = time.time()
+
+ async def send_wrapper(message):
+ if message['type'] == 'http.response.start':
+ status_code = message['status']
+ duration = time.time() - start_time
+
+ # 메트릭 기록
+ http_requests_total.labels(
+ method=method,
+ endpoint=path,
+ status=status_code
+ ).inc()
+
+ http_request_duration_seconds.labels(
+ method=method,
+ endpoint=path
+ ).observe(duration)
+
+ await send(message)
+
+ try:
+ await self.app(scope, receive, send_wrapper)
+ finally:
+ active_requests.dec()
+
+
+def record_llm_request(model: str, status: str, duration_seconds: float, tokens: dict = None):
+ """
+ LLM 요청 메트릭 기록.
+
+ Args:
+ model: 모델 이름
+ status: 'success' or 'error'
+ duration_seconds: 요청 소요 시간 (초)
+ tokens: {'prompt': int, 'completion': int}
+ """
+ llm_requests_total.labels(model=model, status=status).inc()
+ llm_request_duration_seconds.labels(model=model).observe(duration_seconds)
+
+ if tokens:
+ if 'prompt' in tokens:
+ llm_tokens_total.labels(model=model, type='prompt').inc(tokens['prompt'])
+ if 'completion' in tokens:
+ llm_tokens_total.labels(model=model, type='completion').inc(tokens['completion'])
+
+
+def record_db_query(operation: str, table: str, duration_seconds: float):
+ """
+ 데이터베이스 쿼리 메트릭 기록.
+
+ Args:
+ operation: 'insert', 'select', 'update', 'delete'
+ table: 테이블 이름
+ duration_seconds: 쿼리 소요 시간 (초)
+ """
+ db_queries_total.labels(operation=operation, table=table).inc()
+ db_query_duration_seconds.labels(operation=operation, table=table).observe(duration_seconds)
+
+
+def record_log_saved(status: str):
+ """
+ 로그 저장 메트릭 기록.
+
+ Args:
+ status: 'success' or 'error'
+ """
+ logs_saved_total.labels(status=status).inc()
diff --git a/services/gateway-api/pyproject.toml b/services/gateway-api/pyproject.toml
index a10122e..6e9ede0 100644
--- a/services/gateway-api/pyproject.toml
+++ b/services/gateway-api/pyproject.toml
@@ -13,6 +13,7 @@ dependencies = [
"httpx",
"python-dotenv",
"openai",
+ "prometheus-client>=0.19.0",
]
[build-system]