Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 42 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,14 @@ Full migration script (uninstalls Engram, migrates data, reconfigures agents)
- **Vector Search** — Optional hybrid search combining FTS5 with embedding similarity via RRF fusion (requires `cortex_vectors` build tag).
- **HTTP REST API** — 17 JSON endpoints for observations, sessions, search, graph, and scoring.

## MCP Tools (19 total)
## MCP Tools (22 total)

### Core Tools (Engram-Compatible)
### Core Tools (Engram-Compatible — 14)

| Tool | Purpose |
|------|---------|
| `mem_save` | Save observation with What/Why/Where/Learned format |
| `mem_search` | Full-text search with filters |
| `mem_search` | Full-text search with filters + optional graph expansion |
| `mem_context` | Recent session context aggregation |
| `mem_session_summary` | End-of-session comprehensive save (**mandatory**) |
| `mem_get_observation` | Full content by ID |
Expand All @@ -126,7 +126,7 @@ Full migration script (uninstalls Engram, migrates data, reconfigures agents)
| `mem_stats` | Memory system statistics |
| `mem_timeline` | Chronological drill-in around observation |

### Cortex-Exclusive Tools
### Cortex-Exclusive Tools (8)

| Tool | Purpose |
|------|---------|
Expand All @@ -135,6 +135,9 @@ Full migration script (uninstalls Engram, migrates data, reconfigures agents)
| `mem_score` | Get/recalculate importance score |
| `mem_archive` | Archive an observation (soft-delete) |
| `mem_search_hybrid` | Hybrid FTS5 + vector search with RRF fusion |
| `mem_search_temporal` | Search as-of a specific date (temporal reasoning) |
| `mem_consolidate` | Find duplicate topic keys for consolidation |
| `mem_project_dna` | Generate structured project summary from observations |

Full tool reference → [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)

Expand All @@ -151,11 +154,38 @@ Full tool reference → [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
| Importance Scoring | :x: | :white_check_mark: |
| Auto-Archival | :x: | :white_check_mark: |
| Entity Linking | :x: | :white_check_mark: |
| Vector Search | :x: | :white_check_mark: (optional) |
| Vector Search (Ollama/OpenAI) | :x: | :white_check_mark: |
| Temporal Search (as-of date) | :x: | :white_check_mark: |
| Memory Consolidation | :x: | :white_check_mark: |
| Project DNA Generation | :x: | :white_check_mark: |
| Memory Health Dashboard | :x: | :white_check_mark: (TUI) |
| HTTP REST API | :white_check_mark: | :white_check_mark: |

Full comparison (including vs claude-mem) → [docs/COMPARISON.md](docs/COMPARISON.md)

## Benchmarks

Evaluated on [LOCOMO](https://github.com/snap-research/locomo) (ACL 2024) — 1,986 questions across 10 long-term conversations:

| Search Mode | single-hop | multi-hop | temporal | Improvement |
|-------------|-----------|-----------|----------|-------------|
| FTS5 only | 0.002 | 0.001 | 0.000 | baseline |
| **FTS5 + Ollama** (local) | **0.025** | **0.016** | **0.026** | **12-16x** |
| **FTS5 + OpenAI** (cloud) | **0.026** | **0.016** | **0.037** | **13-37x** |

> Scores are F1 token overlap on retrieval. Cortex finds relevant memories — the agent's LLM generates answers on top.

**Embedding providers:**

| Provider | Cost | Privacy | Speed | Quality |
|----------|------|---------|-------|---------|
| **Ollama** (recommended) | Free | 100% local | Fast | Excellent |
| OpenAI | ~$0.02/1M tokens | Cloud | Moderate | Best temporal |
| None (FTS5) | Free | Local | Instant | Keyword only |

Full methodology, results, and reproducibility → [docs/BENCHMARKS.md](docs/BENCHMARKS.md)
Configuration guide → [docs/RECOMMENDATIONS.md](docs/RECOMMENDATIONS.md)

## CLI Reference

| Command | Description |
Expand All @@ -172,6 +202,9 @@ Full comparison (including vs claude-mem) → [docs/COMPARISON.md](docs/COMPARIS
| `cortex export [--project P]` | Export to JSON |
| `cortex import --from-engram` | Import from Engram database |
| `cortex import --from-json` | Import from JSON file |
| `cortex reindex [--project P]` | Generate vector embeddings for all observations |
| `cortex doctor` | Run health checks on the database |
| `cortex gc [--days N]` | Garbage collect archived observations (default: 90 days) |
| `cortex migrate <up\|down\|status>` | Manage database migrations |
| `cortex version` | Show version |

Expand All @@ -187,7 +220,8 @@ database:

search:
default_limit: 20
vector: false # Enable with cortex_vectors build tag
vector: false # Enable with cortex_vectors build tag
embedding_provider: ollama # ollama (local), openai (cloud), or none

memory:
auto_archive_days: 90
Expand All @@ -207,6 +241,8 @@ Full config reference → [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
| [Agent Setup](docs/AGENT-SETUP.md) | Per-agent configuration + Memory Protocol |
| [Architecture](docs/ARCHITECTURE.md) | Design, scoring formula, graph, API, schema |
| [Plugins](docs/PLUGINS.md) | Claude Code hooks + OpenCode TypeScript plugin |
| [Benchmarks](docs/BENCHMARKS.md) | LOCOMO/DMR results, methodology, reproducibility |
| [Recommendations](docs/RECOMMENDATIONS.md) | Embedding provider guide, tuning, agent setup |
| [Comparison](docs/COMPARISON.md) | Why Cortex vs Engram vs claude-mem |
| [Contributing](CONTRIBUTING.md) | Contribution workflow + standards |

Expand Down
4 changes: 4 additions & 0 deletions bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
datasets/
results/*.json
!results/.gitkeep
!datasets/.gitkeep
50 changes: 50 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Cortex Benchmarks

Evaluation suite for Cortex memory retrieval against standard benchmarks.

## Benchmarks

| Benchmark | Questions | Focus | Source |
|-----------|-----------|-------|--------|
| **LOCOMO** | 7,512 | Long-term conversational memory (5 question types) | [snap-research/locomo](https://github.com/snap-research/locomo) |
| **DMR** | ~500 | Deep memory retrieval across sessions | [MemGPT/MSC-Self-Instruct](https://huggingface.co/datasets/MemGPT/MSC-Self-Instruct) |
| **LongMemEval** | 500 | Long-term interactive memory (5 abilities) | [xiaowu0162/LongMemEval](https://github.com/xiaowu0162/LongMemEval) |

## Quick Start

```bash
# 1. Download datasets
chmod +x download.sh
./download.sh

# 2. Run benchmarks
cd .. && go test ./bench/... -v -timeout 30m

# 3. Run specific benchmark with reporting
go run ./bench/locomo -data bench/datasets/locomo10.json -report bench/results/locomo.json
```

## Evaluation Methodology

1. **Ingest** — Parse dataset conversations into Cortex sessions + observations
2. **Query** — For each question, run `mem_search` (FTS5 + optional graph boost)
3. **Score** — F1 token overlap (always) + LLM-as-Judge (optional, needs API key)
4. **Aggregate** — Per-type and overall accuracy

### LLM Judge (Optional)

Set an API key to enable LLM-based answer evaluation:

```bash
export OPENAI_API_KEY=sk-... # Uses gpt-4o
# OR
export ANTHROPIC_API_KEY=sk-... # Uses claude-sonnet
```

Without an API key, scoring falls back to F1 token overlap only.

## Dataset Licenses

- LOCOMO: CC BY-NC 4.0
- MSC-Self-Instruct: Apache 2.0
- LongMemEval: MIT
141 changes: 141 additions & 0 deletions bench/common/llm_judge.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
package common

import (
"bytes"
"encoding/json"
"fmt"
"net/http"
"os"
"strings"
"time"
)

// JudgeConfig configures the LLM-as-Judge evaluator.
type JudgeConfig struct {
Provider string // "openai" or "anthropic"
APIKey string
Model string // e.g. "gpt-4o" or "claude-sonnet-4-20250514"
}

// DefaultJudgeConfig returns config from environment variables.
func DefaultJudgeConfig() *JudgeConfig {
if key := os.Getenv("OPENAI_API_KEY"); key != "" {
return &JudgeConfig{Provider: "openai", APIKey: key, Model: "gpt-4o"}
}
if key := os.Getenv("ANTHROPIC_API_KEY"); key != "" {
return &JudgeConfig{Provider: "anthropic", APIKey: key, Model: "claude-sonnet-4-20250514"}
}
return nil
}

// JudgeAnswer uses an LLM to judge if the answer is correct given the reference.
// Returns a score between 0.0 and 1.0.
func JudgeAnswer(cfg *JudgeConfig, question, expected, got string) (float64, error) {
if cfg == nil {
return -1, fmt.Errorf("no LLM judge configured (set OPENAI_API_KEY or ANTHROPIC_API_KEY)")
}

prompt := fmt.Sprintf(
"You are evaluating a memory system's answer.\n\n"+
"Question: %s\n"+
"Expected answer: %s\n"+
"System answer: %s\n\n"+
"Is the system answer correct? Reply with ONLY 'CORRECT' or 'INCORRECT'.",
question, expected, got,
)

var response string
var err error

switch cfg.Provider {
case "openai":
response, err = callOpenAI(cfg, prompt)
case "anthropic":
response, err = callAnthropic(cfg, prompt)
default:
return -1, fmt.Errorf("unknown judge provider: %s", cfg.Provider)
}

if err != nil {
return -1, err
}

response = strings.TrimSpace(strings.ToUpper(response))
if strings.Contains(response, "CORRECT") && !strings.Contains(response, "INCORRECT") {
return 1.0, nil
}
return 0.0, nil
}

func callOpenAI(cfg *JudgeConfig, prompt string) (string, error) {
body := map[string]any{
"model": cfg.Model,
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
"max_tokens": 10,
}
data, _ := json.Marshal(body)

req, _ := http.NewRequest("POST", "https://api.openai.com/v1/chat/completions", bytes.NewReader(data))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+cfg.APIKey)

client := &http.Client{Timeout: 30 * time.Second}
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("openai: %w", err)
}
defer func() { _ = resp.Body.Close() }()

var result struct {
Choices []struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
} `json:"choices"`
}
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return "", fmt.Errorf("openai decode: %w", err)
}
if len(result.Choices) == 0 {
return "", fmt.Errorf("openai: no choices returned")
}
return result.Choices[0].Message.Content, nil
}

func callAnthropic(cfg *JudgeConfig, prompt string) (string, error) {
body := map[string]any{
"model": cfg.Model,
"max_tokens": 10,
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
}
data, _ := json.Marshal(body)

req, _ := http.NewRequest("POST", "https://api.anthropic.com/v1/messages", bytes.NewReader(data))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("x-api-key", cfg.APIKey)
req.Header.Set("anthropic-version", "2023-06-01")

client := &http.Client{Timeout: 30 * time.Second}
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("anthropic: %w", err)
}
defer func() { _ = resp.Body.Close() }()

var result struct {
Content []struct {
Text string `json:"text"`
} `json:"content"`
}
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return "", fmt.Errorf("anthropic decode: %w", err)
}
if len(result.Content) == 0 {
return "", fmt.Errorf("anthropic: no content returned")
}
return result.Content[0].Text, nil
}
Loading