diff --git a/docs/HINDSIGHT_EVALUATION.md b/docs/HINDSIGHT_EVALUATION.md new file mode 100644 index 00000000..caccbc80 --- /dev/null +++ b/docs/HINDSIGHT_EVALUATION.md @@ -0,0 +1,412 @@ +# HUF × Hindsight Memory Evaluation + +## Executive Recommendation + +**Recommended option: B — Hindsight as an embedded sidecar service (opt-in per agent).** + +**Why:** +- HUF’s current knowledge stack is strong for **static/document knowledge retrieval** (ingestion + chunking + FTS/vector search + prompt injection), but it is not a full **learning memory** system with reflection/consolidation. +- Hindsight is specialized for **learned memory over time** (retain/recall/reflect, memory banks, directives/dispositions). +- Sidecar integration preserves HUF’s current strengths while adding capabilities HUF currently does not have, without forcing a risky full-platform rewrite. + +--- + +## Scope & Method Notes + +### What was requested +You asked for a five-phase evaluation including cloning and code-level study of Hindsight and comparative architecture analysis. + +### What was done +- I attempted to clone Hindsight to `/tmp/hindsight` exactly as requested. +- Clone failed due network restrictions in this environment (`CONNECT tunnel failed, response 403`). +- I then performed a best-effort fallback study using: + - Hindsight GitHub repo pages and raw files accessible via the web tool. + - Hindsight docs site pages (Developer Guide / API sections / FAQs / integrations pages). + - Full local HUF code review in this repository. + +### Confidence labeling +- **High confidence:** HUF-side architecture findings (direct source code review). +- **Medium confidence:** Hindsight-side deep internals (since direct local clone/source traversal was blocked; conclusions rely on public docs + accessible repo pages). + +--- + +## Phase 1 — Hindsight Study Notes + +## 1) `AGENTS.md` / `CLAUDE.md` + +### Findings +- `AGENTS.md` in Hindsight is minimal and points to `CLAUDE.md`. +- `CLAUDE.md` is also concise and primarily references docs and workflows rather than containing extensive internal code conventions. + +### Practical implication for integration +- Architectural truth for Hindsight is primarily in docs and API behavior rather than deep project-local agent instructions. + +## 2) `README.md` + +### Findings +- Positions Hindsight as memory that helps agents **learn over time**, not only retrieve history. +- Exposes two integration modes: + 1. **LLM wrapper** (low-friction memory insertion around LLM calls). + 2. **Explicit API operations** (`retain`, `recall`, `reflect`). +- Deployment options include local Docker, embedded modes, and external PostgreSQL-oriented setup. + +### Practical implication for integration +- Hindsight can be integrated incrementally via API without replacing current HUF execution architecture. + +## 3) `skills/` and docs references + +### Findings +- Hindsight docs repeatedly advertise a `hindsight-docs` skill for coding assistants. +- Docs structure suggests productized API-first consumption with SDKs/clients and integrations. + +### Practical implication +- Hindsight is designed to be consumed as infrastructure from other agent frameworks, which fits HUF’s extensible tool architecture. + +## 4) `docs/` (developer/API behavior) + +### Findings (from docs pages) +- Core operations are conceptually: + - **Retain:** ingest experiences/content into memory. + - **Recall:** retrieve relevant memories. + - **Reflect:** synthesize disposition-aware responses / higher-order reasoning over retained memory. +- **Memory banks** are isolated containers and appear to be the main tenancy boundary. +- Docs and examples show additional governance concepts (e.g., directives/dispositions) and memory lifecycle patterns. +- FAQ and integration guidance emphasize re-retain/update workflows and full-context retention rather than simplistic pre-summarized facts. + +### Practical implication +- Hindsight’s abstraction layer (bank + retain/recall/reflect) maps naturally to HUF agent+user/session identity and could be wrapped as tools. + +## 5) `hindsight-api-slim` module (best-effort) + +### Findings +- I could inspect high-level repo tree presence (`hindsight-api-slim/hindsight_api/...`) but could not fully clone and deeply inspect each source file in this environment. +- Public docs indicate server-side API orchestration around retain/recall/reflect, embeddings, and LLM wrapping. + +### Practical implication +- Exact module-level internals should be validated in a follow-up once clone/network is available, but API contract appears stable enough for sidecar prototyping. + +## 6) `hindsight-integrations` + +### Findings +- Hindsight publishes direct integrations and external API modes across agent ecosystems/tools. +- Integration docs for external API mode reinforce the pattern: keep app logic where it is, call Hindsight memory operations remotely. + +### Practical implication +- Sidecar model is aligned with how Hindsight expects production adoption. + +--- + +## Phase 2 — HUF Current State (Code-Based) + +## Knowledge ingestion/retrieval architecture + +### Ingestion +- `KnowledgeInput` validates source type (File/Text/URL), hashes content for dedupe, and queues async processing after insert. +- `process_knowledge_input()` in `indexer.py` performs extraction, sentence-aware chunking, and backend indexing with per-source locking and status transitions. + +### Chunking +- `chunkers/sentence.py` uses LlamaIndex `SentenceSplitter` with fallback chunker. + +### Backends +- `Knowledge Source` supports `sqlite_fts` and `sqlite_vec` types. +- `sqlite_fts.py` implements FTS5 schema + BM25 ranking in per-source SQLite artifacts. + +### Retrieval +- `retriever.py` `knowledge_search()` searches one or many sources, respects source readiness/permissions, and returns top-k merged by score. +- `context_builder.py` injects mandatory knowledge snippets into prompt under token budget. +- `knowledge/tool.py` exposes optional explicit `knowledge_search` tool and knowledge source listing tool. + +## Conversation/session memory architecture + +### Persistence model +- Conversation/session storage is first-class via: + - `Agent Conversation` + - `Agent Message` + - `Agent Run` +- `conversation_manager.py` manages session-id-based active conversation lookup/creation and appends ordered messages. + +### Memory semantics +- HUF persists full conversation history and supports rolling summary fields and structured `conversation_data` JSON. +- Current memory is largely **session/conversation persistence and retrieval**, not a separate long-term reflective memory engine. + +## Agent configuration & memory-related controls + +- Agent-level controls include: + - `persist_conversation` + - `persist_user_history` + - context strategy/history limits/summary ratio + - knowledge source bindings (`Agent Knowledge` child rows with `mode`, `priority`, `max_chunks`, `token_budget`) +- Knowledge can be **Mandatory** (prompt-injected) or **Optional** (tool-invoked). + +--- + +## Phase 3 — Comparative Analysis + +## 3.1 Capability Comparison + +| Capability | HUF (current) | Hindsight | +|---|---|---| +| Knowledge ingestion | Strong document ingestion (File/Text/URL), async processing, dedupe hash, chunking pipeline. | Strong memory ingestion via retain semantics oriented to episodic/experience memory. | +| Retrieval method | SQLite FTS5 BM25 (and optional sqlite_vec), top-k retrieval, mandatory prompt injection + tool-based retrieval. | Recall over memory banks; docs describe richer memory retrieval and memory-aware operations. | +| Memory consolidation | Limited: conversation summary and `conversation_data`; no dedicated multi-stage consolidation pipeline. | Core feature: retain/recall/reflect with consolidation into higher-order memory representations (per docs claims). | +| Temporal reasoning | Minimal explicit temporal reasoning beyond chronological conversation logs. | Explicitly marketed for long-horizon memory and temporal coherence. | +| Entity extraction & linking | Not a dedicated subsystem in current HUF knowledge stack. | Described as memory-structured reasoning with richer linkage semantics. | +| Cross-session learning | Partial persistence (history per session/user) but no autonomous learned memory graph/model across sessions by default. | Designed for long-term learning across interactions in banks. | +| Per-user memory isolation | Available via session_id + `persist_user_history` patterns in HUF model. | Native bank isolation; bank-per-user/per-agent patterns are first-class. | +| Reflection / self-improvement | No dedicated reflect operation in knowledge subsystem. | Reflect operation is first-class API primitive. | +| LLM provider support | Broad via LiteLLM + custom provider routing in HUF. | Supports multiple providers (docs indicate provider/model configurable) but memory API is the core abstraction. | +| Deployment model | In-process within Frappe app; MariaDB + per-source SQLite artifacts. | Separate service/embedded modes; frequently paired with PostgreSQL/pgvector style deployment options. | + +## 3.2 Strengths Assessment + +### HUF strengths + +**Better than Hindsight:** +- Tight integration with Frappe DocTypes, permissions, business workflows, and MCP tool ecosystem. +- Lower operational complexity for static knowledge retrieval in current architecture. +- Strong tenant/application alignment with existing HUF agent config UX. + +**Does that Hindsight does not (in this context):** +- Native coupling to HUF-specific automation flows (Doc Event, scheduled execution, agent tools bound to business data). + +**Architectural constraints:** +- SQLite-based knowledge artifacts can become operationally awkward for large multi-tenant high-write scenarios. +- Memory is still retrieval-centric and session-centric; advanced consolidation/reflection must be custom-built. + +### Hindsight strengths + +**Better than HUF:** +- Purpose-built long-term agent memory lifecycle (retain/recall/reflect) rather than only RAG-style retrieval. +- Better abstraction for learned memory and behavior adaptation over time. + +**Does that HUF does not:** +- Reflection primitive as a first-class operation. +- Bank abstraction for isolated memory lifecycles with memory directives/dispositions. + +**Architectural constraints:** +- Additional infra overhead (service lifecycle, storage dependencies, monitoring). +- Potentially higher LLM cost due multi-stage memory operations (retain + reflect in addition to generation). +- External project dependency and version drift risk. + +## 3.3 Overlap & Conflict + +### Overlap +- Both systems solve “bring past information into current reasoning.” +- Both expose retrieval-like mechanisms and can be called as tools during generation. + +### Potential conflicts if run simultaneously +- Duplicate memory writes (same interaction retained in HUF conversation store and Hindsight bank). +- Contradictory context if HUF mandatory knowledge and Hindsight recall produce divergent facts. +- Extra token/cost pressure if both memory pipelines inject large context. + +### Same problem or different problem? +- **Conclusion:** Mostly **different but adjacent** problems. + - HUF knowledge system: best for curated/static corpus retrieval. + - Hindsight memory system: best for evolving learned memory and adaptation. + +--- + +## Phase 4 — Integration Decision Analysis + +## Option A: Knowledge Transfer Only (port ideas into HUF) + +### What to port +- Retain/recall/reflect lifecycle semantics. +- Memory bank abstraction with per-agent/per-user boundaries. +- Consolidation layers (raw events → distilled observations → stable preferences/mental models). +- Temporal indexing and entity linking. + +### Effort estimate +- **High (8–16+ weeks)** for robust first version due design, schema, migration, evaluation tooling, and cost/quality tuning. + +### What you lose +- Slow time-to-value. +- Re-inventing mechanisms already available in Hindsight. +- Benchmark parity uncertainty. + +### When A is attractive +- If strategic requirement is zero external runtime dependency and full in-house control. + +--- + +## Option B: Hindsight as Embedded Sidecar (Recommended) + +### Integration approach +- Keep HUF RAG stack as-is for static corpora. +- Add optional Hindsight memory toolset per agent: + - `hindsight_retain` + - `hindsight_recall` + - `hindsight_reflect` +- Add runtime orchestration policy: + - On each user turn: selective retain. + - Before response: recall and optionally reflect for eligible agents. + +### Identity mapping proposal +- **Bank ID pattern:** + - `huf:{site}:{agent}:{user}` for user-isolated memory. + - `huf:{site}:{agent}:shared` when shared memory desired. +- Map from existing HUF controls: + - `persist_user_history=1` ⇒ per-user bank. + - `persist_user_history=0` ⇒ shared agent bank. + +### Ops overhead +- New service deployment + health checks + secrets management. +- Optional PostgreSQL/pgvector infra (depending on Hindsight deployment mode chosen). +- Increased LLM spend from memory operations (retain/reflect) requiring policy controls. + +### UI/UX feasibility +- Add per-agent toggle in Agent DocType: + - `enable_hindsight_memory` + - `hindsight_mode` (`off` / `assistive` / `primary`) + - per-turn budget caps and max retrieved memories. + +### Why B wins +- Fastest path to real long-term memory capability with bounded platform risk. +- Preserves existing HUF value and customer workflows. +- Reversible: can disable per agent if quality/cost not acceptable. + +--- + +## Option C: Hindsight as Core Memory Layer (replace HUF memory/knowledge) + +### What breaks +- Existing HUF knowledge source UX and indexing workflows would need major re-plumbing. +- Mandatory/optional knowledge semantics and per-source SQLite artifacts would become legacy or require migration adapters. + +### Migration concerns +- Need full migration from Knowledge Source / Knowledge Input model to Hindsight-compatible retain corpus and bank assignment. +- Risk of retrieval behavior regressions for static documentation use cases. + +### Architecture conflict +- HUF is centered on Frappe + MariaDB app patterns with lightweight per-source SQLite knowledge files. +- Forcing Hindsight as canonical backend introduces non-trivial operational and conceptual coupling. + +### Maintenance burden +- Highest dependency risk and lock-in to external roadmap changes. + +### Verdict on C +- Not recommended at this stage. + +--- + +## Phase 5 — Recommendation, Plan, Architecture, Risks + +## 1) Final recommendation + +**Choose Option B now.** + +**Opinion (explicit):** this is the best risk-adjusted route because it adds true learned memory without destabilizing HUF’s production-critical Frappe knowledge and automation model. + +## 2) Phase-1 implementation (2-week scope) + +### Week 1 +1. Add config and feature flags: + - New agent fields: `enable_hindsight_memory`, `hindsight_bank_strategy`, `hindsight_recall_top_k`, `hindsight_reflect_enabled`, budget fields. +2. Implement Hindsight API client wrapper in `huf/ai/`. +3. Implement bank-id resolver from `(site, agent, user, persist_user_history)`. +4. Add resilient fail-open behavior (if Hindsight unavailable, continue with current HUF behavior). + +### Week 2 +5. Integrate into `AgentManager` execution pipeline: + - pre-response recall + - post-turn retain + - optional reflect call +6. Store memory-operation telemetry in `Agent Run` metadata (latency, token cost, result counts). +7. Add admin docs + runbook + feature toggle in UI. +8. Pilot with 1–2 internal agents and collect quality/cost metrics. + +### Concrete HUF touch points for Phase-1 + +| Area | Existing file(s) | Change required | Notes | +|---|---|---|---| +| Agent settings model | `huf/huf/doctype/agent/agent.json`, `huf/huf/doctype/agent/agent.py`, `huf/huf/doctype/agent/agent.js` | Add optional Hindsight fields (enable flag, bank strategy, recall top_k, reflect toggle, timeout/budget caps). | Keep defaults off to preserve current behavior. | +| Runtime orchestration | `huf/ai/agent_integration.py` | Add sidecar call hooks around generation (pre: recall, post: retain, optional reflect). | Use timeouts and fail-open fallback. | +| Conversation identity mapping | `huf/ai/conversation_manager.py` | Reuse `session_id`/`external_id` patterns when deriving bank IDs. | Align with `persist_user_history`. | +| Knowledge coexistence policy | `huf/ai/knowledge/context_builder.py`, `huf/ai/knowledge/tool.py` | Add merge strategy for HUF knowledge context + Hindsight recall context. | Prevent duplicate/contradictory context injection. | +| Run telemetry | `huf/huf/doctype/agent_run/agent_run.json` and run-write path in `agent_integration.py` | Store Hindsight latency/cost/error metrics in metadata JSON. | Needed for pilot evaluation. | +| Config/secrets | `huf/hook.py` / environment config path in deployment setup | Add sidecar URL + auth secret + per-site toggle. | Must be site-aware for multi-tenant installs. | + +## 3) Architecture sketch + +```text +User -> HUF Agent Runtime + |\ + | \--(existing)--> HUF Knowledge Search (sqlite_fts/sqlite_vec) -> Prompt Context + | + \----(optional per-agent)--> Hindsight Sidecar API + |-- retain(user turn / tool results) + |-- recall(current query) + \-- reflect(optional synthesis) + +HUF persists canonical conversation/run records in Frappe DocTypes. +Hindsight stores learned memory in bank-scoped memory store. +``` + +## 4) Risk register + +| Risk | Impact | Likelihood | Mitigation | +|---|---|---:|---| +| Sidecar outage | Agent quality drop / possible failures | Medium | Fail-open design: skip Hindsight and continue with HUF core path. | +| Token/cost blow-up from retain+reflect | Budget overrun | High | Per-agent rate/budget caps, sampling policy, batched retain, reflect toggle default off. | +| Conflicting context (HUF knowledge vs Hindsight recall) | Incorrect responses | Medium | Prompt policy + source attribution + ranking rules + conflict resolution heuristic. | +| Tenant isolation mistakes in bank mapping | Data leakage | Low/Med | Deterministic bank-id format + tests + explicit ACL validation. | +| External dependency drift | Maintenance overhead | Medium | Pin versions, contract tests, fallback compatibility layer. | +| Performance latency | Slower responses | Medium | Async/non-blocking retain, recall timeout budgets, cached bank metadata. | + +## 5) Decision revisit criteria + +Revisit Option B decision if any of the following occur: +1. Hindsight introduces a mode that fully satisfies HUF static knowledge ingestion/search use cases with equal or better UX/cost. +2. HUF scale reaches thresholds where unified external memory infra clearly outperforms per-source SQLite artifacts (e.g., sustained high-concurrency writes, large tenant counts). +3. Sidecar memory quality remains below target after 2 release cycles despite tuning. +4. Compliance/ops policy disallows external sidecar dependencies for customer deployments. + +## 6) Pilot success criteria (go/no-go) + +Use explicit criteria so Option B can be validated quickly: + +| Metric | Target in pilot | Why it matters | +|---|---:|---| +| Response quality (human eval on memory-heavy prompts) | +20% vs baseline agent without Hindsight | Proves memory value beyond static RAG. | +| P95 latency overhead | <= +900 ms | Keeps UX acceptable. | +| Extra token cost per successful run | <= +25% median | Controls operational spend. | +| Sidecar failure impact | 0 hard failures (fail-open only) | Ensures reliability/safety in production. | +| Memory leakage incidents across users | 0 | Validates bank-isolation mapping. | + +If two or more targets miss for two consecutive weeks, keep Hindsight behind feature flag and continue with Option A-style incremental native improvements. + +--- + +## Implementation Appendix — Example integration pseudocode + +```python +# inside agent execution flow +if agent.enable_hindsight_memory: + bank_id = resolve_bank_id(site, agent.name, user, agent.persist_user_history) + + recalled = hindsight.recall( + bank_id=bank_id, + query=user_prompt, + top_k=agent.hindsight_recall_top_k, + ) + + prompt = inject_hindsight_memories(prompt, recalled) + +response = llm.generate(prompt) + +if agent.enable_hindsight_memory: + hindsight.retain( + bank_id=bank_id, + content=serialize_turn(user_prompt, response, tool_events), + ) + + if agent.hindsight_reflect_enabled: + reflection = hindsight.reflect(bank_id=bank_id, query=user_prompt) + store_reflection_metadata(run_id, reflection) +``` + +--- + +## Final decision statement + +**Ship Option B (sidecar) first, keep HUF RAG as canonical static knowledge, and treat Hindsight as an opt-in learned-memory subsystem.**