-
Notifications
You must be signed in to change notification settings - Fork 14
Description
[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation
Problem Summary
New users creating discovery intents experience inconsistent behavior where some intents appear in discovery results immediately while others don't, requiring manual member settings updates to trigger re-indexing. This creates a poor onboarding experience and confusing UX.
Root Cause
The intent indexing system uses an LLM-based appropriateness evaluator to determine if an intent should be added to an index. When the member prompt is generic (e.g., "everything"), the LLM produces inconsistent scores for semantically similar intents:
- Intent A: "Creating believable dynamic NPCs in video games" → Score 0.72 → ✅ Indexed
- Intent B: "Enhancing simulations using Generative AI" → Score 0.68 → ❌ Not indexed
Both intents are topically similar, but the probabilistic nature of LLM evaluation causes different scores.
Technical Details
Location: protocol/src/agents/core/intent_indexer/evaluator.ts
When a member prompt is "everything", the evaluator asks:
"Does '<intent>' match the member's sharing preference: 'everything'?"
The LLM struggles to interpret "everything" as "allow all intents" and instead tries to semantically match the intent against the word "everything", producing unreliable scores.
Threshold: Intents with scores ≤ 0.7 are rejected (not added to intentIndexes table).
Impact
- Poor Onboarding: New users don't see discovery matches immediately
- Inconsistent UX: Identical workflows produce different results
- User Confusion: Manual workaround (updating settings) shouldn't be necessary
- Lost Connections: Legitimate matches are missed due to arbitrary LLM scoring
Reproduction Steps
- Create new account
- Join an index with member prompt "everything"
- Create discovery intent: "Enhancing simulations using Generative AI"
- Observe no discovery results
- Go to "Manage what you're sharing", add a space, save
- Discovery results now appear
Short-Term Solutions
Option 1: Generic Prompt Detection (Recommended)
Detect generic member prompts and skip LLM evaluation:
const GENERIC_PROMPTS = ['everything', 'all', 'anything', 'share everything'];
if (isGenericPrompt(memberPrompt)) {
return 1.0; // Always allow
}Pros: Quick fix, preserves evaluation for specific prompts
Cons: Hardcoded list of generic patterns
Option 2: Deterministic LLM Settings
Use temperature: 0 for consistent responses:
const response = await llm([...], { temperature: 0 });Pros: More predictable scores
Cons: Doesn't fix "everything" semantic mismatch
Option 3: Result Caching
Cache evaluation results by intent+prompt hash:
Pros: Guarantees consistency for same inputs
Cons: Doesn't prevent initial bad scores
Option 4: Lower Threshold for Discovery Intents
Use 0.5 threshold for sourceType: 'discovery_form':
Pros: More permissive for user-created intents
Cons: Band-aid solution, doesn't address root cause
Recommended: Combined Approach
Implement all four for maximum stability:
- Generic prompt detection (fixes "everything" case)
- Temperature=0 (improves determinism)
- Caching (ensures consistency)
- Lower threshold for discovery (more permissive)
Long-Term Solutions
1. Remove LLM Evaluation Entirely
Trust users to manage their own sharing preferences.
Current system assumes users don't know what they want to share, but:
- Discovery intents are explicitly created by users
- Users can already configure member prompts
- The evaluation adds latency and cost
- False negatives harm UX more than false positives
Proposed: Simple keyword/tag-based filtering instead of LLM evaluation.
2. Intent-Level Sharing Controls
Let users mark individual intents as public/private.
Instead of evaluating all intents against one prompt:
- Default: All discovery intents are shared
- Users can explicitly hide specific intents
- More intuitive and predictable
3. Improved Prompt Engineering
Make the evaluation task more concrete:
Current: "Does X match 'everything'?"
Better: "Should this intent be visible to others? Answer Yes/No"
Provide calibration examples in the system prompt.
4. Hybrid Approach: Rules + ML
Use deterministic rules for common cases, LLM for edge cases:
// Fast path: deterministic rules
if (memberPrompt === 'everything') return 1.0;
if (intentPayload.includes(memberPrompt)) return 1.0;
// Slow path: LLM evaluation for specific prompts
return await llmEvaluate(...);5. User Feedback Loop
Learn from user behavior:
- Track when users manually update settings after rejections
- Use this as signal that evaluation was wrong
- Fine-tune evaluation logic or retrain models
- Eventually remove LLM entirely if patterns emerge
Metrics to Track
- Intent Indexing Rate: % of intents indexed on first try
- Re-evaluation Frequency: How often users trigger manual re-indexing
- False Negative Rate: Intents incorrectly rejected
- User Time-to-First-Match: How long until users see discovery results
Recommended Action
- Immediate (v1): Implement combined short-term approach
- Short-term (v2): Switch to intent-level sharing controls
- Long-term (v3): Remove LLM evaluation, use simple filtering
Related Code
- Evaluator:
protocol/src/agents/core/intent_indexer/evaluator.ts - Indexer:
protocol/src/agents/core/intent_indexer/index.ts - Events:
protocol/src/lib/events.ts - Discovery:
protocol/src/lib/discover.ts
Test Case
describe('Intent Indexing', () => {
it('should index all discovery intents when member prompt is "everything"', async () => {
const testIntents = [
"Creating believable dynamic NPCs in video games",
"Enhancing simulations using Generative AI",
"Building AI-powered educational tools"
];
for (const intent of testIntents) {
const score = await evaluateIntentAppropriateness(
intent,
null, // no index prompt
"everything" // generic member prompt
);
expect(score).toBe(1.0); // Should always allow
}
});
});Additional Context
This issue was discovered through extensive user testing where identical workflows produced different outcomes based on LLM randomness. The current system prioritizes false negatives (missing matches) over false positives (irrelevant matches), which is the wrong trade-off for a discovery system.