Skip to content

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation #301

@yanekyuk

Description

@yanekyuk

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation

Problem Summary

New users creating discovery intents experience inconsistent behavior where some intents appear in discovery results immediately while others don't, requiring manual member settings updates to trigger re-indexing. This creates a poor onboarding experience and confusing UX.

Root Cause

The intent indexing system uses an LLM-based appropriateness evaluator to determine if an intent should be added to an index. When the member prompt is generic (e.g., "everything"), the LLM produces inconsistent scores for semantically similar intents:

  • Intent A: "Creating believable dynamic NPCs in video games" → Score 0.72 → ✅ Indexed
  • Intent B: "Enhancing simulations using Generative AI" → Score 0.68 → ❌ Not indexed

Both intents are topically similar, but the probabilistic nature of LLM evaluation causes different scores.

Technical Details

Location: protocol/src/agents/core/intent_indexer/evaluator.ts

When a member prompt is "everything", the evaluator asks:

"Does '<intent>' match the member's sharing preference: 'everything'?"

The LLM struggles to interpret "everything" as "allow all intents" and instead tries to semantically match the intent against the word "everything", producing unreliable scores.

Threshold: Intents with scores ≤ 0.7 are rejected (not added to intentIndexes table).

Impact

  1. Poor Onboarding: New users don't see discovery matches immediately
  2. Inconsistent UX: Identical workflows produce different results
  3. User Confusion: Manual workaround (updating settings) shouldn't be necessary
  4. Lost Connections: Legitimate matches are missed due to arbitrary LLM scoring

Reproduction Steps

  1. Create new account
  2. Join an index with member prompt "everything"
  3. Create discovery intent: "Enhancing simulations using Generative AI"
  4. Observe no discovery results
  5. Go to "Manage what you're sharing", add a space, save
  6. Discovery results now appear

Short-Term Solutions

Option 1: Generic Prompt Detection (Recommended)

Detect generic member prompts and skip LLM evaluation:

const GENERIC_PROMPTS = ['everything', 'all', 'anything', 'share everything'];

if (isGenericPrompt(memberPrompt)) {
  return 1.0; // Always allow
}

Pros: Quick fix, preserves evaluation for specific prompts
Cons: Hardcoded list of generic patterns

Option 2: Deterministic LLM Settings

Use temperature: 0 for consistent responses:

const response = await llm([...], { temperature: 0 });

Pros: More predictable scores
Cons: Doesn't fix "everything" semantic mismatch

Option 3: Result Caching

Cache evaluation results by intent+prompt hash:

Pros: Guarantees consistency for same inputs
Cons: Doesn't prevent initial bad scores

Option 4: Lower Threshold for Discovery Intents

Use 0.5 threshold for sourceType: 'discovery_form':

Pros: More permissive for user-created intents
Cons: Band-aid solution, doesn't address root cause

Recommended: Combined Approach

Implement all four for maximum stability:

  1. Generic prompt detection (fixes "everything" case)
  2. Temperature=0 (improves determinism)
  3. Caching (ensures consistency)
  4. Lower threshold for discovery (more permissive)

Long-Term Solutions

1. Remove LLM Evaluation Entirely

Trust users to manage their own sharing preferences.

Current system assumes users don't know what they want to share, but:

  • Discovery intents are explicitly created by users
  • Users can already configure member prompts
  • The evaluation adds latency and cost
  • False negatives harm UX more than false positives

Proposed: Simple keyword/tag-based filtering instead of LLM evaluation.

2. Intent-Level Sharing Controls

Let users mark individual intents as public/private.

Instead of evaluating all intents against one prompt:

  • Default: All discovery intents are shared
  • Users can explicitly hide specific intents
  • More intuitive and predictable

3. Improved Prompt Engineering

Make the evaluation task more concrete:

Current: "Does X match 'everything'?"
Better: "Should this intent be visible to others? Answer Yes/No"

Provide calibration examples in the system prompt.

4. Hybrid Approach: Rules + ML

Use deterministic rules for common cases, LLM for edge cases:

// Fast path: deterministic rules
if (memberPrompt === 'everything') return 1.0;
if (intentPayload.includes(memberPrompt)) return 1.0;

// Slow path: LLM evaluation for specific prompts
return await llmEvaluate(...);

5. User Feedback Loop

Learn from user behavior:

  • Track when users manually update settings after rejections
  • Use this as signal that evaluation was wrong
  • Fine-tune evaluation logic or retrain models
  • Eventually remove LLM entirely if patterns emerge

Metrics to Track

  1. Intent Indexing Rate: % of intents indexed on first try
  2. Re-evaluation Frequency: How often users trigger manual re-indexing
  3. False Negative Rate: Intents incorrectly rejected
  4. User Time-to-First-Match: How long until users see discovery results

Recommended Action

  1. Immediate (v1): Implement combined short-term approach
  2. Short-term (v2): Switch to intent-level sharing controls
  3. Long-term (v3): Remove LLM evaluation, use simple filtering

Related Code

  • Evaluator: protocol/src/agents/core/intent_indexer/evaluator.ts
  • Indexer: protocol/src/agents/core/intent_indexer/index.ts
  • Events: protocol/src/lib/events.ts
  • Discovery: protocol/src/lib/discover.ts

Test Case

describe('Intent Indexing', () => {
  it('should index all discovery intents when member prompt is "everything"', async () => {
    const testIntents = [
      "Creating believable dynamic NPCs in video games",
      "Enhancing simulations using Generative AI",
      "Building AI-powered educational tools"
    ];
    
    for (const intent of testIntents) {
      const score = await evaluateIntentAppropriateness(
        intent,
        null,  // no index prompt
        "everything"  // generic member prompt
      );
      
      expect(score).toBe(1.0);  // Should always allow
    }
  });
});

Additional Context

This issue was discovered through extensive user testing where identical workflows produced different outcomes based on LLM randomness. The current system prioritizes false negatives (missing matches) over false positives (irrelevant matches), which is the wrong trade-off for a discovery system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions