[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation

# [BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation

## Problem Summary

New users creating discovery intents experience inconsistent behavior where some intents appear in discovery results immediately while others don't, requiring manual member settings updates to trigger re-indexing. This creates a poor onboarding experience and confusing UX.

## Root Cause

The intent indexing system uses an LLM-based appropriateness evaluator to determine if an intent should be added to an index. When the member prompt is generic (e.g., "everything"), the LLM produces inconsistent scores for semantically similar intents:

- Intent A: "Creating believable dynamic NPCs in video games" → Score 0.72 → ✅ Indexed
- Intent B: "Enhancing simulations using Generative AI" → Score 0.68 → ❌ Not indexed

Both intents are topically similar, but the probabilistic nature of LLM evaluation causes different scores.

### Technical Details

**Location**: `protocol/src/agents/core/intent_indexer/evaluator.ts`

When a member prompt is "everything", the evaluator asks:
```
"Does '<intent>' match the member's sharing preference: 'everything'?"
```

The LLM struggles to interpret "everything" as "allow all intents" and instead tries to semantically match the intent against the word "everything", producing unreliable scores.

**Threshold**: Intents with scores ≤ 0.7 are rejected (not added to `intentIndexes` table).

## Impact

1. **Poor Onboarding**: New users don't see discovery matches immediately
2. **Inconsistent UX**: Identical workflows produce different results
3. **User Confusion**: Manual workaround (updating settings) shouldn't be necessary
4. **Lost Connections**: Legitimate matches are missed due to arbitrary LLM scoring

## Reproduction Steps

1. Create new account
2. Join an index with member prompt "everything"
3. Create discovery intent: "Enhancing simulations using Generative AI"
4. Observe no discovery results
5. Go to "Manage what you're sharing", add a space, save
6. Discovery results now appear

## Short-Term Solutions

### Option 1: Generic Prompt Detection (Recommended)
Detect generic member prompts and skip LLM evaluation:

```typescript
const GENERIC_PROMPTS = ['everything', 'all', 'anything', 'share everything'];

if (isGenericPrompt(memberPrompt)) {
  return 1.0; // Always allow
}
```

**Pros**: Quick fix, preserves evaluation for specific prompts
**Cons**: Hardcoded list of generic patterns

### Option 2: Deterministic LLM Settings
Use `temperature: 0` for consistent responses:

```typescript
const response = await llm([...], { temperature: 0 });
```

**Pros**: More predictable scores
**Cons**: Doesn't fix "everything" semantic mismatch

### Option 3: Result Caching
Cache evaluation results by intent+prompt hash:

**Pros**: Guarantees consistency for same inputs
**Cons**: Doesn't prevent initial bad scores

### Option 4: Lower Threshold for Discovery Intents
Use 0.5 threshold for `sourceType: 'discovery_form'`:

**Pros**: More permissive for user-created intents
**Cons**: Band-aid solution, doesn't address root cause

### Recommended: Combined Approach
Implement all four for maximum stability:
1. Generic prompt detection (fixes "everything" case)
2. Temperature=0 (improves determinism)
3. Caching (ensures consistency)
4. Lower threshold for discovery (more permissive)

## Long-Term Solutions

### 1. Remove LLM Evaluation Entirely
**Trust users to manage their own sharing preferences.**

Current system assumes users don't know what they want to share, but:
- Discovery intents are **explicitly created** by users
- Users can already configure member prompts
- The evaluation adds latency and cost
- False negatives harm UX more than false positives

**Proposed**: Simple keyword/tag-based filtering instead of LLM evaluation.

### 2. Intent-Level Sharing Controls
**Let users mark individual intents as public/private.**

Instead of evaluating all intents against one prompt:
- Default: All discovery intents are shared
- Users can explicitly hide specific intents
- More intuitive and predictable

### 3. Improved Prompt Engineering
**Make the evaluation task more concrete:**

Current: "Does X match 'everything'?"  
Better: "Should this intent be visible to others? Answer Yes/No"

Provide calibration examples in the system prompt.

### 4. Hybrid Approach: Rules + ML
**Use deterministic rules for common cases, LLM for edge cases:**

```typescript
// Fast path: deterministic rules
if (memberPrompt === 'everything') return 1.0;
if (intentPayload.includes(memberPrompt)) return 1.0;

// Slow path: LLM evaluation for specific prompts
return await llmEvaluate(...);
```

### 5. User Feedback Loop
**Learn from user behavior:**

- Track when users manually update settings after rejections
- Use this as signal that evaluation was wrong
- Fine-tune evaluation logic or retrain models
- Eventually remove LLM entirely if patterns emerge

## Metrics to Track

1. **Intent Indexing Rate**: % of intents indexed on first try
2. **Re-evaluation Frequency**: How often users trigger manual re-indexing
3. **False Negative Rate**: Intents incorrectly rejected
4. **User Time-to-First-Match**: How long until users see discovery results

## Recommended Action

1. **Immediate** (v1): Implement combined short-term approach
2. **Short-term** (v2): Switch to intent-level sharing controls  
3. **Long-term** (v3): Remove LLM evaluation, use simple filtering

## Related Code

- **Evaluator**: `protocol/src/agents/core/intent_indexer/evaluator.ts`
- **Indexer**: `protocol/src/agents/core/intent_indexer/index.ts`
- **Events**: `protocol/src/lib/events.ts`
- **Discovery**: `protocol/src/lib/discover.ts`

## Test Case

```typescript
describe('Intent Indexing', () => {
  it('should index all discovery intents when member prompt is "everything"', async () => {
    const testIntents = [
      "Creating believable dynamic NPCs in video games",
      "Enhancing simulations using Generative AI",
      "Building AI-powered educational tools"
    ];
    
    for (const intent of testIntents) {
      const score = await evaluateIntentAppropriateness(
        intent,
        null,  // no index prompt
        "everything"  // generic member prompt
      );
      
      expect(score).toBe(1.0);  // Should always allow
    }
  });
});
```

## Additional Context

This issue was discovered through extensive user testing where identical workflows produced different outcomes based on LLM randomness. The current system prioritizes false negatives (missing matches) over false positives (irrelevant matches), which is the wrong trade-off for a discovery system.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation #301

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation

Problem Summary

Root Cause

Technical Details

Impact

Reproduction Steps

Short-Term Solutions

Option 1: Generic Prompt Detection (Recommended)

Option 2: Deterministic LLM Settings

Option 3: Result Caching

Option 4: Lower Threshold for Discovery Intents

Recommended: Combined Approach

Long-Term Solutions

1. Remove LLM Evaluation Entirely

2. Intent-Level Sharing Controls

3. Improved Prompt Engineering

4. Hybrid Approach: Rules + ML

5. User Feedback Loop

Metrics to Track

Recommended Action

Related Code

Test Case

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation #301

Description

[BUG] Inconsistent Intent Indexing Due to LLM-Based Appropriateness Evaluation

Problem Summary

Root Cause

Technical Details

Impact

Reproduction Steps

Short-Term Solutions

Option 1: Generic Prompt Detection (Recommended)

Option 2: Deterministic LLM Settings

Option 3: Result Caching

Option 4: Lower Threshold for Discovery Intents

Recommended: Combined Approach

Long-Term Solutions

1. Remove LLM Evaluation Entirely

2. Intent-Level Sharing Controls

3. Improved Prompt Engineering

4. Hybrid Approach: Rules + ML

5. User Feedback Loop

Metrics to Track

Recommended Action

Related Code

Test Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions