Skip to content

enhance topic detection #788

@deardarlingoose

Description

@deardarlingoose

currently the approach is naive "topic every 300 words"

we may want to look into better implementations

this is an llm analysys with only 1st step proofread:

Approach 1: Embedding-Based TextTiling (Recommended)

How it works:

  1. Embed each sentence using sentence-transformers or your LLM endpoint
  2. Sliding window: compare embeddings of sentences before/after each potential boundary
  3. Compute cosine similarity between adjacent windows
  4. Mark boundaries where similarity drops below threshold

Pseudocode

for i in range(window_size, len(sentences) - window_size):
left_window = mean(embeddings[i-window_size:i])
right_window = mean(embeddings[i:i+window_size])
similarity = cosine_similarity(left_window, right_window)
if similarity < threshold:
boundaries.append(i)

Pros:

Cons:

  • Need to add sentence-transformers dependency or use LLM embeddings
  • Requires threshold tuning

Implementation effort: Medium - add embeddings, implement sliding window


Approach 2: LLM Boundary Classification

How it works:

  1. Sliding window over transcript
  2. For each potential boundary, ask LLM: "Does the topic change between these two segments?"
  3. Binary classification per boundary

BOUNDARY_PROMPT = """
Given these two consecutive transcript segments, determine if the topic changes.

BEFORE:
{before_text}

AFTER:
{after_text}

Does a new topic begin in the AFTER segment? Answer only: YES or NO
"""

Pros:

  • Uses existing LLM infrastructure
  • Semantic understanding, not just similarity
  • Can explain why it's a boundary

Cons:

  • Many LLM calls (one per potential boundary)
  • Slower, more expensive
  • Need to define what "topic change" means

Implementation effort: Low - just new prompt + loop


Approach 3: Two-Pass LLM Segmentation

How it works:

  1. Pass 1: Send full transcript to LLM, ask for boundary positions
  2. Pass 2: For each segment, generate title/summary (existing code)

SEGMENTATION_PROMPT = """
Analyze this meeting transcript and identify where major topic changes occur.
Return the word indices where new topics begin.

Transcript (with word indices):
[0] Hello [1] everyone [2] let's [3] discuss [4] the [5] budget...

Return JSON: {"boundaries": [0, 145, 312, 489]}
"""

Pros:

  • Single LLM call for segmentation
  • Global context - sees entire transcript
  • Can identify natural topic boundaries

Cons:

  • Context window limits for long transcripts
  • Need chunking strategy for long meetings
  • Less predictable boundary positions

Implementation effort: Medium - new processor, handle long transcripts


Approach 4: Hybrid (Best of Both)

How it works:

  1. Embedding pre-filter: Compute similarity scores, identify candidate boundaries
  2. LLM refinement: For candidates with ambiguous scores (0.6-0.8 range), ask LLM to confirm
  3. Final segmentation: Merge confirmed boundaries

Pros:

  • Fast for obvious boundaries
  • LLM only called for ambiguous cases
  • Best accuracy/cost tradeoff

Cons:

  • More complex implementation
  • Two systems to maintain

Implementation effort: High

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions