enhance topic detection

currently the approach is naive "topic every 300 words"

we may want to look into better implementations

this is an llm analysys with only 1st step proofread:

Approach 1: Embedding-Based TextTiling (Recommended)

  How it works:
  1. Embed each sentence using sentence-transformers or your LLM endpoint
  2. Sliding window: compare embeddings of sentences before/after each potential boundary
  3. Compute cosine similarity between adjacent windows
  4. Mark boundaries where similarity drops below threshold

  # Pseudocode
  for i in range(window_size, len(sentences) - window_size):
      left_window = mean(embeddings[i-window_size:i])
      right_window = mean(embeddings[i:i+window_size])
      similarity = cosine_similarity(left_window, right_window)
      if similarity < threshold:
          boundaries.append(i)

  Pros:
  - Fast (embeddings computed once, comparisons are O(n))
  - No per-boundary LLM calls
  - Proven approach (https://github.com/saeedabc/llm-text-tiling, https://github.com/Ighina/DeepTiling)
  - Works well with meeting transcripts (https://arxiv.org/abs/2106.12978)

  Cons:
  - Need to add sentence-transformers dependency or use LLM embeddings
  - Requires threshold tuning

  Implementation effort: Medium - add embeddings, implement sliding window

  ---
  Approach 2: LLM Boundary Classification

  How it works:
  1. Sliding window over transcript
  2. For each potential boundary, ask LLM: "Does the topic change between these two segments?"
  3. Binary classification per boundary

  BOUNDARY_PROMPT = """
  Given these two consecutive transcript segments, determine if the topic changes.

  BEFORE:
  {before_text}

  AFTER:
  {after_text}

  Does a new topic begin in the AFTER segment? Answer only: YES or NO
  """

  Pros:
  - Uses existing LLM infrastructure
  - Semantic understanding, not just similarity
  - Can explain why it's a boundary

  Cons:
  - Many LLM calls (one per potential boundary)
  - Slower, more expensive
  - Need to define what "topic change" means

  Implementation effort: Low - just new prompt + loop

  ---
  Approach 3: Two-Pass LLM Segmentation

  How it works:
  1. Pass 1: Send full transcript to LLM, ask for boundary positions
  2. Pass 2: For each segment, generate title/summary (existing code)

  SEGMENTATION_PROMPT = """
  Analyze this meeting transcript and identify where major topic changes occur.
  Return the word indices where new topics begin.

  Transcript (with word indices):
  [0] Hello [1] everyone [2] let's [3] discuss [4] the [5] budget...

  Return JSON: {"boundaries": [0, 145, 312, 489]}
  """

  Pros:
  - Single LLM call for segmentation
  - Global context - sees entire transcript
  - Can identify natural topic boundaries

  Cons:
  - Context window limits for long transcripts
  - Need chunking strategy for long meetings
  - Less predictable boundary positions

  Implementation effort: Medium - new processor, handle long transcripts

  ---
  Approach 4: Hybrid (Best of Both)

  How it works:
  1. Embedding pre-filter: Compute similarity scores, identify candidate boundaries
  2. LLM refinement: For candidates with ambiguous scores (0.6-0.8 range), ask LLM to confirm
  3. Final segmentation: Merge confirmed boundaries

  Pros:
  - Fast for obvious boundaries
  - LLM only called for ambiguous cases
  - Best accuracy/cost tradeoff

  Cons:
  - More complex implementation
  - Two systems to maintain

  Implementation effort: High

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance topic detection #788

Pseudocode

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

enhance topic detection #788

Description

Pseudocode

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions