currently the approach is naive "topic every 300 words"
we may want to look into better implementations
this is an llm analysys with only 1st step proofread:
Approach 1: Embedding-Based TextTiling (Recommended)
How it works:
- Embed each sentence using sentence-transformers or your LLM endpoint
- Sliding window: compare embeddings of sentences before/after each potential boundary
- Compute cosine similarity between adjacent windows
- Mark boundaries where similarity drops below threshold
Pseudocode
for i in range(window_size, len(sentences) - window_size):
left_window = mean(embeddings[i-window_size:i])
right_window = mean(embeddings[i:i+window_size])
similarity = cosine_similarity(left_window, right_window)
if similarity < threshold:
boundaries.append(i)
Pros:
Cons:
- Need to add sentence-transformers dependency or use LLM embeddings
- Requires threshold tuning
Implementation effort: Medium - add embeddings, implement sliding window
Approach 2: LLM Boundary Classification
How it works:
- Sliding window over transcript
- For each potential boundary, ask LLM: "Does the topic change between these two segments?"
- Binary classification per boundary
BOUNDARY_PROMPT = """
Given these two consecutive transcript segments, determine if the topic changes.
BEFORE:
{before_text}
AFTER:
{after_text}
Does a new topic begin in the AFTER segment? Answer only: YES or NO
"""
Pros:
- Uses existing LLM infrastructure
- Semantic understanding, not just similarity
- Can explain why it's a boundary
Cons:
- Many LLM calls (one per potential boundary)
- Slower, more expensive
- Need to define what "topic change" means
Implementation effort: Low - just new prompt + loop
Approach 3: Two-Pass LLM Segmentation
How it works:
- Pass 1: Send full transcript to LLM, ask for boundary positions
- Pass 2: For each segment, generate title/summary (existing code)
SEGMENTATION_PROMPT = """
Analyze this meeting transcript and identify where major topic changes occur.
Return the word indices where new topics begin.
Transcript (with word indices):
[0] Hello [1] everyone [2] let's [3] discuss [4] the [5] budget...
Return JSON: {"boundaries": [0, 145, 312, 489]}
"""
Pros:
- Single LLM call for segmentation
- Global context - sees entire transcript
- Can identify natural topic boundaries
Cons:
- Context window limits for long transcripts
- Need chunking strategy for long meetings
- Less predictable boundary positions
Implementation effort: Medium - new processor, handle long transcripts
Approach 4: Hybrid (Best of Both)
How it works:
- Embedding pre-filter: Compute similarity scores, identify candidate boundaries
- LLM refinement: For candidates with ambiguous scores (0.6-0.8 range), ask LLM to confirm
- Final segmentation: Merge confirmed boundaries
Pros:
- Fast for obvious boundaries
- LLM only called for ambiguous cases
- Best accuracy/cost tradeoff
Cons:
- More complex implementation
- Two systems to maintain
Implementation effort: High
currently the approach is naive "topic every 300 words"
we may want to look into better implementations
this is an llm analysys with only 1st step proofread:
Approach 1: Embedding-Based TextTiling (Recommended)
How it works:
Pseudocode
for i in range(window_size, len(sentences) - window_size):
left_window = mean(embeddings[i-window_size:i])
right_window = mean(embeddings[i:i+window_size])
similarity = cosine_similarity(left_window, right_window)
if similarity < threshold:
boundaries.append(i)
Pros:
Cons:
Implementation effort: Medium - add embeddings, implement sliding window
Approach 2: LLM Boundary Classification
How it works:
BOUNDARY_PROMPT = """
Given these two consecutive transcript segments, determine if the topic changes.
BEFORE:
{before_text}
AFTER:
{after_text}
Does a new topic begin in the AFTER segment? Answer only: YES or NO
"""
Pros:
Cons:
Implementation effort: Low - just new prompt + loop
Approach 3: Two-Pass LLM Segmentation
How it works:
SEGMENTATION_PROMPT = """
Analyze this meeting transcript and identify where major topic changes occur.
Return the word indices where new topics begin.
Transcript (with word indices):
[0] Hello [1] everyone [2] let's [3] discuss [4] the [5] budget...
Return JSON: {"boundaries": [0, 145, 312, 489]}
"""
Pros:
Cons:
Implementation effort: Medium - new processor, handle long transcripts
Approach 4: Hybrid (Best of Both)
How it works:
Pros:
Cons:
Implementation effort: High