Skip to content

✨(back) add management command to de-index inactive collections and reindex in conv#441

Open
maxenceh wants to merge 4 commits into
mainfrom
maxenceh/deindex-collections
Open

✨(back) add management command to de-index inactive collections and reindex in conv#441
maxenceh wants to merge 4 commits into
mainfrom
maxenceh/deindex-collections

Conversation

@maxenceh
Copy link
Copy Markdown
Collaborator

@maxenceh maxenceh commented Apr 30, 2026

Purpose

Automatic de-indexing and re-indexing of inactive conversation document collections.

Proposal

  • Adds a Django management command (deindex_inactive_collections) that nullifies collection_id on conversations inactive for more than RAG_COLLECTION_INACTIVITY_DAYS days, freeing vector store space.
  • When a user resumes a de-indexed conversation, the backend automatically re-indexes only the attachments not already inlined in the LLM context window (tool_call_only), restoring RAG search capability transparently.
  • The re-indexing pipeline is schedulable via a new Helm CronJob (backend.deindexCronJob).
  • On the frontend, a "Picking up where you left off" loading state with an illustration replaces the previous inline loader during re-indexing.
demo-reindexing.mov

Summary by CodeRabbit

  • New Features

    • Automatic de‑indexing of inactive collections with configurable inactivity window and transparent re‑index-on-resume; conversations now track collection indexing state.
  • UI

    • Resume loader, autoscroll to resume illustrations, and detailed resume error modal that lists failed documents.
  • Documentation

    • RAG Collection Lifecycle docs updated and new RAG_COLLECTION_INACTIVITY_DAYS config (default 30); changelog updated.
  • Tests

    • Added coverage for reindexing, de‑index command, and resume UI behavior.
  • Chores

    • Added runnable deindex make target and optional scheduled CronJob deployment.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds scheduled de-indexing of inactive conversation collections and transparent re-index-on-resume, refactors document-context listing/rendering, introduces index-state model fields and migration, integrates reindex into the agent, provides a deindex management command + Helm CronJob, frontend resume UX, tests, settings, and docs.

Changes

Reindex & Deindex

Layer / File(s) Summary
Reindex core implementation
src/backend/chat/clients/conversation_reindexer.py
Async generator atomically claims conversation, filters READY text attachments, creates/uses document-store collection, stores documents in background threads, marks attachments, and yields progress events (call, done/partial/error).
Document-context listing & rendering
src/backend/chat/document_context_builder.py
Introduces `build_documents_listing(...) -> DocumentsListing
Agent integration
src/backend/chat/clients/pydantic_ai.py
AIAgentService._run_agent caches DocumentsListing, detects de-indexed conversations (index_state DEINDEXED/ERROR) with own READY text docs, streams reindex_conversation(...) events for full-context docs, refreshes conversation, and threads conversation_has_own_documents through RAG gating.
Models, enums, migration
src/backend/chat/enums.py, src/backend/chat/models.py, src/backend/chat/migrations/0008_*.py
Adds CollectionIndexState enum, ChatConversation.index_state, ChatConversationAttachment.is_indexed, and a migration with backfill marking existing collection_ids as indexed.
De-index management command
src/backend/chat/management/commands/deindex_inactive_collections.py
Command queries inactive conversations by cutoff (RAG_COLLECTION_INACTIVITY_DAYS), claims rows, clears collection fields, updates attachments, invokes backend delete_collection/adelete_collection, restores on failure, and reports totals.
De-index command tests
src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
Tests inactive deletion, recent-activity skip, null-collection skip, error handling that continues, and preservation of updated_at.
Reindex generator tests
src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
Async tests covering happy path, no-ready attachments, collection creation failure, per-attachment failures (partial), binary filtering, in-context skipping, selective out-of-context storage, concurrency safety, INDEXING-claim blocking, and all-attachments-fail.
Document-context tests
src/backend/chat/tests/test_document_context_builder.py
Tests updated to call build_documents_listing and assert structured listing results; no public API changes beyond builder return type.
Frontend resume UI components
src/frontend/apps/conversations/src/features/chat/components/Chat.tsx, MessageItem.tsx, ToolInvocationItem.tsx
Modal message accepts ReactNode; effect shows resume error modal with failed-docs list and auto-scrolls; streaming excludes conversation_resume from tool label selection; ToolInvocationItem shows resume loader on call and suppresses on result.
Frontend resume UI tests
src/frontend/apps/conversations/src/features/chat/components/__tests__/ToolInvocationItem.test.tsx
Jest + RTL tests assert resume loader for state: 'call' and empty render for state: 'result'.
Helm CronJob template and values
src/helm/conversations/templates/backend_cronjob_deindex.yaml, src/helm/conversations/values.yaml
Optional CronJob to run deindex management command on configurable schedule with standard backend wiring and volume mounts.
Documentation & changelog
docs/attachments.md, CHANGELOG.md
Adds “RAG Collection Lifecycle” doc section describing de-index and re-index rules, and CHANGELOG entry.
Settings, Makefile, test tweak
src/backend/conversations/settings.py, Makefile, src/backend/chat/tests/views/...
Adds RAG_COLLECTION_INACTIVITY_DAYS setting (default 30), fixes .PHONY: shell, adds deindex_inactive_collections make target, and updates one test factory to set collection_id="123".

Sequence Diagram

sequenceDiagram
  participant Client
  participant reindex_conversation
  participant DefaultStorage
  participant DocumentStore
  participant Database
  Client->>reindex_conversation: start reindex(conversation, in_context_ids)
  reindex_conversation->>Database: claim index_state -> INDEXING
  reindex_conversation->>Client: emit ToolCallPart(conversation_resume)
  reindex_conversation->>DocumentStore: acreate_collection() or reuse
  loop per selected READY text attachment
    reindex_conversation->>DefaultStorage: read attachment bytes (to_thread)
    reindex_conversation->>DocumentStore: store_document(bytes) (thread)
    DocumentStore-->>reindex_conversation: returns rag_document_id
    reindex_conversation->>Database: mark attachment is_indexed & save rag_document_id
  end
  reindex_conversation->>Database: set index_state, set collection_id if applicable
  reindex_conversation->>Client: emit ToolResultPart(state="done"/"partial"/"error")
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested labels

backend, enhancement, frontend

Suggested reviewers

  • providenz
  • qbey
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: adding a management command to de-index inactive collections and supporting automatic re-indexing on conversation resume.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch maxenceh/deindex-collections

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxenceh maxenceh linked an issue Apr 30, 2026 that may be closed by this pull request
4 tasks
@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch from b90789f to 7c8c601 Compare April 30, 2026 15:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 1156-1164: Two concurrent resumes can both see
conversation.collection_id empty and both call _reindex_conversation, producing
duplicate collections; make the claim-and-reindex atomic by adding a
single-claim step (e.g., a distributed/DB-backed lock or an atomic
compare-and-set) that verifies conversation.collection_id is still empty and
writes a placeholder/claim value before creating/filling the backend collection.
Update the flow so is_conversation_deindexed triggers _reindex_conversation only
when the claim succeeds (or move the claim logic into _reindex_conversation at
its start), and ensure the final save of conversation.collection_id uses the
same atomic operation (compare-and-swap or transactional update) so only the
claimer proceeds to populate and persist the collection id.
- Around line 640-716: The _reindex_conversation routine must report and persist
the true outcome: do not set self.conversation.collection_id when collection
creation failed or when no documents were successfully stored; on
collection-creation exception yield a ToolResultPart with state "error" (already
done) and return without modifying conversation; when processing attachments,
track successful_store_count and failed_documents, only set and save
self.conversation.collection_id if successful_store_count > 0 (use
document_store.collection_id), and yield ToolResultPart with state "done" when
failed_documents is empty and success_count > 0, state "partial" including
failed_documents and success_count when some succeeded, and state "none" or
"no_documents" (or "error") when zero succeeded so callers like _run_agent can
distinguish the zero-success path from a valid reindex; update code references
in _reindex_conversation, document_store, failed_documents,
self.conversation.collection_id, and the yielded ToolResultPart to implement
this logic.

In `@src/backend/chat/management/commands/deindex_inactive_collections.py`:
- Around line 34-40: The current flow calls backend_cls(...).delete_collection()
and then updates the DB with
ChatConversation.objects.filter(pk=conversation.pk).update(collection_id=None),
which can leave a stale collection_id if the DB update fails; change the flow to
persist a de-indexed state before deleting the external resource (or atomically
clear the DB id first and restore on failure): add or use a boolean flag (e.g.,
conversation.collection_deindexed or similar) and set it (via queryset.update to
avoid auto_now) before calling delete_collection(), then call
backend.delete_collection() inside try/except and on exception clear the
deindexed flag or log and surface the error; alternatively, perform
queryset.update(collection_id=None) first and if backend.delete_collection()
raises, re-set the collection_id in an except block so the system remains
recoverable; reference delete_collection(), backend_cls, and
ChatConversation.objects.filter(...).update(...) when making the change.

In `@src/frontend/apps/conversations/src/features/chat/components/Chat.tsx`:
- Around line 357-390: The effect only shows the modal when
result.failed_documents exists; update the logic in the useEffect (the block
that finds lastAssistant, resumePart and computes result) to also detect a full
re-index failure by checking result.state === 'error' or presence of
result.error and then call setChatErrorModal (same pattern used for
failed_documents) to surface the backend error; keep the existing de-duplication
using lastResumeErrorMessageIdRef.current === lastAssistant.id and include the
backend error text in the modal message (e.g., show result.error or a translated
generic message) so users see feedback when conversation_resume returned an
error.
- Around line 121-123: The modal currently wraps chatErrorModal.message (from
useState in Chat.tsx) inside a <Text> component which breaks rich React nodes
(like <ul>); update the Chat component's error modal render to render
chatErrorModal.message directly (or conditionally wrap only when the message is
a plain string) instead of always placing it inside <Text>, keeping the existing
state shape { title, message: React.ReactNode } and the setChatErrorModal usage
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: eb1fae3e-895c-4f7c-a3b1-102b706c46c7

📥 Commits

Reviewing files that changed from the base of the PR and between 890dc10 and b90789f.

📒 Files selected for processing (16)
  • Makefile
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/tests/management/__init__.py
  • src/backend/chat/tests/management/commands/__init__.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/conversations/settings.py
  • src/frontend/apps/conversations/src/features/chat/components/Chat.tsx
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx
  • src/frontend/apps/conversations/src/i18n/translations.json
  • src/helm/conversations/templates/backend_cronjob_deindex.yaml
  • src/helm/conversations/values.yaml

Comment thread src/backend/chat/clients/pydantic_ai.py Outdated
Comment thread src/backend/chat/clients/pydantic_ai.py
Comment thread src/backend/chat/management/commands/deindex_inactive_collections.py Outdated
@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch 9 times, most recently from 31b8b16 to 7771e07 Compare May 22, 2026 08:35
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
src/frontend/apps/conversations/src/features/chat/components/Chat.tsx (1)

121-124: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Render modal content without forcing a <Text> wrapper.

chatErrorModal.message is typed as React.ReactNode, but it is still always wrapped in <Text>. Rich content (like the failed-documents <ul>) should be rendered directly.

💡 Suggested fix
-        <Text>{chatErrorModal?.message}</Text>
+        {typeof chatErrorModal?.message === 'string' ? (
+          <Text>{chatErrorModal.message}</Text>
+        ) : (
+          chatErrorModal?.message
+        )}
#!/bin/bash
# Verify that ReactNode modal content is still wrapped by <Text> in Chat.tsx.
rg -n "message:\s*React\.ReactNode|<Text>\{chatErrorModal\?\.message\}</Text>" src/frontend/apps/conversations/src/features/chat/components/Chat.tsx

Also applies to: 901-901

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/apps/conversations/src/features/chat/components/Chat.tsx` around
lines 121 - 124, The modal state chatErrorModal is typed to allow
React.ReactNode for message but the render uses a strict
<Text>{chatErrorModal?.message}</Text> wrapper; change the Chat component's
modal rendering (in Chat.tsx) to render chatErrorModal.message directly (e.g.,
{chatErrorModal?.message}) instead of wrapping it in <Text>, so rich nodes like
<ul> are rendered correctly, and ensure any styling or layout previously
provided by the <Text> wrapper is moved to the surrounding container if needed.
🧹 Nitpick comments (1)
src/backend/chat/management/commands/deindex_inactive_collections.py (1)

57-59: ⚡ Quick win

Surface partial failures as command failure for cron observability.

When errors occur, the command still reports success and exits 0. Consider raising CommandError when count_error > 0 so Kubernetes CronJob status and alerts reflect partial failures.

Suggested fix
-from django.core.management.base import BaseCommand
+from django.core.management.base import BaseCommand, CommandError
@@
-        self.stdout.write(
-            self.style.SUCCESS(f"De-indexed {count_success} collection(s). {count_error} error(s).")
-        )
+        message = f"De-indexed {count_success} collection(s). {count_error} error(s)."
+        if count_error:
+            self.stdout.write(self.style.WARNING(message))
+            raise CommandError(message)
+        self.stdout.write(self.style.SUCCESS(message))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/management/commands/deindex_inactive_collections.py` around
lines 57 - 59, The final success message currently always returns exit 0; modify
the command to raise a django.core.management.base.CommandError when count_error
> 0 so cron systems see partial failures: import CommandError, then after
computing count_success and count_error, if count_error > 0 raise
CommandError(f"De-indexed {count_success} collection(s). {count_error}
error(s).") else keep the existing self.stdout.write(self.style.SUCCESS(...))
call; reference the existing symbols count_success, count_error,
self.stdout.write and self.style.SUCCESS when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/clients/conversation_reindexer.py`:
- Around line 50-51: The early returns that occur after claiming the
"reindexing" sentinel (e.g., when ready_attachments is falsy) do not
release/reset the claim and can leave conversations stuck; modify the code so
the claim is always cleared before any no-op return by invoking the DB/claim
release/reset API (e.g., call the existing reset_claim/release_claim method or
unset the "reindexing" sentinel using the conversation id and
collection_id="reindexing") and wrap the work in a try/finally (or ensure both
return paths call the release) so both the ready_attachments early return and
the other return path also clear the claim.

In `@src/backend/chat/management/commands/deindex_inactive_collections.py`:
- Around line 34-51: The current clear/restore uses pk-only updates and can
overwrite a concurrent change; instead perform a conditional claim and
conditional restore: when iterating conversations, replace the initial
ChatConversation.objects.filter(pk=conversation.pk).update(collection_id=None)
with a conditional claim that includes the expected collection_id and the
inactivity predicate (e.g., ChatConversation.objects.filter(pk=conversation.pk,
collection_id=old_collection_id,
<inactivity_predicate>).update(collection_id=None)) and only proceed to
instantiate backend_cls(...) and call backend.delete_collection if the claim
returned 1; on failure do not touch the row. When restoring, only restore if the
row is still NULL by using ChatConversation.objects.filter(pk=conversation.pk,
collection_id=None).update(collection_id=old_collection_id) so you won't
overwrite a concurrent update.

In `@src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py`:
- Around line 75-89: The test asserts on an in-memory ChatConversation instance
that may be stale after the reindexing; refresh the conversation from the DB
before asserting collection_id is None. Wrap the synchronous refresh or fetch in
sync_to_async (e.g. call sync_to_async(conversation.refresh_from_db)() or
re-query via sync_to_async(ChatConversation.objects.get)(pk=conversation.pk))
after iterating reindex_conversation and before the final assertions; apply the
same change to the similar test at lines ~217-233.

In `@src/helm/conversations/templates/backend_cronjob_deindex.yaml`:
- Line 31: Locate the Helm template directives in backend_cronjob_deindex.yaml
that are written as "{{- end}}" (appears twice) and add a space before the
closing braces so they read "{{- end }}"; update each occurrence (e.g., the
directive tokens near the end blocks) to use the spaced form to satisfy
Helm/Sonar spacing rules.

---

Duplicate comments:
In `@src/frontend/apps/conversations/src/features/chat/components/Chat.tsx`:
- Around line 121-124: The modal state chatErrorModal is typed to allow
React.ReactNode for message but the render uses a strict
<Text>{chatErrorModal?.message}</Text> wrapper; change the Chat component's
modal rendering (in Chat.tsx) to render chatErrorModal.message directly (e.g.,
{chatErrorModal?.message}) instead of wrapping it in <Text>, so rich nodes like
<ul> are rendered correctly, and ensure any styling or layout previously
provided by the <Text> wrapper is moved to the surrounding container if needed.

---

Nitpick comments:
In `@src/backend/chat/management/commands/deindex_inactive_collections.py`:
- Around line 57-59: The final success message currently always returns exit 0;
modify the command to raise a django.core.management.base.CommandError when
count_error > 0 so cron systems see partial failures: import CommandError, then
after computing count_success and count_error, if count_error > 0 raise
CommandError(f"De-indexed {count_success} collection(s). {count_error}
error(s).") else keep the existing self.stdout.write(self.style.SUCCESS(...))
call; reference the existing symbols count_success, count_error,
self.stdout.write and self.style.SUCCESS when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4902684d-d9b1-4e21-a882-b8430d49d4e7

📥 Commits

Reviewing files that changed from the base of the PR and between b90789f and 7771e07.

⛔ Files ignored due to path filters (1)
  • src/frontend/apps/conversations/src/features/chat/assets/chat-bubbles-illustration.svg is excluded by !**/*.svg
📒 Files selected for processing (22)
  • CHANGELOG.md
  • Makefile
  • docs/attachments.md
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/tests/management/__init__.py
  • src/backend/chat/tests/management/commands/__init__.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/chat/tests/test_document_context_builder.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_url.py
  • src/backend/conversations/settings.py
  • src/frontend/apps/conversations/src/features/chat/components/Chat.tsx
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/__tests__/ToolInvocationItem.test.tsx
  • src/helm/conversations/templates/backend_cronjob_deindex.yaml
  • src/helm/conversations/values.yaml

Comment thread src/backend/chat/clients/conversation_reindexer.py
Comment thread src/backend/chat/management/commands/deindex_inactive_collections.py Outdated
Comment thread src/helm/conversations/templates/backend_cronjob_deindex.yaml Outdated
@maxenceh maxenceh requested a review from providenz May 25, 2026 08:40
Copy link
Copy Markdown
Collaborator

@providenz providenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty solid to me. Recent model updates and mass deindexing need to be handled.


ready_attachments = [
attachment
async for attachment in models.ChatConversationAttachment.objects.filter(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rag_document_id should be updated (it has been added on ChatConversationAttachment since this code was written )

old_collection_id = conversation.collection_id
try:
# Clear DB first so a crash during delete_collection doesn't leave a stale pointer.
ChatConversation.objects.filter(pk=conversation.pk).update(collection_id=None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rag_document_id (ChatConversationAttachment) should be nulled

claimed = await models.ChatConversation.objects.filter(
pk=conversation.pk,
collection_id__isnull=True,
).aupdate(collection_id="reindexing")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this kind of magic strings.
Any usage of the conversation carrying this magic string could feed it to the rag backend

We should maybe add new filed to carry the index state (and on the attachments too).
Maybe worth considering a kind of recovery for the sentinel ? what happens if the process die beetween the claim and the db save ?

Comment thread src/backend/chat/clients/conversation_reindexer.py
Comment thread src/backend/chat/management/commands/deindex_inactive_collections.py Outdated
Comment thread src/backend/chat/management/commands/deindex_inactive_collections.py Outdated
Comment thread src/backend/chat/clients/pydantic_ai.py
Comment thread src/helm/conversations/values.yaml
}
except (ValueError, IndexError):
in_context_ids = set()
async for event in reindex_conversation(self.conversation, in_context_ids):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we uplaod a doc to a deindexed conversation ? reindex runs before self._handle_input_documents() so it's worth checking (if not already)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that was a bug (creating another collection with the new doc). Fixed by updating collection_id after reindexing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to cover this case : test_reindex_existing_and_new_attachment_creates_single_collection

Comment thread src/backend/chat/clients/conversation_reindexer.py
maxenceh added 3 commits May 26, 2026 18:54
Add management command and cron job
Collections are deindexed after N days
If user reopen the conversations, we reindex them
Loader while we reindex document
Modal if some documents fail
Add a claim to prevent concurrent request to
create orphan collections
@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch 2 times, most recently from 7c9055b to 20281ca Compare May 27, 2026 13:07
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/backend/chat/clients/pydantic_ai.py (1)

643-653: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't mark the conversation INDEXED before the first document is stored.

If any later parse_and_store_document(...) call fails, _handle_input_documents() aborts the turn but this block has already persisted index_state=INDEXED. On the next request _run_agent() only retries UNINDEXED/ERROR, so the conversation can get stuck in a false healthy state with an empty or partial collection.

Suggested direction
         if not document_store.collection_id:
             # Create a new collection for the conversation
             collection_id = document_store.create_collection(
                 name=f"conversation-{self.conversation.pk}",
             )
             self.conversation.collection_id = str(collection_id)
-            self.conversation.index_state = CollectionIndexState.INDEXED
+            self.conversation.index_state = CollectionIndexState.INDEXING
             await self.conversation.asave(
                 update_fields=["collection_id", "index_state", "updated_at"]
             )
 
-        for document in documents:
-            key = None
-            if isinstance(document, DocumentUrl):
-                ...
-            else:
-                ...
+        try:
+            for document in documents:
+                key = None
+                if isinstance(document, DocumentUrl):
+                    ...
+                else:
+                    ...
+        except Exception:
+            self.conversation.index_state = CollectionIndexState.ERROR
+            await self.conversation.asave(update_fields=["index_state", "updated_at"])
+            raise
+        else:
+            self.conversation.index_state = CollectionIndexState.INDEXED
+            await self.conversation.asave(update_fields=["index_state", "updated_at"])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/clients/pydantic_ai.py` around lines 643 - 653, The code
currently sets self.conversation.index_state = CollectionIndexState.INDEXED
immediately after creating an empty document collection (in the block using
document_store_backend(self.conversation.collection_id) and
document_store.create_collection), which can leave the conversation marked
INDEXED even if subsequent parse_and_store_document calls fail; change this to
not mark INDEXED on creation—set the index_state to UNINDEXED (or leave unset)
when creating the empty collection and persist that, and instead update
self.conversation.index_state = CollectionIndexState.INDEXED only after the
first successful parse_and_store_document (or wherever documents are actually
persisted) so the state accurately reflects a populated/fully indexed
collection.
♻️ Duplicate comments (1)
src/frontend/apps/conversations/src/features/chat/components/Chat.tsx (1)

380-395: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Render rich resume-error content outside the typography wrapper.

This branch now stores a fragment with a <ul>, but the modal still does <Text>{chatErrorModal?.message}</Text> on Line 901. That wraps block content inside the text component, so the failed-documents list won't render correctly.

💡 Suggested fix
-      <Modal
-        isOpen={!!chatErrorModal}
-        onClose={() => {
-          setChatErrorModal(null);
-        }}
-        title={chatErrorModal?.title}
-        hideCloseButton={true}
-        closeOnClickOutside={true}
-        closeOnEsc={true}
-        size={ModalSize.MEDIUM}
-      >
-        <Text>{chatErrorModal?.message}</Text>
-      </Modal>
+      <Modal
+        isOpen={!!chatErrorModal}
+        onClose={() => {
+          setChatErrorModal(null);
+        }}
+        title={chatErrorModal?.title}
+        hideCloseButton={true}
+        closeOnClickOutside={true}
+        closeOnEsc={true}
+        size={ModalSize.MEDIUM}
+      >
+        {typeof chatErrorModal?.message === 'string' ? (
+          <Text>{chatErrorModal.message}</Text>
+        ) : (
+          chatErrorModal?.message
+        )}
+      </Modal>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/apps/conversations/src/features/chat/components/Chat.tsx` around
lines 380 - 395, The chat error modal currently stores complex JSX (a <ul> list)
via setChatErrorModal but the modal later wraps chatErrorModal?.message inside a
<Text> component (see chatErrorModal?.message usage), which breaks rendering of
block-level content; update the modal state shape so message is a ReactNode (not
always string) and change the modal rendering to render chatErrorModal?.message
directly (remove the surrounding <Text> wrapper) or conditionally wrap only when
the message is a plain string; update usages around setChatErrorModal and the
modal component that renders chatErrorModal?.message to handle both string and
JSX nodes correctly.
🧹 Nitpick comments (2)
CHANGELOG.md (1)

21-21: ⚡ Quick win

Consider mentioning re-indexing for completeness.

The entry focuses on de-indexing but the feature also includes automatic re-indexing when a user resumes an inactive conversation. Since re-indexing is a significant user-facing behavior (including new frontend UX), consider rewording for clarity.

Suggested rewording
-- ✨(back) de-index collections after inactivity 
+- ✨(back) de-index and re-index collections based on activity

or

-- ✨(back) de-index collections after inactivity 
+- ✨(back) lifecycle management for RAG collections (de-index inactive, re-index on resume)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 21, Update the CHANGELOG entry that currently reads
"✨(back) de-index collections after inactivity" to explicitly mention that
collections are automatically re-indexed when a user resumes an inactive
conversation and note the accompanying frontend UX change; replace or expand
that line to something like "de-index collections after inactivity and
automatically re-index when conversation resumes (includes new frontend UX for
re-indexing)" so release notes clearly reflect both backend and user-facing
behavior.
src/backend/chat/clients/conversation_reindexer.py (1)

111-115: ⚡ Quick win

Treat the reindexer’s collection_id update as correct for current backends; optional to use acreate_collection()’s return for future-proofing.

  • BaseRagBackend.acreate_collection() delegates to create_collection(), and both shipped RAG backends assign self.collection_id before returning (AlbertRagBackend.acreate_collection()/create_collection(), FindRagBackend.create_collection()), so the current document_store.collection_id usage in src/backend/chat/clients/conversation_reindexer.py (lines 111-115, 164-165) matches reality.
  • Still, capturing the returned id in the reindexer would make this resilient to a future backend that doesn’t mutate collection_id.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/clients/conversation_reindexer.py` around lines 111 - 115,
The reindexer currently relies on document_store.collection_id being set by
backend implementations (via BaseRagBackend.acreate_collection delegating to
create_collection), which works today but is fragile; modify the block in
conversation_reindexer.py where
document_store.acreate_collection(name=f"conversation-{conversation.pk}") is
awaited to capture its return value (e.g., new_id = await
document_store.acreate_collection(...)) and then use that returned id to set the
collection_id used later (assign to document_store.collection_id or to the local
existing_collection_id variable) so the code doesn't rely solely on backends
mutating state (refer to document_store.acreate_collection,
BaseRagBackend.acreate_collection, AlbertRagBackend, FindRagBackend, and the
conversation reindex flow).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 643-653: The code currently sets self.conversation.index_state =
CollectionIndexState.INDEXED immediately after creating an empty document
collection (in the block using
document_store_backend(self.conversation.collection_id) and
document_store.create_collection), which can leave the conversation marked
INDEXED even if subsequent parse_and_store_document calls fail; change this to
not mark INDEXED on creation—set the index_state to UNINDEXED (or leave unset)
when creating the empty collection and persist that, and instead update
self.conversation.index_state = CollectionIndexState.INDEXED only after the
first successful parse_and_store_document (or wherever documents are actually
persisted) so the state accurately reflects a populated/fully indexed
collection.

---

Duplicate comments:
In `@src/frontend/apps/conversations/src/features/chat/components/Chat.tsx`:
- Around line 380-395: The chat error modal currently stores complex JSX (a <ul>
list) via setChatErrorModal but the modal later wraps chatErrorModal?.message
inside a <Text> component (see chatErrorModal?.message usage), which breaks
rendering of block-level content; update the modal state shape so message is a
ReactNode (not always string) and change the modal rendering to render
chatErrorModal?.message directly (remove the surrounding <Text> wrapper) or
conditionally wrap only when the message is a plain string; update usages around
setChatErrorModal and the modal component that renders chatErrorModal?.message
to handle both string and JSX nodes correctly.

---

Nitpick comments:
In `@CHANGELOG.md`:
- Line 21: Update the CHANGELOG entry that currently reads "✨(back) de-index
collections after inactivity" to explicitly mention that collections are
automatically re-indexed when a user resumes an inactive conversation and note
the accompanying frontend UX change; replace or expand that line to something
like "de-index collections after inactivity and automatically re-index when
conversation resumes (includes new frontend UX for re-indexing)" so release
notes clearly reflect both backend and user-facing behavior.

In `@src/backend/chat/clients/conversation_reindexer.py`:
- Around line 111-115: The reindexer currently relies on
document_store.collection_id being set by backend implementations (via
BaseRagBackend.acreate_collection delegating to create_collection), which works
today but is fragile; modify the block in conversation_reindexer.py where
document_store.acreate_collection(name=f"conversation-{conversation.pk}") is
awaited to capture its return value (e.g., new_id = await
document_store.acreate_collection(...)) and then use that returned id to set the
collection_id used later (assign to document_store.collection_id or to the local
existing_collection_id variable) so the code doesn't rely solely on backends
mutating state (refer to document_store.acreate_collection,
BaseRagBackend.acreate_collection, AlbertRagBackend, FindRagBackend, and the
conversation reindex flow).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e1fd1c21-f9f5-446e-ab5e-f36466d540e6

📥 Commits

Reviewing files that changed from the base of the PR and between 7771e07 and 20281ca.

⛔ Files ignored due to path filters (1)
  • src/frontend/apps/conversations/src/features/chat/assets/chat-bubbles-illustration.svg is excluded by !**/*.svg
📒 Files selected for processing (25)
  • CHANGELOG.md
  • Makefile
  • docs/attachments.md
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/enums.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py
  • src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py
  • src/backend/chat/models.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/tests/management/__init__.py
  • src/backend/chat/tests/management/commands/__init__.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/chat/tests/test_document_context_builder.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_url.py
  • src/backend/conversations/settings.py
  • src/frontend/apps/conversations/src/features/chat/components/Chat.tsx
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/__tests__/ToolInvocationItem.test.tsx
  • src/helm/conversations/templates/backend_cronjob_deindex.yaml
  • src/helm/conversations/values.yaml
🚧 Files skipped from review as they are similar to previous changes (12)
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx
  • src/backend/chat/enums.py
  • Makefile
  • src/backend/chat/models.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_url.py
  • src/helm/conversations/values.yaml
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/conversations/settings.py
  • docs/attachments.md
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py

@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch from 20281ca to 3985d8a Compare May 27, 2026 13:46
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/backend/chat/clients/pydantic_ai.py (1)

643-653: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't persist INDEXED before any document store succeeds.

This marks the conversation as indexed right after create_collection(), but the later parse_and_store_document() calls can still fail. In that case _handle_input_documents() aborts the turn and leaves the conversation in INDEXED, so the next resume skips the DEINDEXED/ERROR recovery path even though nothing may be searchable yet. Set INDEXED only after at least one document is stored, or flip the conversation to ERROR in the exception path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/clients/pydantic_ai.py` around lines 643 - 653, The code
sets self.conversation.index_state = CollectionIndexState.INDEXED immediately
after document_store.create_collection(), which can leave the conversation
incorrectly marked INDEXED if subsequent parse_and_store_document() calls fail;
change the flow so you only set INDEXED after at least one successful
parse_and_store_document() (or alternatively set CollectionIndexState.ERROR in
the exception path). Specifically, keep using document_store_backend(...) and
create_collection(...) to obtain and persist self.conversation.collection_id,
but delay setting index_state to INDEXED and calling asave(...) until after a
successful parse/store step (the code that calls parse_and_store_document()/
_handle_input_documents()); also catch exceptions around the parse/store loop
and set self.conversation.index_state = CollectionIndexState.ERROR and await
asave(update_fields=["index_state","updated_at"]) if storing fails so recovery
paths work correctly.
♻️ Duplicate comments (1)
src/backend/chat/clients/pydantic_ai.py (1)

1244-1257: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Exclude current-turn uploads from the resume reindex pass.

This reindex runs before _handle_input_documents(). Since reindex_conversation() reindexes any READY text attachment with not a.is_indexed, a newly uploaded DocumentUrl can be indexed here and then parsed again in _handle_input_documents(), duplicating chunks in the same collection. Filter the current request's attachments out of the resume pass, or move the resume reindex after input-document handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/clients/pydantic_ai.py` around lines 1244 - 1257, The resume
reindex step (reindex_conversation called before _handle_input_documents in the
block around is_conversation_deindexed) can index newly uploaded DocumentUrl
attachments from the current request and cause duplicate chunks; fix by
excluding current-turn uploads from that resume pass: collect IDs (or another
unique marker) of input_documents/document uploads from the current request and
filter them out when building in_context_ids or when iterating
reindex_conversation so any DocumentUrl whose id is in the current request is
skipped, or alternatively move the reindex_conversation call to after
_handle_input_documents; update the logic in the is_conversation_deindexed
branch (functions/methods: _build_documents_listing, reindex_conversation,
_handle_input_documents, conversation.arefresh_from_db) to implement one of
these options.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/management/commands/deindex_inactive_collections.py`:
- Around line 27-35: The deindex path currently sets
ChatConversation.index_state to CollectionIndexState.DEINDEXED and clears
attachment fields (ChatConversationAttachment.is_indexed and rag_document_id)
before calling delete_collection, but on delete_collection failure only
collection_id is restored; update the rollback to also restore index_state and
attachment metadata (or defer clearing attachments until after delete_collection
succeeds) so that on backend delete errors you either (a) don't clear
attachments until delete succeeds, or (b) when catching the delete exception
restore ChatConversation.index_state and restore
ChatConversationAttachment.is_indexed and rag_document_id together (apply the
same fix for the analogous block at the later occurrence around lines 58-61).
Use the existing identifiers _claim / ChatConversation,
ChatConversationAttachment, delete_collection, and CollectionIndexState to
locate the code.
- Around line 83-90: The queryset for ChatConversation currently filters only
for collection_id__isnull=False but still includes empty-string values; update
the queryset construction (the
ChatConversation.objects.filter(...).only(...).iterator(...) block) to exclude
empty strings as well (e.g., add .exclude(collection_id="") or change the filter
to include collection_id__gt="") so rows with collection_id == "" are skipped
and delete_collection("") is never called.

In
`@src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py`:
- Around line 23-33: The migration's CharField choices for the conversation
index state are missing the "deindexed" option; update the choices list in the
models.CharField (the migration that defines choices
["unindexed","indexing","indexed","error"]) to include "deindexed" (matching
CollectionIndexState.DEINDEXED) so historical DB state aligns with the runtime
enum and the deindex command; ensure the choice string and human-readable label
follow the existing pattern (e.g., ("deindexed", "DEINDEXED")).

---

Outside diff comments:
In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 643-653: The code sets self.conversation.index_state =
CollectionIndexState.INDEXED immediately after
document_store.create_collection(), which can leave the conversation incorrectly
marked INDEXED if subsequent parse_and_store_document() calls fail; change the
flow so you only set INDEXED after at least one successful
parse_and_store_document() (or alternatively set CollectionIndexState.ERROR in
the exception path). Specifically, keep using document_store_backend(...) and
create_collection(...) to obtain and persist self.conversation.collection_id,
but delay setting index_state to INDEXED and calling asave(...) until after a
successful parse/store step (the code that calls parse_and_store_document()/
_handle_input_documents()); also catch exceptions around the parse/store loop
and set self.conversation.index_state = CollectionIndexState.ERROR and await
asave(update_fields=["index_state","updated_at"]) if storing fails so recovery
paths work correctly.

---

Duplicate comments:
In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 1244-1257: The resume reindex step (reindex_conversation called
before _handle_input_documents in the block around is_conversation_deindexed)
can index newly uploaded DocumentUrl attachments from the current request and
cause duplicate chunks; fix by excluding current-turn uploads from that resume
pass: collect IDs (or another unique marker) of input_documents/document uploads
from the current request and filter them out when building in_context_ids or
when iterating reindex_conversation so any DocumentUrl whose id is in the
current request is skipped, or alternatively move the reindex_conversation call
to after _handle_input_documents; update the logic in the
is_conversation_deindexed branch (functions/methods: _build_documents_listing,
reindex_conversation, _handle_input_documents, conversation.arefresh_from_db) to
implement one of these options.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 76e0308c-f446-43af-8e0d-f0efbf5f22ed

📥 Commits

Reviewing files that changed from the base of the PR and between 20281ca and 3985d8a.

📒 Files selected for processing (12)
  • docs/attachments.md
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/enums.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py
  • src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py
  • src/backend/chat/models.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/chat/tests/test_document_context_builder.py
  • src/backend/conversations/settings.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/chat/enums.py
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/models.py
  • docs/attachments.md

Comment on lines +27 to +35
claimed = ChatConversation.objects.filter(
pk=conv.pk,
collection_id=conv.collection_id,
updated_at__lt=threshold,
).update(collection_id=None, index_state=CollectionIndexState.DEINDEXED)
if claimed:
ChatConversationAttachment.objects.filter(conversation_id=conv.pk).update(
is_indexed=False, rag_document_id=None
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rollback the full deindex state on backend delete errors.

_claim() already flips the conversation to DEINDEXED and clears every attachment's is_indexed / rag_document_id, but the error path restores only collection_id. If delete_collection() fails transiently and the backend collection still exists, the next resume will see not a.is_indexed everywhere and reindex those documents into the existing collection, duplicating content. Either defer clearing attachment metadata until the delete succeeds, or restore index_state and attachment metadata together on failure.

Also applies to: 58-61

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/management/commands/deindex_inactive_collections.py` around
lines 27 - 35, The deindex path currently sets ChatConversation.index_state to
CollectionIndexState.DEINDEXED and clears attachment fields
(ChatConversationAttachment.is_indexed and rag_document_id) before calling
delete_collection, but on delete_collection failure only collection_id is
restored; update the rollback to also restore index_state and attachment
metadata (or defer clearing attachments until after delete_collection succeeds)
so that on backend delete errors you either (a) don't clear attachments until
delete succeeds, or (b) when catching the delete exception restore
ChatConversation.index_state and restore ChatConversationAttachment.is_indexed
and rag_document_id together (apply the same fix for the analogous block at the
later occurrence around lines 58-61). Use the existing identifiers _claim /
ChatConversation, ChatConversationAttachment, delete_collection, and
CollectionIndexState to locate the code.

Comment on lines +83 to +90
queryset = (
ChatConversation.objects.filter(
collection_id__isnull=False,
updated_at__lt=threshold,
)
.only("id", "collection_id")
.iterator(chunk_size=100)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Skip blank collection_id values too.

collection_id is nullable and blankable, and the migration backfill already treats "" as "no collection". This queryset will still claim those rows and call delete_collection(""), which can create avoidable errors and mark never-indexed conversations as deindexed. Add an empty-string exclusion here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/management/commands/deindex_inactive_collections.py` around
lines 83 - 90, The queryset for ChatConversation currently filters only for
collection_id__isnull=False but still includes empty-string values; update the
queryset construction (the
ChatConversation.objects.filter(...).only(...).iterator(...) block) to exclude
empty strings as well (e.g., add .exclude(collection_id="") or change the filter
to include collection_id__gt="") so rows with collection_id == "" are skipped
and delete_collection("") is never called.

@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch from 3985d8a to 6da02ff Compare May 27, 2026 14:01
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py (1)

92-94: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Refresh conversation before asserting persisted no-op state.

These assertions read the in-memory instance only. Add await conversation.arefresh_from_db() before checking collection_id to ensure DB-side regressions are caught.

Also applies to: 243-245

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py`
around lines 92 - 94, The test is asserting on the in-memory Conversation
instance instead of its persisted state; before asserting
conversation.collection_id is None (and similarly at the other occurrence around
lines 243-245), call await conversation.arefresh_from_db() to refresh the model
from the DB so the assertion verifies the persisted no-op state—update the test
to await conversation.arefresh_from_db() immediately before asserting
collection_id and any other persisted fields.
src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py (1)

24-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add missing deindexed choice to migration state.

The migration choices omit ("deindexed", "DEINDEXED") while runtime logic uses that state. This creates model-state drift and unnecessary future AlterField churn.

Suggested fix
                 choices=[
                     ("unindexed", "UNINDEXED"),
+                    ("deindexed", "DEINDEXED"),
                     ("indexing", "INDEXING"),
                     ("indexed", "INDEXED"),
                     ("error", "ERROR"),
                 ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py`
around lines 24 - 29, The migration's choices list for the state field in
0008_chatconversation_index_state_attachment_is_indexed.py is missing the
("deindexed", "DEINDEXED") option used at runtime; update the choices array (the
tuple list assigned to choices) to include ("deindexed", "DEINDEXED") alongside
("unindexed","UNINDEXED"), ("indexing","INDEXING"), ("indexed","INDEXED"),
("error","ERROR") so the migration matches runtime model states and avoids
future AlterField churn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 539-540: The early return that enables RAG when
conversation_has_own_documents is true is unsafe; update the guard in the
function containing that check so it only returns True if
conversation_has_own_documents AND the conversation has a searchable collection
(e.g., conversation.collection_id is non-empty) or a verified indexed-attachment
flag, rather than relying solely on attachment presence; apply the same fix to
the duplicate check around lines 1274-1276 (both places that reference
conversation_has_own_documents) so RAG tools are only enabled when a valid
collection_id or confirmed indexed state exists.
- Around line 649-653: Don't set conversation.index_state =
CollectionIndexState.INDEXED and persist it (self.conversation.asave(...))
before attempting to index; instead, perform parse_and_store_document (the
indexing call that may fail) first and only on its success assign
self.conversation.collection_id = str(collection_id), set
self.conversation.index_state = CollectionIndexState.INDEXED, and then await
self.conversation.asave(update_fields=["collection_id","index_state","updated_at"]);
also handle exceptions from parse_and_store_document to set a failure state
(e.g., CollectionIndexState.FAILED) or leave it unchanged and persist that state
so a failed indexing run doesn't leave the conversation marked INDEXED.

---

Duplicate comments:
In
`@src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py`:
- Around line 24-29: The migration's choices list for the state field in
0008_chatconversation_index_state_attachment_is_indexed.py is missing the
("deindexed", "DEINDEXED") option used at runtime; update the choices array (the
tuple list assigned to choices) to include ("deindexed", "DEINDEXED") alongside
("unindexed","UNINDEXED"), ("indexing","INDEXING"), ("indexed","INDEXED"),
("error","ERROR") so the migration matches runtime model states and avoids
future AlterField churn.

In `@src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py`:
- Around line 92-94: The test is asserting on the in-memory Conversation
instance instead of its persisted state; before asserting
conversation.collection_id is None (and similarly at the other occurrence around
lines 243-245), call await conversation.arefresh_from_db() to refresh the model
from the DB so the assertion verifies the persisted no-op state—update the test
to await conversation.arefresh_from_db() immediately before asserting
collection_id and any other persisted fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b30dc389-3f87-464d-b822-97c5c8412451

📥 Commits

Reviewing files that changed from the base of the PR and between 3985d8a and 6da02ff.

📒 Files selected for processing (13)
  • docs/attachments.md
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/document_context_builder.py
  • src/backend/chat/enums.py
  • src/backend/chat/management/commands/deindex_inactive_collections.py
  • src/backend/chat/migrations/0008_chatconversation_index_state_attachment_is_indexed.py
  • src/backend/chat/models.py
  • src/backend/chat/tests/clients/pydantic_ai/test_reindex_conversation.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • src/backend/chat/tests/test_document_context_builder.py
  • src/backend/conversations/settings.py
  • src/helm/conversations/templates/backend_cronjob_deindex.yaml
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/backend/conversations/settings.py
  • src/backend/chat/enums.py
  • src/helm/conversations/templates/backend_cronjob_deindex.yaml
  • src/backend/chat/clients/conversation_reindexer.py
  • src/backend/chat/tests/management/commands/test_deindex_inactive_collections.py
  • docs/attachments.md
  • src/backend/chat/management/commands/deindex_inactive_collections.py

Comment thread src/backend/chat/clients/pydantic_ai.py Outdated
Comment thread src/backend/chat/clients/pydantic_ai.py Outdated
@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch 2 times, most recently from 2a75d4c to d637900 Compare May 27, 2026 15:22
improve deindexing behaviour by adding index_state
scale up deindexing script
@maxenceh maxenceh force-pushed the maxenceh/deindex-collections branch from d637900 to 8b93c21 Compare May 27, 2026 15:45
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(backend) de-index collections after inactivity

2 participants