Skip to content

test: add crash_recovery cases#241

Open
liututu12 wants to merge 1 commit intomainfrom
test/test_crash_recovery_0318
Open

test: add crash_recovery cases#241
liututu12 wants to merge 1 commit intomainfrom
test/test_crash_recovery_0318

Conversation

@liututu12
Copy link
Collaborator

@liututu12 liututu12 commented Mar 18, 2026

Add power failure recovery related test cases

Greptile Summary

This PR adds nine new test files covering crash-recovery (power-failure simulation) scenarios for Zvec collection operations: document insert, update, upsert, delete, index creation/deletion, and column add/alter/drop. The shared pattern across all tests is sound — create a collection in the main process, spawn a subprocess that performs the target operation in a loop, kill it mid-flight with SIGKILL, then reopen the collection and assert it is usable.

Key issues found:

  • test_collection_crash_recovery_deletedoc.py — The subprocess script's document-ID generation has a double-multiplication bug: the outer loop variable i already steps by batch_size (values 0, 10, 20…), but IDs are computed as range(i * batch_size, (i+1) * batch_size), producing ranges 0–9, 100–109, 200–209 … instead of 0–9, 10–19, 20–29. Documents in the 10–99 range are never deleted, and IDs above the inserted range may be targeted.
  • test_collection_crash_recovery_deletedoc.py — Two f-string error messages in the recovery-verification section reference fetched_doc (undefined) instead of fetched_docs, which would raise a secondary NameError on assertion failure, masking the real error.
  • test_collection_crash_recovery_deletedoc.py — Two is_doc_equal calls use the hardcoded key fetched_docs["1"] instead of fetched_docs[doc.id], meaning the check always validates document "1" regardless of the current iteration.
  • test_collection_crash_recovery_updatedoc.py — The post-crash doc-count assertion hardcodes == 201 rather than using the dynamically captured initial_doc_count + 1.
  • All 9 filesthreading is imported but never used.

Confidence Score: 2/5

  • Not safe to merge — the delete-doc crash recovery test contains logic bugs that make its assertions incorrect or non-functional.
  • Eight of the nine test files follow a correct and well-structured pattern. However, test_collection_crash_recovery_deletedoc.py has three distinct correctness bugs: an off-by-a-factor-of-batch_size error in ID generation (deletes the wrong documents), undefined variable references in two error-message f-strings, and hardcoded dictionary keys that silently verify the wrong document. Together these mean the deletion crash-recovery test does not actually validate what it claims to validate. The updatedoc file also has a brittle hardcoded count assertion. These issues should be fixed before merging.
  • Pay close attention to test_collection_crash_recovery_deletedoc.py (multiple logic bugs in both the embedded subprocess script and the recovery-verification section) and test_collection_crash_recovery_updatedoc.py (hardcoded magic number assertion).

Important Files Changed

Filename Overview
python/tests/detail/test_collection_crash_recovery_addcolumn.py Adds crash recovery tests for column addition operations across 6 data types (INT32–DOUBLE). Pattern is well-structured: create collection, spawn subprocess, kill it mid-operation, verify recovery. Unused threading import is the only issue.
python/tests/detail/test_collection_crash_recovery_altercolumn.py Adds crash recovery tests for alter_column operations. The post-crash alter_column in step 3.8 is intentionally wrapped in a try/except (expected to fail if already altered), which is acceptable. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_createindex.py Adds crash recovery tests for index creation (INVERT, HNSW, FLAT, IVF). Uses appropriate field types per index kind. The basic_schema fixture parameter is accepted but unused in test methods. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_deletedoc.py Contains three bugs: (1) double-multiplication in doc ID calculation produces wrong IDs in the subprocess script, (2) undefined variable fetched_doc in f-string error messages at two locations, and (3) hardcoded dictionary key "1" instead of doc.id at two locations. Also has misleading "insert" log messages in deletion code.
python/tests/detail/test_collection_crash_recovery_deleteindex.py Adds crash recovery tests for index deletion. The subprocess script correctly checks for index existence before attempting deletion. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_dropcolumn.py Adds crash recovery tests for drop_column. The subprocess does add-then-drop in a loop to ensure the column exists before each drop attempt, which is a sensible design. Post-crash drop is try/except guarded. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_insertdoc.py Adds crash recovery tests for bulk document insertion. The subprocess uses a correct j-based inner loop counter for ID generation. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_updatedoc.py Adds crash recovery tests for bulk document updates. Contains a hardcoded magic number 201 in the doc count assertion (should be initial_doc_count + 1). The is_doc_equal calls use query result docs as the expected value rather than the original inserted docs, which may not catch regressions. Unused threading import present.
python/tests/detail/test_collection_crash_recovery_upsertdoc.py Adds crash recovery tests for bulk upsert. Similar structure to updatedoc but cleaner. Module docstring incorrectly refers to "updatedoc" rather than "upsertdoc". Unused threading import present.

Sequence Diagram

sequenceDiagram
    participant T as Test (Main Process)
    participant F as Filesystem (tempdir)
    participant S as Subprocess (zvec operations)
    participant Z as Zvec Collection

    T->>F: Create collection path in tempdir
    T->>Z: zvec.create_and_open(path, schema, option)
    T->>Z: insert N documents
    T->>Z: del coll (close)
    T->>F: Write subprocess script to tempfile
    T->>S: subprocess.Popen([python, script, args_json])
    S->>Z: zvec.open(collection_path)
    S->>Z: Perform operation in loop (insert/delete/update/create_index/etc.)
    Note over T,S: Main process waits 2–3 seconds
    T--xS: SIGKILL (simulate power failure)
    T->>F: os.remove(script file)
    T->>Z: zvec.open(collection_path) — recovery attempt
    T->>Z: Verify doc count >= 1
    T->>Z: Fetch & validate documents
    T->>Z: Query with filter
    T->>Z: Test insert/update/delete post-recovery
    T->>Z: Re-run the crashed operation (create_index / add_column / etc.)
    T->>Z: Final query to confirm functionality
Loading

Last reviewed commit: "test: add crash_reco..."

Greptile also left 6 inline comments on this PR.

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

Copy link
Collaborator Author

@liututu12 liututu12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test: add power failure recovery test cases

Comment on lines +225 to +229
if current_batch_size==batch_size:

doc_ids= [str(i) for i in range(i*batch_size,(i+1)*batch_size)]
else:
doc_ids= [str(i) for i in range(i*batch_size, num_docs_to_delete)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Double-multiplication bug produces wrong document IDs

The outer loop variable i iterates as 0, 10, 20, ... 190 (step = batch_size), but each iteration then computes range(i * batch_size, (i+1) * batch_size). This squares the offset: when i = 10 the generated IDs are 100–109 instead of 10–19, when i = 20 the IDs are 200–209, etc. As a result the test will attempt to delete documents that were never inserted (or delete the wrong ones), and many documents in the 10–99 range are never touched.

The fix is to use i directly as the starting offset (it already represents the absolute start position):

Suggested change
if current_batch_size==batch_size:
doc_ids= [str(i) for i in range(i*batch_size,(i+1)*batch_size)]
else:
doc_ids= [str(i) for i in range(i*batch_size, num_docs_to_delete)]
if current_batch_size==batch_size:
doc_ids= [str(j) for j in range(i, i + batch_size)]
else:
doc_ids= [str(j) for j in range(i, num_docs_to_delete)]

Comment on lines +418 to +421
single_doc,
recovered_collection.schema,
), f"result doc={fetched_doc},doc_exp={single_doc}"
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 fetched_doc is undefined — will raise NameError on assertion failure

In this branch, fetched_docs (plural) is assigned but the error message f-string references fetched_doc (singular), which is never bound in this scope. If the is_doc_equal assertion fails, Python will raise a NameError on fetched_doc instead of reporting the actual assertion failure, hiding the real error.

The same issue occurs at approximately line 452 in the second loop over query_result.

Suggested change
single_doc,
recovered_collection.schema,
), f"result doc={fetched_doc},doc_exp={single_doc}"
break
), f"result doc={fetched_docs},doc_exp={single_doc}"

Comment on lines +428 to +431
assert doc.id in fetched_docs
assert is_doc_equal(
fetched_docs["1"], exp_doc, recovered_collection.schema
), f"result doc={fetched_docs},doc_exp={exp_doc}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded key "1" always looks up the wrong document

fetched_docs["1"] always fetches the document with ID "1" regardless of which document is currently being iterated (doc.id). For any document whose ID is not "1" this will either raise a KeyError (if doc "1" was deleted) or silently verify the wrong document, making the integrity check meaningless.

The same hardcoded key appears again at approximately line 462 in the second loop.

Suggested change
assert doc.id in fetched_docs
assert is_doc_equal(
fetched_docs["1"], exp_doc, recovered_collection.schema
), f"result doc={fetched_docs},doc_exp={exp_doc}"
assert is_doc_equal(
fetched_docs[doc.id], exp_doc, recovered_collection.schema
), f"result doc={fetched_docs},doc_exp={exp_doc}"

# Verify existing documents have correct structure
if len(query_result) > 0:
for doc in query_result[:100]: # Limit to first 100 for efficiency
fetched_docs = recovered_collection.fetch([doc.id])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded magic number 201 makes the test brittle

This assertion hardcodes the expected document count. If the setup ever changes (e.g., the number of initially inserted documents changes from 200 or the seed document ID 2001 is moved), this assertion will silently pass or fail for the wrong reason. The commented-out line directly above shows the intent was to use a dynamic value. Consider deriving the expected count from initial_doc_count + 1 (captured before closing the collection):

Suggested change
fetched_docs = recovered_collection.fetch([doc.id])
assert recovered_collection.stats.doc_count == initial_doc_count + 1

import psutil # Used for more reliable process management
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused threading import across all nine new test files

threading is imported at the top of every new test file but is never referenced anywhere in the file body. This applies to all 9 files introduced in this PR:

  • test_collection_crash_recovery_addcolumn.py:29
  • test_collection_crash_recovery_altercolumn.py:29
  • test_collection_crash_recovery_createindex.py:29
  • test_collection_crash_recovery_deletedoc.py:29
  • test_collection_crash_recovery_deleteindex.py:29
  • test_collection_crash_recovery_dropcolumn.py:29
  • test_collection_crash_recovery_insertdoc.py:29
  • test_collection_crash_recovery_updatedoc.py:29
  • test_collection_crash_recovery_upsertdoc.py:29

Remove the import from all files.

Comment on lines +212 to +214
print(f"[Subprocess] Starting Zvec insert document operations on {collection_path} at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"[Subprocess] Will insert {num_docs_to_delete} documents in batches of {batch_size}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading print statement says "insert" instead of "delete"

The log message reads "Starting Zvec insert document operations" inside run_zvec_deletedoc_operations, which is a deletion function. This makes debugging confusing. Likewise, the next line says "Will insert N documents" instead of "Will delete N documents". These copy-paste leftovers should be corrected to reference deletion.

import psutil # Used for more reliable process management
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
from doc_helper import *
import psutil # Used for more reliable process management
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
from doc_helper import *
import psutil # Used for more reliable process management
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
except ImportError:
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
psutil = None # If psutil is not installed, set it to None
from distance_helper import *
from fixture_helper import *
from doc_helper import *
import psutil # Used for more reliable process management
except ImportError:
psutil = None # If psutil is not installed, set it to None
from fixture_helper import *
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants