test: add crash_recovery cases by liututu12 · Pull Request #241 · alibaba/zvec

liututu12 · 2026-03-18T06:43:59Z

Add power failure recovery related test cases

Greptile Summary

This PR adds nine new test files covering crash-recovery (power-failure simulation) scenarios for Zvec collection operations: document insert, update, upsert, delete, index creation/deletion, and column add/alter/drop. The shared pattern across all tests is sound — create a collection in the main process, spawn a subprocess that performs the target operation in a loop, kill it mid-flight with SIGKILL, then reopen the collection and assert it is usable.

Key issues found:

test_collection_crash_recovery_deletedoc.py — The subprocess script's document-ID generation has a double-multiplication bug: the outer loop variable i already steps by batch_size (values 0, 10, 20…), but IDs are computed as range(i * batch_size, (i+1) * batch_size), producing ranges 0–9, 100–109, 200–209 … instead of 0–9, 10–19, 20–29. Documents in the 10–99 range are never deleted, and IDs above the inserted range may be targeted.
test_collection_crash_recovery_deletedoc.py — Two f-string error messages in the recovery-verification section reference fetched_doc (undefined) instead of fetched_docs, which would raise a secondary NameError on assertion failure, masking the real error.
test_collection_crash_recovery_deletedoc.py — Two is_doc_equal calls use the hardcoded key fetched_docs["1"] instead of fetched_docs[doc.id], meaning the check always validates document "1" regardless of the current iteration.
test_collection_crash_recovery_updatedoc.py — The post-crash doc-count assertion hardcodes == 201 rather than using the dynamically captured initial_doc_count + 1.
All 9 files — threading is imported but never used.

Confidence Score: 2/5

Not safe to merge — the delete-doc crash recovery test contains logic bugs that make its assertions incorrect or non-functional.
Eight of the nine test files follow a correct and well-structured pattern. However, test_collection_crash_recovery_deletedoc.py has three distinct correctness bugs: an off-by-a-factor-of-batch_size error in ID generation (deletes the wrong documents), undefined variable references in two error-message f-strings, and hardcoded dictionary keys that silently verify the wrong document. Together these mean the deletion crash-recovery test does not actually validate what it claims to validate. The updatedoc file also has a brittle hardcoded count assertion. These issues should be fixed before merging.
Pay close attention to test_collection_crash_recovery_deletedoc.py (multiple logic bugs in both the embedded subprocess script and the recovery-verification section) and test_collection_crash_recovery_updatedoc.py (hardcoded magic number assertion).

Important Files Changed

Filename	Overview
python/tests/detail/test_collection_crash_recovery_addcolumn.py	Adds crash recovery tests for column addition operations across 6 data types (INT32–DOUBLE). Pattern is well-structured: create collection, spawn subprocess, kill it mid-operation, verify recovery. Unused `threading` import is the only issue.
python/tests/detail/test_collection_crash_recovery_altercolumn.py	Adds crash recovery tests for `alter_column` operations. The post-crash `alter_column` in step 3.8 is intentionally wrapped in a try/except (expected to fail if already altered), which is acceptable. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_createindex.py	Adds crash recovery tests for index creation (INVERT, HNSW, FLAT, IVF). Uses appropriate field types per index kind. The `basic_schema` fixture parameter is accepted but unused in test methods. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_deletedoc.py	Contains three bugs: (1) double-multiplication in doc ID calculation produces wrong IDs in the subprocess script, (2) undefined variable `fetched_doc` in f-string error messages at two locations, and (3) hardcoded dictionary key `"1"` instead of `doc.id` at two locations. Also has misleading "insert" log messages in deletion code.
python/tests/detail/test_collection_crash_recovery_deleteindex.py	Adds crash recovery tests for index deletion. The subprocess script correctly checks for index existence before attempting deletion. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_dropcolumn.py	Adds crash recovery tests for `drop_column`. The subprocess does add-then-drop in a loop to ensure the column exists before each drop attempt, which is a sensible design. Post-crash drop is try/except guarded. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_insertdoc.py	Adds crash recovery tests for bulk document insertion. The subprocess uses a correct j-based inner loop counter for ID generation. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_updatedoc.py	Adds crash recovery tests for bulk document updates. Contains a hardcoded magic number `201` in the doc count assertion (should be `initial_doc_count + 1`). The `is_doc_equal` calls use query result docs as the expected value rather than the original inserted docs, which may not catch regressions. Unused `threading` import present.
python/tests/detail/test_collection_crash_recovery_upsertdoc.py	Adds crash recovery tests for bulk upsert. Similar structure to updatedoc but cleaner. Module docstring incorrectly refers to "updatedoc" rather than "upsertdoc". Unused `threading` import present.

Sequence Diagram

sequenceDiagram
    participant T as Test (Main Process)
    participant F as Filesystem (tempdir)
    participant S as Subprocess (zvec operations)
    participant Z as Zvec Collection

    T->>F: Create collection path in tempdir
    T->>Z: zvec.create_and_open(path, schema, option)
    T->>Z: insert N documents
    T->>Z: del coll (close)
    T->>F: Write subprocess script to tempfile
    T->>S: subprocess.Popen([python, script, args_json])
    S->>Z: zvec.open(collection_path)
    S->>Z: Perform operation in loop (insert/delete/update/create_index/etc.)
    Note over T,S: Main process waits 2–3 seconds
    T--xS: SIGKILL (simulate power failure)
    T->>F: os.remove(script file)
    T->>Z: zvec.open(collection_path) — recovery attempt
    T->>Z: Verify doc count >= 1
    T->>Z: Fetch & validate documents
    T->>Z: Query with filter
    T->>Z: Test insert/update/delete post-recovery
    T->>Z: Re-run the crashed operation (create_index / add_column / etc.)
    T->>Z: Final query to confirm functionality

_{Last reviewed commit: "test: add crash_reco..."}

Greptile also left 6 inline comments on this PR.

_{(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!}

liututu12

test: add power failure recovery test cases

greptile-apps · 2026-03-18T06:48:35Z

python/tests/detail/test_collection_crash_recovery_deletedoc.py

+            if current_batch_size==batch_size:
+
+                doc_ids= [str(i) for i in range(i*batch_size,(i+1)*batch_size)]
+            else:
+                doc_ids= [str(i) for i in range(i*batch_size, num_docs_to_delete)]


Double-multiplication bug produces wrong document IDs

The outer loop variable i iterates as 0, 10, 20, ... 190 (step = batch_size), but each iteration then computes range(i * batch_size, (i+1) * batch_size). This squares the offset: when i = 10 the generated IDs are 100–109 instead of 10–19, when i = 20 the IDs are 200–209, etc. As a result the test will attempt to delete documents that were never inserted (or delete the wrong ones), and many documents in the 10–99 range are never touched.

The fix is to use i directly as the starting offset (it already represents the absolute start position):

Suggested change

if current_batch_size==batch_size:

doc_ids= [str(i) for i in range(i*batch_size,(i+1)*batch_size)]

else:

doc_ids= [str(i) for i in range(i*batch_size, num_docs_to_delete)]

if current_batch_size==batch_size:

doc_ids= [str(j) for j in range(i, i + batch_size)]

else:

doc_ids= [str(j) for j in range(i, num_docs_to_delete)]

greptile-apps · 2026-03-18T06:48:36Z

python/tests/detail/test_collection_crash_recovery_deletedoc.py

+                            single_doc,
+                            recovered_collection.schema,
+                        ), f"result doc={fetched_doc},doc_exp={single_doc}"
+                        break


fetched_doc is undefined — will raise NameError on assertion failure

In this branch, fetched_docs (plural) is assigned but the error message f-string references fetched_doc (singular), which is never bound in this scope. If the is_doc_equal assertion fails, Python will raise a NameError on fetched_doc instead of reporting the actual assertion failure, hiding the real error.

The same issue occurs at approximately line 452 in the second loop over query_result.

Suggested change

single_doc,

recovered_collection.schema,

), f"result doc={fetched_doc},doc_exp={single_doc}"

break

), f"result doc={fetched_docs},doc_exp={single_doc}"

greptile-apps · 2026-03-18T06:48:37Z

python/tests/detail/test_collection_crash_recovery_deletedoc.py

+                        assert doc.id in fetched_docs
+                        assert is_doc_equal(
+                            fetched_docs["1"], exp_doc, recovered_collection.schema
+                        ), f"result doc={fetched_docs},doc_exp={exp_doc}"


Hardcoded key "1" always looks up the wrong document

fetched_docs["1"] always fetches the document with ID "1" regardless of which document is currently being iterated (doc.id). For any document whose ID is not "1" this will either raise a KeyError (if doc "1" was deleted) or silently verify the wrong document, making the integrity check meaningless.

The same hardcoded key appears again at approximately line 462 in the second loop.

Suggested change

assert doc.id in fetched_docs

assert is_doc_equal(

fetched_docs["1"], exp_doc, recovered_collection.schema

), f"result doc={fetched_docs},doc_exp={exp_doc}"

assert is_doc_equal(

fetched_docs[doc.id], exp_doc, recovered_collection.schema

), f"result doc={fetched_docs},doc_exp={exp_doc}"

greptile-apps · 2026-03-18T06:48:38Z

python/tests/detail/test_collection_crash_recovery_updatedoc.py

+            # Verify existing documents have correct structure
+            if len(query_result) > 0:
+                for doc in query_result[:100]:  # Limit to first 100 for efficiency
+                    fetched_docs = recovered_collection.fetch([doc.id])


Hardcoded magic number 201 makes the test brittle

This assertion hardcodes the expected document count. If the setup ever changes (e.g., the number of initially inserted documents changes from 200 or the seed document ID 2001 is moved), this assertion will silently pass or fail for the wrong reason. The commented-out line directly above shows the intent was to use a dynamic value. Consider deriving the expected count from initial_doc_count + 1 (captured before closing the collection):

Suggested change

fetched_docs = recovered_collection.fetch([doc.id])

assert recovered_collection.stats.doc_count == initial_doc_count + 1

greptile-apps · 2026-03-18T06:48:39Z

python/tests/detail/test_collection_crash_recovery_addcolumn.py

+    import psutil  # Used for more reliable process management
+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *


Unused threading import across all nine new test files

threading is imported at the top of every new test file but is never referenced anywhere in the file body. This applies to all 9 files introduced in this PR:

test_collection_crash_recovery_addcolumn.py:29

test_collection_crash_recovery_altercolumn.py:29

test_collection_crash_recovery_createindex.py:29

test_collection_crash_recovery_deletedoc.py:29

test_collection_crash_recovery_deleteindex.py:29

test_collection_crash_recovery_dropcolumn.py:29

test_collection_crash_recovery_insertdoc.py:29

test_collection_crash_recovery_updatedoc.py:29

test_collection_crash_recovery_upsertdoc.py:29

Remove the import from all files.

greptile-apps · 2026-03-18T06:48:39Z

python/tests/detail/test_collection_crash_recovery_deletedoc.py

+    print(f"[Subprocess] Starting Zvec insert document operations on {collection_path} at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"[Subprocess] Will insert {num_docs_to_delete} documents in batches of {batch_size}")
+


Misleading print statement says "insert" instead of "delete"

The log message reads "Starting Zvec insert document operations" inside run_zvec_deletedoc_operations, which is a deletion function. This makes debugging confusing. Likewise, the next line says "Will insert N documents" instead of "Will delete N documents". These copy-paste leftovers should be corrected to reference deletion.

python/tests/detail/test_collection_crash_recovery_altercolumn.py

+    import psutil  # Used for more reliable process management
+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *


python/tests/detail/test_collection_crash_recovery_altercolumn.py

+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *


python/tests/detail/test_collection_crash_recovery_altercolumn.py

+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *
+from doc_helper import *


python/tests/detail/test_collection_crash_recovery_addcolumn.py

+    import psutil  # Used for more reliable process management
+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *


python/tests/detail/test_collection_crash_recovery_addcolumn.py

+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *


python/tests/detail/test_collection_crash_recovery_addcolumn.py

+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *
+from doc_helper import *


python/tests/detail/test_collection_crash_recovery_createindex.py

+    import psutil  # Used for more reliable process management
+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *


python/tests/detail/test_collection_crash_recovery_createindex.py

+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *


python/tests/detail/test_collection_crash_recovery_createindex.py

+    psutil = None  # If psutil is not installed, set it to None
+from distance_helper import *
+from fixture_helper import *
+from doc_helper import *


python/tests/detail/test_collection_crash_recovery_deletedoc.py

+    import psutil  # Used for more reliable process management
+except ImportError:
+    psutil = None  # If psutil is not installed, set it to None
+from fixture_helper import *


test: add crash_recovery cases

72d7f8e

liututu12 commented Mar 18, 2026

View reviewed changes

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

github-code-quality bot found potential problems Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add crash_recovery cases#241

test: add crash_recovery cases#241
liututu12 wants to merge 1 commit intomainfrom
test/test_crash_recovery_0318

liututu12 commented Mar 18, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

liututu12 left a comment

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

greptile-apps bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	fetched_docs = recovered_collection.fetch([doc.id])
	assert recovered_collection.stats.doc_count == initial_doc_count + 1

		print(f"[Subprocess] Starting Zvec insert document operations on {collection_path} at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
		print(f"[Subprocess] Will insert {num_docs_to_delete} documents in batches of {batch_size}")

Conversation

liututu12 commented Mar 18, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

liututu12 left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liututu12 commented Mar 18, 2026 •

edited by greptile-apps bot

Loading