Skip to content

Dev#229

Merged
jorgedanisc merged 7 commits intomainfrom
dev
Mar 20, 2026
Merged

Dev#229
jorgedanisc merged 7 commits intomainfrom
dev

Conversation

@jorgedanisc
Copy link
Copy Markdown
Contributor

No description provided.

@jorgedanisc jorgedanisc self-assigned this Mar 20, 2026
Copilot AI review requested due to automatic review settings March 20, 2026 03:56
@github-actions
Copy link
Copy Markdown
Contributor

📋 Release Preview

Package Current Next Tests
ducrs 3.2.0 ⏭️ No release
ducjs 3.2.0 ⏭️ No release
ducpy 3.2.0 3.2.1 ✅ Passed
ducpdf 3.7.0 ⏭️ No release
ducsvg 3.6.0 ⏭️ No release
duc2pdf 3.2.0 ⏭️ No release

@jorgedanisc jorgedanisc merged commit 8091c5c into main Mar 20, 2026
5 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the DUC SQLite schema to enhance FTS5-based search (diacritic-insensitive tokenization + prefix indexing), introduces a migration to rebuild existing FTS tables, and adds a Python (ducpy) search API with tests while ensuring Python-side schema migrations are applied when opening .duc databases.

Changes:

  • Rebuild/upgrade FTS5 virtual tables to use unicode61 remove_diacritics 2 and prefix=..., plus add explicit backfill (rebuild) steps.
  • Add schema migration 3000002 → 3000003 and bump PRAGMA user_version to 3000003.
  • Add ducpy.search module (hybrid FTS + Python reranking) with new pytest coverage, and update DucSQL to apply SQL migrations on open.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
schema/search.sql Adds tokenizer/prefix settings and rebuild statements for FTS tables; switches to IF NOT EXISTS for FTS tables/triggers.
schema/migrations/3000002_to_3000003.sql New migration that drops/recreates FTS tables and rebuilds the indexes; bumps user_version.
schema/duc.sql Bumps PRAGMA user_version to 3000003.
packages/ducrs/src/api/version_control.rs Removes outdated module doc bullet.
packages/ducpy/src/ducpy/builders/sql_builder.py Adds migration discovery/application to keep opened DBs current.
packages/ducpy/src/ducpy/search/search_elements.py New Python search implementation and result types.
packages/ducpy/src/ducpy/search/init.py Exposes the search API from the new package.
packages/ducpy/src/ducpy/init.py Re-exports search helpers at top-level ducpy.
packages/ducpy/src/tests/src/test_search_module.py New tests covering search behavior (exact, prefix, grouping, limit, empty query).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +6 to +11
-- 1. Drop old FTS tables (triggers are automatically dropped)
DROP TABLE IF EXISTS search_elements;
DROP TABLE IF EXISTS search_element_text;
DROP TABLE IF EXISTS search_element_doc;
DROP TABLE IF EXISTS search_element_model;
DROP TABLE IF EXISTS search_blocks;
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the FTS tables will NOT drop the existing sync triggers, because the triggers are defined on elements / element_text / etc (not on the FTS tables). On an upgrade, the old trg_* triggers will still exist, so the subsequent CREATE TRIGGER trg_* statements will fail with "trigger already exists" (and/or keep stale trigger definitions if you switch to IF NOT EXISTS). Explicitly DROP TRIGGER IF EXISTS ... for each trg_* before recreating the tables/triggers, and consider fixing the misleading comment here.

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +105
conn.executescript(sql)
current = conn.execute("PRAGMA user_version").fetchone()[0]
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_apply_migrations can get stuck in an infinite loop if a migration script fails to advance PRAGMA user_version (e.g., script error, missing final PRAGMA, or PRAGMA guarded by a conditional). Add a progress check after executescript (e.g., ensure the new user_version is > current and ideally equals the expected to_v) and raise a clear error if not.

Suggested change
conn.executescript(sql)
current = conn.execute("PRAGMA user_version").fetchone()[0]
prev_version = current
conn.executescript(sql)
new_version = conn.execute("PRAGMA user_version").fetchone()[0]
if new_version <= prev_version:
raise RuntimeError(
"Migration did not advance PRAGMA user_version: "
f"attempted to migrate from {prev_version} to {to_v}, "
f"but user_version is still {new_version}."
)
if new_version != to_v:
raise RuntimeError(
"Migration set unexpected PRAGMA user_version: "
f"attempted to migrate from {prev_version} to {to_v}, "
f"but user_version is {new_version}."
)
current = new_version

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +51
@dataclass(slots=True)
class DucSearchResult:
"""Compatibility placeholder for exported result types."""


@dataclass(slots=True)
class DucElementSearchResult:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DucSearchResult is defined twice: first as a (empty) dataclass, then later reassigned to a DucElementSearchResult | DucFileSearchResult union type alias. The second assignment overwrites the class at runtime, which is confusing and can break consumers expecting DucSearchResult to be a class. Either remove the placeholder dataclass, or rename it (e.g., DucSearchResultBase) and keep DucSearchResult exclusively as the type alias (or vice versa).

Copilot uses AI. Check for mistakes.
Comment on lines +479 to +509
for _variant_name, _expression, variant_boost in _build_query_variants(query):
for element in elements:
if element.get("is_deleted"):
continue

element_id = element.get("id")
element_type = element.get("type")
if not element_id or not element_type:
continue

aggregate = aggregates.get(element_id)
if aggregate is None:
aggregate = _ElementAggregate(
element_id=element_id,
raw_element_type=element_type,
label=element.get("label") or "",
description=element.get("description"),
)
aggregates[element_id] = aggregate

for field_name, source_weight in field_weights.items():
raw_text = element.get(field_name)
score, _similarity = _evaluate_match_text(
query,
raw_text,
fts_rank=None,
source_weight=source_weight,
variant_boost=variant_boost,
)
if score > 0.0 and raw_text:
aggregate.add_match(raw_text, score)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _collect_candidates_from_parsed_duc, the loop over _build_query_variants(query) ignores the generated FTS expressions and only uses variant_boost. Because the boosts are all <= 1.0 and _evaluate_match_text doesn't depend on the expression, this repeats the same work up to 3× without affecting the final best score (the 1.0-boost pass dominates). Consider removing the variants loop for the non-SQLite path (or refactor so variants actually change the matching logic) to avoid unnecessary CPU work.

Suggested change
for _variant_name, _expression, variant_boost in _build_query_variants(query):
for element in elements:
if element.get("is_deleted"):
continue
element_id = element.get("id")
element_type = element.get("type")
if not element_id or not element_type:
continue
aggregate = aggregates.get(element_id)
if aggregate is None:
aggregate = _ElementAggregate(
element_id=element_id,
raw_element_type=element_type,
label=element.get("label") or "",
description=element.get("description"),
)
aggregates[element_id] = aggregate
for field_name, source_weight in field_weights.items():
raw_text = element.get(field_name)
score, _similarity = _evaluate_match_text(
query,
raw_text,
fts_rank=None,
source_weight=source_weight,
variant_boost=variant_boost,
)
if score > 0.0 and raw_text:
aggregate.add_match(raw_text, score)
# For non-SQLite search, query variants do not affect matching logic,
# and the highest boost is 1.0, so we only need a single pass here.
variant_boost = 1.0
for element in elements:
if element.get("is_deleted"):
continue
element_id = element.get("id")
element_type = element.get("type")
if not element_id or not element_type:
continue
aggregate = aggregates.get(element_id)
if aggregate is None:
aggregate = _ElementAggregate(
element_id=element_id,
raw_element_type=element_type,
label=element.get("label") or "",
description=element.get("description"),
)
aggregates[element_id] = aggregate
for field_name, source_weight in field_weights.items():
raw_text = element.get(field_name)
score, _similarity = _evaluate_match_text(
query,
raw_text,
fts_rank=None,
source_weight=source_weight,
variant_boost=variant_boost,
)
if score > 0.0 and raw_text:
aggregate.add_match(raw_text, score)

Copilot uses AI. Check for mistakes.
asset_path = _asset_input_path(asset_name)
assert asset_path.exists(), f"Missing asset file: {asset_path}"

json_path = Path(test_output_dir+"/search_results") / f"{test_name}.json"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path construction here uses string concatenation with a hard-coded / (test_output_dir + "/search_results"), which is not portable on Windows and bypasses pathlib's normalization. Prefer Path(test_output_dir) / "search_results" / f"{test_name}.json" (and avoid the double-space before the f-string).

Suggested change
json_path = Path(test_output_dir+"/search_results") / f"{test_name}.json"
json_path = Path(test_output_dir) / "search_results" / f"{test_name}.json"

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +5
from .search_elements import (DucElementSearchResult, DucFileSearchResult,
DucSearchResponse, DucSearchResult,
search_duc_elements)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import list formatting/indentation is inconsistent here, which makes the module harder to scan and is likely to fight auto-formatters. Reformat this into a standard multi-line import (one name per line or a single line if it fits) to keep style consistent.

Suggested change
from .search_elements import (DucElementSearchResult, DucFileSearchResult,
DucSearchResponse, DucSearchResult,
search_duc_elements)
from .search_elements import (
DucElementSearchResult,
DucFileSearchResult,
DucSearchResponse,
DucSearchResult,
search_duc_elements,
)

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 3.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🎉 This PR is included in version 3.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🎉 This PR is included in version 3.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🎉 This PR is included in version 3.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🎉 This PR is included in version 3.7.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🎉 This PR is included in version 3.6.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants