Skip to content

[BUG] Empty Extension Match Causes Binary Files to Be Indexed #229

@Crimsonyx412

Description

@Crimsonyx412

Project

vgrep

Description

The should_index() function includes an empty string "" in its extension match pattern, causing all files without extensions to be indexed. This includes compiled binaries, core dumps, lock files, and other non-text files that should be excluded.

Error Message

# May produce errors like:
Failed to read file: stream did not contain valid UTF-8
# Or silently corrupt the index with binary content

Debug Logs

$ RUST_LOG=debug vgrep index
# Shows attempts to read binary files without extensions

System Information

OS: Ubuntu 22.04
vgrep version: 0.1.0

Screenshots

No response

Steps to Reproduce

  1. Create a directory with mixed files:
   echo "valid source" > test.rs
   cp /bin/ls ./my_binary  # Or any binary without extension
  1. Run vgrep index
  2. Observe that vgrep attempts to index my_binary
  3. Check logs/output for UTF-8 errors or observe the binary in the database

Expected Behavior

Only known text/source file types should be indexed. Files without extensions should only be indexed if they match specific known names (Dockerfile, Makefile, etc.).

Actual Behavior

All files without extensions are matched by | "" in the extension pattern and are attempted to be indexed.

Additional Context

Files affected:

  • src/core/indexer.rsshould_index() (line 310)
  • src/watcher.rsshould_index() (line 244)

Problematic code:

matches!(
    ext.as_str(),
    "rs" | "py" | ... | ""  // Matches ALL extensionless files
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinginvalidThis doesn't seem right

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions