-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Describe the bug
FSCrawler appears to calculate incorrect MD5 checksums for binary files larger than 64KB. The checksum seems to be computed from only the first 65,536 bytes instead of the entire file.
For a 2.09MB binary file detected as application/octet-stream:
- True MD5 (full file) different than the checksum from FSCrawler MD5 (indexed)
The FSCrawler checksum matches the MD5 of exactly the first 65,536 bytes.
I attempted to configure a different parser via tika-config.xml but wasn't able to work around this issue.
Looking at the code, I suspect the issue may be in TikaDocParser.java - for application/octet-stream files, Tika's EmptyParser may not read the full stream, so the DigestInputStream only hashes the bytes read during detection. However, I'm not certain this is the exact cause.
Job Settings
name: "files"
fs:
tika_config_path: "/usr/share/fscrawler/tika-config/tikaConfig.xml"
filename_as_id: true
index_content: true
includes:
- "*.mex"
url: /data
update_rate: 5s
indexed_chars: 0
ignore_above: "5mb"
lang_detect: false
raw_metadata: true
continue_on_error: true
checksum: "MD5"
attributes_support: true
ocr:
enabled: false
elasticsearch:
urls:
- "http://elasticsearch:9200"
ssl_verification: false
type: "elasticsearch"
distribution_version: "es7"
bulk_size: 2
byte_size: "500kb"
flush_interval: "2s"
rest:
enabled: true
port: 8080
host: 0.0.0.0
url: http://0.0.0.0:8080/fscrawlerLogs
15:20:50,800 INFO [f.p.e.c.f.FsParserAbstract] Run #X: job [files]: indexed [1], deleted [0]
No errors or warnings - indexing completes successfully with the incorrect checksum value.
Expected behavior
The MD5 checksum should be calculated from the entire file contents.
Versions:
- OS: Docker (Linux container) on Windows 11 with WSL2
- Version: 2.10-SNAPSHOT (ES7 build)
- Elasticsearch: 7.17.0
Questions
- Is this a known issue?
- Are there any workarounds or configuration options that might help?