MD5 checksum incorrect for binary files > 64KB (application/octet-stream)

**Describe the bug**

FSCrawler appears to calculate incorrect MD5 checksums for binary files larger than 64KB. The checksum seems to be computed from only the first 65,536 bytes instead of the entire file.

For a 2.09MB binary file detected as `application/octet-stream`:
- True MD5 (full file) different than the checksum from FSCrawler MD5 (indexed)

The FSCrawler checksum matches the MD5 of exactly the first 65,536 bytes.  

I attempted to configure a different parser via `tika-config.xml` but wasn't able to work around this issue.

Looking at the code, I suspect the issue may be in `TikaDocParser.java` - for `application/octet-stream` files, Tika's `EmptyParser` may not read the full stream, so the `DigestInputStream` only hashes the bytes read during detection. However, I'm not certain this is the exact cause.

**Job Settings**

```yml
name: "files"
fs:
  tika_config_path: "/usr/share/fscrawler/tika-config/tikaConfig.xml"
  filename_as_id: true
  index_content: true
  includes:
    - "*.mex"
  url: /data
  update_rate: 5s
  indexed_chars: 0
  ignore_above: "5mb"
  lang_detect: false
  raw_metadata: true
  continue_on_error: true
  checksum: "MD5"
  attributes_support: true
  ocr:
    enabled: false
elasticsearch:
  urls:
    - "http://elasticsearch:9200"
  ssl_verification: false
  type: "elasticsearch"
  distribution_version: "es7"
  bulk_size: 2
  byte_size: "500kb"
  flush_interval: "2s"
rest:
  enabled: true
  port: 8080
  host: 0.0.0.0
  url: http://0.0.0.0:8080/fscrawler
```

**Logs**

```
15:20:50,800 INFO  [f.p.e.c.f.FsParserAbstract] Run #X: job [files]: indexed [1], deleted [0]
```

No errors or warnings - indexing completes successfully with the incorrect checksum value.

**Expected behavior**

The MD5 checksum should be calculated from the entire file contents.

**Versions:**

- OS: Docker (Linux container) on Windows 11 with WSL2
- Version: 2.10-SNAPSHOT (ES7 build)
- Elasticsearch: 7.17.0


**Questions**

1. Is this a known issue?
2. Are there any workarounds or configuration options that might help?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MD5 checksum incorrect for binary files > 64KB (application/octet-stream) #2252

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MD5 checksum incorrect for binary files > 64KB (application/octet-stream) #2252

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions