Skip to content

MD5 checksum incorrect for binary files > 64KB (application/octet-stream) #2252

@thshorrock

Description

@thshorrock

Describe the bug

FSCrawler appears to calculate incorrect MD5 checksums for binary files larger than 64KB. The checksum seems to be computed from only the first 65,536 bytes instead of the entire file.

For a 2.09MB binary file detected as application/octet-stream:

  • True MD5 (full file) different than the checksum from FSCrawler MD5 (indexed)

The FSCrawler checksum matches the MD5 of exactly the first 65,536 bytes.

I attempted to configure a different parser via tika-config.xml but wasn't able to work around this issue.

Looking at the code, I suspect the issue may be in TikaDocParser.java - for application/octet-stream files, Tika's EmptyParser may not read the full stream, so the DigestInputStream only hashes the bytes read during detection. However, I'm not certain this is the exact cause.

Job Settings

name: "files"
fs:
  tika_config_path: "/usr/share/fscrawler/tika-config/tikaConfig.xml"
  filename_as_id: true
  index_content: true
  includes:
    - "*.mex"
  url: /data
  update_rate: 5s
  indexed_chars: 0
  ignore_above: "5mb"
  lang_detect: false
  raw_metadata: true
  continue_on_error: true
  checksum: "MD5"
  attributes_support: true
  ocr:
    enabled: false
elasticsearch:
  urls:
    - "http://elasticsearch:9200"
  ssl_verification: false
  type: "elasticsearch"
  distribution_version: "es7"
  bulk_size: 2
  byte_size: "500kb"
  flush_interval: "2s"
rest:
  enabled: true
  port: 8080
  host: 0.0.0.0
  url: http://0.0.0.0:8080/fscrawler

Logs

15:20:50,800 INFO  [f.p.e.c.f.FsParserAbstract] Run #X: job [files]: indexed [1], deleted [0]

No errors or warnings - indexing completes successfully with the incorrect checksum value.

Expected behavior

The MD5 checksum should be calculated from the entire file contents.

Versions:

  • OS: Docker (Linux container) on Windows 11 with WSL2
  • Version: 2.10-SNAPSHOT (ES7 build)
  • Elasticsearch: 7.17.0

Questions

  1. Is this a known issue?
  2. Are there any workarounds or configuration options that might help?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions