Skip to content

Performance improvement for File Details web/api #441

@lfreijo

Description

@lfreijo

Is your feature request related to a problem? Please describe.

The file detail page (/api/v4/file/result/<sha256>/) becomes unusably slow on large deployments. On our production instance with ~200M submissions and ~1.2B results in Elasticsearch, page loads take 30-60 seconds. Meanwhile, search pages return instantly.

The root cause is get_file_submission_meta() in assemblyline/datastore/helper.py, which builds the query:

query = f"files.sha256:{sha256} OR results:{sha256}*"

The files.sha256:{sha256} term query is fast (~26ms), but the results:{sha256}* wildcard query must scan the entire submission index because the results field stores full result key strings (e.g., {sha256}.ServiceName.vVersion.cConfig). On our 200M-document submission index, this single wildcard query takes 35+ seconds, even with the results field mapped as the ES wildcard type. Force-merging the index to reduce segment count made it worse, not better, because larger merged segments produce larger ngram automatons for the wildcard field type.

The other queries on the file detail page (result keys, parent files, child files) are all sub-100ms. The submission metadata facet is the sole bottleneck.

Describe the solution you'd like

Add a denormalized file_sha256s keyword field to the Submission model that stores all unique SHA256 hashes associated with the submission -- both originally submitted files (from files[].sha256) and extracted/processed files (the 64-character SHA256 prefix from each entry in results[]).

This field should be populated at write time whenever the results array is updated (i.e., when the Dispatcher adds result keys as services complete). The get_file_submission_meta() query then becomes a simple term match:

query = f"file_sha256s:{sha256}"

On our production data, this drops the submission metadata query from 35,000ms to 128ms -- a 270x improvement -- with identical hit counts.

The field should be:

  • Type: keyword (array)
  • Updated whenever results is modified, by extracting result_key[:64] for each entry
  • Added to the Submission ODM model in assemblyline/odm/models/submission.py

Describe alternatives you've considered

  1. Force-merging the submission index to reduce Elasticsearch segment count. This actually degraded wildcard query performance because the wildcard field type uses an ngram-based index internally, and fewer, larger segments produce larger automatons that are slower to traverse. After force merge, the query went from 17s to 35s.

  2. Splitting the submission index into time-based indices (e.g., monthly) via ILM. This would reduce the scan size per index, but the file detail page needs to find all submissions containing a given file across all time, so it would still fan out across every time-based index. Net improvement would be modest.

  3. Using an Elasticsearch ingest pipeline to populate the field without code changes. This is what we implemented as a workaround -- an ingest pipeline extracts SHA256 prefixes from result keys into file_sha256s on every write, and the default pipeline setting on the submission index ensures new documents are processed. This works but is fragile: it exists outside the application code, must be manually applied to each deployment, and the query-side change still requires patching helper.py in the container. A native implementation in the ODM and datastore helper would be more robust and benefit all deployments.

  4. Changing the results field mapping from wildcard to keyword to enable faster prefix queries. This would require a full reindex of the submission index and would not fundamentally change the query pattern -- prefix/wildcard queries on keyword arrays with 200M documents are still slow.

Additional context

We tested and deployed this fix on our production Assemblyline 4.6.0 instance:

  • Cluster: 20-node Elasticsearch cluster, 18 data nodes
  • Index sizes: submission_hot = 200M docs / 1.6TB, result_hot = 1.24B docs / 9.6TB
  • Measured improvement: File detail page load dropped from 30-60s to 250ms

The workaround (ES ingest pipeline + patched helper.py) is fully backwards compatible. The file_sha256s field is additive -- removing the code change reverts to the original wildcard query with no data loss or schema issues.

The affected code path is:

  • assemblyline/datastore/helper.py:540-549 -- get_file_submission_meta()
  • Called from assemblyline_ui/api/v4/file.py -- /api/v4/file/result/<sha256>/ endpoint
  • Called from assemblyline_ui/api/v4/submission.py -- /api/v4/submission/<sid>/file/<sha256>/ endpoint

There are two additional methods using similar wildcard patterns on the results field (get_or_create_summary() at lines 350 and 452), though these are in deletion/cleanup paths and not on the user-facing hot path.

Metadata

Metadata

Assignees

Labels

assessWe still haven't decided if this will be worked on or notenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions