Performance improvement for File Details web/api

**Is your feature request related to a problem? Please describe.**

The file detail page (`/api/v4/file/result/<sha256>/`) becomes unusably slow on large deployments. On our production instance with ~200M submissions and ~1.2B results in Elasticsearch, page loads take 30-60 seconds. Meanwhile, search pages return instantly.

The root cause is `get_file_submission_meta()` in `assemblyline/datastore/helper.py`, which builds the query:

```python
query = f"files.sha256:{sha256} OR results:{sha256}*"
```

The `files.sha256:{sha256}` term query is fast (~26ms), but the `results:{sha256}*` wildcard query must scan the entire submission index because the `results` field stores full result key strings (e.g., `{sha256}.ServiceName.vVersion.cConfig`). On our 200M-document submission index, this single wildcard query takes 35+ seconds, even with the `results` field mapped as the ES `wildcard` type. Force-merging the index to reduce segment count made it worse, not better, because larger merged segments produce larger ngram automatons for the wildcard field type.

The other queries on the file detail page (result keys, parent files, child files) are all sub-100ms. The submission metadata facet is the sole bottleneck.

**Describe the solution you'd like**

Add a denormalized `file_sha256s` keyword field to the Submission model that stores all unique SHA256 hashes associated with the submission -- both originally submitted files (from `files[].sha256`) and extracted/processed files (the 64-character SHA256 prefix from each entry in `results[]`).

This field should be populated at write time whenever the `results` array is updated (i.e., when the Dispatcher adds result keys as services complete). The `get_file_submission_meta()` query then becomes a simple term match:

```python
query = f"file_sha256s:{sha256}"
```

On our production data, this drops the submission metadata query from **35,000ms to 128ms** -- a 270x improvement -- with identical hit counts.

The field should be:
- Type: `keyword` (array)
- Updated whenever `results` is modified, by extracting `result_key[:64]` for each entry
- Added to the Submission ODM model in `assemblyline/odm/models/submission.py`

**Describe alternatives you've considered**

1. **Force-merging the submission index** to reduce Elasticsearch segment count. This actually degraded wildcard query performance because the `wildcard` field type uses an ngram-based index internally, and fewer, larger segments produce larger automatons that are slower to traverse. After force merge, the query went from 17s to 35s.

2. **Splitting the submission index into time-based indices** (e.g., monthly) via ILM. This would reduce the scan size per index, but the file detail page needs to find all submissions containing a given file across all time, so it would still fan out across every time-based index. Net improvement would be modest.

3. **Using an Elasticsearch ingest pipeline** to populate the field without code changes. This is what we implemented as a workaround -- an ingest pipeline extracts SHA256 prefixes from result keys into `file_sha256s` on every write, and the default pipeline setting on the submission index ensures new documents are processed. This works but is fragile: it exists outside the application code, must be manually applied to each deployment, and the query-side change still requires patching `helper.py` in the container. A native implementation in the ODM and datastore helper would be more robust and benefit all deployments.

4. **Changing the `results` field mapping** from `wildcard` to `keyword` to enable faster prefix queries. This would require a full reindex of the submission index and would not fundamentally change the query pattern -- prefix/wildcard queries on keyword arrays with 200M documents are still slow.

**Additional context**

We tested and deployed this fix on our production Assemblyline 4.6.0 instance:

- **Cluster**: 20-node Elasticsearch cluster, 18 data nodes
- **Index sizes**: `submission_hot` = 200M docs / 1.6TB, `result_hot` = 1.24B docs / 9.6TB
- **Measured improvement**: File detail page load dropped from 30-60s to 250ms

The workaround (ES ingest pipeline + patched `helper.py`) is fully backwards compatible. The `file_sha256s` field is additive -- removing the code change reverts to the original wildcard query with no data loss or schema issues.

The affected code path is:

- `assemblyline/datastore/helper.py:540-549` -- `get_file_submission_meta()`
- Called from `assemblyline_ui/api/v4/file.py` -- `/api/v4/file/result/<sha256>/` endpoint
- Called from `assemblyline_ui/api/v4/submission.py` -- `/api/v4/submission/<sid>/file/<sha256>/` endpoint

There are two additional methods using similar wildcard patterns on the `results` field (`get_or_create_summary()` at lines 350 and 452), though these are in deletion/cleanup paths and not on the user-facing hot path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement for File Details web/api #441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance improvement for File Details web/api #441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions