Skip to content

Memory optimization: RefSet document retention during dereferencing #133

@char0n

Description

@char0n

Problem

During dereferencing, the RefSet holds every external file's full parsed ApiDOM tree
(ParseResultElement) in memory for the entire session. For large APIs with many external files
(e.g., DigitalOcean API with 1000+ files), this can consume 500 MB - 5 GB+ of memory.

Root cause

  • Reference.value holds the entire parsed ApiDOM tree, not a fragment
  • With immutable: true (default), each file is stored twice: mutable clone
    (cloneDeep(parseResult)) + immutable original
  • No eviction policy — once parsed, a document stays in RefSet until refSet.clean() at the very
    end
  • No streaming — entire file must be fully parsed before any JSON Pointer fragment is extracted
  • Recursive $ref resolution can pull in hundreds/thousands of transitive dependencies

Data flow

$ref encountered → check RefSet → miss → fetch file → parse → cloneDeep(parseResult) → add to
RefSet
→ hit → return cached Reference

After dereferencing completes, refSet.clean() releases everything at once.

Key files

  • packages/apidom-reference/src/ReferenceSet.ts — RefSet container (no size limit, no eviction)
  • packages/apidom-reference/src/Reference.ts — holds full ParseResultElement in value
  • packages/apidom-reference/src/dereference/strategies/openapi-3-1/visitor.tstoReference()
    method (parse + clone + cache)
  • packages/apidom-reference/src/dereference/strategies/openapi-3-1/index.ts — cleanup at end

Existing infrastructure we can leverage

HTTP response cache (HTTPResolverAxios)

The HTTP resolver already has an in-memory cache for raw Buffer responses. If a parsed document
is evicted from RefSet and needed again:

  • Re-fetch cost: zero (HTTP cache hit)
  • Re-parse cost: CPU only (no network I/O)
  • Memory trade-off: Buffer (raw bytes, small) stays cached, ParseResultElement (10-100x
    larger) gets evicted

This makes eviction-based strategies practical — re-parsing from cached buffers is cheap.

consume: true refractor option

Parsing referenced files internally can use consume: true to reduce peak memory during each
file's refraction. Already available.

Remediation plan

Phase 1: Quick wins (low risk, high impact)

1.1 Use consume: true when parsing referenced files

In toReference(), pass consume: true to the parse/refract pipeline for external files. Each
file's refraction uses less peak memory.

1.2 Skip cloneDeep in mutable mode

When immutable: false, the mutable reference currently still deep-clones. Could take the parse
result directly.

1.3 Add maxRefSetSize option

Allow users to cap how many parsed documents RefSet holds. When exceeded, evict
least-recently-used entries. Re-parse from HTTP cache or file system if needed again.

Default: unlimited (backward compatible)
Recommended for large APIs: maxRefSetSize: 50 or similar

Phase 2: LRU eviction on RefSet (medium risk, high impact)

2.1 LRU cache for RefSet

Replace the simple array in ReferenceSet with an LRU cache:

  • Track access order
  • When capacity exceeded, evict least-recently-used Reference
  • Store eviction metadata (URI + depth) to know how to re-parse if needed

2.2 Lazy re-parse on cache miss

When an evicted reference is needed again:

  1. Check HTTP cache for raw buffer → re-parse
  2. If not in HTTP cache, re-fetch from network/filesystem → parse
  3. Re-add to RefSet (may evict another entry)

Cost model:

  • HTTP cache hit: ~10-100ms (parse only)
  • HTTP cache miss: ~100-1000ms (fetch + parse)
  • Memory saved per eviction: ~1-50 MB per document

Phase 3: Fragment-only retention (medium risk, highest impact)

3.1 After extracting JSON Pointer fragment, release full document

The dereference visitor uses JSON Pointer to extract a specific fragment from the parsed
document. After extraction:

  • Keep only the fragment in memory
  • Release the full ParseResultElement
  • If another $ref points to a different fragment in the same file, re-parse from cache

This is the most aggressive optimization — reduces per-file memory from "entire document" to
"just the referenced fragment."

3.2 Fragment cache per URI

Cache at the fragment level: Map<string, Map<string, Element>> where outer key is URI, inner
key is JSON Pointer.

Phase 4: Immutable mode optimization (low risk)

4.1 Copy-on-write instead of upfront cloneDeep

Currently immutable: true deep-clones every referenced document upfront. Instead:

  • Store only the immutable original
  • Clone lazily when the mutable version is actually modified
  • Many referenced documents are never modified — saves the clone entirely

4.2 Structural sharing

For documents that are mostly read-only with small modifications, use structural sharing (clone
only the modified path, share the rest).

Estimated impact

Phase Memory reduction Complexity Risk
Phase 1 (consume + maxRefSetSize) 20-40% Low Low
Phase 2 (LRU eviction) 50-70% Medium Medium
Phase 3 (fragment-only) 80-90% High Medium
Phase 4 (copy-on-write) 50% of immutable overhead Medium Low

For a 1000-file API (estimated):

Configuration Memory
Current (no limit, immutable) 2-5 GB
+ Phase 1 (consume + cap at 50) 500 MB - 1 GB
+ Phase 2 (LRU, 50 entries) 200-500 MB
+ Phase 3 (fragments only) 50-200 MB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions