S14S Identify

Enterprise Identifier Registry — a REST API for consolidating customer identities across multiple source systems using probabilistic record linkage.

When different systems (CRM, billing, support, etc.) each maintain their own customer records, S14S Identify serves as a single source of truth. It uses the Fellegi-Sunter model to determine whether an incoming record refers to an existing person, and links them automatically via an aliases array.

Quick Start
API Overview
Source System Registration
Customer Matching
Match Quality and F1 Feedback Loop
Near-Miss Candidates and Candidate Review
Nickname Normalization
Typeahead Search
Search Tokens and Candidate Blocking
- Token Design
- Double Metaphone
Input Sanitization
- E.164 Phone Normalization
- Address Standardization (USPS Pub 28)
Audit Trail
Atomicity and Transactions
Soft Deletes
Data Model
Testing
Project Structure

Quick Start

# Install dependencies
npm install

# Start MongoDB (single-node replica set for transaction support)
docker compose up -d

# Seed 1000 sample customers
npm run seed

# Run the server
npm start

# Run tests with coverage
npm test

# Run build (enforces 100% coverage)
npm run build

The Swagger UI is available at http://localhost:3000/api-docs once the server is running.

API Overview

All endpoints are served under /customers and /sources. The x-user-id header identifies who is performing the action (used for audit tracking).

Sources

Method	Endpoint	Description
`POST`	`/sources`	Register a new source system
`GET`	`/sources`	List all active sources (`?include_deleted=true` for all)
`GET`	`/sources/:id`	Get a source system by ID
`PUT`	`/sources/:id`	Update a source system
`DELETE`	`/sources/:id`	Soft delete a source system

Customers

Method	Endpoint	Description
`POST`	`/customers`	Create or match a customer
`GET`	`/customers`	List all active customers (`?under_review=true` for review queue)
`GET`	`/customers/search?q=`	Typeahead search by name prefix
`GET`	`/customers/:id`	Get a customer by ID (`?source_system=` for original record)
`PUT`	`/customers/:id`	Update a customer
`PATCH`	`/customers/:id`	Merge another customer into this one
`DELETE`	`/customers/:id`	Soft delete a customer
`GET`	`/customers/:id/aliases`	Get cross-system identity links
`GET`	`/customers/:id/changes`	Get change history
`POST`	`/customers/:id/aliases/:aliasId/feedback`	Report a false positive match
`GET`	`/customers/:id/aliases/:aliasId/candidates`	Get near-miss match candidates for an alias
`POST`	`/customers/:id/aliases/:aliasId/candidates/:candidateId/approve`	Approve a candidate (merges records)
`POST`	`/customers/:id/aliases/:aliasId/candidates/:candidateId/reject`	Reject a candidate
`GET`	`/match-quality`	F1 score, precision, and recall metrics
`GET`	`/match-quality/tune`	Suggested weight adjustments based on feedback
`GET`	`/match-quality/feedback`	List match feedback records

POST Behavior

The POST /customers endpoint does not blindly create records. It first validates that the source_system is registered (see Source System Registration), then checks for source key collisions before running the Fellegi-Sunter matching algorithm:

Unregistered source system — returns 400 if the source_system is not registered in the sources collection.
Source key collision with aligned data — if a record with the same source_system + source_key already exists and the data aligns (confidence >= review threshold 0.70), the alias payload is updated and the API returns 200.
Source key collision with misaligned data — if the existing record's data doesn't align, the API returns 409 with collision details and the confidence score.
Score >= auto-approve threshold (currently 95% confidence) — the record is identified as an existing person. The incoming data is added as an alias to the matched record, and the API returns 200 with the existing customer.
Score >= review threshold (70%) but < auto-approve — a new customer record is created, and the near-miss candidates are attached to the alias as candidates for manual review. The API returns 201 with the candidates included.
Score < review threshold — a new customer record is created with no candidates. The API returns 201.

This means source systems can POST freely without worrying about duplicates. The matching engine handles deduplication automatically. First names are normalized from nicknames to formal equivalents before matching (e.g., "Chuck" becomes "Charles"), while the original name is preserved in the alias's original_payload.

The threshold is not a fixed constant — it is tunable via the F1 feedback loop based on real-world match accuracy.

Logic Flow Diagrams

Each resource and action has its own control flow diagram below, grouped by resource.

POST /sources

graph TD
    A[POST /sources] --> B{Validate Body}
    B -- Missing name --> E400[Return 400 Bad Request]
    B -- Invalid reviewers --> E400
    B -- Valid --> C[Lowercase reviewer emails]
    C --> D[Create Source document]
    D --> E{Name unique?}
    E -- No → duplicate key --> E409[Return 409 Conflict]
    E -- Yes --> F[Save with created_by / created_at]
    F --> G[Return 201 Created]

GET /sources

graph TD
    A[GET /sources] --> B{include_deleted=true?}
    B -- Yes --> C[Query: all sources]
    B -- No --> D[Query: deleted_at = null]
    C --> E[Sort by name ascending]
    D --> E
    E --> F[Return 200 with source array]

GET /sources/:id

graph TD
    A[GET /sources/:id] --> B[Query by ID + deleted_at: null]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Return 200 with source]

PUT /sources/:id

graph TD
    A[PUT /sources/:id] --> B[Find source by ID]
    B --> C{Found + active?}
    C -- No --> D[Return 404]
    C -- Yes --> E{Validate body}
    E -- Invalid --> F[Return 400]
    E -- Valid --> G[Apply updates + set updated_by/updated_at]
    G --> H{Name changed + unique?}
    H -- Duplicate --> I[Return 409 Conflict]
    H -- OK --> J[Save document]
    J --> K[Return 200 with updated source]

DELETE /sources/:id

graph TD
    A[DELETE /sources/:id] --> B[Find source by ID + active]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Set deleted_by + deleted_at]
    E --> F[Save document]
    F --> G[Return 200 OK]

POST /customers

graph TD
    A[POST /customers] --> B{Sanitize & Validate Input}
    B -- Errors --> C[Return 400 Bad Request]
    B -- Valid --> SV{Source System Registered?}
    SV -- No --> C
    SV -- Yes --> SC{Source Key Collision?}
    SC -- Yes --> SCA{Data Aligned? ≥ 0.70}
    SCA -- Yes --> SCU[Update Alias Payload → 200]
    SCA -- No --> SCC[Return 409 Collision]
    SC -- No --> N[Normalize Nicknames to Formal Names]
    N --> D[Generate Search Tokens]
    D --> E[Query DB for Candidates via Token Index]
    E --> F{Iterate Candidates}
    F -- For each --> G[Calculate Fellegi-Sunter Score]
    G --> F
    F -- All scored --> H{Best Score ≥ 0.95?}
    H -- Yes --> I[Add Alias to Matched Record]
    I --> K[Save Record + Audit Trail]
    K --> L[Return 200 with Matched Customer]
    H -- No --> NM{Any Score ≥ 0.70?}
    NM -- Yes --> J2[Create New Record + Pending Candidates]
    NM -- No --> J[Create New Customer Record]
    J --> K2[Save Record]
    J2 --> K2
    K2 --> L2[Return 201 Created]

    style H fill:#f9f,stroke:#333
    linkStyle default stroke:#333

Steps:

Sanitization: All incoming data is cleaned, validated, and nickname-normalized.
Source Validation: The source_system must be registered or the request is rejected with 400.
Collision Check: If a record with the same source_system + source_key already exists, the system uses Fellegi-Sunter to determine if the data aligns (update alias, 200) or conflicts (409 collision).
Token Generation: Search tokens (phonetic, prefix, exact) are created from the sanitized data.
Candidate Blocking: The database is queried for records sharing at least one token. This is a highly efficient indexed operation (O(log N)).
Scoring: The small set of candidates (C) is scored using the Fellegi-Sunter algorithm.
Decision: Three-tier outcome based on score:
- >= auto-approve (0.95): alias linked automatically (200)
- >= review threshold (0.70): new record created with candidates on the alias for manual review (201)
- < review threshold: new record created, no candidates (201)

GET /customers

graph TD
    A[GET /customers] --> B{under_review=true?}
    B -- Yes --> C[Query MatchFeedback for unresolved false_positives]
    C --> D[Deduplicate by customer_id, preserve feedback order]
    D --> E[Fetch customers by IDs]
    B -- No --> F{include_deleted=true?}
    F -- Yes --> G[Query: all customers]
    F -- No --> H[Query: deleted_at = null]
    G --> I[Apply pagination: page, limit]
    H --> I
    E --> I
    I --> J[Set headers: X-Total-Count, X-Page, X-Limit, Link]
    J --> K[Return 200 with customer array]

GET /customers/search

graph TD
    A["GET /customers/search?q=..."] --> B{q length ≥ 2?}
    B -- No --> C[Return 400]
    B -- Yes --> D[Lowercase query]
    D --> E[Generate prefix tokens from query]
    E --> F[Expand nicknames to formal equivalents]
    F --> G["Query: search_tokens $in prefix tokens, deleted_at: null"]
    G --> H[Apply limit, default 20, max 100]
    H --> I[Return 200 with matching customers]

GET /customers/:id

graph TD
    A[GET /customers/:id] --> B[Find by ID]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E{merged_into set?}
    E -- Yes --> F["Return 301 + Location: /customers/:targetId"]
    E -- No --> G{deleted_at set?}
    G -- Yes --> D
    G -- No --> H{source_system param?}
    H -- Yes --> I[Find alias for source_system]
    I --> J{Alias found?}
    J -- No --> D
    J -- Yes --> K[Overlay original_payload fields onto response]
    K --> L[Return 200]
    H -- No --> L

PUT /customers/:id

graph TD
    A[PUT /customers/:id] --> B{Validate update body}
    B -- Errors --> C[Return 400]
    B -- Valid --> D[Find customer by ID + active]
    D --> E{Found?}
    E -- No --> F[Return 404]
    E -- Yes --> G["Find alias matching x-source-system header"]
    G --> H{Alias found?}
    H -- No --> F
    H -- Yes --> I[Update alias fields + original_payload]
    I --> J[Compute audit delta: from → to]
    J --> K[Append to change_history]
    K --> L["Set _needsResolution flag"]
    L --> M["Pre-save: resolveCustomerFields from all aliases"]
    M --> N[Regenerate search_tokens]
    N --> O[Save document]
    O --> P[Return 200 with updated customer]

PATCH /customers/:id (Merge)

graph TD
    A["PATCH /customers/:id {merge: sourceId}"] --> B{Target exists + active?}
    B -- No --> C[Return 404]
    B -- Yes --> D{Source exists + active?}
    D -- No --> E[Return 404 / 400]
    D -- Yes --> F{Source already merged?}
    F -- Yes --> G[Return 400]
    F -- No --> H{Target = Source?}
    H -- Yes --> G
    H -- No --> I[Start MongoDB Transaction]

    I --> J[Transfer source aliases → target]
    J --> K[Clear source aliases]
    K --> L[Audit entry on target: merge]
    L --> M["Soft-delete source: merged_into = target ID"]
    M --> N[Audit entry on source: merged]
    N --> O["Create MatchFeedback: false_negative"]
    O --> P[Save target + source + feedback in transaction]

    P --> Q{Transaction success?}
    Q -- No --> R[Rollback → Return 500]
    Q -- Yes --> S[Return 200 with merged target]

DELETE /customers/:id

graph TD
    A[DELETE /customers/:id] --> B[Find by ID + active]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Set deleted_by + deleted_at]
    E --> F[Append soft_delete to change_history]
    F --> G[Save document]
    G --> H[Return 200 OK]

GET /customers/:id/aliases

graph TD
    A[GET /customers/:id/aliases] --> B[Find customer by ID]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Return 200 with aliases array]

GET /customers/:id/changes

graph TD
    A[GET /customers/:id/changes] --> B[Find customer by ID]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Return 200 with change_history array]

POST /customers/:id/aliases/:aliasId/feedback

graph TD
    A[POST /:id/aliases/:aliasId/feedback] --> B[Find customer + alias]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E{Alias has match_confidence?}
    E -- No / null --> F[Return 400: not an auto-match]
    E -- Yes --> G[Create MatchFeedback record]
    G --> H["type: false_positive"]
    H --> I[Set customer_id, alias_id, original_confidence]
    I --> J[Set reported_by, notes, resolved: false]
    J --> K[Save feedback]
    K --> L[Return 201 with feedback record]

GET /customers/:id/aliases/:aliasId/candidates

graph TD
    A[GET /:id/aliases/:aliasId/candidates] --> B[Find customer + alias]
    B --> C{Found?}
    C -- No --> D[Return 404]
    C -- Yes --> E[Return 200 with candidates array]

POST /.../candidates/:candidateId/approve

graph TD
    A["POST /:id/aliases/:aliasId/candidates/:candidateId/approve"] --> B[Find source customer + alias + candidate]
    B --> C{All found?}
    C -- No --> D[Return 404]
    C -- Yes --> E{Candidate status = pending?}
    E -- No --> F[Return 400: already reviewed]
    E -- Yes --> G[Find target customer by candidate_id]
    G --> H{Target exists?}
    H -- No --> D
    H -- Yes --> I["Mark candidate: approved"]
    I --> J[Reject all other pending candidates on alias]
    J --> K[Start MongoDB Transaction]

    K --> L[Transfer source aliases → target]
    L --> M[Clear source aliases]
    M --> N[Audit entry on target: merge]
    N --> O["Soft-delete source: merged_into = target"]
    O --> P["Create MatchFeedback: false_negative"]
    P --> Q[Save target + source + feedback in transaction]

    Q --> R{Transaction success?}
    R -- No --> S[Rollback → Return 500]
    R -- Yes --> T[Return 200 with target customer]

POST /.../candidates/:candidateId/reject

graph TD
    A["POST /:id/aliases/:aliasId/candidates/:candidateId/reject"] --> B[Find customer + alias + candidate]
    B --> C{All found?}
    C -- No --> D[Return 404]
    C -- Yes --> E{Candidate status = pending?}
    E -- No --> F[Return 400: already reviewed]
    E -- Yes --> G["Mark candidate: rejected"]
    G --> H[Set reviewed_by + reviewed_at]
    H --> I[Save customer document]
    I --> J[Return 200 with candidate]

GET /match-quality

graph TD
    A[GET /match-quality] --> B[Count total auto-matches from aliases]
    B --> C[Count false_positive feedback records]
    C --> D[Count false_negative feedback records]
    D --> E["TP = auto-matches − false_positives"]
    E --> F["Precision = TP / (TP + FP)"]
    F --> G["Recall = TP / (TP + FN)"]
    G --> H["F1 = 2 × P × R / (P + R)"]
    H --> I[Return 200 with metrics]

GET /match-quality/tune

graph TD
    A[GET /match-quality/tune] --> B[Count FP and FN feedback]
    B --> C["fpRate = FP / (FP + FN)"]
    C --> D{fpRate > 50%?}
    D -- Yes --> E["Action: tighten"]
    E --> F["Reduce m values −2%, increase u +0.1%"]
    F --> G[Raise match_threshold +1%]
    D -- No --> H{fnRate > 50%?}
    H -- Yes --> I["Action: loosen"]
    I --> J["Increase m values +2%, reduce u −0.05%"]
    J --> K[Lower match_threshold −1%]
    H -- No --> L["Action: none"]
    G --> M[Return 200 with current + suggested weights]
    K --> M
    L --> M

GET /match-quality/feedback

graph TD
    A[GET /match-quality/feedback] --> B{type filter?}
    B -- Yes --> C[Filter by false_positive or false_negative]
    B -- No --> D[No type filter]
    C --> E{resolved filter?}
    D --> E
    E -- Yes --> F[Filter by resolved status]
    E -- No --> G[No resolved filter]
    F --> H[Sort by reported_at descending]
    G --> H
    H --> I[Return 200 with feedback array]

Review Workflow (Cross-Endpoint)

This diagram shows how multiple endpoints connect to form the candidate review workflow:

graph TD
    A[POST /customers → 201 with candidates] --> B[Reviewer views candidates]
    B --> C[GET /:id/aliases/:aliasId/candidates]
    C --> D{Reviewer Decision}
    D -- Approve --> E[POST /.../candidates/:id/approve]
    D -- Reject --> F[POST /.../candidates/:id/reject]
    E --> G["Merge records (transaction)"]
    G --> H["Record false_negative feedback"]
    H --> I[F1 metrics updated]
    F --> J[Candidate marked rejected]

Feedback Loop (Cross-Endpoint)

This diagram shows how feedback flows through the F1 tuning system:

graph TD
    A[User reports false positive] --> B[POST /:id/aliases/:aliasId/feedback]
    C[User merges missed match] --> D[PATCH /:id with merge]
    D --> E["Auto-records false_negative"]
    B --> F[GET /match-quality → F1 score]
    E --> F
    F --> G[GET /match-quality/tune → suggested adjustments]
    G -. adjust threshold & weights .-> H[POST /customers matching engine]

Source System Registration

Before a source system can submit customer records via POST /customers, it must be registered using the /sources API. This ensures that every alias traces back to a known system and enables access control per source.

Source Model

Field	Type	Description
`name`	String	Unique identifier for the source system (e.g., `CRM`, `BILLING`)
`entra_ad_group`	String	Microsoft Entra AD group name for future access control
`reviewers`	Array	Users authorized to review match candidates from this source
`created_by`	String	User who registered the source
`created_at`	Date	Registration timestamp
`deleted_at`	Date	Soft-deletion timestamp (null if active)

Each reviewer has first_name, last_name, and email (stored lowercase). The reviewers list is replaced entirely on update, not merged.

Source Key Uniqueness

The aliases.source_system + aliases.source_key compound index is unique — a given source system can only have one record per source key across all customers. When a POST arrives with a source_system + source_key that already exists:

Data aligns (Fellegi-Sunter confidence >= 0.70): the alias's original_payload is updated and the existing customer is returned (200).
Data misaligned: a 409 collision is returned with the existing customer ID and confidence score, indicating the records need manual review or a source-side data correction.

This prevents silent overwrites and duplicate aliases while still allowing source systems to re-submit corrected data.

Customer Matching

Fellegi-Sunter Model

The matching engine implements the Fellegi-Sunter model (1969), the foundational framework for probabilistic record linkage used across government agencies, healthcare systems, and financial institutions worldwide.

The core idea: for each comparison field, we define two probabilities:

m-probability P(agree | true match) — how often this field agrees when two records truly refer to the same person
u-probability P(agree | not a match) — how often this field agrees purely by coincidence among unrelated records

From these, we compute:

Agreement weight = log2(m / u) — reward for a field matching. Fields that are highly distinctive (low u) produce large positive weights.
Disagreement weight = log2((1 - m) / (1 - u)) — penalty for a field not matching. Fields with high m produce large negative penalties when they disagree.

The raw composite score is converted to a normalized probability between 0 and 1:

P(match) = (score - minPossibleScore) / (maxPossibleScore - minPossibleScore)

Where maxPossibleScore is the sum of all agreement weights (perfect match) and minPossibleScore is the sum of all disagreement weights (complete mismatch).

Jaro-Winkler Distance

For name and address fields, exact matching is too rigid — typos, abbreviations, and data entry inconsistencies are common. The engine uses Jaro-Winkler distance for fuzzy string comparison.

Jaro-Winkler is specifically designed for short strings like names. It:

Computes the Jaro similarity based on the number and order of matching characters
Applies a Winkler prefix bonus that gives extra weight to strings sharing a common prefix (reflecting the empirical observation that typos are less common at the start of a name)

The result is a similarity score between 0 (completely different) and 1 (identical). Each field has a configurable similarity threshold that determines whether the comparison counts as "agreement."

For email and phone fields, exact matching is used — these identifiers are either right or wrong.

Field Configuration

Field	m	u	Compare	Threshold	Rationale
`first_name`	0.95	0.005	Jaro-Winkler	0.85	True matches agree 95% of the time; random pairs share a first name ~0.5%
`last_name`	0.95	0.002	Jaro-Winkler	0.85	Last names are more distinctive than first names
`email`	0.90	0.0001	Exact	1.0	Emails are nearly unique; lower m accounts for people with multiple addresses
`phone`	0.85	0.0005	Exact	1.0	Phones change more often; still highly distinctive
`address_composite`	0.80	0.005	Jaro-Winkler	0.80	Addresses change frequently; composite of street, city, state, zip

The email field carries the most discriminating power due to its extreme m/u ratio — an email match provides strong evidence, while an email mismatch is heavily penalized.

Match Threshold and F1 Tuning

The match threshold is currently set at 0.95 (95% confidence). Rather than relying on a fixed value chosen once, the threshold is designed to be tuned over time using the F1 feedback loop.

The initial threshold favors minimizing false positives — incorrectly merging two distinct people is far more damaging than creating a duplicate record that can be merged later. As real-world feedback accumulates, the /match-quality/tune endpoint analyzes the ratio of false positives to false negatives and suggests adjustments:

Too many false positives (incorrect merges): raise the threshold and tighten field weights
Too many false negatives (missed matches): lower the threshold and loosen field weights

When both records have no overlapping data (all fields empty), the score is 0. Fields where both records are missing are skipped entirely — they neither help nor hurt the score.

Match Quality and F1 Feedback Loop

The matching engine is not a black box — it has a built-in feedback system that measures accuracy and suggests improvements. This closes the loop between automated matching and human review.

Metrics

GET /match-quality computes precision, recall, and F1 score from accumulated feedback:

Metric	Formula	Meaning
Precision	TP / (TP + FP)	Of all auto-matches, how many were correct?
Recall	TP / (TP + FN)	Of all true matches, how many did the system find?
F1	2 * P * R / (P + R)	Harmonic mean — balances precision and recall

Where:

True Positive (TP): system matched correctly (auto-match with no false positive feedback)
False Positive (FP): system matched incorrectly (reported via POST /:id/aliases/:aliasId/feedback)
False Negative (FN): system missed a match (detected when a manual merge is performed via PATCH /:id)

Manual merges automatically record a false_negative feedback entry, so the system learns from both explicit feedback and operational corrections.

Weight Tuning

GET /match-quality/tune analyzes the balance of false positives vs. false negatives and returns:

Current field weights (m/u values) and threshold
Suggested adjustments (small 1-2% changes to avoid oscillation)
Rationale for the suggestion

The tuning endpoint is advisory — it does not auto-apply changes. This gives operators visibility into how the system would adjust and the opportunity to review before applying.

Feedback Pattern	Action	Effect
FP > 50% of feedback	Tighten	Raise threshold +1%, reduce m values, increase u values
FN > 50% of feedback	Loosen	Lower threshold -1%, increase m values, reduce u values
Balanced	None	No adjustment needed

Review Queue

GET /customers?under_review=true returns customers with unresolved false positive feedback, ordered by most recent feedback first. This provides a work queue for operators to review and resolve disputed matches.

Feedback records can be filtered via GET /match-quality/feedback?resolved=false to see only unresolved items.

Near-Miss Candidates and Candidate Review

When a new record is created (201 response), the system also evaluates all candidates that scored above the review threshold (0.70) but below the auto-approve threshold (currently 0.95, tuned by the F1 feedback loop). These "near-miss" candidates are stored as candidates on the alias that triggered the record creation and included in the 201 response.

Candidates are children of alias records, not customer records — each alias represents a specific ingestion event, and the candidates are the near-miss matches for that event.

This enables a targeted review workflow:

POST /customers — if no auto-match, the 201 response includes a candidates array with candidate IDs and confidence scores
GET /customers/:id/aliases/:aliasId/candidates — retrieve the list of candidates for review
POST /customers/:id/aliases/:aliasId/candidates/:candidateId/approve — approve the match (merges the records atomically via transaction, records false negative feedback for F1 tuning)
POST /customers/:id/aliases/:aliasId/candidates/:candidateId/reject — reject the match (confirms distinct individuals)

Approval Behavior

When a candidate is approved:

The new customer's aliases are transferred to the candidate (target)
The new customer is soft-deleted with merged_into set to the target
All remaining candidates on the alias are automatically rejected
A false_negative MatchFeedback record is created to feed the F1 tuning loop
Future GET requests for the source return 301 → target

Candidate Schema

Defined in src/models/candidate.js and embedded within the Alias schema:

// Alias.candidates[]
{
  candidate_id: ObjectId,  // Reference to the potential match
  confidence: Number,      // Fellegi-Sunter score (0.70 – 0.95)
  algorithm: String,       // "fellegi-sunter"
  search_tokens: [String], // Tokens used for candidate blocking (analysis/audit)
  status: String,          // "pending" | "approved" | "rejected"
  reviewed_by: String,     // x-user-id of the reviewer
  reviewed_at: Date        // When the review decision was made
}

Thresholds

Threshold	Value	Meaning
`MATCH_THRESHOLD`	0.95	Auto-approve — records are merged automatically
`REVIEW_THRESHOLD`	0.70	Near-miss floor — candidates below this are discarded

Both thresholds are tunable. The MATCH_THRESHOLD is adjusted by the F1 feedback loop based on false positive/negative reports.

Nickname Normalization

Common nicknames and diminutives are automatically normalized to their formal equivalents on ingestion. This improves match recall — a record submitted as "Chuck" will match an existing "Charles" because both are stored under the same canonical name.

Nickname	Stored As	Nickname	Stored As
Chuck	Charles	Bill	William
Bob	Robert	Mike	Michael
Jim	James	Liz	Elizabeth
Becky	Rebecca	Kate	Katherine

The dictionary covers ~130 common English nicknames (both male and female). Names not in the dictionary are stored as-is.

The original name is always preserved in the alias's original_payload field. To retrieve the original record as submitted by a specific source system, use:

GET /customers/:id?source_system=CRM

This overlays the original_payload fields from the matching alias onto the response, returning first_name: "Chuck" instead of the canonical first_name: "Charles".

Typeahead search also expands nicknames — searching for "chuck" will query both "chuck" and "charles" prefix tokens.

Typeahead Search

GET /customers/search?q=jo&limit=20 provides fast prefix-based name search optimized for 10M+ records. The search uses indexed prefix tokens (fp: and lp:) for O(log N) lookups — no collection scans.

Minimum query length: 2 characters
Matches against the beginning of first_name or last_name, case-insensitive
Excludes soft-deleted customers
Default limit: 20, max: 100
Nicknames are expanded (searching "chuck" also matches "charles" prefix tokens)

Prefix tokens are generated at ingestion time for every substring from length 2 to full name length. For example, "John" produces tokens fp:jo, fp:joh, fp:john. These are stored in the same search_tokens array and backed by the same compound index used for match candidate blocking.

Search Tokens and Candidate Blocking

Running Fellegi-Sunter against every record in the database doesn't scale. The standard solution in record linkage is blocking — narrowing the candidate set before scoring. S14S Identify uses phonetic search tokens stored directly on each customer document and backed by a MongoDB multikey index.

When a new record arrives, the system generates tokens from the incoming data, queries for any existing customers sharing at least one token ($in), and only runs Fellegi-Sunter against those candidates. This means the matching engine touches a small fraction of the database regardless of total size.

Token Design

Each customer stores a flat search_tokens array of prefixed strings:

["fn:JN", "fn:AN", "fp:jo", "fp:joh", "fp:john", "ln:SM0", "lp:sm", "lp:smi", "lp:smit", "lp:smith", "em:john", "ed:example.com", "ph:5309", "ph:8675309", "sn:123", "ss:MN", "zp:62701"]

Prefixes prevent cross-field collisions (e.g., a ZIP code matching a phone suffix). A single compound index on { search_tokens: 1, deleted_at: 1 } handles all token lookups efficiently.

Field	Prefix	Token Logic
`first_name`	`fn:`	Double Metaphone primary + alternate (if different)
`first_name`	`fp:`	Lowercased prefix substrings (length 2 to full) for typeahead
`last_name`	`ln:`	Double Metaphone primary + alternate (if different)
`last_name`	`lp:`	Lowercased prefix substrings (length 2 to full) for typeahead
`email`	`em:`, `ed:`	Lowercased local part and domain
`phone`	`ph:`	Last 4 digits and last 7 digits of E.164 number
`address.street`	`sn:`, `ss:`	Street number (exact) + Double Metaphone of street name words
`address.zip`	`zp:`	First 5 digits of ZIP code

Tokens use any-match semantics ($in) for maximum recall — a single shared token is enough to pull a candidate into the scoring pool. Fellegi-Sunter then handles precision by computing a rigorous probabilistic score across all fields.

Street name tokenization skips common USPS suffix words (ST, AVE, BLVD), directionals (N, S, E, W), secondary unit designators (APT, STE), and single-character words, since these are too common to be discriminating.

Double Metaphone

Phonetic tokens are generated using the Double Metaphone algorithm, which encodes how a word sounds rather than how it's spelled. This enables matching across common name variations:

Name	Primary	Alternate
Michael	MKL	MXL
Smith	SM0	XMT
Schmidt	XMT	SMT

Double Metaphone produces two encodings — a primary and an alternate — to handle words with ambiguous pronunciation. Both are stored as tokens when they differ, maximizing recall. Notice how "Smith" and "Schmidt" share the code XMT, which means a search for either name will surface the other as a candidate.

Address Standardization (USPS Pub 28)

Before token generation, all addresses are standardized according to USPS Publication 28, the postal addressing standard. This ensures consistent tokenization regardless of how the address was originally entered.

Input	Standardized
`123 Main Street`	`123 Main ST`
`North Oak Avenue`	`N Oak AVE`
`456 Elm Boulevard Apartment 4`	`456 Elm BLVD APT 4`
`789 Cedar Blvd.`	`789 Cedar BLVD`

The standardizer normalizes:

Street suffixes — "Street" to "ST", "Avenue" to "AVE", "Boulevard" to "BLVD", etc.
Directionals — "North" to "N", "Southwest" to "SW", etc.
Secondary unit designators — "Apartment" to "APT", "Suite" to "STE", etc.
ZIP+4 formatting — "627011234" to "62701-1234"
Periods and extra whitespace — "St." to "ST", multiple spaces collapsed

Standardization is applied during input sanitization, so all stored addresses are already in canonical form.

Input Sanitization

All input is sanitized before storage or matching. The sanitization layer validates and normalizes data, returning an array of all validation errors at once rather than failing on the first error.

POST validation requires:

source_system and source_key (identifies the originating system)
first_name, last_name, email
Valid email format
Valid phone format (if provided; phone is optional)

PUT validation enforces:

Fields that are present cannot be set to empty (prevents accidental data erasure)
Email format is validated if email is being updated
Phone format is validated if phone is being updated

All string fields are trimmed. Email is lowercased. Address state is uppercased.

E.164 Phone Normalization

Phone numbers are stored in E.164 format, the international standard for phone number formatting defined by the ITU. E.164 numbers:

Begin with a + followed by the country code
Contain no spaces, dashes, or parentheses
Are a maximum of 15 digits

Examples of normalization:

Input	Normalized
`(214) 867-5309`	`+12148675309`
`214-867-5309`	`+12148675309`
`+1 214 867 5309`	`+12148675309`
`020 7946 0958` (GB)	`+442079460958`

The normalization uses Google's libphonenumber library (via libphonenumber-js), which validates against real telephony rules — not just digit counts. Invalid area codes, impossible exchanges, and malformed numbers are all rejected.

The default country is US, but an explicit country code can be provided for international numbers.

Audit Trail

Every mutation is tracked with full context:

Who made the change (x-user-id header, defaults to anonymous)
When the change was made (timestamp)
What changed (field-level delta with from and to values)

The change_history array on each customer record is append-only. Deltas are computed by comparing the original document state against the updated state across all auditable fields:

first_name, last_name, email, phone,
address.street, address.city, address.state, address.zip

Example delta entry:

{
  "changed_by": "admin-user",
  "changed_at": "2026-03-06T14:30:00.000Z",
  "delta": {
    "email": {
      "from": "john@oldmail.com",
      "to": "john@newmail.com"
    },
    "address.state": {
      "from": "il",
      "to": "TX"
    }
  }
}

Alias additions are tracked as well, recording the source_system and source_key that was linked.

Atomicity and Transactions

Every write operation in the system is designed to be atomic — either it completes fully or has no effect. The strategy depends on the scope of the operation.

Single-Document Operations

Most endpoints modify a single MongoDB document per request. MongoDB guarantees that single-document writes are atomic without any additional infrastructure:

Endpoint	Documents Modified	Atomicity
`POST /customers` (new)	1 Customer	Single-document atomic
`POST /customers` (match)	1 Customer (alias appended)	Single-document atomic
`PUT /customers/:id`	1 Customer	Single-document atomic
`DELETE /customers/:id`	1 Customer (soft-delete fields set)	Single-document atomic
`POST /.../feedback`	1 MatchFeedback	Single-document atomic

Because the alias array, change history, and search tokens all live on the Customer document itself, operations that add an alias, record audit deltas, and regenerate tokens are all committed in a single write. There is no window where an alias is added but the audit trail is missing, or where tokens are stale relative to the stored fields.

Multi-Document Transactions

Two operations modify multiple documents atomically and use MongoDB transactions: the manual merge (PATCH /customers/:id) and the candidate approval (POST /customers/:id/aliases/:aliasId/candidates/:candidateId/approve). Both follow the same three-document pattern:

Target customer — receives transferred aliases, updated search tokens, and a merge audit entry
Source customer — marked as soft-deleted with merged_into pointer and a merge audit entry
MatchFeedback record — a false_negative entry is created to feed the F1 metrics

If these three writes were independent, a failure after step 1 but before step 3 would leave the system in an inconsistent state — aliases transferred but the source still appearing as an active customer.

To prevent this, the merge handler wraps all three saves in a MongoDB transaction:

const session = await mongoose.startSession();
try {
  await session.withTransaction(async () => {
    await target.save({ session });
    await source.save({ session });
    await falseNegative.save({ session });
  });
} finally {
  await session.endSession();
}

If any save fails, the entire transaction is rolled back and the database remains unchanged. The client receives a 500 error and can safely retry.

Replica Set Requirement

MongoDB transactions require a replica set. The docker-compose configuration runs MongoDB as a single-node replica set (--replSet rs0), which provides full transaction support with negligible overhead compared to a standalone instance. The healthcheck automatically initializes the replica set on first boot.

For production deployments, a multi-node replica set is recommended for both transaction support and high availability.

Soft Deletes

Records are never physically removed. A DELETE request sets:

deleted_by — who performed the deletion
deleted_at — when the deletion occurred

Soft-deleted records are excluded from all queries by default. Use ?include_deleted=true on the list endpoint to include them. The matching engine only considers active (non-deleted) records as candidates.

Record Merging

When a customer record is merged into another (e.g., via manual administrative action), the deprecated record is soft-deleted and a merged_into pointer is set. The entire operation — alias transfer, source soft-delete, and feedback recording — is executed within a transaction to guarantee all-or-nothing semantics.

Attempts to retrieve the deprecated record via GET /customers/:id will return 301 Moved Permanently with a Location header pointing to the new master record. This ensures clients automatically update their references.

Data Model

Customer

Field	Type	Description
`first_name`	String	Required. Nicknames normalized to formal equivalents on ingestion
`last_name`	String	Required
`email`	String	Required, stored lowercase
`phone`	String	Stored in E.164 format
`address`	Object	`{ street, city, state, zip }` — USPS standardized
`aliases`	Array	Cross-system identity links (see below)
`change_history`	Array	Audit trail entries
`search_tokens`	Array<String>	Phonetic/exact tokens for candidate blocking (not exposed in API responses)
`created_by`	String	User who created the record
`created_at`	Date	Creation timestamp
`updated_by`	String	Last user to modify the record
`updated_at`	Date	Last modification timestamp
`deleted_by`	String	User who soft-deleted the record
`deleted_at`	Date	Soft-deletion timestamp
`merged_into`	String	ID of the master record if this customer was merged

Alias

Field	Type	Description
`source_system`	String	Originating system identifier (e.g., `CRM`, `BILLING`)
`source_key`	String	Primary key in the originating system
`original_payload`	Mixed	Complete original POST body, preserved as-is
`added_by`	String	User who linked this alias
`added_at`	Date	When the alias was linked
`match_confidence`	Number	Fellegi-Sunter score (0–1) when matched; null for record creation
`match_algorithm`	String	Algorithm used (e.g., `fellegi-sunter`); null for record creation
`candidates`	Array	Near-miss match candidates for this alias (see Candidate Review)

MatchFeedback

Field	Type	Description
`type`	String	`false_positive` or `false_negative`
`customer_id`	ObjectId	The customer record involved
`alias_id`	ObjectId	The alias reported as incorrect (false positives only)
`related_customer_id`	ObjectId	The other customer involved (false negatives / merges)
`original_confidence`	Number	The match confidence when the alias was linked
`original_algorithm`	String	Algorithm used (e.g., `fellegi-sunter`)
`reported_by`	String	User who reported the feedback
`reported_at`	Date	When feedback was reported
`resolved`	Boolean	Whether the issue has been resolved (default: false)
`resolved_at`	Date	When the issue was resolved
`notes`	String	Optional explanation

Indexes

deleted_at — fast filtering for active/deleted records
email — candidate lookup during matching
aliases.source_system + aliases.source_key — unique compound index preventing duplicate source keys
search_tokens + deleted_at — compound multikey index for token-based candidate blocking and typeahead search

Testing

The project enforces 100% code coverage across statements, branches, functions, and lines. The build fails if any metric drops below 100%.

# Run tests with coverage report
npm test

# Build (tests + 100% coverage enforcement)
npm run build

Tests use mongodb-memory-server for a real MongoDB instance in-memory — no mocking of the database layer. This ensures tests exercise the actual Mongoose queries and validations.

Current status: 334 tests across 14 test suites.

Project Structure

src/
  app.js                          Express application setup
  server.js                       Entry point (connects to DB, starts server)
  database/
    connection.js                 MongoDB connection management
  middleware/
    auditContext.js                Extracts x-user-id header for audit tracking
  models/
    alias.js                      Alias subdocument schema (embeds candidates)
    candidate.js                  Candidate subdocument schema (near-miss matches)
    changeRecord.js               Change record subdocument schema
    customer.js                   Mongoose schema and indexes
    matchFeedback.js              F1 feedback records (false positives/negatives)
    source.js                     Source system registration schema
  routes/
    customerRoutes.js             REST endpoints (List, CRUD, search, merge, feedback)
    matchQualityRoutes.js         F1 metrics, tuning suggestions, feedback listing
    sourceRoutes.js               Source system CRUD endpoints
  services/
    auditDelta.js                 Field-level change delta computation
    addressStandardizer.js        USPS Pub 28 address standardization
    customerMatchingService.js    Fellegi-Sunter matching (findMatch: O(log N + C), score: O(1))
    inputSanitizer.js             Input validation, E.164 normalization, nickname normalization
    matchQualityService.js        F1 computation, weight adjustment suggestions
    nicknameDictionary.js         ~130 nickname-to-formal-name mappings
    searchTokenService.js         Phonetic, prefix, and exact token generation
  swagger/
    swaggerConfig.js              OpenAPI spec generation

scripts/
  seed.js                         Seed database with sample customers

tests/
  database/connection.test.js
  middleware/auditContext.test.js
  models/
    alias.test.js
    changeRecord.test.js
    customer.test.js
  routes/
    customerRoutes.test.js
    matchQuality.test.js
    sourceRoutes.test.js
  services/
    addressStandardizer.test.js
    auditDelta.test.js
    customerMatchingService.test.js
    inputSanitizer.test.js
    searchTokenService.test.js
  swagger/swaggerConfig.test.js

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
SUMMARY.md		SUMMARY.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation