Skip to content

jamezycesar-collab/cat-pretrade-data-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAT Pre-Trade Data Model

An enterprise-grade pre-trade data model designed for Consolidated Audit Trail (CAT) compliance under SEC Rule 613. Built from a large investment bank perspective, this model covers the full order lifecycle across all CAT-reportable asset classes, participant types, and trading venues.

The model implements the Enterprise Party Model pattern for counterparty management and the ISDA/FpML Instrument Taxonomy for multi-asset instrument representation, providing a production-ready foundation for regulatory reporting infrastructure.


Table of Contents


Quick Start

For Data Architects — Start with docs/CAT_PreTrade_Canonical_Model.pptx for the unified conceptual + logical view of all 29 entities.

For Data Engineers — Go directly to ddl/CAT_PreTrade_DDL_DeltaLake.sql or ddl/CAT_PreTrade_DDL_Hive.sql and deploy to your target environment.

For Business Analysts — Open reference/CAT_PreTrade_Business_Dictionary.xlsx to understand every entity, attribute, and business term in the model.

For Compliance Officers — Read docs/CAT_PreTrade_Data_Model.docx for the full narrative covering regulatory context, design rationale, and governance framework.


Repository Structure

cat-pretrade-data-model/
├── README.md                          # This file
├── docs/                              # Presentations and narrative documents
│   ├── CAT_PreTrade_Canonical_Model.pptx   # Unified canonical model (conceptual + logical, 16 slides)
│   ├── CAT_PreTrade_Conceptual_Model.pptx  # Conceptual ER diagrams (7 slides)
│   ├── CAT_PreTrade_Logical_Model.pptx     # Logical ER diagrams with attributes (11 slides)
│   └── CAT_PreTrade_Data_Model.docx        # Narrative document (regulatory context, governance)
├── ddl/                               # Physical model DDL scripts
│   ├── CAT_PreTrade_DDL_DeltaLake.sql      # Delta Lake / Databricks (29 tables, Spark SQL)
│   └── CAT_PreTrade_DDL_Hive.sql           # Apache Hive / Hadoop (29 tables, HiveQL)
└── reference/                         # Business dictionary and reference materials
    ├── CAT_PreTrade_Business_Dictionary.xlsx # 6-sheet workbook (glossary, entities, attributes)
    ├── CAT_PreTrade_Schema_Summary.txt      # Quick-reference schema listing
    ├── Implementation_Comparison.md         # Delta Lake vs Hive trade-off analysis
    └── FILE_MANIFEST.txt                    # Complete file manifest with descriptions

Model Architecture

Scope

Dimension Coverage
Regulation SEC Rule 613 — Consolidated Audit Trail (CAT)
Asset Classes Equities (NMS/non-NMS), Listed Options, Fixed Income, Digital Assets
Participants Broker-Dealers, Exchanges/ATSs, Buy-Side (Large Traders, Institutional)
Platforms Cloud Lakehouse (Delta Lake/Iceberg) and Hadoop/Hive
Total Entities 29 tables across 5 subject areas
Relationships 47 documented foreign key relationships

Subject Area Map

                    ┌─────────────────┐
                    │  PARTY MODEL    │
                    │   9 entities    │
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
┌─────────┴─────────┐ ┌─────┴─────┐ ┌─────────┴─────────┐
│ INSTRUMENT MODEL  │ │   VENUE   │ │ ORDER MANAGEMENT  │
│    9 entities     │ │ 2 entities│ │    6 entities      │
└───────────────────┘ └───────────┘ └─────────┬─────────┘
                                              │
                                    ┌─────────┴─────────┐
                                    │   OPERATIONS      │
                                    │   3 entities      │
                                    └───────────────────┘

How to Use This Model

1. Reviewing the Data Model

The model is documented at three levels of abstraction. Start with the level that matches your role.

Canonical Model (docs/CAT_PreTrade_Canonical_Model.pptx) — The single authoritative reference. Contains paired slides for each subject area: a conceptual view showing entity relationships with UML <<entity>> stereotype notation, followed by a logical view showing full attribute definitions with PK/FK markers and data types. Also includes a cross-domain relationship map and a relationship matrix of the top 15 foreign key paths. This is the best starting point for understanding the complete model.

Conceptual Model (docs/CAT_PreTrade_Conceptual_Model.pptx) — High-level entity-relationship diagrams organized by subject area. Shows entity boxes with UML connectors and cardinality labels (1, 0..1, 0..*). Use this for stakeholder presentations, design reviews, and communicating model scope.

Logical Model (docs/CAT_PreTrade_Logical_Model.pptx) — Detailed entity attribute tables showing every column, its PK/FK designation, and its role in the model. Entity headers are color-coded by subject area. Use this for development planning, ETL design, and data quality rule definition.

Narrative Document (docs/CAT_PreTrade_Data_Model.docx) — 8-section document covering executive summary, regulatory context (SEC Rule 613 requirements), conceptual model rationale, logical model decisions, physical model specifications, data governance framework, and appendices. Use this for compliance documentation and architecture review boards.

2. Deploying to Delta Lake (Databricks)

The Delta Lake DDL is optimized for Databricks, AWS EMR, or Azure HDInsight with Delta Lake support.

# Upload the DDL script to your Databricks workspace
databricks workspace import ddl/CAT_PreTrade_DDL_DeltaLake.sql /Shared/CAT/ddl/ --language SQL

# Or execute directly via Databricks SQL
databricks sql execute --sql "$(cat ddl/CAT_PreTrade_DDL_DeltaLake.sql)"

Key features of the Delta Lake DDL:

  • ZSTD compression for optimal storage efficiency
  • Z-ORDER optimization hints for common query patterns (event_date, asset_class)
  • Change Data Feed enabled for CDC downstream consumers
  • Liquid Clustering support for dynamic partition optimization
  • TBLPROPERTIES with autoOptimize.optimizeWrite = true and autoOptimize.autoCompact = true
  • Partitioned by event_date and asset_class on fact tables

Pre-deployment checklist:

  1. Ensure the target catalog and schema exist (default: cat_pretrade)
  2. Verify cluster has Delta Lake 2.0+ support
  3. Review partition strategy against your data volume (adjust if processing < 1M events/day)
  4. Configure Z-ORDER columns based on your most common query filters

3. Deploying to Hive (Hadoop)

The Hive DDL is designed for traditional Hadoop environments with Hive Metastore.

# Execute via Beeline
beeline -u "jdbc:hive2://your-hive-server:10000" -f ddl/CAT_PreTrade_DDL_Hive.sql

# Or via Hive CLI
hive -f ddl/CAT_PreTrade_DDL_Hive.sql

Key features of the Hive DDL:

  • Stored as Parquet with Snappy compression
  • Bucketed tables (256–512 buckets) for optimized join performance
  • ACID transactional support enabled
  • CLUSTERED BY on primary keys for co-located data access
  • Partitioned by event_date for time-range queries

Pre-deployment checklist:

  1. Enable ACID transactions: SET hive.support.concurrency = true
  2. Set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
  3. Verify Hive 3.0+ for full ACID support
  4. Configure bucket count based on expected data volume (target 128MB–256MB per bucket)

For a detailed comparison of both platforms, see reference/Implementation_Comparison.md.

4. Extending the Model

Adding a new asset class (e.g., Commodities):

  1. Create a new subtype table following the inheritance pattern:
    CREATE TABLE commodity_attributes (
      instrument_id BIGINT COMMENT 'FK to instrument.instrument_id',
      commodity_type VARCHAR(50),
      delivery_month VARCHAR(10),
      contract_size DECIMAL(18,6),
      settlement_type VARCHAR(20)
    )
  2. Add a row to asset_class_reference with the new class code
  3. Update instrument_classification to include the new taxonomy entries
  4. Add the new asset class to venue_asset_class for applicable trading venues
  5. Update the canonical model presentation with the new entity

Adding a new participant type:

  1. Add the new role type to party_role.role_type enumeration
  2. Create any participant-specific attribute tables if needed
  3. Update party_classification with the new classification scheme
  4. Map the participant to existing order event flows

Adding a new event type:

  1. Insert the new event into event_type_reference
  2. Ensure the order_event table's cat_event_type accommodates the new code
  3. If the event requires additional attributes beyond the base schema, consider creating a detail extension table linked via order_event_id

5. Using the Business Dictionary

Open reference/CAT_PreTrade_Business_Dictionary.xlsx — it contains 6 sheets:

Sheet Contents Audience
Business Glossary 30 CAT and market structure terms with definitions All stakeholders
Entity Catalog All 29 entities with descriptions, subject area, and row volume estimates Data architects, engineers
Attribute Dictionary 79+ attributes with data types, business rules, and sample values Developers, data quality
CAT Event Types 13 CAT event types with their lifecycle stages and reporting rules Compliance, operations
Physical Model Specs Platform-specific implementation details (compression, partitioning) Infrastructure, DBA
Data Governance Data ownership, retention policies, access controls, and quality rules Governance, compliance

Subject Areas

Party Model (9 entities)

The party model follows the Enterprise Supertype/Subtype pattern. PARTY is the root supertype that specializes into LEGAL_ENTITY (firms, exchanges) and NATURAL_PERSON (traders, authorized representatives). All party interactions flow through PARTY_ROLE, which enables a single party to hold multiple roles simultaneously (e.g., a firm can be both a broker-dealer and an exchange member).

Entity Description
party Supertype — all market participants
legal_entity Firms, exchanges, ATSs (LEI-identified)
natural_person Individual traders, authorized persons
party_role Role a party assumes (broker, trader, market maker)
party_relationship Parent/child, affiliation, and control relationships
party_identifier Cross-reference IDs (CRD, LEI, MPID, EIN)
party_address Physical and mailing addresses
party_classification Industry and regulatory classifications
account Trading and customer accounts linked to party roles

Instrument Model (9 entities)

Built on the ISDA/FpML Instrument Taxonomy with a single INSTRUMENT supertype that links to asset-class-specific attribute tables via shared instrument_id. This allows each asset class to carry its own specialized fields while sharing common properties like symbol, status, and currency.

Entity Description
instrument Supertype — all tradeable instruments
equity_attributes Share class, market cap tier, NMS indicator
option_attributes Strike, put/call, expiry, underlying reference
fixed_income_attributes Coupon, credit rating, seniority, issuer
digital_asset_attributes Blockchain, token standard, security classification
instrument_identifier Cross-reference IDs (CUSIP, ISIN, FIGI, OCC symbol)
instrument_classification Taxonomy codes (CFI, product type, asset class)
instrument_listing Venue-specific listing details (lot size, currency)
corporate_action_linkage Stock splits, mergers, symbol changes

Venue Model (2 entities)

Entity Description
venue Exchanges, ATSs, and dark pools (MIC-coded)
venue_asset_class Asset classes supported by each venue

Order Management (6 entities)

The ORDER_EVENT table is the central fact table in an event-sourced architecture. Every order lifecycle event (new order, route, cancel, replace, fill) creates an immutable record. This supports both real-time surveillance and historical audit reconstruction.

Entity Description
order_event Central fact — immutable order lifecycle events
order_route Inter-market and intra-firm routing records
quote_event Quote lifecycle events for market makers
order_modification Field-level change tracking (cancel/replace)
error_correction Post-submission corrections to CAT reports
reporting_submission CAT file submission tracking and status

Operations (3 entities)

Entity Description
error_correction Correction records linked to original events
reporting_submission Submission batches with record counts and status
clock_sync_audit Clock synchronization compliance records

Key Design Patterns

Event Sourcing

All order events are immutable, append-only records. This preserves the complete audit trail required by Rule 613 and enables point-in-time reconstruction of any order's lifecycle. Never update order_event — instead, create a new event with the updated state.

SCD Type 2 (Slowly Changing Dimensions)

All dimension entities (party, instrument, venue) carry effective_date and end_date columns. When a party's status changes or an instrument's attributes update, a new record is inserted with the new effective date and the previous record's end_date is set. This preserves historical accuracy for regulatory lookback periods.

Party Model Pattern

Rather than storing counterparty information directly on transactions, the model uses an indirection layer: PARTY → PARTY_ROLE → ORDER_EVENT. This allows a single legal entity to participate in multiple roles (broker, market maker, custodian) without data duplication, and supports complex organizational hierarchies through PARTY_RELATIONSHIP.

Instrument Inheritance

The instrument model uses table-per-subclass inheritance. INSTRUMENT holds common fields while EQUITY_ATTRIBUTES, OPTION_ATTRIBUTES, FIXED_INCOME_ATTRIBUTES, and DIGITAL_ASSET_ATTRIBUTES hold asset-class-specific fields. This keeps the base table lean while supporting any asset class.

Medallion Architecture

The physical model is designed for deployment in a Medallion (multi-hop) architecture:

Zone Purpose Tables
Bronze (Raw) Ingested CAT data as-is All 29 tables (raw format)
Silver (Curated) Validated, deduped, enriched All 29 tables (cleansed)
Gold (Consumption) Aggregated, business-ready Materialized views, dashboards
Operations Compliance monitoring error_correction, reporting_submission, clock_sync_audit

Data Governance

Ownership

Subject Area Data Owner Data Steward
Party Model Client Data Management KYC/AML Team
Instrument Model Reference Data Management Securities Master Team
Venue Model Market Data Connectivity Team
Order Management Trading Technology Surveillance Team
Operations Regulatory Reporting CAT Compliance Team

Retention Policy

Per SEC Rule 613 requirements, all CAT data must be retained for a minimum of 6 years from the date of creation. The model supports this through partitioned storage by event_date, enabling efficient lifecycle management:

  • Hot tier (0–90 days): Full SSD/NVMe storage for real-time queries
  • Warm tier (90 days–2 years): Standard storage with query optimization
  • Cold tier (2–6 years): Archive storage with on-demand access
  • Purge (6+ years): Automated deletion per retention schedule

Data Quality Rules

The DDL scripts include column-level comments documenting business rules. Key validation rules include:

  • order_event_id must be globally unique across all reporting firms
  • event_timestamp must be synchronized to within 50 milliseconds of NIST (for industry members)
  • cat_order_id must follow the CAT-prescribed format
  • reporter_party_role_id must reference an active party role with a valid CRD number
  • instrument_id must resolve to an active instrument with at least one valid identifier

Data Lineage

Market → OMS/EMS → CAT Reporter → Bronze (raw) → Silver (validated) → Gold (enriched)
                                                                          ↓
                                                               CAT Processor (FINRA)

Sample Queries

Reconstruct an order's full lifecycle

SELECT
  oe.order_event_id,
  oe.cat_event_type,
  oe.event_timestamp,
  oe.cat_order_id,
  pr.role_type AS reporter_type,
  le.entity_name AS reporting_firm,
  i.primary_symbol,
  oe.side,
  oe.order_type,
  oe.price,
  oe.quantity
FROM order_event oe
JOIN party_role pr ON oe.reporter_party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
JOIN instrument i ON oe.instrument_id = i.instrument_id
WHERE oe.cat_order_id = 'CAT-2026-0001-ABCD'
ORDER BY oe.event_timestamp;

Find all routes for a specific order

SELECT
  oe.cat_order_id,
  r.route_id,
  sender_pr.role_type AS sender_role,
  v.venue_name AS destination_venue,
  r.routed_order_id
FROM order_route r
JOIN order_event oe ON r.order_event_id = oe.order_event_id
JOIN party_role sender_pr ON r.sender_party_role_id = sender_pr.party_role_id
JOIN venue v ON r.destination_venue_id = v.venue_id
WHERE oe.cat_order_id = 'CAT-2026-0001-ABCD';

Clock synchronization compliance report

SELECT
  le.entity_name AS firm_name,
  csa.system_name,
  csa.offset_ms,
  csa.compliance_status,
  csa.audit_timestamp
FROM clock_sync_audit csa
JOIN party_role pr ON csa.party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
WHERE ABS(csa.offset_ms) > 50
ORDER BY ABS(csa.offset_ms) DESC;

Daily submission status dashboard

SELECT
  rs.submission_date,
  le.entity_name AS reporting_firm,
  rs.record_count,
  rs.status,
  rs.rejection_count,
  ROUND(rs.rejection_count * 100.0 / NULLIF(rs.record_count, 0), 2) AS rejection_rate_pct
FROM reporting_submission rs
JOIN party_role pr ON rs.reporter_party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
WHERE rs.submission_date >= DATE_SUB(CURRENT_DATE, 7)
ORDER BY rs.submission_date DESC, le.entity_name;

Multi-asset instrument lookup

SELECT
  i.instrument_id,
  i.primary_symbol,
  i.instrument_type,
  ic.asset_class,
  ic.product_type,
  ii.id_type,
  ii.id_value
FROM instrument i
JOIN instrument_classification ic ON i.instrument_id = ic.instrument_id
JOIN instrument_identifier ii ON i.instrument_id = ii.instrument_id
WHERE ii.is_primary = TRUE
  AND i.instrument_status = 'ACTIVE'
ORDER BY ic.asset_class, i.primary_symbol;

Platform Comparison

Feature Delta Lake Hive
Compression ZSTD Snappy
File Format Delta (Parquet-based) Parquet
ACID Transactions Native Hive 3.0+
Time Travel Yes (30-day default) No
Schema Evolution Column mapping mode Limited
Optimization Z-ORDER, Auto-Optimize, Liquid Clustering Bucketing (256–512)
CDC Support Change Data Feed Manual
Best For Cloud-native, Databricks, real-time + batch On-prem Hadoop, batch-heavy

For a detailed comparison, see reference/Implementation_Comparison.md.


Contributing

  1. Fork this repository
  2. Create a feature branch (git checkout -b feature/add-commodity-asset-class)
  3. Follow the existing naming conventions for entities and attributes
  4. Update the business dictionary and canonical model for any schema changes
  5. Submit a pull request with a description of the change and its regulatory impact

Naming Conventions

  • Tables: lowercase, underscore-separated (order_event, party_role)
  • Columns: lowercase, underscore-separated (instrument_id, event_timestamp)
  • Primary keys: {entity}_id (e.g., party_id, order_event_id)
  • Foreign keys: {referenced_entity}_id or descriptive name (e.g., reporter_party_role_id)
  • Timestamps: suffixed with _timestamp or _date depending on precision
  • Booleans: prefixed with is_ or has_ (e.g., is_primary, margin_eligible)

License

This data model is provided for educational and reference purposes. Adapt and deploy according to your organization's regulatory requirements and internal governance policies. This model does not constitute legal or compliance advice — consult with your regulatory compliance team before production deployment.


Version 1.0 | March 2026

About

Enterprise-grade pre-trade data model for CAT (SEC Rule 613) — 29 entities, Delta Lake + Hive DDL, UML diagrams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages