An enterprise-grade pre-trade data model designed for Consolidated Audit Trail (CAT) compliance under SEC Rule 613. Built from a large investment bank perspective, this model covers the full order lifecycle across all CAT-reportable asset classes, participant types, and trading venues.
The model implements the Enterprise Party Model pattern for counterparty management and the ISDA/FpML Instrument Taxonomy for multi-asset instrument representation, providing a production-ready foundation for regulatory reporting infrastructure.
- Quick Start
- Repository Structure
- Model Architecture
- How to Use This Model
- Subject Areas
- Key Design Patterns
- Data Governance
- Sample Queries
- Platform Comparison
- Contributing
- License
For Data Architects — Start with docs/CAT_PreTrade_Canonical_Model.pptx for the unified conceptual + logical view of all 29 entities.
For Data Engineers — Go directly to ddl/CAT_PreTrade_DDL_DeltaLake.sql or ddl/CAT_PreTrade_DDL_Hive.sql and deploy to your target environment.
For Business Analysts — Open reference/CAT_PreTrade_Business_Dictionary.xlsx to understand every entity, attribute, and business term in the model.
For Compliance Officers — Read docs/CAT_PreTrade_Data_Model.docx for the full narrative covering regulatory context, design rationale, and governance framework.
cat-pretrade-data-model/
├── README.md # This file
├── docs/ # Presentations and narrative documents
│ ├── CAT_PreTrade_Canonical_Model.pptx # Unified canonical model (conceptual + logical, 16 slides)
│ ├── CAT_PreTrade_Conceptual_Model.pptx # Conceptual ER diagrams (7 slides)
│ ├── CAT_PreTrade_Logical_Model.pptx # Logical ER diagrams with attributes (11 slides)
│ └── CAT_PreTrade_Data_Model.docx # Narrative document (regulatory context, governance)
├── ddl/ # Physical model DDL scripts
│ ├── CAT_PreTrade_DDL_DeltaLake.sql # Delta Lake / Databricks (29 tables, Spark SQL)
│ └── CAT_PreTrade_DDL_Hive.sql # Apache Hive / Hadoop (29 tables, HiveQL)
└── reference/ # Business dictionary and reference materials
├── CAT_PreTrade_Business_Dictionary.xlsx # 6-sheet workbook (glossary, entities, attributes)
├── CAT_PreTrade_Schema_Summary.txt # Quick-reference schema listing
├── Implementation_Comparison.md # Delta Lake vs Hive trade-off analysis
└── FILE_MANIFEST.txt # Complete file manifest with descriptions
| Dimension | Coverage |
|---|---|
| Regulation | SEC Rule 613 — Consolidated Audit Trail (CAT) |
| Asset Classes | Equities (NMS/non-NMS), Listed Options, Fixed Income, Digital Assets |
| Participants | Broker-Dealers, Exchanges/ATSs, Buy-Side (Large Traders, Institutional) |
| Platforms | Cloud Lakehouse (Delta Lake/Iceberg) and Hadoop/Hive |
| Total Entities | 29 tables across 5 subject areas |
| Relationships | 47 documented foreign key relationships |
┌─────────────────┐
│ PARTY MODEL │
│ 9 entities │
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────────┴─────────┐ ┌─────┴─────┐ ┌─────────┴─────────┐
│ INSTRUMENT MODEL │ │ VENUE │ │ ORDER MANAGEMENT │
│ 9 entities │ │ 2 entities│ │ 6 entities │
└───────────────────┘ └───────────┘ └─────────┬─────────┘
│
┌─────────┴─────────┐
│ OPERATIONS │
│ 3 entities │
└───────────────────┘
The model is documented at three levels of abstraction. Start with the level that matches your role.
Canonical Model (docs/CAT_PreTrade_Canonical_Model.pptx) — The single authoritative reference. Contains paired slides for each subject area: a conceptual view showing entity relationships with UML <<entity>> stereotype notation, followed by a logical view showing full attribute definitions with PK/FK markers and data types. Also includes a cross-domain relationship map and a relationship matrix of the top 15 foreign key paths. This is the best starting point for understanding the complete model.
Conceptual Model (docs/CAT_PreTrade_Conceptual_Model.pptx) — High-level entity-relationship diagrams organized by subject area. Shows entity boxes with UML connectors and cardinality labels (1, 0..1, 0..*). Use this for stakeholder presentations, design reviews, and communicating model scope.
Logical Model (docs/CAT_PreTrade_Logical_Model.pptx) — Detailed entity attribute tables showing every column, its PK/FK designation, and its role in the model. Entity headers are color-coded by subject area. Use this for development planning, ETL design, and data quality rule definition.
Narrative Document (docs/CAT_PreTrade_Data_Model.docx) — 8-section document covering executive summary, regulatory context (SEC Rule 613 requirements), conceptual model rationale, logical model decisions, physical model specifications, data governance framework, and appendices. Use this for compliance documentation and architecture review boards.
The Delta Lake DDL is optimized for Databricks, AWS EMR, or Azure HDInsight with Delta Lake support.
# Upload the DDL script to your Databricks workspace
databricks workspace import ddl/CAT_PreTrade_DDL_DeltaLake.sql /Shared/CAT/ddl/ --language SQL
# Or execute directly via Databricks SQL
databricks sql execute --sql "$(cat ddl/CAT_PreTrade_DDL_DeltaLake.sql)"Key features of the Delta Lake DDL:
- ZSTD compression for optimal storage efficiency
- Z-ORDER optimization hints for common query patterns (event_date, asset_class)
- Change Data Feed enabled for CDC downstream consumers
- Liquid Clustering support for dynamic partition optimization
- TBLPROPERTIES with
autoOptimize.optimizeWrite = trueandautoOptimize.autoCompact = true - Partitioned by
event_dateandasset_classon fact tables
Pre-deployment checklist:
- Ensure the target catalog and schema exist (default:
cat_pretrade) - Verify cluster has Delta Lake 2.0+ support
- Review partition strategy against your data volume (adjust if processing < 1M events/day)
- Configure Z-ORDER columns based on your most common query filters
The Hive DDL is designed for traditional Hadoop environments with Hive Metastore.
# Execute via Beeline
beeline -u "jdbc:hive2://your-hive-server:10000" -f ddl/CAT_PreTrade_DDL_Hive.sql
# Or via Hive CLI
hive -f ddl/CAT_PreTrade_DDL_Hive.sqlKey features of the Hive DDL:
- Stored as Parquet with Snappy compression
- Bucketed tables (256–512 buckets) for optimized join performance
- ACID transactional support enabled
- CLUSTERED BY on primary keys for co-located data access
- Partitioned by
event_datefor time-range queries
Pre-deployment checklist:
- Enable ACID transactions:
SET hive.support.concurrency = true - Set
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager - Verify Hive 3.0+ for full ACID support
- Configure bucket count based on expected data volume (target 128MB–256MB per bucket)
For a detailed comparison of both platforms, see reference/Implementation_Comparison.md.
Adding a new asset class (e.g., Commodities):
- Create a new subtype table following the inheritance pattern:
CREATE TABLE commodity_attributes ( instrument_id BIGINT COMMENT 'FK to instrument.instrument_id', commodity_type VARCHAR(50), delivery_month VARCHAR(10), contract_size DECIMAL(18,6), settlement_type VARCHAR(20) )
- Add a row to
asset_class_referencewith the new class code - Update
instrument_classificationto include the new taxonomy entries - Add the new asset class to
venue_asset_classfor applicable trading venues - Update the canonical model presentation with the new entity
Adding a new participant type:
- Add the new role type to
party_role.role_typeenumeration - Create any participant-specific attribute tables if needed
- Update
party_classificationwith the new classification scheme - Map the participant to existing order event flows
Adding a new event type:
- Insert the new event into
event_type_reference - Ensure the
order_eventtable'scat_event_typeaccommodates the new code - If the event requires additional attributes beyond the base schema, consider creating a detail extension table linked via
order_event_id
Open reference/CAT_PreTrade_Business_Dictionary.xlsx — it contains 6 sheets:
| Sheet | Contents | Audience |
|---|---|---|
| Business Glossary | 30 CAT and market structure terms with definitions | All stakeholders |
| Entity Catalog | All 29 entities with descriptions, subject area, and row volume estimates | Data architects, engineers |
| Attribute Dictionary | 79+ attributes with data types, business rules, and sample values | Developers, data quality |
| CAT Event Types | 13 CAT event types with their lifecycle stages and reporting rules | Compliance, operations |
| Physical Model Specs | Platform-specific implementation details (compression, partitioning) | Infrastructure, DBA |
| Data Governance | Data ownership, retention policies, access controls, and quality rules | Governance, compliance |
The party model follows the Enterprise Supertype/Subtype pattern. PARTY is the root supertype that specializes into LEGAL_ENTITY (firms, exchanges) and NATURAL_PERSON (traders, authorized representatives). All party interactions flow through PARTY_ROLE, which enables a single party to hold multiple roles simultaneously (e.g., a firm can be both a broker-dealer and an exchange member).
| Entity | Description |
|---|---|
party |
Supertype — all market participants |
legal_entity |
Firms, exchanges, ATSs (LEI-identified) |
natural_person |
Individual traders, authorized persons |
party_role |
Role a party assumes (broker, trader, market maker) |
party_relationship |
Parent/child, affiliation, and control relationships |
party_identifier |
Cross-reference IDs (CRD, LEI, MPID, EIN) |
party_address |
Physical and mailing addresses |
party_classification |
Industry and regulatory classifications |
account |
Trading and customer accounts linked to party roles |
Built on the ISDA/FpML Instrument Taxonomy with a single INSTRUMENT supertype that links to asset-class-specific attribute tables via shared instrument_id. This allows each asset class to carry its own specialized fields while sharing common properties like symbol, status, and currency.
| Entity | Description |
|---|---|
instrument |
Supertype — all tradeable instruments |
equity_attributes |
Share class, market cap tier, NMS indicator |
option_attributes |
Strike, put/call, expiry, underlying reference |
fixed_income_attributes |
Coupon, credit rating, seniority, issuer |
digital_asset_attributes |
Blockchain, token standard, security classification |
instrument_identifier |
Cross-reference IDs (CUSIP, ISIN, FIGI, OCC symbol) |
instrument_classification |
Taxonomy codes (CFI, product type, asset class) |
instrument_listing |
Venue-specific listing details (lot size, currency) |
corporate_action_linkage |
Stock splits, mergers, symbol changes |
| Entity | Description |
|---|---|
venue |
Exchanges, ATSs, and dark pools (MIC-coded) |
venue_asset_class |
Asset classes supported by each venue |
The ORDER_EVENT table is the central fact table in an event-sourced architecture. Every order lifecycle event (new order, route, cancel, replace, fill) creates an immutable record. This supports both real-time surveillance and historical audit reconstruction.
| Entity | Description |
|---|---|
order_event |
Central fact — immutable order lifecycle events |
order_route |
Inter-market and intra-firm routing records |
quote_event |
Quote lifecycle events for market makers |
order_modification |
Field-level change tracking (cancel/replace) |
error_correction |
Post-submission corrections to CAT reports |
reporting_submission |
CAT file submission tracking and status |
| Entity | Description |
|---|---|
error_correction |
Correction records linked to original events |
reporting_submission |
Submission batches with record counts and status |
clock_sync_audit |
Clock synchronization compliance records |
All order events are immutable, append-only records. This preserves the complete audit trail required by Rule 613 and enables point-in-time reconstruction of any order's lifecycle. Never update order_event — instead, create a new event with the updated state.
All dimension entities (party, instrument, venue) carry effective_date and end_date columns. When a party's status changes or an instrument's attributes update, a new record is inserted with the new effective date and the previous record's end_date is set. This preserves historical accuracy for regulatory lookback periods.
Rather than storing counterparty information directly on transactions, the model uses an indirection layer: PARTY → PARTY_ROLE → ORDER_EVENT. This allows a single legal entity to participate in multiple roles (broker, market maker, custodian) without data duplication, and supports complex organizational hierarchies through PARTY_RELATIONSHIP.
The instrument model uses table-per-subclass inheritance. INSTRUMENT holds common fields while EQUITY_ATTRIBUTES, OPTION_ATTRIBUTES, FIXED_INCOME_ATTRIBUTES, and DIGITAL_ASSET_ATTRIBUTES hold asset-class-specific fields. This keeps the base table lean while supporting any asset class.
The physical model is designed for deployment in a Medallion (multi-hop) architecture:
| Zone | Purpose | Tables |
|---|---|---|
| Bronze (Raw) | Ingested CAT data as-is | All 29 tables (raw format) |
| Silver (Curated) | Validated, deduped, enriched | All 29 tables (cleansed) |
| Gold (Consumption) | Aggregated, business-ready | Materialized views, dashboards |
| Operations | Compliance monitoring | error_correction, reporting_submission, clock_sync_audit |
| Subject Area | Data Owner | Data Steward |
|---|---|---|
| Party Model | Client Data Management | KYC/AML Team |
| Instrument Model | Reference Data Management | Securities Master Team |
| Venue Model | Market Data | Connectivity Team |
| Order Management | Trading Technology | Surveillance Team |
| Operations | Regulatory Reporting | CAT Compliance Team |
Per SEC Rule 613 requirements, all CAT data must be retained for a minimum of 6 years from the date of creation. The model supports this through partitioned storage by event_date, enabling efficient lifecycle management:
- Hot tier (0–90 days): Full SSD/NVMe storage for real-time queries
- Warm tier (90 days–2 years): Standard storage with query optimization
- Cold tier (2–6 years): Archive storage with on-demand access
- Purge (6+ years): Automated deletion per retention schedule
The DDL scripts include column-level comments documenting business rules. Key validation rules include:
order_event_idmust be globally unique across all reporting firmsevent_timestampmust be synchronized to within 50 milliseconds of NIST (for industry members)cat_order_idmust follow the CAT-prescribed formatreporter_party_role_idmust reference an active party role with a valid CRD numberinstrument_idmust resolve to an active instrument with at least one valid identifier
Market → OMS/EMS → CAT Reporter → Bronze (raw) → Silver (validated) → Gold (enriched)
↓
CAT Processor (FINRA)
SELECT
oe.order_event_id,
oe.cat_event_type,
oe.event_timestamp,
oe.cat_order_id,
pr.role_type AS reporter_type,
le.entity_name AS reporting_firm,
i.primary_symbol,
oe.side,
oe.order_type,
oe.price,
oe.quantity
FROM order_event oe
JOIN party_role pr ON oe.reporter_party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
JOIN instrument i ON oe.instrument_id = i.instrument_id
WHERE oe.cat_order_id = 'CAT-2026-0001-ABCD'
ORDER BY oe.event_timestamp;SELECT
oe.cat_order_id,
r.route_id,
sender_pr.role_type AS sender_role,
v.venue_name AS destination_venue,
r.routed_order_id
FROM order_route r
JOIN order_event oe ON r.order_event_id = oe.order_event_id
JOIN party_role sender_pr ON r.sender_party_role_id = sender_pr.party_role_id
JOIN venue v ON r.destination_venue_id = v.venue_id
WHERE oe.cat_order_id = 'CAT-2026-0001-ABCD';SELECT
le.entity_name AS firm_name,
csa.system_name,
csa.offset_ms,
csa.compliance_status,
csa.audit_timestamp
FROM clock_sync_audit csa
JOIN party_role pr ON csa.party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
WHERE ABS(csa.offset_ms) > 50
ORDER BY ABS(csa.offset_ms) DESC;SELECT
rs.submission_date,
le.entity_name AS reporting_firm,
rs.record_count,
rs.status,
rs.rejection_count,
ROUND(rs.rejection_count * 100.0 / NULLIF(rs.record_count, 0), 2) AS rejection_rate_pct
FROM reporting_submission rs
JOIN party_role pr ON rs.reporter_party_role_id = pr.party_role_id
JOIN party p ON pr.party_id = p.party_id
JOIN legal_entity le ON p.party_id = le.party_id
WHERE rs.submission_date >= DATE_SUB(CURRENT_DATE, 7)
ORDER BY rs.submission_date DESC, le.entity_name;SELECT
i.instrument_id,
i.primary_symbol,
i.instrument_type,
ic.asset_class,
ic.product_type,
ii.id_type,
ii.id_value
FROM instrument i
JOIN instrument_classification ic ON i.instrument_id = ic.instrument_id
JOIN instrument_identifier ii ON i.instrument_id = ii.instrument_id
WHERE ii.is_primary = TRUE
AND i.instrument_status = 'ACTIVE'
ORDER BY ic.asset_class, i.primary_symbol;| Feature | Delta Lake | Hive |
|---|---|---|
| Compression | ZSTD | Snappy |
| File Format | Delta (Parquet-based) | Parquet |
| ACID Transactions | Native | Hive 3.0+ |
| Time Travel | Yes (30-day default) | No |
| Schema Evolution | Column mapping mode | Limited |
| Optimization | Z-ORDER, Auto-Optimize, Liquid Clustering | Bucketing (256–512) |
| CDC Support | Change Data Feed | Manual |
| Best For | Cloud-native, Databricks, real-time + batch | On-prem Hadoop, batch-heavy |
For a detailed comparison, see reference/Implementation_Comparison.md.
- Fork this repository
- Create a feature branch (
git checkout -b feature/add-commodity-asset-class) - Follow the existing naming conventions for entities and attributes
- Update the business dictionary and canonical model for any schema changes
- Submit a pull request with a description of the change and its regulatory impact
- Tables: lowercase, underscore-separated (
order_event,party_role) - Columns: lowercase, underscore-separated (
instrument_id,event_timestamp) - Primary keys:
{entity}_id(e.g.,party_id,order_event_id) - Foreign keys:
{referenced_entity}_idor descriptive name (e.g.,reporter_party_role_id) - Timestamps: suffixed with
_timestampor_datedepending on precision - Booleans: prefixed with
is_orhas_(e.g.,is_primary,margin_eligible)
This data model is provided for educational and reference purposes. Adapt and deploy according to your organization's regulatory requirements and internal governance policies. This model does not constitute legal or compliance advice — consult with your regulatory compliance team before production deployment.
Version 1.0 | March 2026