Skip to content

thanthan9794/stripe_ingestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stripe Raw Ingestion (Databricks)

This repository implements Stripe API to Databricks raw ingestion pipeline with incremental loading and workflow orchestration..

The project is intentionally focused on raw ingestion only:

  • no business transformations in the ingestion step
  • no upserts/merges into raw tables
  • full payload retention for auditability and replay

What This Pipeline Does

For each configured Stripe entity, the job:

  1. Reads the latest resume point from checkpoint events.
  2. Calls Stripe list APIs with:
    • created[gte] watermark filter
    • starting_after cursor for pagination
  3. Writes each object as an immutable raw row to a Delta table.
  4. Appends checkpoint events after each committed page.
  5. Produces a new resume point for the next run.

This design supports daily scheduled ingestion and deterministic recovery from interruption.

Ingested Entities

Current entities are defined in src/stripe_raw_ingestion/entities.py:

  • payment_intents
  • charges
  • balance_transactions
  • refunds
  • invoices

Design Considerations

1) Append-only raw storage

Raw tables are written with .mode("append").
No MERGE, no deletes, no in-place updates.

2) Incremental loading strategy

The pipeline combines three mechanisms:

  • watermark: created[gte]=start_created_gte
  • cursor: starting_after=<last_object_id>
  • lookback overlap: next run starts from max_created_seen - lookback_seconds

3) Resume and recovery

Checkpointing is event-based and append-only (stripe_ingestion_checkpoints):

  • RUN_STARTED
  • PAGE_COMMITTED
  • RUN_COMPLETED
  • RUN_FAILED

Recovery behavior:

  • if latest run is incomplete/failed: resume from last cursor
  • if latest run completed: restart from lookback-adjusted watermark

4) API reliability and backoff

HTTP client handles:

  • pagination
  • network timeouts/connection errors
  • retryable HTTP statuses (429, 5xx)
  • exponential backoff + jitter
  • optional Retry-After support

5) Raw contract is analytics-friendly

Each row stores:

  • _raw_json (entire Stripe object)
  • source IDs/timestamps (_stripe_id, _stripe_created_at)
  • ingestion lineage (_ingest_run_id, _ingested_at, _entity)
  • request/page context (_page_number, cursors, status/request_id)
  • payload fingerprint (_payload_sha256)

Project Structure

  • src/stripe_raw_ingestion/main.py
    Entrypoint and Spark session initialization.
  • src/stripe_raw_ingestion/config.py
    Runtime configuration + secret lookup via Databricks secrets.
  • src/stripe_raw_ingestion/pipeline.py
    Entity ingestion loop, incremental logic, checkpoint orchestration.
  • src/stripe_raw_ingestion/http_client.py
    Stripe HTTP client with retry/backoff policy.
  • src/stripe_raw_ingestion/raw_writer.py
    Raw Delta write contract and append logic.
  • src/stripe_raw_ingestion/state_store.py
    Checkpoint event table + resume-point computation.
  • resources/stripe_ingestion_job.yml
    Databricks Job definition for wheel task execution.
  • databricks.yml
    Bundle configuration, targets, and artifact build.

Runtime Parameters

The wheel task reads parameters (via CLI option parsing) for:

  • catalog
  • schema
  • bundle_target
  • stripe_secret_scope
  • stripe_secret_key
  • entities

Other ingestion defaults currently in config:

  • checkpoint table
  • page size
  • lookback seconds
  • max retries
  • request timeout
  • backoff base/max/jitter
  • Stripe API base URL

Deploy and Run

databricks bundle validate
databricks bundle deploy --target dev
databricks bundle run stripe_ingestion_job --target dev

Onboard another entity

To onboard another API or entity family, you can reuse the same pattern:

  1. Add entity endpoint/table mapping in entities.py.
  2. Reuse existing HTTP pagination + retry client.
  3. Keep raw write contract unchanged.
  4. Reuse checkpoint event model for incremental resume.

About

Ingest Stripe data from API to Databricks with incremental loading and workflow orchestration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages