Stripe Raw Ingestion (Databricks)

This repository implements Stripe API to Databricks raw ingestion pipeline with incremental loading and workflow orchestration..

The project is intentionally focused on raw ingestion only:

no business transformations in the ingestion step
no upserts/merges into raw tables
full payload retention for auditability and replay

What This Pipeline Does

For each configured Stripe entity, the job:

Reads the latest resume point from checkpoint events.
Calls Stripe list APIs with:
- created[gte] watermark filter
- starting_after cursor for pagination
Writes each object as an immutable raw row to a Delta table.
Appends checkpoint events after each committed page.
Produces a new resume point for the next run.

This design supports daily scheduled ingestion and deterministic recovery from interruption.

Ingested Entities

Current entities are defined in src/stripe_raw_ingestion/entities.py:

payment_intents
charges
balance_transactions
refunds
invoices

Design Considerations

1) Append-only raw storage

Raw tables are written with .mode("append").
No MERGE, no deletes, no in-place updates.

2) Incremental loading strategy

The pipeline combines three mechanisms:

watermark: created[gte]=start_created_gte
cursor: starting_after=<last_object_id>
lookback overlap: next run starts from max_created_seen - lookback_seconds

3) Resume and recovery

Checkpointing is event-based and append-only (stripe_ingestion_checkpoints):

RUN_STARTED
PAGE_COMMITTED
RUN_COMPLETED
RUN_FAILED

Recovery behavior:

if latest run is incomplete/failed: resume from last cursor
if latest run completed: restart from lookback-adjusted watermark

4) API reliability and backoff

HTTP client handles:

pagination
network timeouts/connection errors
retryable HTTP statuses (429, 5xx)
exponential backoff + jitter
optional Retry-After support

5) Raw contract is analytics-friendly

Each row stores:

_raw_json (entire Stripe object)
source IDs/timestamps (_stripe_id, _stripe_created_at)
ingestion lineage (_ingest_run_id, _ingested_at, _entity)
request/page context (_page_number, cursors, status/request_id)
payload fingerprint (_payload_sha256)

Project Structure

src/stripe_raw_ingestion/main.py
Entrypoint and Spark session initialization.
src/stripe_raw_ingestion/config.py
Runtime configuration + secret lookup via Databricks secrets.
src/stripe_raw_ingestion/pipeline.py
Entity ingestion loop, incremental logic, checkpoint orchestration.
src/stripe_raw_ingestion/http_client.py
Stripe HTTP client with retry/backoff policy.
src/stripe_raw_ingestion/raw_writer.py
Raw Delta write contract and append logic.
src/stripe_raw_ingestion/state_store.py
Checkpoint event table + resume-point computation.
resources/stripe_ingestion_job.yml
Databricks Job definition for wheel task execution.
databricks.yml
Bundle configuration, targets, and artifact build.

Runtime Parameters

The wheel task reads parameters (via CLI option parsing) for:

catalog
schema
bundle_target
stripe_secret_scope
stripe_secret_key
entities

Other ingestion defaults currently in config:

checkpoint table
page size
lookback seconds
max retries
request timeout
backoff base/max/jitter
Stripe API base URL

Deploy and Run

databricks bundle validate
databricks bundle deploy --target dev
databricks bundle run stripe_ingestion_job --target dev

Onboard another entity

To onboard another API or entity family, you can reuse the same pattern:

Add entity endpoint/table mapping in entities.py.
Reuse existing HTTP pagination + retry client.
Keep raw write contract unchanged.
Reuse checkpoint event model for incremental resume.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
resources		resources
src		src
.gitignore		.gitignore
README.md		README.md
databricks.yml		databricks.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stripe Raw Ingestion (Databricks)

What This Pipeline Does

Ingested Entities

Design Considerations

1) Append-only raw storage

2) Incremental loading strategy

3) Resume and recovery

4) API reliability and backoff

5) Raw contract is analytics-friendly

Project Structure

Runtime Parameters

Deploy and Run

Onboard another entity

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stripe Raw Ingestion (Databricks)

What This Pipeline Does

Ingested Entities

Design Considerations

1) Append-only raw storage

2) Incremental loading strategy

3) Resume and recovery

4) API reliability and backoff

5) Raw contract is analytics-friendly

Project Structure

Runtime Parameters

Deploy and Run

Onboard another entity

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages