This library provides an implementation of the DBSP language for incremental streaming computations. It is a tool primarily meant for research. See it as the PyTorch of streaming.
As of v2.0.0 it has zero runtime dependencies and is written in pure Python.
Here you can find a single-notebook implementation of almost everything in the DBSP paper. It mirrors what is in this library in an accessible way, and with more examples.
DBSP is differential dataflow's less expressive successor. It is a competing theory and framework to other stream processing systems such as Flink and Spark.
Its value is most easily understood in that it is capable of transforming "batch" possibly-iterative relational queries into "streaming incremental ones". This however only conveys a fraction of the theory's power.
As an extreme example, this library ships an incremental interpreter for stratified Datalog with negation (pydbsp.datalog_stratified.IncrementalDatalogStratified). Datalog is a query language similar to SQL, focused on efficiently supporting recursion. By implementing Datalog interpretation with DBSP, we get an interpreter whose queries can both change during runtime and respond to new data being streamed in.
Given the following SQL views over a transactions(id, from_account, to_account, amount) table:
create view credits as
select to_account as account, sum(amount) as credits
from transactions group by to_account;
create view debits as
select from_account as account, sum(amount) as debits
from transactions group by from_account;
create view balance as
select credits.account, credits - debits as balance
from credits inner join debits on credits.account = debits.account;
create materialized view total as
select sum(balance) from balance;The PyDBSP equivalent wires the same dataflow incrementally. Transactions arrive one per outer tick, and after every push we read out the latest total. Records are (id, from_account, to_account, amount) tuples:
from pydbsp.circuit import Circuit
from pydbsp.compute import ComputeCtx
from pydbsp.core import Antichain, dbsp_time
from pydbsp.evaluate import Evaluator
from pydbsp.indexed_relational_operators import (
IndexedDeltaLiftedDeltaLiftedJoin, LiftGroupBy, LiftIndex,
)
from pydbsp.indexed_zset import IndexedZSetAddition
from pydbsp.operator import (
Differentiate, Input, Integrate, Lift1, LiftStreamIntroduction,
)
from pydbsp.storage import DictStorage
from pydbsp.zset import ZSet, ZSetAddition
g_rec = ZSetAddition()
g_kv = ZSetAddition()
g_idx = IndexedZSetAddition(g_kv, lambda kv: kv[0])
e = Evaluator(
circuit=Circuit(),
storage=DictStorage(),
ctx=ComputeCtx(lattice=dbsp_time(2)),
group=g_rec,
)
src = Input(frontier=Antichain(dbsp_time(1))).connect(e.circuit, ())
src_2d = LiftStreamIntroduction(group=g_rec).connect(e.circuit, (src,))
cum = Integrate(group=g_rec).connect(e.circuit, (src_2d,))
sum_amount = lambda items: sum(r[3] * w for r, w in items)
credits = LiftGroupBy(aggregate=sum_amount).connect(
e.circuit,
(LiftIndex(indexer=lambda r: r[2]).connect(e.circuit, (cum,)),),
)
debits = LiftGroupBy(aggregate=sum_amount).connect(
e.circuit,
(LiftIndex(indexer=lambda r: r[1]).connect(e.circuit, (cum,)),),
)
dc = Differentiate(group=g_kv).connect(e.circuit, (credits,))
dd = Differentiate(group=g_kv).connect(e.circuit, (debits,))
balance_delta = IndexedDeltaLiftedDeltaLiftedJoin(
proj=lambda k, c, d: (k, c[1] - d[1]),
group_a=g_idx, group_b=g_idx, out_group=g_kv,
).connect(
e.circuit,
(
LiftIndex(indexer=lambda kv: kv[0]).connect(e.circuit, (dc,)),
LiftIndex(indexer=lambda kv: kv[0]).connect(e.circuit, (dd,)),
),
)
balance = Integrate(group=g_kv).connect(e.circuit, (balance_delta,))
global_idx = LiftIndex(indexer=lambda _kv: ()).connect(e.circuit, (balance,))
total_gb = LiftGroupBy(
aggregate=lambda items: sum(b[1] * w for b, w in items),
).connect(e.circuit, (global_idx,))
total = Lift1(
f=lambda z: next((c for (_k, c), _w in z.inner.items()), 0),
).connect(e.circuit, (total_gb,))
# Stream one transaction per outer tick.
for tick, txn in enumerate([
(0, 1, 2, 50), # cumulative total after this tick: 0
(1, 2, 3, 30), # cumulative total after this tick: 20
(2, 3, 1, 20), # cumulative total after this tick: 0
]):
e.push(src, ZSet({txn: 1}))
print(f"tick {tick}: total = {e.read(total, (tick, 0))}")
print("final balance:", e.read(balance, (2, 0)).inner)
# → {(1, -30): 1, (2, 20): 1, (3, 10): 1}Every credit is some account's debit, so total nets to zero whenever the inner join of credits and debits has caught up with every account. Early ticks may report a non-zero total while some account is still missing from one side of the join.
- Lean declaration index — auto-generated inventory of the 527 top-level
def/lemma/theorementries in the DBSP Lean formalisation, with a one-line signature for each. Useful as a grep target rather than an editorial reference. - Progress algebra — the antichain machinery PyDBSP is built on: the time lattice, down-sets, antichains, the per-axis
shift/retreat/drop_axis/insert_axisprimitives, settled frontiers, and the forward / backward propagation walks. - Implementation of the DBSP Paper — single-notebook port of almost everything in the paper.
- Quickstart — six primitives tour with two canonical pipelines.
- Tutorial — six-section tour of the public API: Z-sets, the evaluator, the doubly-incremental DLD join, sort-merge indexing, transitive closure, and Datalog.
- SQL operator walkthrough — incremental forms of the SQL operators from §4.2 of the DBSP paper.
- Datalog —
IndexedIncrementalDatalogBodyon a transitive-closure program, batched vs. drip-fed. - Stratified Datalog with negation —
IncrementalDatalogStratifiedon two negation programs (transitive complement; 4-cycle without an overlapping 3-cycle). - RDFS materialization — LUBM1 through
IndexedIncrementalRDFSBodycross-checked against a Datalog re-encoding. - Benchmarks — sort-merge-indexed reachability and Datalog TC across the bundled graphs.
- Internals dissection — eight cells tracing the jamie SQL aggregate pipeline at progressively finer resolution.
There are many examples living in each tests/test_*.py file.