Improve import time and diskspace by raphaeljolivet · Pull Request #255 · brightway-lca/brightway2-data

raphaeljolivet · 2026-01-15T10:50:18Z

This pull request is an attempt to improve the performances of the backend lci db :

⚡ Reduce import time : from 6 min to 1.3 min
🗃 Reduce size on disk : from 1.2 Gb to 250 Mb (dropping some extra metadata)

I just packaged and published it on pypi. You can uninstall bw2data and install this version instead to give it a try.

I would be happy to have your feedbacks.

Main changes

The following changes have been made :

From `Pickle` to `JSON`

Changed data column from Pickle to JSON field.

The use of pickle makes it impossible to query with SQL and prone to backward incompatibilities (change with python version).

Instead, SQLite has a great native support for JSON.

Added columns to `Activity` and `Exchange` tables

Add all useful columns to activity & exchange tables:

Regular operations used to fetch the whole bundled data blob and extract relevant information from here.
This was not efficient, especially for process().

Instead, all useful information is spread into new columns and removed from data upon save, and added again into data upon load.
This ensures maximal backward compatibility with existing code.

Deactivated some checks on import

Some checks were not necessary and taking time upon bulk import :

Deactivate typo check on import by default
Don't do "vaccum" if database was empty before write()

Added an option to drop extra meta data

An option drop_metadata has been added to bw2data.config.

If activated, extra metadata will be dropped upon import, saving disk space (from 1.2 Gb to 280Mb) while having everything work fine.

We could make this option more precise and provide a list of metadata fields to keep.

Improved import performance

Use raw SQLite queries for many inserts, using executemany() : much faster than using Peewee.

📊 Results

Here are the results for a typical import of ecoinvent 3.9 :

Method	Import Time (min)	process() Time (s)	Size (GB)
bw2data-legacy	6.0	27	1.20
bw2data-perf	1.3	7	1.00
bw2data-perf+drop_meta	1.3	7	0.28

Results for typical import of ecoinvent 3.9 :

Total import time goes from 6 minutes to 1m30s
process() does from 25s to 7s
Size of lci/databases.db goes from 1.2 Gb to 260 Mb (dropping extra metadata) or 1.0 Gb (keeping all extra meta data)

🔃 Compatibility

Those changes should be transparent for other packages : the content of data is the same after load from database.

I've tried :

The test suite of lca_algebraic : ✅ Ok
Activity browser: LCI computation + monta carlo + Sanke work fine but some columns are empty (categories, ...) in the UI.
I saw that the code in AB does some nasty stuff, by accessing directly to the SQlite DB rather than relying on the bw2data abstraction.

⌛ TODO

Automatic migration

The SQlite db doesn't contain user_version yet.

This could be useful to keep track the evolution of the database, fail if trying to open a database with different versions,
and even propose automatic migration.

Integrate full text search tables into lci database

I find it weird that data are spread into many files of many different formats (Sqlite, NPY, pickle).
Ideally, most information should fit into a single homogeneous database.

Now that relevant information is stored into separate columns, we could even spare an additional 60Mb of space, by having FTS into lci database and using the external content feature

Compress migrations folder

The migration folder takes 160Mb, gzipping the json files would reduce it to only 5Mb !

…ninvent by 1 minute.

- Changed `data` column from Pickle to JSON field - Add many useful columns to activity & exchange tables - Normalize data: avoid duplicate information by removing correspondind field from 'data' upon save, add it back upon load - Deactivate typo check on import by default - Don't do "vaccum" if database was empty before write() - Use raw SQLite queries for many inserts, using insertmany() : much faster than using Peewee. - Add an option to drop extra metadata upon imports Results for typical import of ecoinvent 3.9 : - Total import time goes from 6 minutes to 1m30s - process() does from 25s to 7s - Size of lci/databases.db goes from 1.2 Gb to 260 Mb (dropping extra metadata) or 1.0 Gb (keeping all extra meta data)

cmutel · 2026-03-04T21:05:55Z

Thanks @raphaeljolivet - sorry, didn't see this notification and just stumbled across it now.

These look like great changes in general, and certainly are in line with how I want the library to evolve. However, I worry that doing things as a big bang is a bit risky, and is certainly much harder to review. It feels like this can be broken into several parts and examined in detail.

Let me think a bit about this and I will come back to you.

cmutel · 2026-05-13T18:00:03Z

Hi Raphael — thank you for this detailed PR and for publishing bw2data-perf so people could actually try it. The benchmark numbers are compelling.

We've started pulling pieces of this work into mainline in separate focused PRs. So far:

Add memoization cache to get_id() with signal-based invalidation #266 (merged) extracts the get_id() memoization cache from your first commit
Use raw SQLite executemany for bulk database writes #268 (open) extracts the raw SQLite executemany approach for bulk writes — just that part, without the schema or serialization changes

Both carry a Co-authored-by credit pointing to you.

The larger structural changes — schema denormalization with new columns, pickle → JSON with orjson, drop_metadata — are more invasive and we want to think through the migration story carefully before landing them. I'll continue to reference this PR as that work progresses.

raphaeljolivet · 2026-05-18T07:23:33Z

Thanks !

That was a huge PR that I should have split into smaller parts.
I'm happy to see you integrate the low-hanging fruits.

Have a good.

raphaeljolivet added 2 commits December 22, 2025 16:06

Adding memoization (cache) to get_id, improves time for import of eco…

ca981dd

…ninvent by 1 minute.

cmutel mentioned this pull request May 13, 2026

Use raw SQLite executemany for bulk database writes #268

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve import time and diskspace#255

Improve import time and diskspace#255
raphaeljolivet wants to merge 2 commits into
brightway-lca:mainfrom
raphaeljolivet:import-perfs

raphaeljolivet commented Jan 15, 2026 •

edited

Loading

Uh oh!

cmutel commented Mar 4, 2026

Uh oh!

cmutel commented May 13, 2026

Uh oh!

raphaeljolivet commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raphaeljolivet commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main changes

From Pickle to JSON

Added columns to Activity and Exchange tables

Deactivated some checks on import

Added an option to drop extra meta data

Improved import performance

📊 Results

🔃 Compatibility

⌛ TODO

Automatic migration

Integrate full text search tables into lci database

Compress migrations folder

Uh oh!

cmutel commented Mar 4, 2026

Uh oh!

cmutel commented May 13, 2026

Uh oh!

raphaeljolivet commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raphaeljolivet commented Jan 15, 2026 •

edited

Loading

From `Pickle` to `JSON`

Added columns to `Activity` and `Exchange` tables