Skip to content

Improve import time and diskspace#255

Open
raphaeljolivet wants to merge 2 commits into
brightway-lca:mainfrom
raphaeljolivet:import-perfs
Open

Improve import time and diskspace#255
raphaeljolivet wants to merge 2 commits into
brightway-lca:mainfrom
raphaeljolivet:import-perfs

Conversation

@raphaeljolivet
Copy link
Copy Markdown
Contributor

@raphaeljolivet raphaeljolivet commented Jan 15, 2026

This pull request is an attempt to improve the performances of the backend lci db :

  • ⚡ Reduce import time : from 6 min to 1.3 min
  • 🗃 Reduce size on disk : from 1.2 Gb to 250 Mb (dropping some extra metadata)

I just packaged and published it on pypi. You can uninstall bw2data and install this version instead to give it a try.

I would be happy to have your feedbacks.

Main changes

The following changes have been made :

From Pickle to JSON

Changed data column from Pickle to JSON field.

The use of pickle makes it impossible to query with SQL and prone to backward incompatibilities (change with python version).

Instead, SQLite has a great native support for JSON.

Added columns to Activity and Exchange tables

Add all useful columns to activity & exchange tables:

Regular operations used to fetch the whole bundled data blob and extract relevant information from here.
This was not efficient, especially for process().

Instead, all useful information is spread into new columns and removed from data upon save, and added again into data upon load.
This ensures maximal backward compatibility with existing code.

Deactivated some checks on import

Some checks were not necessary and taking time upon bulk import :

  • Deactivate typo check on import by default
  • Don't do "vaccum" if database was empty before write()

Added an option to drop extra meta data

An option drop_metadata has been added to bw2data.config.

If activated, extra metadata will be dropped upon import, saving disk space (from 1.2 Gb to 280Mb) while having everything work fine.

We could make this option more precise and provide a list of metadata fields to keep.

Improved import performance

Use raw SQLite queries for many inserts, using executemany() : much faster than using Peewee.

📊 Results

Here are the results for a typical import of ecoinvent 3.9 :

Method Import Time (min) process() Time (s) Size (GB)
bw2data-legacy 6.0 27 1.20
bw2data-perf 1.3 7 1.00
bw2data-perf+drop_meta 1.3 7 0.28

Results for typical import of ecoinvent 3.9 :

  • Total import time goes from 6 minutes to 1m30s
  • process() does from  25s to 7s
  • Size of lci/databases.db goes from 1.2 Gb to 260 Mb (dropping extra metadata) or 1.0 Gb (keeping all extra meta data)

🔃 Compatibility

Those changes should be transparent for other packages : the content of data is the same after load from database.

I've tried :

  • The test suite of lca_algebraic : ✅ Ok
  • Activity browser: LCI computation + monta carlo + Sanke work fine but some columns are empty (categories, ...) in the UI.
    I saw that the code in AB does some nasty stuff, by accessing directly to the SQlite DB rather than relying on the bw2data abstraction.

⌛ TODO

Automatic migration

The SQlite db doesn't contain user_version yet.

This could be useful to keep track the evolution of the database, fail if trying to open a database with different versions,
and even propose automatic migration.

Integrate full text search tables into lci database

I find it weird that data are spread into many files of many different formats (Sqlite, NPY, pickle).
Ideally, most information should fit into a single homogeneous database.

Now that relevant information is stored into separate columns, we could even spare an additional 60Mb of space, by having FTS into lci database and using the external content feature

Compress migrations folder

The migration folder takes 160Mb, gzipping the json files would reduce it to only 5Mb !

- Changed `data` column from Pickle to JSON field
- Add many useful columns to activity & exchange tables
- Normalize data: avoid duplicate information by removing correspondind field from 'data' upon save, add it back upon load
- Deactivate typo check on import by default
- Don't do "vaccum" if database was empty before write()
- Use raw SQLite queries for many inserts, using insertmany() : much faster than using Peewee.
- Add an option to drop extra metadata upon imports

Results for typical import of ecoinvent 3.9 :
- Total import time goes from 6 minutes to 1m30s
- process() does from  25s to 7s
- Size of lci/databases.db goes from 1.2 Gb to 260 Mb (dropping extra metadata) or 1.0 Gb (keeping all extra meta data)
@cmutel
Copy link
Copy Markdown
Member

cmutel commented Mar 4, 2026

Thanks @raphaeljolivet - sorry, didn't see this notification and just stumbled across it now.

These look like great changes in general, and certainly are in line with how I want the library to evolve. However, I worry that doing things as a big bang is a bit risky, and is certainly much harder to review. It feels like this can be broken into several parts and examined in detail.

Let me think a bit about this and I will come back to you.

@cmutel
Copy link
Copy Markdown
Member

cmutel commented May 13, 2026

Hi Raphael — thank you for this detailed PR and for publishing bw2data-perf so people could actually try it. The benchmark numbers are compelling.

We've started pulling pieces of this work into mainline in separate focused PRs. So far:

Both carry a Co-authored-by credit pointing to you.

The larger structural changes — schema denormalization with new columns, pickle → JSON with orjson, drop_metadata — are more invasive and we want to think through the migration story carefully before landing them. I'll continue to reference this PR as that work progresses.

@raphaeljolivet
Copy link
Copy Markdown
Contributor Author

Thanks !

That was a huge PR that I should have split into smaller parts.
I'm happy to see you integrate the low-hanging fruits.

Have a good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants