Improve import time and diskspace#255
Conversation
…ninvent by 1 minute.
- Changed `data` column from Pickle to JSON field - Add many useful columns to activity & exchange tables - Normalize data: avoid duplicate information by removing correspondind field from 'data' upon save, add it back upon load - Deactivate typo check on import by default - Don't do "vaccum" if database was empty before write() - Use raw SQLite queries for many inserts, using insertmany() : much faster than using Peewee. - Add an option to drop extra metadata upon imports Results for typical import of ecoinvent 3.9 : - Total import time goes from 6 minutes to 1m30s - process() does from 25s to 7s - Size of lci/databases.db goes from 1.2 Gb to 260 Mb (dropping extra metadata) or 1.0 Gb (keeping all extra meta data)
|
Thanks @raphaeljolivet - sorry, didn't see this notification and just stumbled across it now. These look like great changes in general, and certainly are in line with how I want the library to evolve. However, I worry that doing things as a big bang is a bit risky, and is certainly much harder to review. It feels like this can be broken into several parts and examined in detail. Let me think a bit about this and I will come back to you. |
|
Hi Raphael — thank you for this detailed PR and for publishing We've started pulling pieces of this work into mainline in separate focused PRs. So far:
Both carry a The larger structural changes — schema denormalization with new columns, pickle → JSON with |
|
Thanks ! That was a huge PR that I should have split into smaller parts. Have a good. |
This pull request is an attempt to improve the performances of the backend
lcidb :I just packaged and published it on pypi. You can uninstall bw2data and install this version instead to give it a try.
I would be happy to have your feedbacks.
Main changes
The following changes have been made :
From
PickletoJSONChanged
datacolumn from Pickle to JSON field.The use of
picklemakes it impossible to query with SQL and prone to backward incompatibilities (change with python version).Instead, SQLite has a great native support for JSON.
Added columns to
ActivityandExchangetablesAdd all useful columns to activity & exchange tables:
Regular operations used to fetch the whole bundled
datablob and extract relevant information from here.This was not efficient, especially for
process().Instead, all useful information is spread into new columns and removed from data upon save, and added again into data upon load.
This ensures maximal backward compatibility with existing code.
Deactivated some checks on import
Some checks were not necessary and taking time upon bulk import :
Added an option to drop extra meta data
An option
drop_metadatahas been added tobw2data.config.If activated, extra metadata will be dropped upon import, saving disk space (from 1.2 Gb to 280Mb) while having everything work fine.
We could make this option more precise and provide a list of metadata fields to keep.
Improved import performance
Use raw SQLite queries for many inserts, using executemany() : much faster than using Peewee.
📊 Results
Here are the results for a typical import of ecoinvent 3.9 :
Results for typical import of ecoinvent 3.9 :
🔃 Compatibility
Those changes should be transparent for other packages : the content of
datais the same after load from database.I've tried :
lca_algebraic: ✅ OkActivity browser: LCI computation + monta carlo + Sanke work fine but some columns are empty (categories, ...) in the UI.I saw that the code in
ABdoes some nasty stuff, by accessing directly to the SQlite DB rather than relying on the bw2data abstraction.⌛ TODO
Automatic migration
The SQlite db doesn't contain
user_versionyet.This could be useful to keep track the evolution of the database, fail if trying to open a database with different versions,
and even propose automatic migration.
Integrate full text search tables into lci database
I find it weird that data are spread into many files of many different formats (Sqlite, NPY, pickle).
Ideally, most information should fit into a single homogeneous database.
Now that relevant information is stored into separate columns, we could even spare an additional 60Mb of space, by having FTS into lci database and using the external content feature
Compress migrations folder
The migration folder takes 160Mb, gzipping the json files would reduce it to only 5Mb !