Try HDF5 / PyTables / Pandas integration to speed time series I/O

One of the ideas I had back in 2012-2013 when we were developing ODM2 was to use the HDF5 file format in certain cases to improve performance, because of the benefits of HDF5:
- High performance read/write of files, especially for very large files (much faster than text formats, such as CSV, JSON or XML).
- Compressed binary format for portable files that is very space efficient, on disk and for exchange
- Supports data slicing of files that are bigger than memory.
- Hierarchical Data Format (HDF) can contain simple dataset structures, has self-describing metadata, and can support heterogeneous data types
- NOTE that NetCDF uses an HDF5 container.

The two use cases I had in mind were:
- **High performance web services**, to exchange or deliver ODM2 "Datasets" via specialized web services (because read/write is fast, because it is compact and does't take as much I/O bandwidth). I think a YODA file, in a tabular array format, could be put in an HDF5 container.
- **High performance database functionality**, via a hybrid of a standard RDBMS ODM2 instance that stores TimeSeriesResults in HDF5 files. At the time, it seemed that a lot of people doing nuclear physics were using similar approaches. Also, Aquatic Informatics does something like this, storing all their Time Series data in a "proprietary" binary file that their MS SQLserver points to. Roelof Versteeg also has done this for very large datasets.

Given that the ODM2PythonAPI uses the Pandas library, I think we could tap into HDF5 very easily for one or more of these uses. Here are a few links to information:
- http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables
- http://www.pytables.org/index.html
- http://www.pytables.org/FAQ.html#is-pytables-a-replacement-for-a-relational-database

In writing this, I have come across some recent posts about people who are not happy with HDF5 (such as this: http://cyrille.rossant.net/moving-away-hdf5/ or https://www.rustprooflabs.com/2014/11/data-processing-in-python). I read the comments on the first of these two articles, and it sounds like most of the issues have been with improper direct use of the C library (and the lack of a Javascript library. 

People who use the Python libraries (h5py and PyTables) seem to have very positive experiences. The fact that PyTables is actively supported by ContinuumIO and NumFocus, and is an important package in SciPy and the binary format of Matlab, all suggests that HDF5 is still well loved and useful to many.

If we approached the use of HDF5, we would want to use of refined libraries (such as the Pandas/PyTables integration), and in small steps, such as in the time-series data caching work that Jeff was just describing to me related to improving the CSV delivery of EnviroDIY datasets.
 
I'm interested in your thoughts!​

cc: @horsburgh, @sreeder, @emiliom, @lsetiawan, @miguelcleon 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try HDF5 / PyTables / Pandas integration to speed time series I/O #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Try HDF5 / PyTables / Pandas integration to speed time series I/O #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions