Issue: cooltools (and, potentially, pairtools) generate a variety of computational "artifacts", i.e. derivatives of primary datasets, e.g. P(s), compartments, saddleplots, insulation scores, etc. Currently, we lack a consistent way to name and store these datasets and their metadata on disk. This results in messy and inconsistent project folders, missing metadata and hours of time wasted on ad-hoc code that matches artifacts with their primary datasets. The lack of a consistent naming scheme also hinders further development of reporting scripts.
Proposal: come up with (a) a storage format and (b) a naming schema that would automate storage, discovery, and access to computational artifacts.
Potential solutions.
(A) File format. We need some kind of a container that can store computational artifacts of various kinds (tables, texts, binary arrays, etc...) and provide random access and append/rewrite functionality. Potential solutions:
- a folder
- an hdf5 file
- a database
- a zero-compression zip file
- ...
My personal favorite is a zero-compression (aka STORE) zip file. It is a very well accepted format (MS Office formats are zip files!), can be accessed from all command lines, Python, R. Like a folder with files, it offers random access and append/rewrite functionality, but it also has an advantage of being easily transferable between machines (admittedly, this is not a very strong advantage).
HDF5 can serve as a key-value store as well. The downside is that it treats all datasets as arrays, doesn't work well with NFS (according to @mimakaev) and requires special CLI tools/libraries to manipulate.
Various databases/key-value-stores are another alternative, but it's not clear to me why would they serve better for
(B) Schema.
My initial proposal is that, for each primary dataset, we would create a file or folder with a name derived from the filename of the dataset. E.g., if the primary dataset is called 'WT.1000.mcool', the artifact file/folder would be called 'WT.1000.mcool.arts' or something like that. Probably, the most important point is that there should be a single, well-defined procedure that matches the artifact file/folder with its primary dataset and vice versa.
Then, inside the artifact container, each computational tool would claim its own folder, presumably, named after the tool itself. The structure of the files inside that folder would be left up to the tools creators. We could, however, suggest some default schema that would standardize metadata storage and fields.
Ideas/suggestions?..
the issue generalizes #38
Issue: cooltools (and, potentially, pairtools) generate a variety of computational "artifacts", i.e. derivatives of primary datasets, e.g. P(s), compartments, saddleplots, insulation scores, etc. Currently, we lack a consistent way to name and store these datasets and their metadata on disk. This results in messy and inconsistent project folders, missing metadata and hours of time wasted on ad-hoc code that matches artifacts with their primary datasets. The lack of a consistent naming scheme also hinders further development of reporting scripts.
Proposal: come up with (a) a storage format and (b) a naming schema that would automate storage, discovery, and access to computational artifacts.
Potential solutions.
(A) File format. We need some kind of a container that can store computational artifacts of various kinds (tables, texts, binary arrays, etc...) and provide random access and append/rewrite functionality. Potential solutions:
My personal favorite is a zero-compression (aka STORE) zip file. It is a very well accepted format (MS Office formats are zip files!), can be accessed from all command lines, Python, R. Like a folder with files, it offers random access and append/rewrite functionality, but it also has an advantage of being easily transferable between machines (admittedly, this is not a very strong advantage).
HDF5 can serve as a key-value store as well. The downside is that it treats all datasets as arrays, doesn't work well with NFS (according to @mimakaev) and requires special CLI tools/libraries to manipulate.
Various databases/key-value-stores are another alternative, but it's not clear to me why would they serve better for
(B) Schema.
My initial proposal is that, for each primary dataset, we would create a file or folder with a name derived from the filename of the dataset. E.g., if the primary dataset is called 'WT.1000.mcool', the artifact file/folder would be called 'WT.1000.mcool.arts' or something like that. Probably, the most important point is that there should be a single, well-defined procedure that matches the artifact file/folder with its primary dataset and vice versa.
Then, inside the artifact container, each computational tool would claim its own folder, presumably, named after the tool itself. The structure of the files inside that folder would be left up to the tools creators. We could, however, suggest some default schema that would standardize metadata storage and fields.
Ideas/suggestions?..
the issue generalizes #38