An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993
An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993yazanandoni wants to merge 24 commits into
Conversation
…tomatically when daphne starts
pdamme
left a comment
There was a problem hiding this comment.
Thank you so much for this PR, @yazanandoni, and sorry for the delay. The contribution of this PR is twofold:
- An extensibility infrastructure for file formats (readers/writers as plug-ins)
- Concrete efficient plug-ins for CSV and Parquet
The extensibility infrastructure is highly valuable to DAPHNE as it can help us (and expert users) to easily add support for more file formats without requiring more and more dependencies for the main system. The concrete plug-ins are also highly welcome as they can improve the performance of reads/writes in DAPHNE.
The following comments refer only to the extensibility infrastructure, which I would like to merge first as it enables follow-up work on more plug-ins. The concrete plug-ins will be handled later.
The code of the proposed extensibility infrastructure is largely fine, but several improvements and corrections are required before we can merge it in.
Mandatory points:
- All files must adhere to the code style to make the CI checks pass.
- There are some smaller issues with the extensibility infrastructure, which should be fixed, e.g.:
- Don't read the plug-in priority from the JSON file, since the priority is inherently related to the registration of the plug-in, not to the plug-in itself.
FileIORegistry::getReader()/getWriter()may not take the priority into account correctly. They would prefer a low-priority reader/writer (plug-in that has already been loaded) over high-priority lazy specification (plug-in that has not been loaded yet).registerLazy()should ensure that at least one ofreaderSymbol/writerSymbolis specified.Read.h/Write.h: HDFS is handled after the plug-ins and will most likely not work anymore.- Do we really need a mutex for the access to the catalog?
- Readers and writers use the same options map, and could overwrite each other.
- The test cases need some improvements, e.g.:
- The tests cases for the extensibility infrastructure should not depend on concrete plug-ins and data files in
scripts/examples/, they should only use test files from thetest/directory. - The test output files should be
.gitignore'd. - The new test cases should focus on the extensibility infrastructure, not on concrete new extensions.
- The test cases should not use Arrow (or other third-party libraries) in order to avoid additional dependencies.
- Some test cases print debug messages, these must be silenced.
- The tests cases for the extensibility infrastructure should not depend on concrete plug-ins and data files in
- The code needs a thorough polishing pass, e.g.:
- Polishing identifiers, comments, and error messages to make them more understandable and more consistent with the rest of the code base.
- Ensure correct license headers in all files.
- Undo several unrelated changes.
- Refactoring of the directory and file structure of the new plug-ins (both for built-in readers/writers and new custom ones), e.g.:
- Currently, these plug-ins reside in
scripts/examples/. However, they are not just usage examples, but actual reusable code. Maybe a new directoryextensions/would be more appropriate for them. - Currently, the built-in plug-ins consists of a large cpp file, which seems to concatenate the code of the original built-in readers/writers from
src/runtime/local/io/. Retaining the separation into multiple files would be a clearer structure. - The
FileIOCatalogParsershould be moved from tosrc/parser/(where all other parsers reside, including those for the kernel extension catalog).
- Currently, these plug-ins reside in
- Several artifacts that we don't need on the main branch need to be removed, e.g., experiments and compiled binaries.
Performance impact. I conducted some little experiments on file read/write performance (writing randomly generated dense and sparse matrices to CSV and the DAPHNE Binary Data Format, writing randomly generated frames with and without string columns to CSV, reading these matrices and frames again, and reading sparse matrices from MatrixMarket). Consistently with what you reported in you thesis, I observed that reading via a plug-in yields only a negligible overhead compared to the built-in readers, but writing is noticeably slower when done via a plug-in (slow-down of roughly 10-20% for numerical data and 30-100% for string data). We should clearly understand (and ideally fix) these performance issues before we make plug-ins the default path for reading/writing files.
I will finalize the code in this PR and merge it in (initially only the extensibility infrastructure).
- This fix became necessary after rebasing this branch on the upstream main branch.
e614402 to
012ca81
Compare
- There seem to be some inconsistencies between my local clang-format and the one in the CI container.
…lt-in IO extensions - Initial experiments showed that using the source code of DAPHNE's CSV readers/writers through IO plug-ins can yield a significant runtime overhead (especially for writing string-heavy data). - A deeper investigation revealed that this overhead can be attributed to a subtle difference in the compiler flags used when compiling the built-in write-kernel and the IO extension. - More precisely, the write-kernel uses `-std=gnu++20`, while the extension used `-std=c++17`, which can apparently cause a significant performance difference. - This commit harmonizes the compiler flags by also using `-std=gnu++20` for the extension. - With that, the runtime overhead mentioned above vanishes.
|
Update on the performance impact: I further investigated the slow-down observed when using the readers/writers through an IO plug-in. It turned out that the root cause of the runtime overheads are slightly different compiler flags used for compiling the built-in write-kernel (original DAPHNE) and the built-in IO plug-ins shared lib (this PR). The former uses Besides that, I've also tried to reproduce the runtime overheads in a small stand-alone example without the entire DAPHNE code around, i.e., a small To sum up, to the best of my understanding, there are no noteworthy runtime overheads from using a reader/writer through an IO plug-in compared to using the same reader/writer from inside DAPHNE, as long as the plug-in was compiled with the same flags. Hence, I don't see a performance risk in making the proposed plug-in architecture the default (and the only) path for using file readers/writers in DAPHNE. |
No description provided.