Skip to content

An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993

Open
yazanandoni wants to merge 24 commits into
daphne-project:mainfrom
yazanandoni:Extendability_Up_To_Date
Open

An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993
yazanandoni wants to merge 24 commits into
daphne-project:mainfrom
yazanandoni:Extendability_Up_To_Date

Conversation

@yazanandoni

Copy link
Copy Markdown

No description provided.

@pdamme pdamme self-requested a review June 1, 2026 13:27

@pdamme pdamme left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR, @yazanandoni, and sorry for the delay. The contribution of this PR is twofold:

  1. An extensibility infrastructure for file formats (readers/writers as plug-ins)
  2. Concrete efficient plug-ins for CSV and Parquet

The extensibility infrastructure is highly valuable to DAPHNE as it can help us (and expert users) to easily add support for more file formats without requiring more and more dependencies for the main system. The concrete plug-ins are also highly welcome as they can improve the performance of reads/writes in DAPHNE.


The following comments refer only to the extensibility infrastructure, which I would like to merge first as it enables follow-up work on more plug-ins. The concrete plug-ins will be handled later.

The code of the proposed extensibility infrastructure is largely fine, but several improvements and corrections are required before we can merge it in.

Mandatory points:

  1. All files must adhere to the code style to make the CI checks pass.
  2. There are some smaller issues with the extensibility infrastructure, which should be fixed, e.g.:
    • Don't read the plug-in priority from the JSON file, since the priority is inherently related to the registration of the plug-in, not to the plug-in itself.
    • FileIORegistry::getReader()/getWriter() may not take the priority into account correctly. They would prefer a low-priority reader/writer (plug-in that has already been loaded) over high-priority lazy specification (plug-in that has not been loaded yet).
    • registerLazy() should ensure that at least one of readerSymbol/writerSymbol is specified.
    • Read.h/Write.h: HDFS is handled after the plug-ins and will most likely not work anymore.
    • Do we really need a mutex for the access to the catalog?
    • Readers and writers use the same options map, and could overwrite each other.
  3. The test cases need some improvements, e.g.:
    • The tests cases for the extensibility infrastructure should not depend on concrete plug-ins and data files in scripts/examples/, they should only use test files from the test/ directory.
    • The test output files should be .gitignore'd.
    • The new test cases should focus on the extensibility infrastructure, not on concrete new extensions.
    • The test cases should not use Arrow (or other third-party libraries) in order to avoid additional dependencies.
    • Some test cases print debug messages, these must be silenced.
  4. The code needs a thorough polishing pass, e.g.:
    • Polishing identifiers, comments, and error messages to make them more understandable and more consistent with the rest of the code base.
    • Ensure correct license headers in all files.
    • Undo several unrelated changes.
  5. Refactoring of the directory and file structure of the new plug-ins (both for built-in readers/writers and new custom ones), e.g.:
    • Currently, these plug-ins reside in scripts/examples/. However, they are not just usage examples, but actual reusable code. Maybe a new directory extensions/ would be more appropriate for them.
    • Currently, the built-in plug-ins consists of a large cpp file, which seems to concatenate the code of the original built-in readers/writers from src/runtime/local/io/. Retaining the separation into multiple files would be a clearer structure.
    • The FileIOCatalogParser should be moved from to src/parser/ (where all other parsers reside, including those for the kernel extension catalog).
  6. Several artifacts that we don't need on the main branch need to be removed, e.g., experiments and compiled binaries.

Performance impact. I conducted some little experiments on file read/write performance (writing randomly generated dense and sparse matrices to CSV and the DAPHNE Binary Data Format, writing randomly generated frames with and without string columns to CSV, reading these matrices and frames again, and reading sparse matrices from MatrixMarket). Consistently with what you reported in you thesis, I observed that reading via a plug-in yields only a negligible overhead compared to the built-in readers, but writing is noticeably slower when done via a plug-in (slow-down of roughly 10-20% for numerical data and 30-100% for string data). We should clearly understand (and ideally fix) these performance issues before we make plug-ins the default path for reading/writing files.


I will finalize the code in this PR and merge it in (initially only the extensibility infrastructure).

@pdamme pdamme force-pushed the Extendability_Up_To_Date branch from e614402 to 012ca81 Compare June 5, 2026 19:41
pdamme added 3 commits June 5, 2026 22:21
- There seem to be some inconsistencies between my local clang-format and the one in the CI container.
…lt-in IO extensions

- Initial experiments showed that using the source code of DAPHNE's CSV readers/writers through IO plug-ins can yield a significant runtime overhead (especially for writing string-heavy data).
- A deeper investigation revealed that this overhead can be attributed to a subtle difference in the compiler flags used when compiling the built-in write-kernel and the IO extension.
- More precisely, the write-kernel uses `-std=gnu++20`, while the extension used `-std=c++17`, which can apparently cause a significant performance difference.
- This commit harmonizes the compiler flags by also using `-std=gnu++20` for the extension.
- With that, the runtime overhead mentioned above vanishes.
@pdamme

pdamme commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Update on the performance impact: I further investigated the slow-down observed when using the readers/writers through an IO plug-in. It turned out that the root cause of the runtime overheads are slightly different compiler flags used for compiling the built-in write-kernel (original DAPHNE) and the built-in IO plug-ins shared lib (this PR). The former uses -std=gnu++20 (see build/build.ninja, any of the src/runtime/local/kernels/CMakeFiles/KernelObjLib.dir/kernels_*.cpp.o targets), while the latter used -std=c++17 (see scripts/examples/extensions/builtInIO/MakeFile). Once we consistently use -std=gnu++20 for the plug-in, the runtime overhead vanishes.

Besides that, I've also tried to reproduce the runtime overheads in a small stand-alone example without the entire DAPHNE code around, i.e., a small main.cpp that runs the same piece of code (expensive writing to a text file, including string manipulations) either through a C++ function directly in the main.cpp or through a function from a shared lib (accessed through dlopen() and dlsym(), just like the extensibility infrastructure proposed in this PR does it). I could not reproduce any overheads.

To sum up, to the best of my understanding, there are no noteworthy runtime overheads from using a reader/writer through an IO plug-in compared to using the same reader/writer from inside DAPHNE, as long as the plug-in was compiled with the same flags. Hence, I don't see a performance risk in making the proposed plug-in architecture the default (and the only) path for using file readers/writers in DAPHNE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants