An Infrastructure for Extendable Input/Output File Format Support in DAPHNE by yazanandoni · Pull Request #993 · daphne-project/daphne

yazanandoni · 2025-12-07T16:41:53Z

No description provided.

…tomatically when daphne starts

pdamme

Thank you so much for this PR, @yazanandoni, and sorry for the delay. The contribution of this PR is twofold:

An extensibility infrastructure for file formats (readers/writers as plug-ins)
Concrete efficient plug-ins for CSV and Parquet

The extensibility infrastructure is highly valuable to DAPHNE as it can help us (and expert users) to easily add support for more file formats without requiring more and more dependencies for the main system. The concrete plug-ins are also highly welcome as they can improve the performance of reads/writes in DAPHNE.

The following comments refer only to the extensibility infrastructure, which I would like to merge first as it enables follow-up work on more plug-ins. The concrete plug-ins will be handled later.

The code of the proposed extensibility infrastructure is largely fine, but several improvements and corrections are required before we can merge it in.

Mandatory points:

All files must adhere to the code style to make the CI checks pass.
There are some smaller issues with the extensibility infrastructure, which should be fixed, e.g.:
- Don't read the plug-in priority from the JSON file, since the priority is inherently related to the registration of the plug-in, not to the plug-in itself.
- FileIORegistry::getReader()/getWriter() may not take the priority into account correctly. They would prefer a low-priority reader/writer (plug-in that has already been loaded) over high-priority lazy specification (plug-in that has not been loaded yet).
- registerLazy() should ensure that at least one of readerSymbol/writerSymbol is specified.
- Read.h/Write.h: HDFS is handled after the plug-ins and will most likely not work anymore.
- Do we really need a mutex for the access to the catalog?
- Readers and writers use the same options map, and could overwrite each other.
The test cases need some improvements, e.g.:
- The tests cases for the extensibility infrastructure should not depend on concrete plug-ins and data files in scripts/examples/, they should only use test files from the test/ directory.
- The test output files should be .gitignore'd.
- The new test cases should focus on the extensibility infrastructure, not on concrete new extensions.
- The test cases should not use Arrow (or other third-party libraries) in order to avoid additional dependencies.
- Some test cases print debug messages, these must be silenced.
The code needs a thorough polishing pass, e.g.:
- Polishing identifiers, comments, and error messages to make them more understandable and more consistent with the rest of the code base.
- Ensure correct license headers in all files.
- Undo several unrelated changes.
Refactoring of the directory and file structure of the new plug-ins (both for built-in readers/writers and new custom ones), e.g.:
- Currently, these plug-ins reside in scripts/examples/. However, they are not just usage examples, but actual reusable code. Maybe a new directory extensions/ would be more appropriate for them.
- Currently, the built-in plug-ins consists of a large cpp file, which seems to concatenate the code of the original built-in readers/writers from src/runtime/local/io/. Retaining the separation into multiple files would be a clearer structure.
- The FileIOCatalogParser should be moved from to src/parser/ (where all other parsers reside, including those for the kernel extension catalog).
Several artifacts that we don't need on the main branch need to be removed, e.g., experiments and compiled binaries.

Performance impact. I conducted some little experiments on file read/write performance (writing randomly generated dense and sparse matrices to CSV and the DAPHNE Binary Data Format, writing randomly generated frames with and without string columns to CSV, reading these matrices and frames again, and reading sparse matrices from MatrixMarket). Consistently with what you reported in you thesis, I observed that reading via a plug-in yields only a negligible overhead compared to the built-in readers, but writing is noticeably slower when done via a plug-in (slow-down of roughly 10-20% for numerical data and 30-100% for string data). We should clearly understand (and ideally fix) these performance issues before we make plug-ins the default path for reading/writing files.

I will finalize the code in this PR and merge it in (initially only the extensibility infrastructure).

- This fix became necessary after rebasing this branch on the upstream main branch.

- There seem to be some inconsistencies between my local clang-format and the one in the CI container.

…lt-in IO extensions - Initial experiments showed that using the source code of DAPHNE's CSV readers/writers through IO plug-ins can yield a significant runtime overhead (especially for writing string-heavy data). - A deeper investigation revealed that this overhead can be attributed to a subtle difference in the compiler flags used when compiling the built-in write-kernel and the IO extension. - More precisely, the write-kernel uses `-std=gnu++20`, while the extension used `-std=c++17`, which can apparently cause a significant performance difference. - This commit harmonizes the compiler flags by also using `-std=gnu++20` for the extension. - With that, the runtime overhead mentioned above vanishes.

pdamme · 2026-06-16T13:50:54Z

Update on the performance impact: I further investigated the slow-down observed when using the readers/writers through an IO plug-in. It turned out that the root cause of the runtime overheads are slightly different compiler flags used for compiling the built-in write-kernel (original DAPHNE) and the built-in IO plug-ins shared lib (this PR). The former uses -std=gnu++20 (see build/build.ninja, any of the src/runtime/local/kernels/CMakeFiles/KernelObjLib.dir/kernels_*.cpp.o targets), while the latter used -std=c++17 (see scripts/examples/extensions/builtInIO/MakeFile). Once we consistently use -std=gnu++20 for the plug-in, the runtime overhead vanishes.

Besides that, I've also tried to reproduce the runtime overheads in a small stand-alone example without the entire DAPHNE code around, i.e., a small main.cpp that runs the same piece of code (expensive writing to a text file, including string manipulations) either through a C++ function directly in the main.cpp or through a function from a shared lib (accessed through dlopen() and dlsym(), just like the extensibility infrastructure proposed in this PR does it). I could not reproduce any overheads.

To sum up, to the best of my understanding, there are no noteworthy runtime overheads from using a reader/writer through an IO plug-in compared to using the same reader/writer from inside DAPHNE, as long as the plug-in was compiled with the same flags. Hence, I don't see a performance risk in making the proposed plug-in architecture the default (and the only) path for using file readers/writers in DAPHNE.

pdamme self-requested a review June 1, 2026 13:27

yazanandoni added 19 commits June 5, 2026 13:51

27.05

bd33864

added the option to put costume options as a frame

fa172c7

before changing read

470e903

added extendability arguement in the comand line

f187858

use of one registry by utelizing the daphne context

f86fb05

added all built-in readers as a plug in

6b88dc0

added benchmarks and created a clear function for the Registry

5e1cd5a

changes and fixes

418cd23

added writer support and changed write.h

2244404

added library loading when plug-in is used and changed the write kernel

9d49d9a

seperated the benchmarks and changed the csv plugin

823e5c1

fixed some bugs

194e846

engine and priority support

ad49432

removed built-ins and made them only as a plug-in that gets loaded au…

8092e07

…tomatically when daphne starts

added the script tests

736b028

adjustments

ce001fa

added eval_runner and cleaned some code

602bc6d

final commit

6c7d094

final commit

e406f21

pdamme requested changes Jun 5, 2026

View reviewed changes

pdamme added 2 commits June 5, 2026 21:00

fix: parsing DaphneDSL read/write built-ins with optional extra options.

911e8c0

- This fix became necessary after rebasing this branch on the upstream main branch.

style: applied clang-format to all h/cpp files touched by this PR

012ca81

pdamme force-pushed the Extendability_Up_To_Date branch from e614402 to 012ca81 Compare June 5, 2026 19:41

pdamme added 3 commits June 5, 2026 22:21

fix: removed unwanted debug output

cd75afa

style: make CI code style checks pass

299f718

- There seem to be some inconsistencies between my local clang-format and the one in the CI container.

ShreyasGS mentioned this pull request Jun 21, 2026

feat(io): ORC reader for DenseMatrix and Frame (initial prototype, refs #985) #1006

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993

An Infrastructure for Extendable Input/Output File Format Support in DAPHNE#993
yazanandoni wants to merge 24 commits into
daphne-project:mainfrom
yazanandoni:Extendability_Up_To_Date

yazanandoni commented Dec 7, 2025

Uh oh!

pdamme left a comment

Uh oh!

pdamme commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yazanandoni commented Dec 7, 2025

Uh oh!

pdamme left a comment

Choose a reason for hiding this comment

Uh oh!

pdamme commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants