Skip to content

Vitruves/carquet

Repository files navigation

Carquet

Carquet

A fast, pure C library for reading and writing Apache Parquet files.

Build Platform C Standard License
x86 SIMD ARM SIMD

Highlights

  • Pure C11 with three external dependencies (zstd, zlib, lz4) -- all auto-fetched by CMake
  • ~200KB binary vs ~50MB+ for Arrow
  • Built-in CLI for file inspection (schema, info, head, tail, stat, ...) and C code generation (codegen)
  • 70x faster reads than Arrow C++ on uncompressed data (mmap zero-copy), 150x faster than PyArrow
  • 1.2-2.6x faster compressed reads than Arrow C++ on the same file (cross-read benchmark)
  • Writes 1.0-2.3x faster than Arrow C++ across codecs and platforms
  • Reads 10M uncompressed rows in 0.25ms (mmap zero-copy on Apple M3)
  • Full Parquet spec: all types, encodings, compression codecs, nested schemas, bloom filters, page indexes
  • SIMD-optimized (SSE4.2, AVX2, AVX-512, NEON, SVE) with runtime detection and scalar fallbacks
  • PyArrow, DuckDB, Spark compatible out of the box

Performance

Carquet vs Arrow C++ 23.0.1 at 10M rows (the most representative size). Higher ratio = Carquet faster.

x86 (Xeon D-1531) ARM (Apple M3)
Codec Write Read Write Read
snappy 1.55x 1.25x 1.10x 1.53x
zstd 1.31x 1.04x 1.37x 1.28x
lz4 1.02x 0.83x 1.25x 0.96x
none 1.13x 40.6x* 1.33x 70.4x*

* Uncompressed reads use mmap zero-copy -- see note below.

Compressed reads involve full decompression and decoding of every value, no shortcuts — and both libraries use the same system lz4/zstd shared libraries, so the raw codec speed is identical. The most meaningful comparison is the same-file cross-read table (below), where both libraries read the exact same Parquet file: Carquet reads compressed data 1.5-2.6x faster than Arrow C++ on that apples-to-apples test.

Benchmark methodology

All benchmarks use identical data (deterministic LCG PRNG), identical Parquet settings (no dictionary, BYTE_STREAM_SPLIT for floats, page checksums, mmap reads), trimmed median of 11-51 iterations, with OS page cache purged between write and read phases and cooldown between configurations. Schema: 3 columns (INT64, DOUBLE, INT32). Compared against Arrow C++ 23.0.1 low-level Parquet reader (bypassing Arrow Table materialization) and PyArrow 23.0.1.

The same-file cross-read benchmark is the fairest comparison: both libraries read the exact same Parquet file (written by one, read by both). This eliminates differences in page sizes, encoding choices, and row group layout.

Uncompressed reads marked with * use Carquet's mmap zero-copy path: for PLAIN-encoded, uncompressed, fixed-size, required columns, the batch reader returns pointers directly into the memory-mapped file with no memcpy. Arrow always materializes into its own buffers. The compressed read numbers are the most representative measure of end-to-end read throughput.

Full x86 results (Intel Xeon D-1531, Linux)

12 threads @ 2.2GHz, 32GB RAM, Ubuntu 24.04 -- ZSTD level 1

10M rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio Size
none 1557ms 1766ms 1.13x 1.25ms 50.8ms 40.6x* 190.7MB
snappy 1002ms 1549ms 1.55x 78ms 97.8ms 1.25x 125.1MB
zstd 1311ms 1714ms 1.31x 76.8ms 80.2ms 1.04x 95.3MB
lz4 1521ms 1554ms 1.02x 59.1ms 49.0ms 0.83x 122.9MB

1M rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio
none 180ms 196ms 1.09x 0.22ms 6.2ms 28x*
snappy 141ms 148ms 1.05x 8.1ms 11.6ms 1.44x
zstd 131ms 185ms 1.41x 10.3ms 9.1ms 0.88x
lz4 143ms 149ms 1.04x 8.5ms 6.1ms 0.72x

100K rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio
none 14.1ms 18.4ms 1.30x 0.11ms 2.18ms 19.8x*
snappy 10.1ms 10.6ms 1.05x 1.27ms 5.97ms 4.70x
zstd 8.7ms 14.1ms 1.62x 1.58ms 3.88ms 2.46x
lz4 9.6ms 11.0ms 1.14x 0.77ms 2.78ms 3.61x

Same-file cross-read (10M rows)

Both libraries read the same Parquet file — the fairest apples-to-apples comparison.

Codec Writer Carquet Read Arrow C++ Read Ratio
none Carquet 0.99ms 73.6ms 74x*
none Arrow 7.6ms 51.2ms 6.8x*
snappy Carquet 41.0ms 107ms 2.61x
snappy Arrow 43.4ms 101ms 2.33x
zstd Carquet 46.1ms 88.4ms 1.92x
zstd Arrow 49.1ms 79.5ms 1.62x
lz4 Carquet 34.8ms 74.8ms 2.15x
lz4 Arrow 27.4ms 52.0ms 1.90x

10M rows vs PyArrow

Codec Carquet Write PyArrow Write W ratio Carquet Read PyArrow Read R ratio
none 1557ms 1806ms 1.16x 1.25ms 213ms 170x*
snappy 1002ms 1649ms 1.65x 78ms 384ms 4.91x
zstd 1311ms 1796ms 1.37x 76.8ms 369ms 4.81x
lz4 1521ms 1676ms 1.10x 59.1ms 281ms 4.76x

* Zero-copy mmap path

Full ARM results (Apple M3, macOS)

MacBook Air M3, 16GB RAM, macOS 26.2, Arrow C++ 23.0.1, PyArrow 23.0.1 -- ZSTD level 1

10M rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio Size
none 99.4ms 131.9ms 1.33x 0.25ms 17.59ms 70.4x* 190.7MB
snappy 231.0ms 253.1ms 1.10x 16.15ms 24.75ms 1.53x 125.1MB
zstd 253.3ms 347.5ms 1.37x 22.91ms 29.38ms 1.28x 95.3MB
lz4 198.3ms 248.8ms 1.25x 18.90ms 18.05ms 0.96x 122.9MB

1M rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio
none 7.57ms 12.91ms 1.71x 0.05ms 1.77ms 35.4x*
snappy 13.43ms 24.50ms 1.82x 1.52ms 2.55ms 1.68x
zstd 15.05ms 34.12ms 2.27x 2.29ms 3.06ms 1.34x
lz4 13.09ms 25.11ms 1.92x 1.03ms 1.74ms 1.69x

100K rows vs Arrow C++

Codec Carquet Write Arrow C++ Write W ratio Carquet Read Arrow C++ Read R ratio
none 1.13ms 1.56ms 1.38x 0.02ms 0.23ms 11.5x*
snappy 1.64ms 2.50ms 1.52x 0.37ms 0.90ms 2.43x
zstd 1.69ms 3.52ms 2.08x 0.64ms 1.31ms 2.05x
lz4 1.58ms 2.49ms 1.58x 0.25ms 0.57ms 2.28x

Same-file cross-read (10M rows)

Both libraries read the same Parquet file — the fairest apples-to-apples comparison.

Codec Writer Carquet Read Arrow C++ Read Ratio
none Carquet 0.36ms 18.33ms 50.9x*
none Arrow 1.01ms 17.60ms 17.4x*
snappy Carquet 20.54ms 24.52ms 1.19x
snappy Arrow 14.91ms 23.65ms 1.59x
zstd Carquet 23.11ms 34.71ms 1.50x
zstd Arrow 22.03ms 29.87ms 1.36x
lz4 Carquet 10.96ms 18.54ms 1.69x
lz4 Arrow 10.54ms 17.43ms 1.65x

10M rows vs PyArrow

Codec Carquet Write PyArrow Write W ratio Carquet Read PyArrow Read R ratio
none 99.4ms 193.4ms 1.95x 0.25ms 37.64ms 150.6x*
snappy 231.0ms 306.3ms 1.33x 16.15ms 48.01ms 2.97x
zstd 253.3ms 405.7ms 1.60x 22.91ms 61.63ms 2.69x
lz4 198.3ms 309.4ms 1.56x 18.90ms 40.09ms 2.12x

1M rows vs PyArrow

Codec Carquet Write PyArrow Write W ratio Carquet Read PyArrow Read R ratio
none 7.57ms 18.41ms 2.43x 0.05ms 2.63ms 52.6x*
snappy 13.43ms 30.73ms 2.29x 1.52ms 3.65ms 2.40x
zstd 15.05ms 39.84ms 2.65x 2.29ms 4.43ms 1.93x
lz4 13.09ms 30.27ms 2.31x 1.03ms 3.10ms 3.01x

100K rows vs PyArrow

Codec Carquet Write PyArrow Write W ratio Carquet Read PyArrow Read R ratio
none 1.13ms 1.95ms 1.73x 0.02ms 0.23ms 11.5x*
snappy 1.64ms 2.98ms 1.82x 0.37ms 0.59ms 1.59x
zstd 1.69ms 4.15ms 2.46x 0.64ms 0.81ms 1.27x
lz4 1.58ms 3.05ms 1.93x 0.25ms 0.40ms 1.60x

* Zero-copy mmap path

Building

Requirements

  • C11 compiler (GCC 4.9+, Clang 3.4+, MSVC 2015+)
  • CMake 3.16+
  • zstd, zlib, lz4 (auto-fetched if missing)
  • OpenMP (optional, for parallel column reading)

Quick Start

git clone https://github.com/Vitruves/carquet.git
cd carquet
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Build Options

Option Default Description
CARQUET_BUILD_DEV OFF Build everything (tests, examples, benchmarks)
CARQUET_BUILD_TESTS OFF Build test suite only
CARQUET_BUILD_CLI ON Build carquet CLI tool
CARQUET_BUILD_SHARED OFF Build shared library instead of static
CARQUET_NATIVE_ARCH OFF -march=native for max performance
CARQUET_ENABLE_SVE OFF ARM SVE (experimental)

All x86 SIMD (SSE, AVX, AVX2, AVX-512) and ARM NEON are auto-detected and enabled by default.

All build options
Option Default Description
CARQUET_BUILD_EXAMPLES OFF Build example programs
CARQUET_BUILD_BENCHMARKS OFF Build benchmark and profiling programs
CARQUET_BUILD_ARROW_CPP_BENCHMARK OFF Optional Arrow C++ comparison benchmark
CARQUET_BUILD_INTEROP OFF Build interoperability tests
CARQUET_BUILD_FUZZ OFF Build fuzz targets
CARQUET_ENABLE_SSE ON SSE optimizations (x86, auto-detected)
CARQUET_ENABLE_AVX ON AVX optimizations (x86, auto-detected)
CARQUET_ENABLE_AVX2 ON AVX2 optimizations (x86, auto-detected)
CARQUET_ENABLE_AVX512 ON AVX-512 optimizations (x86, auto-detected)
CARQUET_ENABLE_NEON ON NEON optimizations (ARM, auto-detected)

Installation

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
sudo cmake --install build

This installs:

  • libcarquet.a (or .so / .dylib with -DCARQUET_BUILD_SHARED=ON)
  • include/carquet/ headers
  • carquet CLI binary

After installation, link your project with -lcarquet.

You can use the CLI directly if you want to create a file reader:

carquet info data.parquet
carquet codegen -f data.parquet -o reader.c

Development Build

cmake -B build -DCARQUET_BUILD_DEV=ON
cmake --build build -j$(nproc)
cd build && ctest --output-on-failure

CLI Tool

Carquet ships with a command-line tool for inspecting Parquet files and generating C reader code. Built and installed by default alongside the library.

Commands:
  schema     Print file schema
  info       Print detailed file metadata
  head       Print first N rows
  tail       Print last N rows
  count      Print total row count
  columns    List column names (one per line)
  stat       Print column statistics
  validate   Verify file integrity
  sample     Print N random rows
  codegen    Generate C reader code
carquet schema data.parquet
carquet head -n 20 data.parquet
carquet stat data.parquet
carquet validate data.parquet

Code Generation

Generate a complete, compilable C reader from any Parquet file's schema:

carquet codegen -f data.parquet -o reader.c
# Generated: reader.c
# Compile:   clang -o reader reader.c -I.../include -L.../build -lcarquet ...

./reader                    # reads data.parquet (embedded as default)
./reader other.parquet      # override with different file

Options:

Flag Description
-f, --file FILE Parquet file to inspect schema from
-o, --output FILE Output source file (default: stdout)
--mmap Use memory-mapped I/O in generated code
--skeleton Generate empty process_batch for custom logic
-c, --columns COLS Comma-separated column filter
-b, --batch-size N Batch size (default: 1024)

C API

Manual

The top-level README is intentionally short. For day-to-day usage, prefer the versioned manual in docs/:

Write a Parquet File

#include <carquet/carquet.h>

int main(void) {
    carquet_error_t err = CARQUET_ERROR_INIT;

    // Define schema
    carquet_schema_t* schema = carquet_schema_create(&err);
    carquet_schema_add_column(schema, "id",    CARQUET_PHYSICAL_INT64,  NULL, CARQUET_REPETITION_REQUIRED, 0, 0);
    carquet_schema_add_column(schema, "value", CARQUET_PHYSICAL_DOUBLE, NULL, CARQUET_REPETITION_REQUIRED, 0, 0);

    // Configure writer
    carquet_writer_options_t opts;
    carquet_writer_options_init(&opts);
    opts.compression = CARQUET_COMPRESSION_ZSTD;

    // Write
    carquet_writer_t* w = carquet_writer_create("output.parquet", schema, &opts, &err);

    int64_t ids[]    = {1, 2, 3, 4, 5};
    double values[]  = {1.1, 2.2, 3.3, 4.4, 5.5};
    carquet_writer_write_batch(w, 0, ids, 5, NULL, NULL);
    carquet_writer_write_batch(w, 1, values, 5, NULL, NULL);
    carquet_writer_close(w);

    carquet_schema_free(schema);
    return 0;
}

Read a Parquet File

#include <carquet/carquet.h>
#include <stdio.h>

int main(void) {
    carquet_error_t err = CARQUET_ERROR_INIT;

    // Open with mmap for best read performance
    carquet_reader_options_t opts;
    carquet_reader_options_init(&opts);
    opts.use_mmap = true;

    carquet_reader_t* r = carquet_reader_open("output.parquet", &opts, &err);
    if (!r) { printf("Error: %s\n", err.message); return 1; }

    printf("Rows: %lld, Columns: %d\n",
           (long long)carquet_reader_num_rows(r),
           carquet_reader_num_columns(r));

    // Batch reader for efficient iteration
    carquet_batch_reader_config_t cfg;
    carquet_batch_reader_config_init(&cfg);
    cfg.batch_size = 65536;

    carquet_batch_reader_t* br = carquet_batch_reader_create(r, &cfg, &err);
    carquet_row_batch_t* batch = NULL;

    while (carquet_batch_reader_next(br, &batch) == CARQUET_OK && batch) {
        const void* data;
        const uint8_t* nulls;
        int64_t n;
        carquet_row_batch_column(batch, 0, &data, &nulls, &n);
        const int64_t* ids = (const int64_t*)data;
        // process ids[0..n-1] ...
        carquet_row_batch_free(batch);
        batch = NULL;
    }

    carquet_batch_reader_free(br);
    carquet_reader_close(r);
    return 0;
}

Nullable Columns

// Schema with nullable column
carquet_schema_add_column(schema, "name", CARQUET_PHYSICAL_BYTE_ARRAY,
                          NULL, CARQUET_REPETITION_OPTIONAL, 0, 0);

// Write with definition levels (1 = present, 0 = null)
carquet_byte_array_t names[] = {{(uint8_t*)"Alice", 5}, {(uint8_t*)"Bob", 3}};
int16_t def_levels[] = {1, 0, 1};  // Alice, NULL, Bob (3 rows, 2 values)
carquet_writer_write_batch(writer, col, names, 3, def_levels, NULL);

Nested Types (Lists, Maps)

// list<int32>
int32_t list_leaf = carquet_schema_add_list(
    schema, "tags", CARQUET_PHYSICAL_INT32, NULL,
    CARQUET_REPETITION_OPTIONAL, 0, 0);

// map<string, int32>
int32_t map_val = carquet_schema_add_map(
    schema, "props",
    CARQUET_PHYSICAL_BYTE_ARRAY, NULL, 0,   // key: string
    CARQUET_PHYSICAL_INT32, NULL, 0,         // value: int32
    CARQUET_REPETITION_OPTIONAL, 0);

// Write list data: row0=[100,200], row1=NULL, row2=[300]
int32_t vals[] = {100, 200, 300};
int16_t def[]  = {  3,   3,   0,   3};
int16_t rep[]  = {  0,   1,   0,   0};
carquet_writer_write_batch(writer, col, vals, 4, def, rep);

Column Projection

carquet_batch_reader_config_t cfg;
carquet_batch_reader_config_init(&cfg);

// Read only specific columns
const char* names[] = {"id", "timestamp"};
cfg.column_names = names;
cfg.num_column_names = 2;

Predicate Pushdown

Skip entire row groups that cannot match a query, based on column statistics:

// Filter callback: only read row groups where column 0 might have values > threshold
bool filter_fn(const carquet_reader_t* reader, int32_t rg, void* ctx) {
    int64_t threshold = *(int64_t*)ctx;
    bool might_match = true;
    carquet_reader_row_group_matches(reader, rg, 0,
        CARQUET_COMPARE_GT, &threshold, sizeof(threshold), &might_match);
    return might_match;
}

int64_t threshold = 1000;
cfg.row_group_filter = filter_fn;
cfg.row_group_filter_ctx = &threshold;
// Non-matching row groups are skipped with zero I/O

I/O Coalescing

Pre-buffer multiple columns in a single read (reduces seeks for fread path, no-op for mmap):

int32_t cols[] = {0, 2, 5};
carquet_reader_prebuffer(reader, 0, cols, 3, &err);
// Subsequent column reads from row group 0 use the cached data

Compression

Codec Enum Best For
ZSTD CARQUET_COMPRESSION_ZSTD Best overall (great ratio + speed)
LZ4 CARQUET_COMPRESSION_LZ4_RAW Read-heavy workloads (fastest decompression)
Snappy CARQUET_COMPRESSION_SNAPPY Wide compatibility
GZIP CARQUET_COMPRESSION_GZIP Maximum compatibility with older tools
opts.compression = CARQUET_COMPRESSION_ZSTD;
opts.compression_level = 1;  // 0 = codec default; ZSTD: 1-22, GZIP: 1-9

Writer Options

carquet_writer_options_t opts;
carquet_writer_options_init(&opts);
opts.compression        = CARQUET_COMPRESSION_ZSTD;
opts.row_group_size     = 128 * 1024 * 1024;  // 128 MB row groups
opts.write_statistics   = true;                // min/max for predicate pushdown
opts.write_crc          = true;                // CRC32 page verification
opts.write_bloom_filters = true;               // bloom filters per column
opts.write_page_index   = true;                // column/offset page indexes

Error Handling

carquet_error_t err = CARQUET_ERROR_INIT;
carquet_reader_t* r = carquet_reader_open("data.parquet", NULL, &err);
if (!r) {
    printf("[%s] %s\n", carquet_status_name(err.code), err.message);
    printf("Hint: %s\n", carquet_error_recovery_hint(err.code));
    return 1;
}

All functions return carquet_status_t or use carquet_error_t* out-parameters. Programming errors (NULL where a valid pointer is required) trigger assertions; runtime errors (bad files, OOM) return error codes.

Interoperability

Carquet files are fully compatible with PyArrow, DuckDB, Spark, and any Parquet reader:

import pyarrow.parquet as pq
table = pq.read_table("carquet_output.parquet")  # just works
-- DuckDB
SELECT * FROM read_parquet('carquet_output.parquet');

Bidirectional interop testing:

cmake -B build -DCARQUET_BUILD_INTEROP=ON && cmake --build build
python3 interop/run_interop.py

Parquet Feature Support

Feature Status
Physical types All 8 (BOOLEAN through FIXED_LEN_BYTE_ARRAY)
Logical types STRING, DATE, TIME, TIMESTAMP, DECIMAL, UUID, JSON
Encodings PLAIN, RLE, DICTIONARY, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT
Compression UNCOMPRESSED, SNAPPY, GZIP, LZ4, ZSTD
Nested schemas Groups, lists, maps with definition/repetition levels
Bloom filters Read, write, and query (carquet_bloom_filter_check_*)
Page indexes Column index + offset index (read + write + per-page stats access)
Statistics Min/max/null count per column chunk
Predicate pushdown Row group filtering via statistics; page-level via column index
Key-value metadata Read and write arbitrary footer metadata
Per-column options Per-column encoding, compression, statistics, bloom filter
Buffer writer Write Parquet to in-memory buffer
CRC32 Page-level verification (HW-accelerated on ARM)
Memory-mapped I/O Zero-copy reads for uncompressed PLAIN data
Column projection Read only selected columns
I/O coalescing Pre-buffer multi-column reads in a single I/O
Speculative footer Single-I/O file open for most files
OpenMP parallel reads When available
Encryption Not supported

Running Benchmarks

# Build with max optimizations
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCARQUET_NATIVE_ARCH=ON -DCARQUET_BUILD_DEV=ON
cmake --build build -j$(nproc)

cd build
./benchmark_carquet                     # Carquet standalone
python3 ../benchmark/run_benchmark.py   # Full comparison (+ PyArrow, + Arrow C++)

# Skip 100M-row (xlarge) configs — they write ~2GB files per codec
# and can take 30+ minutes depending on hardware
python3 ../benchmark/run_benchmark.py --skip-xlarge

# Override ZSTD level (default: 1)
CARQUET_BENCH_ZSTD_LEVEL=3 python3 ../benchmark/run_benchmark.py
Optional Arrow C++ benchmark
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCARQUET_NATIVE_ARCH=ON \
  -DCARQUET_BUILD_BENCHMARKS=ON \
  -DCARQUET_BUILD_ARROW_CPP_BENCHMARK=ON
cmake --build build -j$(nproc)

# Or point at a custom Arrow install
cmake -B build ... -DCARQUET_ARROW_CPP_ROOT=/path/to/arrow-prefix

The Arrow C++ benchmark uses the low-level parquet::ParquetFileReader API (bypassing Arrow Table materialization overhead) with parallel row group readers. The same-file cross-read mode has both libraries read the exact same Parquet file, eliminating differences in page sizes, encoding, and row group layout. Both benchmarks use identical data, row group sizing, no dictionary, page checksums, mmap reads, BYTE_STREAM_SPLIT for floats.

API Reference

Full API is in include/carquet/carquet.h. Key types:

Type Purpose
carquet_reader_t File reader (open from path, FILE*, or memory buffer)
carquet_writer_t File writer
carquet_batch_reader_t High-level batch iteration
carquet_schema_t Schema definition and introspection
carquet_error_t Rich error info (code, message, source location, recovery hint)
Core API functions

Reader

carquet_reader_t* carquet_reader_open(const char* path, const carquet_reader_options_t* opts, carquet_error_t* err);
carquet_reader_t* carquet_reader_open_buffer(const void* buf, size_t size, const carquet_reader_options_t* opts, carquet_error_t* err);
void              carquet_reader_close(carquet_reader_t* reader);
int64_t           carquet_reader_num_rows(const carquet_reader_t* reader);
int32_t           carquet_reader_num_columns(const carquet_reader_t* reader);

Batch Reader

carquet_batch_reader_t* carquet_batch_reader_create(carquet_reader_t* reader, const carquet_batch_reader_config_t* cfg, carquet_error_t* err);
carquet_status_t        carquet_batch_reader_next(carquet_batch_reader_t* br, carquet_row_batch_t** batch);
carquet_status_t        carquet_row_batch_column(const carquet_row_batch_t* batch, int32_t col, const void** data, const uint8_t** nulls, int64_t* n);

Writer

carquet_writer_t*  carquet_writer_create(const char* path, const carquet_schema_t* schema, const carquet_writer_options_t* opts, carquet_error_t* err);
carquet_status_t   carquet_writer_write_batch(carquet_writer_t* w, int32_t col, const void* values, int64_t n, const int16_t* def, const int16_t* rep);
carquet_status_t   carquet_writer_close(carquet_writer_t* w);

Schema

carquet_schema_t* carquet_schema_create(carquet_error_t* err);
carquet_status_t  carquet_schema_add_column(carquet_schema_t* s, const char* name, carquet_physical_type_t type, const carquet_logical_type_t* logical, carquet_field_repetition_t rep, int32_t type_len, int32_t parent);
int32_t           carquet_schema_add_list(carquet_schema_t* s, const char* name, carquet_physical_type_t elem_type, const carquet_logical_type_t* elem_logical, carquet_field_repetition_t rep, int32_t type_len, int32_t parent);
int32_t           carquet_schema_add_map(carquet_schema_t* s, const char* name, carquet_physical_type_t key_type, const carquet_logical_type_t* key_logical, int32_t key_len, carquet_physical_type_t val_type, const carquet_logical_type_t* val_logical, int32_t val_len, carquet_field_repetition_t rep, int32_t parent);

Filtering

int32_t carquet_reader_filter_row_groups(const carquet_reader_t* reader, int32_t col, carquet_compare_op_t op, const void* value, int32_t value_size, int32_t* matching, int32_t max);

Project Structure

include/carquet/   Public API (carquet.h, types.h, error.h)
src/
  core/            Arena allocator, buffer, bitpack, endian
  encoding/        PLAIN, RLE, DELTA, DICTIONARY, BYTE_STREAM_SPLIT
  compression/     Snappy (internal), GZIP, ZSTD, LZ4 (wrappers)
  thrift/          Thrift compact protocol for Parquet metadata
  simd/            Runtime dispatch + x86 (SSE/AVX2/AVX-512) + ARM (NEON/SVE)
  reader/          File, row group, column, page, batch readers + mmap
  writer/          File, row group, column, page writers
  metadata/        Schema, statistics, bloom filters, page indexes
  cli/             CLI tool and code generator
  util/            CRC32, xxHash
tests/             18 test files
examples/          basic_write_read, data_types, compression_codecs, nullable_columns, advanced_features
benchmark/         Performance benchmarks and comparison tools

License

MIT

About

A high-performance, SIMD-optimized, pure C library for reading and writing Apache Parquet files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors