Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 92 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
A C++ local database library with cross language bindings. Aiming to be a fast, lightweight, and easy-to-use data communication solution for RPC and coupled modeling in scientific computing.

## What's new
- **2026-03-04 (Memory Overflow Improvement)**: Enhanced the `MemoryStream` implementation to handle large data sizes exceeding 4GB without causing size overflow in `chunk_data_t.size` (u32). This improvement allows for more robust handling of large datasets in memory. (PR #22)
- **2026-02-28 (Release Improvement)**: Fix bugs related to build process in Windows. (PR #20)
- **2025-12-31(Bug Fix)**: Fixed an issue where shared memory segments were not being properly unregistered from the resource tracker upon closing, which could lead to resource leaks. (PR #17)
- **2025-12-15 (Release Improvement)**: Enabled distribution of pre-compiled binary wheels for macOS (Intel/Apple Silicon) and Linux (x86_64/aarch64), eliminating the need for local compilation tools during installation. (PR #15)
Expand All @@ -22,7 +23,97 @@ You can install the Python package of fastdb via pip:
pip install fastdb4py
```

**Note:** Pre-compiled binary wheels are provided for major platforms (macOS, Linux). For other systems (including Windows), the package will build from source, requiring a C++ compiler and CMake.
**Note:** Pre-compiled binary wheels are provided for major platforms (macOS-Intel/macOS-Apple Silicon, Linux-Ubuntu, Windows-AMD64). For other systems, the package will build from source, requiring a C++ compiler and CMake.

## Usage

### 1. Define a Feature (Schema)

To use `fastdb`, you first need to define your data schema by subclassing `fastdb4py.Feature`.
Use type hints to define the fields of your feature.

```python
import fastdb4py

class Point(fastdb4py.Feature):
x: fastdb4py.F64
y: fastdb4py.F64
```

### 2. Create and Initialize a Database

You can create a new database or truncate an existing one using `fastdb4py.ORM.truncate`.
This function takes a list of `TableDefn` objects, specifying the feature class and the initial capacity (number of rows).

```python
from pathlib import Path

# specific the path for the database
DB_PATH = "my_fastdb_data"

# Create a new database with a table for 'Point' features, capacity 1000
# The name parameter is optional; if not provided, a default name will be generated based on the feature class name.
# In this example, we explicitly set the table name to 'points'.
db = fastdb4py.ORM.truncate([
fastdb4py.TableDefn(Point, 1000, name='points'),
])
```

### 3. Write Data

You can access the table using the feature class as a key.
Features can be accessed by index or iterated over.

```python
# Access the table 'points' with schema defined by the Point feature class
points_table = db[Point]['points']
# If you did not specify the table name when creating the database, you can access it using the default name:
# points_table = db[Point][Point]
# or
# points_table = db[Point]['Point']

# Ensure we are in write mode (if loaded from file later)
# For a newly created DB in memory, we are already good to go.

for i in range(10):
# Access the feature at index i
p = points_table[i]

# Set field values
p.x = i * 1.5
p.y = i * 2.5
p.label = f"point_{i}"

# Save the database to disk
db.save(DB_PATH)
```

### 4. Read and Modify Data (Columnar Access)

`fastdb` supports high-performance columnar access using NumPy arrays.
This allows for vectorized operations on your data.

```python
# Load the database from disk
db = fastdb4py.ORM.load(DB_PATH, from_file=True)
points_table = db[Point]['points']

# The length of the table (number of rows) can be obtained using len()
print(f"Number of points: {len(points_table)}")

# Access fields as numpy arrays via the `.column` property
xs = points_table.column.x
ys = points_table.column.y

print(f"First 5 X values: {xs[:5]}")

# Modify data in bulk using numpy operations
# This modifies the data in memory directly!
xs += 10.0

# Verify the change via object access
print(f"Point 0 x: {points_table[0].x}") # Should be 0 * 1.5 + 10.0 = 10.0
```

## Development Environment
This project uses DevContainer for development environment. Please refer to the `.devcontainer/devcontainer.example.json` file for configuration details.
Expand Down
2 changes: 1 addition & 1 deletion fastcarto/fastdb/include/fastdb.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ namespace wx

struct chunk_data_t
{
u32 size;
size_t size;
u8* pdata;
};

Expand Down
2 changes: 1 addition & 1 deletion fastcarto/fastdb/src/FastVectorDbBuild.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ namespace wx

chunk_data_t MemoryStream::Impl::data()
{
return chunk_data_t{ (u32)m_buffer.size(), m_buffer.data() };
return chunk_data_t{ m_buffer.size(), m_buffer.data() };
}

void MemoryStream::Impl::reset()
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "fastdb4py"
version = "0.1.10"
version = "0.1.11"
description = "FastCarto database bindings"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
7 changes: 6 additions & 1 deletion python/fastdb4py/orm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,12 @@ def close(self):
Make sure to unlink the shared memory if you want to completely remove it through the unlink() method by other processes.
"""
if self._shm:
resource_tracker.unregister(self._shm._name, 'shared_memory')
# Not manually unregistering shared memory
# However, this may cause some warnings in multiprocessing resource tracker
# when the process that shares the memory transmits the ownership to other processes and exits without unlinking the shared memory.
# But it is generally safe to ignore these warnings as long as you ensure proper unlinking of shared memory when it is no longer needed.
# May be optimized in the future if necessary.
# resource_tracker.unregister(self._shm._name, 'shared_memory')
self._shm.close()
self._shm = None
self._origin = None
Expand Down
120 changes: 120 additions & 0 deletions tests/reproduce_overflow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#################################################################
# Test Script: reproduce_overflow.py
# Purpose: To verify if writing > 4GB to MemoryStream causes size overflow in chunk_data_t.size (u32).
# Usage: python reproduce_overflow.py [size_in_gb]
# Note: This test can consume a lot of RAM due to C++ std::vector resizing strategy. Use with caution.
#################################################################
import sys
import os
import time

try:
from fastdb4py import core
except ImportError:
# Try local import if package is installed in weird way or running from source
# adjust path as needed
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../python')))
try:
from fastdb4py import core
except ImportError:
print("Error: Could not import fastdb4py.core")
sys.exit(1)

def run_test(target_size_gb=4.05):
"""
Tries to write > 4GB data to MemoryStream to verify if size overflows 32-bit integer.
target_size_gb: The size to write in GB. Should be > 4.0 to trigger overflow.
"""
target_bytes = int(target_size_gb * 1024 * 1024 * 1024)

# Use 64MB chunks to avoid one giant python allocation
chunk_size = 64 * 1024 * 1024
chunk = b'\x00' * chunk_size

print(f"[-] Allocating ~{target_size_gb:.2f} GB in MemoryStream...")
print("[-] WARNING: This test can consume > 8GB of RAM due to C++ std::vector resizing strategy.")
print("[-] If your machine has < 8GB RAM, this script might be killed by OOM killer.")

ms = core.WxMemoryStream()
total_written = 0
start_time = time.time()

try:
while total_written < target_bytes:
remaining = target_bytes - total_written

# Determine how much to write in this iteration
current_write_size = min(remaining, chunk_size)

# Slice buffer if needed (avoid copy if possible, memoryview might be better but bytes slice works for now)
bytes_to_write = chunk if current_write_size == chunk_size else chunk[:current_write_size]

# Depending on SWIG typemap, write might take 1 or 2 args.
# Based on inspection, it likely takes 1 argument (buffer) due to %typemap(in) (void* pdata, size_t size)
try:
ms.write(bytes_to_write)
except TypeError:
# Fallback if typemap doesn't match single arg
ms.write(bytes_to_write, len(bytes_to_write))

total_written += current_write_size

# Print progress
if total_written % (512 * 1024 * 1024) < chunk_size:
elapsed = time.time() - start_time
speed = (total_written / (1024**3)) / elapsed if elapsed > 0 else 0
print(f" Written: {total_written / (1024**3):.2f} GB ({speed:.2f} GB/s)")

except MemoryError:
print("\n[ERROR] Out of Memory! Python failed to allocate memory.")
sys.exit(1)
except Exception as e:
print(f"\n[ERROR] Unexpected error during write: {e}")
# Even if we crash, we might want to check what we wrote so far
pass

# Verify result
print(f"\n[-] Finished writing. Total intended: {total_written} bytes")

try:
result = ms.data()
reported_size = result.size # This is the u32 field from C++ struct

print(f"[-] Reported Size (from chunk_data_t.size): {reported_size} bytes")

expected_modulo = total_written % (2**32)

if reported_size != total_written:
print(f"[FAIL] Size mismatch! {reported_size} != {total_written}")
if reported_size == expected_modulo:
diff = total_written - reported_size
print(f"[FAIL] Bug Reproduced: Size overflowed exactly by {diff} bytes (approx {diff/1024/1024/1024:.0f}GB).")
print(f" This confirms that chunk_data_t.size is truncated to 32-bit unsigned integer.")

# Check for Magic String issue
# If we wrote > 4GB, the reported size is small (modulo).
# If the modulo is very small (e.g. < 16 bytes), the magic string (first 16 bytes) is effectively lost
# because the system thinks the file is only N bytes long.
if reported_size < 16:
print("[CRITICAL] Magic string compromised! File size is reported as < 16 bytes.")
else:
print("[WARN] Magic string might be present in first 16 bytes, but file is truncated.")
else:
print(f"[FAIL] Size mismatch but not exact modulo match. Got {reported_size}")
else:
print("[PASS] Size matches correctly (No overflow observed).")
print(" (Did you write enough data? You typically need > 4GB to trigger this.)")

except Exception as e:
print(f"[ERROR] Failed to inspect result: {e}")

if __name__ == "__main__":
size_gb = 4.05 # Default to slightly over 4GB
if len(sys.argv) > 1:
try:
size_gb = float(sys.argv[1])
except ValueError:
print("Usage: python reproduce_overflow.py [size_in_gb]")
sys.exit(1)

run_test(size_gb)