Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Set up JDK
uses: coursier/setup-action@v1
with:
jvm: temurin:21
jvm: temurin:25

- name: Cache Coursier
uses: actions/cache@v4
Expand Down
14 changes: 8 additions & 6 deletions docs/LIMITATIONS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# XL Current Limitations and Future Roadmap

**Last Updated**: 2025-12-27 (Docs Cleanup)
**Last Updated**: 2026-04-26
**Current Phase**: Core domain + OOXML + streaming I/O complete; formula system complete (**81 functions** + cross-sheet support); tables + benchmarks complete; row/column serialization complete; **security hardening complete** (ZIP bomb detection, XXE prevention, formula injection guards in both in-memory and streaming writes).

This document provides a comprehensive overview of what XL can and cannot do today, with clear links to future implementation plans.
Expand Down Expand Up @@ -84,18 +84,20 @@ This document provides a comprehensive overview of what XL can and cannot do tod

### 🟡 Medium Impact (Reduces Functionality)

#### 4. Merged Cells in Streaming Writes
**Status**: Fully supported in the in‑memory OOXML path; not emitted by streaming writers.
#### 4. Merged Cells in Pure Row-Stream Writes
**Status**: Fully supported in the in‑memory OOXML path and `writeWorkbookStream`; not available in pure row-stream generation.

**Current State**:
- In‑memory:
- `Sheet.mergedRanges: Set[CellRange]` tracks merged regions.
- `OoxmlWorksheet.toXml` emits `<mergeCells>` / `<mergeCell>` for those ranges.
- Streaming write (`writeStream`, `writeStreamsSeq`):
- Only writes `sheetData` with plain rows and cells; no merged cell metadata is currently generated.
- In-memory workbook SAX/StAX write (`writeWorkbookStream`):
- Delegates to the full OOXML writer and preserves merged cell metadata.
- Pure row-stream write (`writeStream`, `writeStreamsSeq`):
- Writes rows from `Stream[RowData]`; there is no API for supplying merged cell metadata.

**Impact**:
- In‑memory read/write round‑trips preserve merges.
- In‑memory read/write and CLI workbook writes preserve merges.
- Pure streaming‑generated workbooks will not contain merged ranges.

---
Expand Down
2 changes: 1 addition & 1 deletion docs/QUICK-START.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ Stream.range(1, 1_000_001)

**Memory**: ~10MB constant (even for 10M rows!)

**Limitations**: Streaming writers use inline strings and minimal styles (no rich formatting or merges).
**Limitations**: Pure row-stream writers use inline strings and minimal styles, and do not accept workbook metadata such as merges. Use in-memory writes or `writeWorkbookStream` when you need full workbook metadata.

---

Expand Down
1 change: 1 addition & 0 deletions docs/RELEASING.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Mill auto-imports these as `MILL_*` prefixed environment variables.

| Module | Artifact ID | Description |
|--------|-------------|-------------|
| xl | `xl_3` | Aggregate artifact for the full XL library |
| xl-core | `xl-core_3` | Core domain model, macros, DSL |
| xl-ooxml | `xl-ooxml_3` | OOXML readers and writers |
| xl-cats-effect | `xl-cats-effect_3` | IO interpreters, streaming |
Expand Down
34 changes: 21 additions & 13 deletions docs/STATUS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# XL Project Status

**Last Updated**: 2026-01-21
**Last Updated**: 2026-04-26

## Current State

Expand Down Expand Up @@ -60,6 +60,7 @@
- ✅ ExcelIO[IO] interpreter
- ✅ `readStream` / `readSheetStream` / `readStreamByIndex` – constant‑memory streaming read (fs2.io.readInputStream + fs2‑data‑xml)
- ✅ `writeStream` / `writeStreamsSeq` – constant‑memory streaming write (fs2‑data‑xml)
- ✅ `writeWorkbookStream` – lower-allocation SAX/StAX write for in-memory workbooks; preserves merges, comments, tables, row/column properties, and freeze panes
- ✅ **`writeFast`** – SAX/StAX streaming write (opt-in via `ExcelIO.writeFast()` or `WriterConfig(backend = XmlBackend.SaxStax)`)
- ✅ Benchmark: 100k rows in ~1.8s read (~10MB constant memory) / ~1.1s write (~10MB constant memory)

Expand Down Expand Up @@ -149,7 +150,7 @@
- Precedent/dependent queries: O(1) lookups via adjacency lists
- Safe evaluation: sheet.evaluateWithDependencyCheck() (production-ready)
- Performance: Handles 10k formula cells in <10ms
- ⚠️ Merged cells are fully supported in the in-memory OOXML path, but not emitted by streaming writers.
- ⚠️ Merged cells are supported by the in-memory OOXML path and `writeWorkbookStream`. Pure row-stream generation (`writeStream` / `writeStreamsSeq`) has no merge API.
- ❌ Hyperlinks not serialized.
- ✅ Column/row properties (width, height, hidden, outlineLevel, collapsed) are fully serialized via DirectSaxEmitter.

Expand All @@ -171,14 +172,19 @@
- ❌ Data validation
- ❌ Named ranges

### Streaming I/O Limitations (CRITICAL)
### Streaming I/O Limitations

**Write Path** (✅ Working):
- ✅ True constant-memory streaming with `writeStream`
**Row-stream write path** (✅ Working):
- ✅ True constant-memory row streaming with `writeStream` / `writeStreamsSeq`
- ✅ O(1) memory regardless of file size
- ⚠️ No SST support (inline strings only - larger files)
- ⚠️ Minimal styles (default only - no rich formatting)
- ⚠️ [Content_Types].xml written before SST decision made
- ⚠️ No row-stream API for workbook metadata such as merged ranges, comments, tables, and freeze panes

**In-memory workbook SAX/StAX write path** (✅ Working):
- ✅ `writeWorkbookStream` writes an already-materialized `Workbook` through the SAX/StAX backend
- ✅ Preserves full workbook metadata handled by the OOXML writer, including merges, comments, tables, row/column properties, and freeze panes
- ⚠️ Not a row-input streaming API; the `Workbook` is already in memory

**Read Path** (✅ P6.6 Complete):
- ✅ **True constant-memory streaming** - uses `fs2.io.readInputStream`
Expand All @@ -196,12 +202,14 @@

### Security & Safety

**Not Implemented** (P11):
- ❌ ZIP bomb detection
- ❌ XXE (XML External Entity) prevention
- ❌ Formula injection guards
- ❌ XLSM macro preservation (should never execute)
- ❌ File size limits
**Implemented**:
- ✅ ZIP bomb detection
- ✅ XXE (XML External Entity) prevention
- ✅ Formula injection guards in in-memory and streaming writes

**Remaining**:
- ❌ XLSM macro preservation policy and tests (macros are never executed)
- ❌ Configurable file size limits

### Advanced Features

Expand All @@ -214,14 +222,14 @@
- ✅ **Excel Tables** (WI-10): Structured data with headers, AutoFilter, styling
- ✅ **Benchmarks** (WI-15): JMH performance suite (XL vs POI)
- ✅ **SAX Write** (WI-17): Fast SAX/StAX streaming write path
- ✅ **Security Hardening** (WI-30): ZIP bomb detection, XXE prevention, formula injection guards

**Not Started** (Future):
- ❌ P6b: Full case class codec derivation (Magnolia/Shapeless)
- ❌ P9: Advanced macros (path macro, style literal)
- ❌ P10: Drawings (images, shapes)
- ❌ P11: Charts
- ❌ Pivot Tables (remaining part of P12)
- ❌ P13: Security hardening (ZIP bomb, XXE prevention)

---

Expand Down
5 changes: 3 additions & 2 deletions docs/design/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ graph TD
end

subgraph Evaluator
Eval[xl-evaluator<br/>Formula Parser]
Eval[xl-evaluator<br/>Formula parser + evaluator]
end

subgraph Test
Expand All @@ -37,7 +37,7 @@ graph TD
- `xl-core`: Pure domain model (`Cell`, `Sheet`, `Workbook`, styles, codecs, optics, macros).
- `xl-ooxml`: Pure OOXML mapping layer (`XlsxReader` / `XlsxWriter`, `OoxmlWorkbook`, `OoxmlWorksheet`, `SharedStrings`, `Styles`).
- `xl-cats-effect`: Effectful interpreters (`Excel[F]` / `ExcelIO`) and true streaming I/O built on Cats Effect, fs2, and fs2-data-xml.
- `xl-evaluator`: Formula parser (`TExpr` GADT, `FormulaParser`, `FormulaPrinter`); evaluator planned (WI-08).
- `xl-evaluator`: Formula parser, printer, evaluator, function registry, dependency graph, and cross-sheet formula support.
- `xl-testkit`: Reusable generators and law test helpers for the other modules.

## I/O Flow
Expand Down Expand Up @@ -72,6 +72,7 @@ flowchart LR
- **Streaming path**:
- `ExcelIO.readStream` / `readSheetStream` open the ZIP and stream a worksheet’s XML through fs2‑data‑xml, yielding a `Stream[F, RowData]` with constant memory use (SST is still materialized once if present).
- `ExcelIO.writeStream` / `writeStreamsSeq` write static parts once, then stream worksheet XML events directly to a `ZipOutputStream` from a `Stream[F, RowData]` without ever materializing all rows.
- `ExcelIO.writeWorkbookStream` is different: it accepts an already-materialized `Workbook`, then uses the SAX/StAX OOXML backend to reduce writer allocation while preserving the full metadata handled by `XlsxWriter`.

See also:
- `docs/design/io-modes.md` – deeper comparison of in-memory vs streaming modes.
Expand Down
4 changes: 2 additions & 2 deletions docs/design/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,13 @@
- **Alternatives Considered**:
- Streaming-only: Would lose SST/styles (unacceptable for <100k row use cases)
- In-memory only: Would OOM on large files (unacceptable for ETL pipelines)
- Two-phase streaming: Deferred to P7.5 (complex, not MVP-critical)
- Full-feature row streaming: Deferred; workbook-level SST/styles/metadata make it substantially more complex than row emission
- **Consequences**:
- ✅ Best-of-both-worlds (full features OR constant memory)
- ✅ Users choose based on needs
- ❌ Two implementations to maintain
- ❌ Potential confusion about which to use
- **Mitigation**: Clear documentation in README and performance-guide.md; streaming read was fixed in P6.6 (fs2.io.readInputStream) and now matches streaming write on O(1) memory.
- **Mitigation**: Clear documentation in README and performance-guide.md; streaming read was fixed in P6.6 (fs2.io.readInputStream), and in-memory workbook writes can use the SAX/StAX backend through `writeWorkbookStream`.

## ADR-012: Compression defaults to DEFLATED
**Date**: 2025-11 (P6.7 planned)
Expand Down
25 changes: 13 additions & 12 deletions docs/design/io-modes.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Characteristics:

---

### Streaming Path
### Row-Streaming Path

Write (`writeStream` / `writeStreamsSeq`):
- Static parts (`[Content_Types].xml`, workbook relationships, minimal `styles.xml`) are written once up front.
Expand All @@ -86,9 +86,11 @@ Read (`readStream` / `readSheetStream` / `readStreamByIndex`):
Characteristics:
- Memory: **O(1)** for worksheet data (plus the in‑memory SST and minimal bookkeeping).
- Features:
- Write: inline strings only, default styles, no merged cells or advanced sheet metadata.
- Write: inline strings only, default styles, no row-stream API for merged cells or advanced sheet metadata.
- Read: values and basic types; you typically use it for ETL/analytics rather than formatting‑preserving workflows.

For an already-materialized `Workbook`, `writeWorkbookStream` uses the SAX/StAX OOXML backend. It is a lower-allocation full-workbook write path, not a row-input streaming API, and it preserves the full metadata handled by `XlsxWriter`.

```
Read:
User calls readStream(path)
Expand Down Expand Up @@ -124,7 +126,7 @@ Features: Limited (reads values only, minimal style info)
**Alternatives Considered**:
1. **Streaming only**: Would lose SST/styles (larger files, no formatting)
2. **In-memory only**: Would OOM on large files
3. **Two-phase streaming**: Adds complexity, still under development (P7.5)
3. **Full-feature row streaming**: Adds complexity because SST/styles/metadata require workbook-level state

**Chosen**: Two modes with clear guidance on when to use each

Expand Down Expand Up @@ -232,12 +234,12 @@ Today:

## Why Not a Unified Implementation?

**Could we make streaming support full features?**
**Could pure row streaming support full features?**

**Challenge 1**: SST requires string deduplication
- Need to see all strings before writing sharedStrings.xml
- But [Content_Types].xml written first (before strings known)
- **Solution**: Two-phase approach (P7.5) or optimistic SST inclusion
- **Solution**: Two-phase row streaming or optimistic SST inclusion

**Challenge 2**: Styles require deduplication across workbook
- Need to merge all sheet style registries
Expand All @@ -251,12 +253,12 @@ Today:
- Relationships typically before referenced parts
- Changing order might break some readers (untested)

**Conclusion**: Streaming with full features is possible (P7.5) but requires:
**Conclusion**: Pure row streaming with full features is possible but requires:
- Two-pass approach (scan data, write with indices)
- OR disk-backed registries for SST/styles
- OR optimistic overhead (include SST/styles even if empty)

**Timeline**: 3-4 weeks of implementation (deferred to post-MVP)
For already-materialized workbooks, the implemented `writeWorkbookStream` path uses the SAX/StAX backend and preserves the full metadata handled by `XlsxWriter`.

---

Expand Down Expand Up @@ -330,11 +332,10 @@ test("streaming read uses constant memory"):
- Default: DEFLATED + compact (smaller files)
- Debug mode: STORED + pretty for inspection

### P7.5: Two-Phase Streaming Writer (3-4 weeks)
- Support SST and styles in streaming mode
- Two-pass approach: scan → write
- Disk-backed registries for very large datasets
- Achieves O(1) memory with full features
### SAX/StAX Workbook Writer
- `writeWorkbookStream` writes an already-materialized workbook through the SAX/StAX backend.
- Preserves SST, styles, merged cells, comments, tables, row/column properties, and freeze panes handled by `XlsxWriter`.
- Pure row-stream writers remain the true O(1) row-input path for generated datasets.

---

Expand Down
15 changes: 8 additions & 7 deletions docs/reference/performance-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ XL provides two distinct I/O implementations with different performance characte
| **10k-50k rows** | Full styling | In-memory | In-memory | ~25-50MB | Good |
| **50k-100k rows** | Full styling | In-memory | In-memory | ~50-100MB | Fair |
| **100k+ rows** | Minimal | **Streaming** (sequential) | **Streaming** | ~10-50MB (SST dependent) | Excellent |
| **100k+ rows** | Full styling | In-memory for now | In-memory for now | O(n) | See roadmap P7.5 |
| **100k+ rows** | Full styling / metadata | In-memory | `writeWorkbookStream` if already materialized | O(n) workbook, lower writer allocation | Good |

## I/O Modes Explained

Expand Down Expand Up @@ -56,14 +56,15 @@ excel.read(path).map { wb =>

---

### Streaming Write Mode
### Row-Stream Write Mode

**Use For**: Large data generation (100k+ rows) when you can live with minimal styling and inline strings.

**Characteristics**:
- O(1) constant memory for worksheet data (~10MB regardless of row count).
- fs2‑data‑xml event streaming; no intermediate XML trees.
- **Limitations**: No SST support (inline strings only), minimal styles (no rich formatting or merged cells).
- **Limitations**: No SST support (inline strings only), minimal styles, and no API for workbook metadata such as merged cells.
- For an already-materialized workbook that needs metadata preservation with lower writer allocation, use `ExcelIO.writeWorkbookStream`.

**API**:
```scala
Expand Down Expand Up @@ -385,10 +386,10 @@ excel.read(largeFile)
- ✅ DEFLATED compression by default (5-10x smaller files)
- ✅ Configurable compression mode

### P7.5 (3-4 weeks)
- ✅ Two-phase streaming writer with full SST/styles
- ✅ O(1) memory for writes with rich formatting
- ✅ Disk-backed SST for very large datasets
### SAX/StAX OOXML Writer
- ✅ Lower-allocation `writeWorkbookStream` path for already-materialized workbooks
- ✅ Preserves full OOXML metadata handled by `XlsxWriter`
- ✅ Pure row-stream writers remain available for true O(1) row input

### Future (Post-1.0)
- ⬜ Parallel XML serialization (4-8 threads)
Expand Down
31 changes: 18 additions & 13 deletions docs/reference/testing-guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Testing & Laws — Property Suites, Round-Trips, and Coverage

**Current Status**: 636/636 tests passing across 4 modules
**Current Status**: CI runs the full Mill test graph across the library, evaluator, CLI, and support modules. Use `./mill __.test` as the authoritative count.

## Test Infrastructure

Expand All @@ -14,11 +14,14 @@
xl-core/test/src/com/tjclp/xl/
xl-ooxml/test/src/com/tjclp/xl/ooxml/
xl-cats-effect/test/src/com/tjclp/xl/io/
xl-evaluator/test/src/com/tjclp/xl/formula/
xl-cli/test/src/com/tjclp/xl/cli/
xl-testkit/test/src/com/tjclp/xl/testkit/
```

## Test Coverage by Module

### xl-core: 221 tests
### xl-core: domain, style, codec, optics, and law suites

#### Addressing Laws (17 tests)
- **Column/Row round-trips**: `from0` → `index0` identity
Expand Down Expand Up @@ -83,9 +86,9 @@ xl-cats-effect/test/src/com/tjclp/xl/io/
- **Whitespace**: Preserved correctly with `xml:space="preserve"`
- **OOXML mapping**: `TextRun` → `<r><rPr>...</rPr><t>...</t></r>`

### xl-ooxml: 24 tests
### xl-ooxml: OOXML round-trip, surgical write, metadata, table, security, and performance suites

#### Round-Trip Tests (24 tests)
#### Round-Trip and Regression Tests
- **Text cells**: String values preserve exactly
- **Number cells**: Numeric precision maintained
- **Boolean cells**: True/false round-trip
Expand All @@ -98,9 +101,9 @@ xl-cats-effect/test/src/com/tjclp/xl/io/
- **RichText**: Multi-run formatted text preserved
- **XML determinism**: Same input → same byte output

### xl-cats-effect: 18 tests
### xl-cats-effect: streaming and effectful IO suites

#### Streaming I/O (18 tests)
#### Streaming I/O
- **writeStream / writeStreamsSeq**: Event-based ZIP write via fs2-data-xml
- **readStream / readSheetStream / readStreamByIndex**: Event-based worksheet reads with fs2-data-xml + fs2.io.readInputStream
- **Constant memory**: O(1) memory usage verified (100k rows @ ~50MB)
Expand Down Expand Up @@ -184,10 +187,12 @@ property("get-set") {

### Run All Tests
```bash
./mill __.test # All modules (263 tests)
./mill xl-core.test # Core only (221 tests)
./mill xl-ooxml.test # OOXML only (24 tests)
./mill xl-cats-effect.test # Streaming only (18 tests)
./mill __.test # All test modules
./mill xl-core.test # Core only
./mill xl-ooxml.test # OOXML only
./mill xl-cats-effect.test # Streaming/IO only
./mill xl-evaluator.test # Formula parser/evaluator only
./mill xl-cli.test # CLI only
```

### Run Specific Test
Expand All @@ -201,7 +206,7 @@ property("get-set") {
GitHub Actions runs:
1. `./mill __.checkFormat` (Scalafmt verification)
2. `./mill __.compile` (Compilation check)
3. `./mill __.test` (All 263 tests)
3. `./mill __.test` (All test modules)

## Coverage Goals

Expand All @@ -215,9 +220,9 @@ GitHub Actions runs:

## Test Quality Metrics

- **All tests pass**: 263/263
- **All tests pass**: enforced by `./mill __.test` in CI
- **Zero flaky tests**: Deterministic, reproducible
- **Fast execution**: <5 seconds for full suite
- **Fast execution**: focused suites run quickly; full-suite time is tracked by CI
- **Property-based**: 60%+ of tests use ScalaCheck
- **Law coverage**: All algebras (Monoid, Lens, Optional) verified
- **Edge cases**: Boundary values, error paths tested
Expand Down
Loading
Loading