diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 569faa57..de5ba43c 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -15,7 +15,7 @@ jobs: - name: Set up JDK uses: coursier/setup-action@v1 with: - jvm: temurin:21 + jvm: temurin:25 - name: Cache Coursier uses: actions/cache@v4 diff --git a/docs/LIMITATIONS.md b/docs/LIMITATIONS.md index d5a2e534..c34a4681 100644 --- a/docs/LIMITATIONS.md +++ b/docs/LIMITATIONS.md @@ -1,6 +1,6 @@ # XL Current Limitations and Future Roadmap -**Last Updated**: 2025-12-27 (Docs Cleanup) +**Last Updated**: 2026-04-26 **Current Phase**: Core domain + OOXML + streaming I/O complete; formula system complete (**81 functions** + cross-sheet support); tables + benchmarks complete; row/column serialization complete; **security hardening complete** (ZIP bomb detection, XXE prevention, formula injection guards in both in-memory and streaming writes). This document provides a comprehensive overview of what XL can and cannot do today, with clear links to future implementation plans. @@ -84,18 +84,20 @@ This document provides a comprehensive overview of what XL can and cannot do tod ### 🟡 Medium Impact (Reduces Functionality) -#### 4. Merged Cells in Streaming Writes -**Status**: Fully supported in the in‑memory OOXML path; not emitted by streaming writers. +#### 4. Merged Cells in Pure Row-Stream Writes +**Status**: Fully supported in the in‑memory OOXML path and `writeWorkbookStream`; not available in pure row-stream generation. **Current State**: - In‑memory: - `Sheet.mergedRanges: Set[CellRange]` tracks merged regions. - `OoxmlWorksheet.toXml` emits `` / `` for those ranges. -- Streaming write (`writeStream`, `writeStreamsSeq`): - - Only writes `sheetData` with plain rows and cells; no merged cell metadata is currently generated. +- In-memory workbook SAX/StAX write (`writeWorkbookStream`): + - Delegates to the full OOXML writer and preserves merged cell metadata. +- Pure row-stream write (`writeStream`, `writeStreamsSeq`): + - Writes rows from `Stream[RowData]`; there is no API for supplying merged cell metadata. **Impact**: -- In‑memory read/write round‑trips preserve merges. +- In‑memory read/write and CLI workbook writes preserve merges. - Pure streaming‑generated workbooks will not contain merged ranges. --- diff --git a/docs/QUICK-START.md b/docs/QUICK-START.md index 2c46b094..84edbca5 100644 --- a/docs/QUICK-START.md +++ b/docs/QUICK-START.md @@ -193,7 +193,7 @@ Stream.range(1, 1_000_001) **Memory**: ~10MB constant (even for 10M rows!) -**Limitations**: Streaming writers use inline strings and minimal styles (no rich formatting or merges). +**Limitations**: Pure row-stream writers use inline strings and minimal styles, and do not accept workbook metadata such as merges. Use in-memory writes or `writeWorkbookStream` when you need full workbook metadata. --- diff --git a/docs/RELEASING.md b/docs/RELEASING.md index b2ba00a1..bc2e5111 100644 --- a/docs/RELEASING.md +++ b/docs/RELEASING.md @@ -26,6 +26,7 @@ Mill auto-imports these as `MILL_*` prefixed environment variables. | Module | Artifact ID | Description | |--------|-------------|-------------| +| xl | `xl_3` | Aggregate artifact for the full XL library | | xl-core | `xl-core_3` | Core domain model, macros, DSL | | xl-ooxml | `xl-ooxml_3` | OOXML readers and writers | | xl-cats-effect | `xl-cats-effect_3` | IO interpreters, streaming | diff --git a/docs/STATUS.md b/docs/STATUS.md index 4588850a..6520b8ca 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -1,6 +1,6 @@ # XL Project Status -**Last Updated**: 2026-01-21 +**Last Updated**: 2026-04-26 ## Current State @@ -60,6 +60,7 @@ - ✅ ExcelIO[IO] interpreter - ✅ `readStream` / `readSheetStream` / `readStreamByIndex` – constant‑memory streaming read (fs2.io.readInputStream + fs2‑data‑xml) - ✅ `writeStream` / `writeStreamsSeq` – constant‑memory streaming write (fs2‑data‑xml) +- ✅ `writeWorkbookStream` – lower-allocation SAX/StAX write for in-memory workbooks; preserves merges, comments, tables, row/column properties, and freeze panes - ✅ **`writeFast`** – SAX/StAX streaming write (opt-in via `ExcelIO.writeFast()` or `WriterConfig(backend = XmlBackend.SaxStax)`) - ✅ Benchmark: 100k rows in ~1.8s read (~10MB constant memory) / ~1.1s write (~10MB constant memory) @@ -149,7 +150,7 @@ - Precedent/dependent queries: O(1) lookups via adjacency lists - Safe evaluation: sheet.evaluateWithDependencyCheck() (production-ready) - Performance: Handles 10k formula cells in <10ms -- ⚠️ Merged cells are fully supported in the in-memory OOXML path, but not emitted by streaming writers. +- ⚠️ Merged cells are supported by the in-memory OOXML path and `writeWorkbookStream`. Pure row-stream generation (`writeStream` / `writeStreamsSeq`) has no merge API. - ❌ Hyperlinks not serialized. - ✅ Column/row properties (width, height, hidden, outlineLevel, collapsed) are fully serialized via DirectSaxEmitter. @@ -171,14 +172,19 @@ - ❌ Data validation - ❌ Named ranges -### Streaming I/O Limitations (CRITICAL) +### Streaming I/O Limitations -**Write Path** (✅ Working): -- ✅ True constant-memory streaming with `writeStream` +**Row-stream write path** (✅ Working): +- ✅ True constant-memory row streaming with `writeStream` / `writeStreamsSeq` - ✅ O(1) memory regardless of file size - ⚠️ No SST support (inline strings only - larger files) - ⚠️ Minimal styles (default only - no rich formatting) -- ⚠️ [Content_Types].xml written before SST decision made +- ⚠️ No row-stream API for workbook metadata such as merged ranges, comments, tables, and freeze panes + +**In-memory workbook SAX/StAX write path** (✅ Working): +- ✅ `writeWorkbookStream` writes an already-materialized `Workbook` through the SAX/StAX backend +- ✅ Preserves full workbook metadata handled by the OOXML writer, including merges, comments, tables, row/column properties, and freeze panes +- ⚠️ Not a row-input streaming API; the `Workbook` is already in memory **Read Path** (✅ P6.6 Complete): - ✅ **True constant-memory streaming** - uses `fs2.io.readInputStream` @@ -196,12 +202,14 @@ ### Security & Safety -**Not Implemented** (P11): -- ❌ ZIP bomb detection -- ❌ XXE (XML External Entity) prevention -- ❌ Formula injection guards -- ❌ XLSM macro preservation (should never execute) -- ❌ File size limits +**Implemented**: +- ✅ ZIP bomb detection +- ✅ XXE (XML External Entity) prevention +- ✅ Formula injection guards in in-memory and streaming writes + +**Remaining**: +- ❌ XLSM macro preservation policy and tests (macros are never executed) +- ❌ Configurable file size limits ### Advanced Features @@ -214,6 +222,7 @@ - ✅ **Excel Tables** (WI-10): Structured data with headers, AutoFilter, styling - ✅ **Benchmarks** (WI-15): JMH performance suite (XL vs POI) - ✅ **SAX Write** (WI-17): Fast SAX/StAX streaming write path +- ✅ **Security Hardening** (WI-30): ZIP bomb detection, XXE prevention, formula injection guards **Not Started** (Future): - ❌ P6b: Full case class codec derivation (Magnolia/Shapeless) @@ -221,7 +230,6 @@ - ❌ P10: Drawings (images, shapes) - ❌ P11: Charts - ❌ Pivot Tables (remaining part of P12) -- ❌ P13: Security hardening (ZIP bomb, XXE prevention) --- diff --git a/docs/design/architecture.md b/docs/design/architecture.md index 1c7c22c1..6be162cf 100644 --- a/docs/design/architecture.md +++ b/docs/design/architecture.md @@ -19,7 +19,7 @@ graph TD end subgraph Evaluator - Eval[xl-evaluator
Formula Parser] + Eval[xl-evaluator
Formula parser + evaluator] end subgraph Test @@ -37,7 +37,7 @@ graph TD - `xl-core`: Pure domain model (`Cell`, `Sheet`, `Workbook`, styles, codecs, optics, macros). - `xl-ooxml`: Pure OOXML mapping layer (`XlsxReader` / `XlsxWriter`, `OoxmlWorkbook`, `OoxmlWorksheet`, `SharedStrings`, `Styles`). - `xl-cats-effect`: Effectful interpreters (`Excel[F]` / `ExcelIO`) and true streaming I/O built on Cats Effect, fs2, and fs2-data-xml. -- `xl-evaluator`: Formula parser (`TExpr` GADT, `FormulaParser`, `FormulaPrinter`); evaluator planned (WI-08). +- `xl-evaluator`: Formula parser, printer, evaluator, function registry, dependency graph, and cross-sheet formula support. - `xl-testkit`: Reusable generators and law test helpers for the other modules. ## I/O Flow @@ -72,6 +72,7 @@ flowchart LR - **Streaming path**: - `ExcelIO.readStream` / `readSheetStream` open the ZIP and stream a worksheet’s XML through fs2‑data‑xml, yielding a `Stream[F, RowData]` with constant memory use (SST is still materialized once if present). - `ExcelIO.writeStream` / `writeStreamsSeq` write static parts once, then stream worksheet XML events directly to a `ZipOutputStream` from a `Stream[F, RowData]` without ever materializing all rows. + - `ExcelIO.writeWorkbookStream` is different: it accepts an already-materialized `Workbook`, then uses the SAX/StAX OOXML backend to reduce writer allocation while preserving the full metadata handled by `XlsxWriter`. See also: - `docs/design/io-modes.md` – deeper comparison of in-memory vs streaming modes. diff --git a/docs/design/decisions.md b/docs/design/decisions.md index d7c88345..2d12dcf2 100644 --- a/docs/design/decisions.md +++ b/docs/design/decisions.md @@ -89,13 +89,13 @@ - **Alternatives Considered**: - Streaming-only: Would lose SST/styles (unacceptable for <100k row use cases) - In-memory only: Would OOM on large files (unacceptable for ETL pipelines) - - Two-phase streaming: Deferred to P7.5 (complex, not MVP-critical) + - Full-feature row streaming: Deferred; workbook-level SST/styles/metadata make it substantially more complex than row emission - **Consequences**: - ✅ Best-of-both-worlds (full features OR constant memory) - ✅ Users choose based on needs - ❌ Two implementations to maintain - ❌ Potential confusion about which to use -- **Mitigation**: Clear documentation in README and performance-guide.md; streaming read was fixed in P6.6 (fs2.io.readInputStream) and now matches streaming write on O(1) memory. +- **Mitigation**: Clear documentation in README and performance-guide.md; streaming read was fixed in P6.6 (fs2.io.readInputStream), and in-memory workbook writes can use the SAX/StAX backend through `writeWorkbookStream`. ## ADR-012: Compression defaults to DEFLATED **Date**: 2025-11 (P6.7 planned) diff --git a/docs/design/io-modes.md b/docs/design/io-modes.md index cbc2aa56..a4caf209 100644 --- a/docs/design/io-modes.md +++ b/docs/design/io-modes.md @@ -71,7 +71,7 @@ Characteristics: --- -### Streaming Path +### Row-Streaming Path Write (`writeStream` / `writeStreamsSeq`): - Static parts (`[Content_Types].xml`, workbook relationships, minimal `styles.xml`) are written once up front. @@ -86,9 +86,11 @@ Read (`readStream` / `readSheetStream` / `readStreamByIndex`): Characteristics: - Memory: **O(1)** for worksheet data (plus the in‑memory SST and minimal bookkeeping). - Features: - - Write: inline strings only, default styles, no merged cells or advanced sheet metadata. + - Write: inline strings only, default styles, no row-stream API for merged cells or advanced sheet metadata. - Read: values and basic types; you typically use it for ETL/analytics rather than formatting‑preserving workflows. +For an already-materialized `Workbook`, `writeWorkbookStream` uses the SAX/StAX OOXML backend. It is a lower-allocation full-workbook write path, not a row-input streaming API, and it preserves the full metadata handled by `XlsxWriter`. + ``` Read: User calls readStream(path) @@ -124,7 +126,7 @@ Features: Limited (reads values only, minimal style info) **Alternatives Considered**: 1. **Streaming only**: Would lose SST/styles (larger files, no formatting) 2. **In-memory only**: Would OOM on large files -3. **Two-phase streaming**: Adds complexity, still under development (P7.5) +3. **Full-feature row streaming**: Adds complexity because SST/styles/metadata require workbook-level state **Chosen**: Two modes with clear guidance on when to use each @@ -232,12 +234,12 @@ Today: ## Why Not a Unified Implementation? -**Could we make streaming support full features?** +**Could pure row streaming support full features?** **Challenge 1**: SST requires string deduplication - Need to see all strings before writing sharedStrings.xml - But [Content_Types].xml written first (before strings known) -- **Solution**: Two-phase approach (P7.5) or optimistic SST inclusion +- **Solution**: Two-phase row streaming or optimistic SST inclusion **Challenge 2**: Styles require deduplication across workbook - Need to merge all sheet style registries @@ -251,12 +253,12 @@ Today: - Relationships typically before referenced parts - Changing order might break some readers (untested) -**Conclusion**: Streaming with full features is possible (P7.5) but requires: +**Conclusion**: Pure row streaming with full features is possible but requires: - Two-pass approach (scan data, write with indices) - OR disk-backed registries for SST/styles - OR optimistic overhead (include SST/styles even if empty) -**Timeline**: 3-4 weeks of implementation (deferred to post-MVP) +For already-materialized workbooks, the implemented `writeWorkbookStream` path uses the SAX/StAX backend and preserves the full metadata handled by `XlsxWriter`. --- @@ -330,11 +332,10 @@ test("streaming read uses constant memory"): - Default: DEFLATED + compact (smaller files) - Debug mode: STORED + pretty for inspection -### P7.5: Two-Phase Streaming Writer (3-4 weeks) -- Support SST and styles in streaming mode -- Two-pass approach: scan → write -- Disk-backed registries for very large datasets -- Achieves O(1) memory with full features +### SAX/StAX Workbook Writer +- `writeWorkbookStream` writes an already-materialized workbook through the SAX/StAX backend. +- Preserves SST, styles, merged cells, comments, tables, row/column properties, and freeze panes handled by `XlsxWriter`. +- Pure row-stream writers remain the true O(1) row-input path for generated datasets. --- diff --git a/docs/reference/performance-guide.md b/docs/reference/performance-guide.md index 4e5f2476..3ec47cc4 100644 --- a/docs/reference/performance-guide.md +++ b/docs/reference/performance-guide.md @@ -12,7 +12,7 @@ XL provides two distinct I/O implementations with different performance characte | **10k-50k rows** | Full styling | In-memory | In-memory | ~25-50MB | Good | | **50k-100k rows** | Full styling | In-memory | In-memory | ~50-100MB | Fair | | **100k+ rows** | Minimal | **Streaming** (sequential) | **Streaming** | ~10-50MB (SST dependent) | Excellent | -| **100k+ rows** | Full styling | In-memory for now | In-memory for now | O(n) | See roadmap P7.5 | +| **100k+ rows** | Full styling / metadata | In-memory | `writeWorkbookStream` if already materialized | O(n) workbook, lower writer allocation | Good | ## I/O Modes Explained @@ -56,14 +56,15 @@ excel.read(path).map { wb => --- -### Streaming Write Mode +### Row-Stream Write Mode **Use For**: Large data generation (100k+ rows) when you can live with minimal styling and inline strings. **Characteristics**: - O(1) constant memory for worksheet data (~10MB regardless of row count). - fs2‑data‑xml event streaming; no intermediate XML trees. -- **Limitations**: No SST support (inline strings only), minimal styles (no rich formatting or merged cells). +- **Limitations**: No SST support (inline strings only), minimal styles, and no API for workbook metadata such as merged cells. +- For an already-materialized workbook that needs metadata preservation with lower writer allocation, use `ExcelIO.writeWorkbookStream`. **API**: ```scala @@ -385,10 +386,10 @@ excel.read(largeFile) - ✅ DEFLATED compression by default (5-10x smaller files) - ✅ Configurable compression mode -### P7.5 (3-4 weeks) -- ✅ Two-phase streaming writer with full SST/styles -- ✅ O(1) memory for writes with rich formatting -- ✅ Disk-backed SST for very large datasets +### SAX/StAX OOXML Writer +- ✅ Lower-allocation `writeWorkbookStream` path for already-materialized workbooks +- ✅ Preserves full OOXML metadata handled by `XlsxWriter` +- ✅ Pure row-stream writers remain available for true O(1) row input ### Future (Post-1.0) - ⬜ Parallel XML serialization (4-8 threads) diff --git a/docs/reference/testing-guide.md b/docs/reference/testing-guide.md index 970d650f..0536355c 100644 --- a/docs/reference/testing-guide.md +++ b/docs/reference/testing-guide.md @@ -1,6 +1,6 @@ # Testing & Laws — Property Suites, Round-Trips, and Coverage -**Current Status**: 636/636 tests passing across 4 modules +**Current Status**: CI runs the full Mill test graph across the library, evaluator, CLI, and support modules. Use `./mill __.test` as the authoritative count. ## Test Infrastructure @@ -14,11 +14,14 @@ xl-core/test/src/com/tjclp/xl/ xl-ooxml/test/src/com/tjclp/xl/ooxml/ xl-cats-effect/test/src/com/tjclp/xl/io/ +xl-evaluator/test/src/com/tjclp/xl/formula/ +xl-cli/test/src/com/tjclp/xl/cli/ +xl-testkit/test/src/com/tjclp/xl/testkit/ ``` ## Test Coverage by Module -### xl-core: 221 tests ✅ +### xl-core: domain, style, codec, optics, and law suites ✅ #### Addressing Laws (17 tests) - **Column/Row round-trips**: `from0` → `index0` identity @@ -83,9 +86,9 @@ xl-cats-effect/test/src/com/tjclp/xl/io/ - **Whitespace**: Preserved correctly with `xml:space="preserve"` - **OOXML mapping**: `TextRun` → `......` -### xl-ooxml: 24 tests ✅ +### xl-ooxml: OOXML round-trip, surgical write, metadata, table, security, and performance suites ✅ -#### Round-Trip Tests (24 tests) +#### Round-Trip and Regression Tests - **Text cells**: String values preserve exactly - **Number cells**: Numeric precision maintained - **Boolean cells**: True/false round-trip @@ -98,9 +101,9 @@ xl-cats-effect/test/src/com/tjclp/xl/io/ - **RichText**: Multi-run formatted text preserved - **XML determinism**: Same input → same byte output -### xl-cats-effect: 18 tests ✅ +### xl-cats-effect: streaming and effectful IO suites ✅ -#### Streaming I/O (18 tests) +#### Streaming I/O - **writeStream / writeStreamsSeq**: Event-based ZIP write via fs2-data-xml - **readStream / readSheetStream / readStreamByIndex**: Event-based worksheet reads with fs2-data-xml + fs2.io.readInputStream - **Constant memory**: O(1) memory usage verified (100k rows @ ~50MB) @@ -184,10 +187,12 @@ property("get-set") { ### Run All Tests ```bash -./mill __.test # All modules (263 tests) -./mill xl-core.test # Core only (221 tests) -./mill xl-ooxml.test # OOXML only (24 tests) -./mill xl-cats-effect.test # Streaming only (18 tests) +./mill __.test # All test modules +./mill xl-core.test # Core only +./mill xl-ooxml.test # OOXML only +./mill xl-cats-effect.test # Streaming/IO only +./mill xl-evaluator.test # Formula parser/evaluator only +./mill xl-cli.test # CLI only ``` ### Run Specific Test @@ -201,7 +206,7 @@ property("get-set") { GitHub Actions runs: 1. `./mill __.checkFormat` (Scalafmt verification) 2. `./mill __.compile` (Compilation check) -3. `./mill __.test` (All 263 tests) +3. `./mill __.test` (All test modules) ## Coverage Goals @@ -215,9 +220,9 @@ GitHub Actions runs: ## Test Quality Metrics -- **All tests pass**: 263/263 ✅ +- **All tests pass**: enforced by `./mill __.test` in CI ✅ - **Zero flaky tests**: Deterministic, reproducible -- **Fast execution**: <5 seconds for full suite +- **Fast execution**: focused suites run quickly; full-suite time is tracked by CI - **Property-based**: 60%+ of tests use ScalaCheck - **Law coverage**: All algebras (Monoid, Lens, Optional) verified - **Edge cases**: Boundary values, error paths tested diff --git a/xl-cats-effect/src/com/tjclp/xl/io/ExcelIO.scala b/xl-cats-effect/src/com/tjclp/xl/io/ExcelIO.scala index ae043467..d9618dd7 100644 --- a/xl-cats-effect/src/com/tjclp/xl/io/ExcelIO.scala +++ b/xl-cats-effect/src/com/tjclp/xl/io/ExcelIO.scala @@ -4,8 +4,8 @@ import cats.effect.{Async, Sync, Resource} import cats.syntax.all.* import fs2.{Stream, Pipe} import java.nio.file.{Files as JFiles, Path} -import java.io.{BufferedOutputStream, FileOutputStream, FileInputStream} -import java.util.zip.{ZipOutputStream, ZipEntry, CRC32, ZipInputStream, ZipFile} +import java.io.{BufferedOutputStream, FileOutputStream} +import java.util.zip.{ZipOutputStream, ZipEntry, CRC32, ZipFile} import java.nio.charset.StandardCharsets import com.tjclp.xl.addressing.CellRange import com.tjclp.xl.api.Workbook @@ -26,7 +26,6 @@ import com.tjclp.xl.ooxml.metadata.{LightMetadata, WorkbookMetadataReader} import com.tjclp.xl.ooxml.style.WorkbookStyles import com.tjclp.xl.io.streaming.{SaxSharedStringsReader, SaxSingleCellReader, StreamingCellDetails} import fs2.data.xml -import fs2.data.xml.XmlEvent /** * Cats Effect interpreter for Excel operations. @@ -972,191 +971,26 @@ class ExcelIO[F[_]: Async](warningHandler: XlsxReader.Warning => F[Unit]) .drain /** - * Write workbook using streaming writer with O(1) output memory and style preservation. + * Write an in-memory workbook using the SAX/StAX OOXML writer. * - * Hybrid approach: workbook is loaded in-memory (for style extraction), but output uses streaming - * writer for O(1) output memory. Styles are fully preserved via the two-pass approach with - * StyleIndex. + * This path is intended for CLI `--stream` operations that already materialize a Workbook for + * command semantics but want the lower-allocation writer. It delegates to the full OOXML writer + * so workbook metadata such as merges, comments, tables, row/column properties, and freeze panes + * is preserved. * - * Best for: Large modified workbooks where output memory is the bottleneck. Trade-off: Extra I/O - * pass (temp file) for dimension detection. - * - * @param wb - * Workbook to write (in-memory) - * @param path - * Output file path - * @param config - * Writer configuration + * For true O(1) row input, use `writeStream`, `writeStreamsSeq`, or + * `writeStreamsSeqWithAutoDetect`. */ def writeWorkbookStream( wb: Workbook, path: Path, config: com.tjclp.xl.ooxml.WriterConfig = com.tjclp.xl.ooxml.WriterConfig.default ): F[Unit] = - import com.tjclp.xl.ooxml.style.{StyleIndex, OoxmlStyles} - if wb.sheets.isEmpty then Async[F].raiseError(new IllegalArgumentException("Workbook must have at least one sheet")) else - Sync[F] - .delay { - // Build unified style index from workbook (extracts all styles) - val (styleIndex, remappings) = StyleIndex.fromWorkbook(wb) - val ooxmlStyles = OoxmlStyles(styleIndex) - - // Prepare sheet data with remapped style IDs - val sheetsWithIndices = wb.sheets.zipWithIndex.map { case (sheet, idx) => - val sheetIndex = idx + 1 - val remapping = remappings.getOrElse(idx, Map.empty[Int, Int]) - (sheet, sheetIndex, remapping) - } - - (ooxmlStyles, sheetsWithIndices) - } - .flatMap { case (ooxmlStyles, sheetsWithIndices) => - // Create temp files for two-phase approach - Stream - .bracket( - Sync[F].delay { - sheetsWithIndices.map { case (sheet, sheetIndex, _) => - val tempFile = JFiles.createTempFile(s"xl-wbstream-$sheetIndex-", ".xml") - val bounds = new BoundsAccumulator() - (sheet, sheetIndex, tempFile, bounds) - } - } - ) { resources => - // Cleanup: delete all temp files - Sync[F].delay { - resources.foreach { case (_, _, tempFile, _) => - JFiles.deleteIfExists(tempFile) - } - }.void - } - .flatMap { resources => - // Phase 1: Stream each sheet to temp file with style remapping - val writePhase = Stream - .emits(resources.zip(sheetsWithIndices)) - .flatMap { case ((_, _, tempFile, bounds), (sheet, _, remapping)) => - streamSheetStyled(sheet, remapping) - .evalTap(row => Sync[F].delay(bounds.update(row.toRowData))) - .through(styledRowsToTempXml(tempFile, config)) - } - - // Phase 2: Assemble final ZIP with styles and dimensions - val assemblePhase = Stream.eval( - assembleWorkbookStreamZip(path, resources, ooxmlStyles, config) - ) - - writePhase ++ assemblePhase - } - .compile - .drain - } - - // Helper: Stream styled rows to temp XML file (body only) - private def styledRowsToTempXml( - tempFile: Path, - config: com.tjclp.xl.ooxml.WriterConfig - ): Pipe[F, StyledRowData, Unit] = - rows => - Stream - .bracket( - Sync[F].delay(new BufferedOutputStream(new FileOutputStream(tempFile.toFile))) - )(os => Sync[F].delay(os.close())) - .flatMap { os => - StreamingXmlWriter - .worksheetBodyStyled(rows, config.formulaInjectionPolicy) - .through(xml.render.raw()) - .through(fs2.text.utf8.encode) - .chunks - .evalMap(chunk => Sync[F].delay(os.write(chunk.toArray))) - } - - // Helper: Assemble workbook ZIP with styles - private def assembleWorkbookStreamZip( - path: Path, - resources: Seq[(Sheet, Int, Path, BoundsAccumulator)], - ooxmlStyles: com.tjclp.xl.ooxml.style.OoxmlStyles, - config: com.tjclp.xl.ooxml.WriterConfig - ): F[Unit] = - Sync[F].delay { - import com.tjclp.xl.ooxml.* - import com.tjclp.xl.addressing.SheetName - - val zip = new ZipOutputStream(new FileOutputStream(path.toFile)) - try - val sheets = resources.map { case (sheet, idx, _, _) => (sheet.name.value, idx) } - - // Content types and relationships - val contentTypes = ContentTypes.forSheetIndices( - sheetIndices = sheets.map(_._2), - hasStyles = true, - hasSharedStrings = false - ) - val rootRels = Relationships.root() - val workbookRels = Relationships.workbook( - sheetCount = sheets.size, - hasStyles = true, - hasSharedStrings = false - ) - - // Workbook with sheet refs - val sheetRefs = sheets.map { case (name, idx) => - SheetRef(SheetName.unsafe(name), idx, s"rId$idx") - } - val ooxmlWb = OoxmlWorkbook(sheets = sheetRefs.toVector) - - // Write static parts (with full styles, not minimal!) - writePartSync(zip, "[Content_Types].xml", contentTypes.toXml, config) - writePartSync(zip, "_rels/.rels", rootRels.toXml, config) - writePartSync(zip, "xl/workbook.xml", ooxmlWb.toXml, config) - writePartSync(zip, "xl/_rels/workbook.xml.rels", workbookRels.toXml, config) - writePartSync(zip, "xl/styles.xml", ooxmlStyles.toXml, config) // Full styles! - - // Write each worksheet with dimension - resources.foreach { case (_, sheetIndex, tempBodyFile, bounds) => - val wsEntry = new ZipEntry(s"xl/worksheets/sheet$sheetIndex.xml") - wsEntry.setMethod(ZipEntry.DEFLATED) - zip.putNextEntry(wsEntry) - - // Header with dimension - writeWorksheetHeader(zip, bounds.dimension) - - // Body from temp file - JFiles.copy(tempBodyFile, zip) - - // Footer - writeWorksheetFooter(zip) - - zip.closeEntry() - } - finally zip.close() - } - - // Helper: Convert Sheet to StyledRowData stream with remapped style IDs - private def streamSheetStyled( - sheet: Sheet, - remapping: Map[Int, Int] - ): Stream[F, StyledRowData] = - val rowMap = sheet.cells.values - .groupBy(_.ref.row.index1) // Group by 1-based row - .view - .mapValues { cells => - val cellValues = cells.map(c => c.ref.col.index0 -> c.value).toMap - val cellStyles = cells.flatMap { c => - c.styleId.map { sid => - // Remap local styleId to global index - val globalId = remapping.getOrElse(sid.value, sid.value) - c.ref.col.index0 -> globalId - } - }.toMap - (cellValues, cellStyles) - } - .toMap - - Stream.emits(rowMap.toSeq.sortBy(_._1)).map { case (rowIdx, (cellMap, styleMap)) => - StyledRowData(rowIdx, cellMap, styleMap) - } + val saxConfig = config.copy(backend = com.tjclp.xl.ooxml.XmlBackend.SaxStax) + writeWith(wb, path, saxConfig) // Helper: Assemble multi-sheet ZIP with dimension elements private def assembleMultiSheetZip( diff --git a/xl-cats-effect/src/com/tjclp/xl/io/StreamingXmlWriter.scala b/xl-cats-effect/src/com/tjclp/xl/io/StreamingXmlWriter.scala index bf0d288e..b5c67e0d 100644 --- a/xl-cats-effect/src/com/tjclp/xl/io/StreamingXmlWriter.scala +++ b/xl-cats-effect/src/com/tjclp/xl/io/StreamingXmlWriter.scala @@ -429,8 +429,8 @@ object StreamingXmlWriter: /** * Generate row events for styled worksheet body. * - * Emits events incrementally as rows arrive, including s="N" style attributes. Used by - * writeWorkbookStream for style-preserving output. + * Emits events incrementally as rows arrive, including s="N" style attributes, for row-streaming + * paths that provide explicit style ids. * * @param rows * Stream of styled row data diff --git a/xl-cats-effect/test/src/com/tjclp/xl/io/ExcelIOSpec.scala b/xl-cats-effect/test/src/com/tjclp/xl/io/ExcelIOSpec.scala index 4f8613ac..37437f5d 100644 --- a/xl-cats-effect/test/src/com/tjclp/xl/io/ExcelIOSpec.scala +++ b/xl-cats-effect/test/src/com/tjclp/xl/io/ExcelIOSpec.scala @@ -13,7 +13,7 @@ import com.tjclp.xl.addressing.{ARef, CellRange, Column, Row} import com.tjclp.xl.cells.{CellError, CellValue} import com.tjclp.xl.macros.ref import com.tjclp.xl.display.NumFmtFormatter -import com.tjclp.xl.ooxml.{WriterConfig, XlsxReader} +import com.tjclp.xl.ooxml.{SstPolicy, WriterConfig, XlsxReader, XlsxWriter} /** Tests for Excel streaming API */ @SuppressWarnings( @@ -1146,6 +1146,47 @@ class ExcelIOSpec extends CatsEffectSuite: case other => fail(s"Expected Text, got: $other") } + tempDir.test("writeWorkbookStream: WriterConfig.secure escapes clean read workbook") { dir => + val excel = ExcelIO.instance[IO] + val initial = Workbook("Data") + val sheet = initial.sheets(0) + .put(ref"A1", CellValue.Text("=DANGER")) + .put(ref"A2", CellValue.Text("+EVIL")) + .put(ref"A3", CellValue.Text("Normal text")) + val sourceWb = + initial + .update(initial.sheets(0).name, _ => sheet) + .getOrElse(fail("Failed to create workbook")) + + val source = dir.resolve("secure-source.xlsx") + val output = dir.resolve("secure-output.xlsx") + + for + _ <- IO { + XlsxWriter.writeWith( + sourceWb, + source, + WriterConfig.default.copy(sstPolicy = SstPolicy.Always) + ) match + case Left(err) => fail(s"Source write failed: $err") + case Right(()) => () + } + readWb <- IO(XlsxReader.read(source).fold(err => fail(s"Read failed: $err"), identity)) + _ = assert(readWb.sourceContext.exists(_.isClean), "Read workbook should be clean") + _ <- excel.writeWorkbookStream(readWb, output, WriterConfig.secure) + rewritten <- IO(XlsxReader.read(output).fold(err => fail(s"Re-read failed: $err"), identity)) + yield + val readSheet = rewritten.sheets.head + + readSheet(ref"A1").value match + case CellValue.Text(t) => assertEquals(t, "'=DANGER", "= should be escaped") + case other => fail(s"Expected Text, got: $other") + + readSheet(ref"A2").value match + case CellValue.Text(t) => assertEquals(t, "'+EVIL", "+ should be escaped") + case other => fail(s"Expected Text, got: $other") + } + // ========== Dimension Support Tests ========== tempDir.test("writeStreamWithAutoDetect: includes dimension element in output") { dir => diff --git a/xl-cli/src/com/tjclp/xl/cli/Main.scala b/xl-cli/src/com/tjclp/xl/cli/Main.scala index e147966c..86e1b521 100644 --- a/xl-cli/src/com/tjclp/xl/cli/Main.scala +++ b/xl-cli/src/com/tjclp/xl/cli/Main.scala @@ -95,7 +95,7 @@ object Main // Sheet-level write: --file, --sheet, and --output (required) // Sheet-level write: --file, --sheet, and --output (required) - // --stream enables O(1) output memory with style preservation (hybrid streaming) + // --stream uses SAX/StAX workbook writes for modifying commands. val sheetWriteSubcmds = putCmd orElse putfCmd orElse styleCmd orElse rowCmd orElse colCmd orElse autoFitCmd orElse batchCmd orElse importCmd orElse addSheetCmd orElse removeSheetCmd orElse renameSheetCmd orElse moveSheetCmd orElse copySheetCmd orElse mergeCmd orElse unmergeCmd orElse commentCmd orElse removeCommentCmd orElse clearCmd orElse fillCmd orElse sortCmd orElse freezeCmd orElse unfreezeCmd orElse copyCmd @@ -1158,7 +1158,7 @@ Use --dry-run to validate JSON without writing.""" // Other commands: regular execution path case _ => - // For write commands: stream flag enables O(1) output memory (hybrid streaming) + // For write commands: stream flag uses the SAX/StAX workbook writer // For read commands: stream flag enables O(1) input memory (true streaming) val isReadCmd = cmd match case _: CliCommand.Search | _: CliCommand.Stats | _: CliCommand.Bounds | diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/CellCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/CellCommands.scala index fecf388c..6794fccf 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/CellCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/CellCommands.scala @@ -12,12 +12,12 @@ import com.tjclp.xl.ooxml.writer.WriterConfig /** * Cell-level command handlers. * - * Commands for merge/unmerge operations. All methods accept a `stream` parameter for O(1) output - * memory. + * Commands for merge/unmerge operations. All methods accept a `stream` parameter to use the + * SAX/StAX workbook writer. */ object CellCommands: - /** Write workbook using standard or streaming writer based on mode */ + /** Write workbook using the standard or SAX/StAX backend based on mode */ private def writeWorkbook( wb: Workbook, outputPath: Path, @@ -37,7 +37,7 @@ object CellCommands: * Merge cells in range. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def merge( wb: Workbook, @@ -65,7 +65,7 @@ object CellCommands: * Unmerge cells in range. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def unmerge( wb: Workbook, @@ -101,7 +101,7 @@ object CellCommands: * Flags can be combined (e.g., --styles --comments clears both). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def clear( wb: Workbook, diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/CommentCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/CommentCommands.scala index 5f69c1e6..2534eeae 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/CommentCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/CommentCommands.scala @@ -13,12 +13,12 @@ import com.tjclp.xl.ooxml.writer.WriterConfig /** * Command handlers for cell comment operations. * - * Supports adding and removing comments from cells. All methods accept a `stream` parameter for - * O(1) output memory. + * Supports adding and removing comments from cells. All methods accept a `stream` parameter to use + * the SAX/StAX workbook writer. */ object CommentCommands: - /** Write workbook using standard or streaming writer based on mode */ + /** Write workbook using the standard or SAX/StAX backend based on mode */ private def writeWorkbook( wb: Workbook, outputPath: Path, @@ -52,7 +52,7 @@ object CommentCommands: * @param config * Writer configuration * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer * @return * Success message */ @@ -102,7 +102,7 @@ ${saveSuffix(outputPath, stream)}""" * @param config * Writer configuration * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer * @return * Success message */ diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/ImportCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/ImportCommands.scala index 00324800..5ec9c0cb 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/ImportCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/ImportCommands.scala @@ -20,7 +20,7 @@ import com.tjclp.xl.ooxml.writer.WriterConfig */ object ImportCommands: - /** Write workbook using standard or streaming writer based on mode */ + /** Write workbook using the standard or SAX/StAX backend based on mode */ private def writeWorkbook( wb: Workbook, outputPath: Path, @@ -125,7 +125,7 @@ object ImportCommands: * Overwrites any existing data at that position. * * @param stream - * If true, uses streaming writer for O(1) output memory (hybrid streaming) + * If true, uses the SAX/StAX workbook writer */ private def importToPosition( wb: Workbook, @@ -163,7 +163,7 @@ ${saveSuffix(outputPath, stream)}""" * Sheet name must not already exist in the workbook. * * @param stream - * If true, uses streaming writer for O(1) output memory (hybrid streaming) + * If true, uses the SAX/StAX workbook writer */ private def importToNewSheet( wb: Workbook, diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/SheetCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/SheetCommands.scala index a9c6870d..01154f34 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/SheetCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/SheetCommands.scala @@ -14,11 +14,11 @@ import com.tjclp.xl.ooxml.writer.WriterConfig * Sheet management command handlers. * * Commands for add, remove, rename, move, copy sheet operations. All methods accept a `stream` - * parameter for O(1) output memory. + * parameter to use the SAX/StAX workbook writer. */ object SheetCommands: - /** Write workbook using standard or streaming writer based on mode */ + /** Write workbook using the standard or SAX/StAX backend based on mode */ private def writeWorkbook( wb: Workbook, outputPath: Path, @@ -38,7 +38,7 @@ object SheetCommands: * Add a new empty sheet to workbook. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def addSheet( wb: Workbook, @@ -106,7 +106,7 @@ object SheetCommands: * Remove sheet from workbook. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def removeSheet( wb: Workbook, @@ -132,7 +132,7 @@ object SheetCommands: * Rename a sheet. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def renameSheet( wb: Workbook, @@ -161,7 +161,7 @@ object SheetCommands: * Move sheet to new position. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def moveSheet( wb: Workbook, @@ -229,7 +229,7 @@ object SheetCommands: * Copy sheet to new name. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def copySheet( wb: Workbook, @@ -261,7 +261,7 @@ object SheetCommands: * @param veryHide * If true, uses "veryHidden" state (not accessible from Excel UI, only via VBA) * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def hideSheet( wb: Workbook, @@ -290,7 +290,7 @@ object SheetCommands: * Show a hidden sheet (make it visible). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def showSheet( wb: Workbook, diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/StreamingWriteCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/StreamingWriteCommands.scala index 8e07fa74..b0910e4b 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/StreamingWriteCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/StreamingWriteCommands.scala @@ -28,7 +28,7 @@ import scala.xml.Elem * * Provides two modes: * 1. True streaming (CSV import): End-to-end O(1) memory - * 2. Hybrid streaming (workbook write): In-memory workbook → O(1) output memory + * 2. SAX/StAX workbook write: In-memory workbook → lower-allocation full OOXML writer */ object StreamingWriteCommands: @@ -70,10 +70,10 @@ object StreamingWriteCommands: /** * Hybrid streaming workbook write. * - * Takes an in-memory workbook and writes it using the streaming writer for O(1) output memory. - * Styles are fully preserved via StyleIndex. + * Takes an in-memory workbook and writes it using the SAX/StAX OOXML backend. Styles and workbook + * metadata are preserved via the full writer pipeline. * - * Best for: Large modified workbooks where output memory is the bottleneck. + * Best for: Modified workbooks that need the full OOXML feature set with lower writer allocation. * * @param wb * Workbook to write (already in memory) @@ -707,7 +707,9 @@ object StreamingWriteCommands: // Ops that require full workbook context (not supported in streaming mode) case _: BatchParser.BatchOp.AddComment | _: BatchParser.BatchOp.RemoveComment | _: BatchParser.BatchOp.Clear | _: BatchParser.BatchOp.AutoFit | - _: BatchParser.BatchOp.AddSheet | _: BatchParser.BatchOp.RenameSheet => + _: BatchParser.BatchOp.AddSheet | _: BatchParser.BatchOp.RenameSheet | + _: BatchParser.BatchOp.Freeze | BatchParser.BatchOp.Unfreeze | + _: BatchParser.BatchOp.CopyRange => throw new Exception( "This batch operation is not supported in streaming mode. " + "Remove --stream to use full workbook mode." diff --git a/xl-cli/src/com/tjclp/xl/cli/commands/WriteCommands.scala b/xl-cli/src/com/tjclp/xl/cli/commands/WriteCommands.scala index 532bc625..129440a7 100644 --- a/xl-cli/src/com/tjclp/xl/cli/commands/WriteCommands.scala +++ b/xl-cli/src/com/tjclp/xl/cli/commands/WriteCommands.scala @@ -32,15 +32,15 @@ import com.tjclp.xl.cli.{FillDirection, SortDirection, SortKey, SortMode} * * All write commands accept a `stream` parameter: * - false (default): Use standard writer (O(n) memory) - * - true: Use streaming writer (O(1) output memory, preserves styles) + * - true: Use the lower-allocation SAX/StAX workbook writer */ object WriteCommands: /** - * Write workbook using standard or streaming writer based on mode. + * Write workbook using the standard or SAX/StAX backend based on mode. * - * When stream=true, uses writeWorkbookStream for O(1) output memory. Styles are fully preserved - * in both modes. + * When stream=true, uses writeWorkbookStream. Styles and workbook metadata are preserved in both + * modes. */ private def writeWorkbook( wb: Workbook, @@ -101,7 +101,7 @@ object WriteCommands: * @param csvSplit * Enable CSV auto-split in Mode 2 (fill pattern with comma-containing value) * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def put( wb: Workbook, @@ -240,7 +240,7 @@ object WriteCommands: * Write formula to cell(s). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def putFormula( wb: Workbook, @@ -426,7 +426,7 @@ object WriteCommands: * replace the entire style instead of merging. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def style( wb: Workbook, @@ -518,7 +518,7 @@ object WriteCommands: * Set row properties (height, hide/show). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def row( wb: Workbook, @@ -555,7 +555,7 @@ object WriteCommands: * (A:F). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def col( wb: Workbook, @@ -606,7 +606,7 @@ object WriteCommands: * Auto-fit all columns (or specified range) based on content. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def autoFit( wb: Workbook, @@ -722,7 +722,7 @@ object WriteCommands: * Apply multiple operations atomically (JSON from stdin or file). * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def batch( wb: Workbook, @@ -756,7 +756,7 @@ object WriteCommands: * Formulas are shifted relative to the source position using Excel's anchor rules. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def fill( wb: Workbook, @@ -972,7 +972,7 @@ object WriteCommands: * @param hasHeader * If true, first row is excluded from sort * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def sort( wb: Workbook, @@ -1019,7 +1019,7 @@ object WriteCommands: * Rows above and columns to the left of the reference are locked in place when scrolling. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def freeze( wb: Workbook, @@ -1049,7 +1049,7 @@ object WriteCommands: * Remove freeze panes from the sheet. * * @param stream - * If true, uses streaming writer for O(1) output memory + * If true, uses the SAX/StAX workbook writer */ def unfreeze( wb: Workbook, @@ -1079,7 +1079,7 @@ object WriteCommands: * @param valuesOnly * If true, formula cells are materialized to their cached/computed value. * @param stream - * If true, uses streaming writer for O(1) output memory. + * If true, uses the SAX/StAX workbook writer. */ def copyRange( wb: Workbook, diff --git a/xl-cli/src/com/tjclp/xl/cli/raster/ImageMagick.scala b/xl-cli/src/com/tjclp/xl/cli/raster/ImageMagick.scala index e444789b..f0632694 100644 --- a/xl-cli/src/com/tjclp/xl/cli/raster/ImageMagick.scala +++ b/xl-cli/src/com/tjclp/xl/cli/raster/ImageMagick.scala @@ -275,7 +275,7 @@ object ImageMagick extends Rasterizer: for versionOutput <- process.stdout.through(fs2.text.utf8.decode).compile.string _ <- process.stderr.compile.drain - yield versionOutput.linesIterator.nextOption + yield versionOutput.linesIterator.nextOption() } .handleError(_ => None) } diff --git a/xl-cli/test/src/com/tjclp/xl/cli/StreamingWriteSpec.scala b/xl-cli/test/src/com/tjclp/xl/cli/StreamingWriteSpec.scala index 8eeb7cf7..00656905 100644 --- a/xl-cli/test/src/com/tjclp/xl/cli/StreamingWriteSpec.scala +++ b/xl-cli/test/src/com/tjclp/xl/cli/StreamingWriteSpec.scala @@ -13,11 +13,12 @@ import com.tjclp.xl.addressing.{ARef, Column, Row} import com.tjclp.xl.cells.CellValue import com.tjclp.xl.extensions.style import com.tjclp.xl.cli.commands.{ + CommentCommands, + CellCommands, ImportCommands, SheetCommands, StreamingWriteCommands, - WriteCommands, - CellCommands + WriteCommands } import com.tjclp.xl.cli.helpers.{CsvParser, StreamingCsvParser} import com.tjclp.xl.io.ExcelIO @@ -271,11 +272,7 @@ class StreamingWriteSpec extends FunSuite: // ========== Cell Operations in Streaming Mode ========== - // Note: Streaming mode currently uses a simplified XML writer that doesn't - // include mergeCells element in the output. This is a known limitation. - // Merged cell metadata is stored in-memory but not serialized via streaming path. - // For full fidelity, use non-streaming mode. - test("streaming: merge cells (values preserved, merge metadata may not be)".ignore) { + test("streaming: merge cells preserves merge metadata") { val outputPath = tempXlsx() try val wb = Workbook(Sheet("Test").put(ARef.from0(0, 0), CellValue.Text("Merged"))) @@ -292,6 +289,33 @@ class StreamingWriteSpec extends FunSuite: finally Files.deleteIfExists(outputPath) } + test("streaming: comments are preserved") { + val outputPath = tempXlsx() + try + val wb = Workbook(Sheet("Test").put(ARef.from0(0, 0), CellValue.Text("Commented"))) + val result = CommentCommands + .addComment( + wb, + Some(wb.sheets.head), + "A1", + "Review this value", + Some("QA"), + outputPath, + config, + stream = true + ) + .unsafeRunSync() + + assert(result.contains("Added comment to A1"), result) + assert(result.contains("Saved (streaming)"), result) + + val imported = ExcelIO.instance[IO].read(outputPath).unsafeRunSync() + val comment = imported.sheets.head.comments.get(ARef.from0(0, 0)) + assertEquals(comment.flatMap(_.author), Some("QA")) + assertEquals(comment.map(_.text.toPlainText), Some("Review this value")) + finally Files.deleteIfExists(outputPath) + } + test("streaming: clear cells") { val outputPath = tempXlsx() try @@ -612,6 +636,47 @@ class StreamingWriteSpec extends FunSuite: Files.deleteIfExists(jsonPath) } + test("streaming batch: freeze is rejected with clear unsupported message") { + val sourcePath = tempXlsx() + val outputPath = tempXlsx() + val jsonPath = tempJson("""[{"op":"freeze","ref":"B2"}]""") + try + ExcelIO.instance[IO].write(Workbook(Sheet("Test")), sourcePath).unsafeRunSync() + + val ex = intercept[Exception] { + StreamingWriteCommands + .batch(sourcePath, outputPath, Some("Test"), jsonPath.toString) + .unsafeRunSync() + } + + assert(ex.getMessage.contains("not supported in streaming mode"), ex.getMessage) + finally + Files.deleteIfExists(sourcePath) + Files.deleteIfExists(outputPath) + Files.deleteIfExists(jsonPath) + } + + test("streaming batch: copy range is rejected with clear unsupported message") { + val sourcePath = tempXlsx() + val outputPath = tempXlsx() + val jsonPath = tempJson("""[{"op":"copy","source":"A1","target":"B1"}]""") + try + val wb = Workbook(Sheet("Test").put(ARef.from0(0, 0), CellValue.Text("copy me"))) + ExcelIO.instance[IO].write(wb, sourcePath).unsafeRunSync() + + val ex = intercept[Exception] { + StreamingWriteCommands + .batch(sourcePath, outputPath, Some("Test"), jsonPath.toString) + .unsafeRunSync() + } + + assert(ex.getMessage.contains("not supported in streaming mode"), ex.getMessage) + finally + Files.deleteIfExists(sourcePath) + Files.deleteIfExists(outputPath) + Files.deleteIfExists(jsonPath) + } + test("streaming batch: formula dragging shifts references correctly") { val sourcePath = tempXlsx() val outputPath = tempXlsx() diff --git a/xl-ooxml/src/com/tjclp/xl/ooxml/SharedStrings.scala b/xl-ooxml/src/com/tjclp/xl/ooxml/SharedStrings.scala index ff7a12fa..c0d3c31a 100644 --- a/xl-ooxml/src/com/tjclp/xl/ooxml/SharedStrings.scala +++ b/xl-ooxml/src/com/tjclp/xl/ooxml/SharedStrings.scala @@ -298,12 +298,14 @@ object SharedStrings extends XmlReadable[SharedStrings]: * @return * SharedStrings with deduplicated entries and total count */ - def fromWorkbook(wb: Workbook): SharedStrings = + def fromWorkbook(wb: Workbook, escapeFormulas: Boolean = false): SharedStrings = // Stream entries using iterator for lazy evaluation val allEntries = wb.sheets.iterator.flatMap { sheet => sheet.cells.values.iterator.flatMap { cell => cell.value match - case com.tjclp.xl.cells.CellValue.Text(s) => Iterator.single(Left(s): SSTEntry) + case com.tjclp.xl.cells.CellValue.Text(s) => + val safeText = if escapeFormulas then CellValue.escape(s) else s + Iterator.single(Left(safeText): SSTEntry) case com.tjclp.xl.cells.CellValue.RichText(rt) => Iterator.single(Right(rt): SSTEntry) case _ => Iterator.empty } diff --git a/xl-ooxml/src/com/tjclp/xl/ooxml/XlsxWriter.scala b/xl-ooxml/src/com/tjclp/xl/ooxml/XlsxWriter.scala index 074aa0d6..fbfd168f 100644 --- a/xl-ooxml/src/com/tjclp/xl/ooxml/XlsxWriter.scala +++ b/xl-ooxml/src/com/tjclp/xl/ooxml/XlsxWriter.scala @@ -93,8 +93,10 @@ object XlsxWriter: config: WriterConfig ): XLResult[Unit] = try + val escapeFormulas = formulaEscapingRequested(config) + workbook.sourceContext match - case Some(ctx) if ctx.isClean => + case Some(ctx) if ctx.isClean && !escapeFormulas => // Clean workbook + file target → verbatim copy (ultra-fast) target match case OutputPath(path) => @@ -157,6 +159,9 @@ object XlsxWriter: Files.deleteIfExists(tempPath) throw e + private def formulaEscapingRequested(config: WriterConfig): Boolean = + config.formulaInjectionPolicy == FormulaInjectionPolicy.Escape + /** * Copy source file verbatim to destination (for clean workbooks). * @@ -1194,6 +1199,8 @@ object XlsxWriter: target: OutputTarget, config: WriterConfig ): Unit = + val escapeFormulas = formulaEscapingRequested(config) + // Determine modification tracking (all sheets modified if no source) val tracker = sourceContext.map(_.modificationTracker).getOrElse { ModificationTracker( @@ -1212,6 +1219,13 @@ object XlsxWriter: case None => (RelationshipGraph.empty, Set.empty[String], Set.empty[String]) + // Escaping must be applied to every text-bearing worksheet part; preserved sheets can contain + // dangerous inline strings or shared-string references that would bypass WriterConfig.secure. + val sheetsToRegenerate = + if escapeFormulas || tracker.modifiedMetadata || tracker.reorderedSheets then + workbook.sheets.indices.toSet + else tracker.modifiedSheets + val sharedStringsPath = "xl/sharedStrings.xml" val sourceHasSharedStrings = sourceContext.exists(_.partManifest.contains(sharedStringsPath)) @@ -1220,7 +1234,9 @@ object XlsxWriter: // - If modified sheets only use existing SST strings → copy verbatim (byte-perfect preservation) // - If no source SST → generate according to policy val (sst, regenerateSharedStrings) = - if sourceHasSharedStrings then + if escapeFormulas && sourceHasSharedStrings then + (Some(SharedStrings.fromWorkbook(workbook, escapeFormulas = true)), true) + else if sourceHasSharedStrings then // Parse preserved SST (sourceContext guaranteed to exist if sourceHasSharedStrings is true) val parsedSST = sourceContext.map(ctx => parsePreservedSST(ctx.sourcePath)).getOrElse(None) @@ -1281,18 +1297,21 @@ object XlsxWriter: else // No source SST - generate if policy allows val generated = config.sstPolicy match - case SstPolicy.Always => Some(SharedStrings.fromWorkbook(workbook)) + case SstPolicy.Always => Some(SharedStrings.fromWorkbook(workbook, escapeFormulas)) case SstPolicy.Never => None case SstPolicy.Auto => - if SharedStrings.shouldUseSST(workbook) then Some(SharedStrings.fromWorkbook(workbook)) + if SharedStrings.shouldUseSST(workbook) then + Some(SharedStrings.fromWorkbook(workbook, escapeFormulas)) else None (generated, generated.isDefined) val sharedStringsInOutput = sourceHasSharedStrings || regenerateSharedStrings val sstForSheets = if regenerateSharedStrings then sst else None - // Build style index (automatic optimization based on sourceContext) - val (styleIndex, sheetRemappings) = StyleIndex.fromWorkbook(workbook) + // Build style index (automatic optimization based on sourceContext). Any sheet regenerated + // from a source-backed workbook needs a local-to-workbook style remapping. + val (styleIndex, sheetRemappings) = + StyleIndex.fromWorkbook(workbook, sheetsRequiringRemapping = sheetsToRegenerate) // Parse preserved styles metadata (namespaces and dxfs) if source available val (preservedStylesAttrs, preservedStylesScope, preservedDxfs) = sourceContext match @@ -1404,13 +1423,6 @@ object XlsxWriter: else if sourceHasSharedStrings then sourceContext.foreach(ctx => copyPreservedPart(ctx.sourcePath, sharedStringsPath, zip)) - // When metadata is modified or sheets are reordered, we must regenerate ALL sheets - // because new sheets don't exist in source, sheet indices may have changed, - // or sheet positions must be remapped to match the new order. - val sheetsToRegenerate = - if tracker.modifiedMetadata || tracker.reorderedSheets then workbook.sheets.indices.toSet - else tracker.modifiedSheets - // Write sheets: regenerate modified, copy unmodified (if source available) workbook.sheets.zipWithIndex.foreach { case (sheet, idx) => if sheetsToRegenerate.contains(idx) then @@ -1456,8 +1468,6 @@ object XlsxWriter: ) } - val escapeFormulas = config.formulaInjectionPolicy == FormulaInjectionPolicy.Escape - // For SaxStax backend with no preserved metadata, use direct SAX emission // (bypasses intermediate OOXML types for 5-7x performance improvement) (config.backend, preservedMetadata) match diff --git a/xl-ooxml/src/com/tjclp/xl/ooxml/style/StyleIndex.scala b/xl-ooxml/src/com/tjclp/xl/ooxml/style/StyleIndex.scala index f61f1fe0..8edd18fa 100644 --- a/xl-ooxml/src/com/tjclp/xl/ooxml/style/StyleIndex.scala +++ b/xl-ooxml/src/com/tjclp/xl/ooxml/style/StyleIndex.scala @@ -54,14 +54,20 @@ object StyleIndex: * * @param wb * The workbook to index + * @param sheetsRequiringRemapping + * Source-backed sheets that will be regenerated from the domain model and therefore need + * sheet-local style IDs remapped to workbook-level style IDs. * @return * (StyleIndex for writing, Map[sheetIndex -> Map[localStyleId -> globalStyleId]]) */ - def fromWorkbook(wb: Workbook): (StyleIndex, Map[Int, Map[Int, Int]]) = + def fromWorkbook( + wb: Workbook, + sheetsRequiringRemapping: Set[Int] = Set.empty + ): (StyleIndex, Map[Int, Map[Int, Int]]) = wb.sourceContext match case Some(ctx) => // Has source: surgical mode (preserve original structure) - fromWorkbookWithSource(wb, ctx) + fromWorkbookWithSource(wb, ctx, sheetsRequiringRemapping) case None => // No source: full deduplication (optimal compression) fromWorkbookWithoutSource(wb) @@ -162,16 +168,20 @@ object StyleIndex: * This variant is used during surgical modification to avoid corruption: * - Preserves ALL original styles from source (including duplicates) * - Deduplicates only truly new styles from modified sheets - * - Generates remappings for ALL sheets (not just modified ones) + * - Generates remappings for regenerated sheets (not just modified ones) * - * All sheets need remappings because metadata-only changes (add-sheet, reorder) cause ALL sheets - * to be regenerated from the domain model, not copied verbatim from source. Without remappings, - * unmodified-but-regenerated sheets would fall back to styleId 0, stripping all styles (TJC-751). + * Sheets need remappings when they are regenerated from the domain model, not copied verbatim + * from source. Metadata-only changes (add-sheet, reorder) and escape-driven secure writes can + * force otherwise-unmodified sheets through regeneration. Without remappings, those sheets would + * fall back to styleId 0, stripping all styles (TJC-751). * * @param wb * The workbook with modified sheets * @param ctx * Source context providing modification tracker and original file path + * @param sheetsRequiringRemapping + * Additional sheet indices that the writer will regenerate even if the modification tracker is + * clean, such as secure formula-escaping rewrites. * @return * (StyleIndex with all original + new styles, Map[sheetIdx -> remapping]) */ @@ -184,7 +194,8 @@ object StyleIndex: ) private def fromWorkbookWithSource( wb: Workbook, - ctx: SourceContext + ctx: SourceContext, + sheetsRequiringRemapping: Set[Int] ): (StyleIndex, Map[Int, Map[Int, Int]]) = val tracker = ctx.modificationTracker val needsRemappingForAll = tracker.modifiedMetadata || tracker.reorderedSheets @@ -251,10 +262,15 @@ object StyleIndex: // Step 4: Process sheets for style remapping // When metadata or reorder changes force ALL sheets to be regenerated from domain model, // every sheet needs a remapping — otherwise unmodified sheets fall back to styleId 0, - // stripping all styles (TJC-751). When only specific sheets are modified, unmodified - // sheets are copied verbatim from source and don't need remappings. + // stripping all styles (TJC-751). Secure writes can also force clean sheets through + // regeneration, so the writer passes those sheet indices explicitly. val remappings = wb.sheets.zipWithIndex.map { case (sheet, sheetIdx) => - if needsRemappingForAll || tracker.modifiedSheets.contains(sheetIdx) then + val needsRemapping = + needsRemappingForAll || + tracker.modifiedSheets.contains(sheetIdx) || + sheetsRequiringRemapping.contains(sheetIdx) + + if needsRemapping then val registry = sheet.styleRegistry val remapping = mutable.Map[Int, Int]() diff --git a/xl-ooxml/test/src/com/tjclp/xl/ooxml/FormulaInjectionSpec.scala b/xl-ooxml/test/src/com/tjclp/xl/ooxml/FormulaInjectionSpec.scala index 48711ba9..ae85b927 100644 --- a/xl-ooxml/test/src/com/tjclp/xl/ooxml/FormulaInjectionSpec.scala +++ b/xl-ooxml/test/src/com/tjclp/xl/ooxml/FormulaInjectionSpec.scala @@ -6,7 +6,10 @@ import com.tjclp.xl.cells.CellValue import com.tjclp.xl.codec.CellCodec.given import com.tjclp.xl.macros.ref import com.tjclp.xl.sheets.Sheet +import com.tjclp.xl.sheets.syntax.* +import java.nio.charset.StandardCharsets import java.nio.file.{Files, Path} +import java.util.zip.ZipFile /** * Security tests for formula injection prevention. @@ -23,6 +26,15 @@ import java.nio.file.{Files, Path} @SuppressWarnings(Array("org.wartremover.warts.IterableOps")) class FormulaInjectionSpec extends FunSuite: + private def readZipEntry(path: Path, entryName: String): String = + val zip = new ZipFile(path.toFile) + try + val entry = Option(zip.getEntry(entryName)).getOrElse(fail(s"Missing ZIP entry $entryName")) + val input = zip.getInputStream(entry) + try new String(input.readAllBytes(), StandardCharsets.UTF_8) + finally input.close() + finally zip.close() + // ========== CellValue.escape() tests ========== test("escape() prefixes = with single quote") { @@ -258,6 +270,60 @@ class FormulaInjectionSpec extends FunSuite: finally Files.deleteIfExists(tempFile) } + test("WriterConfig.secure escapes clean read workbook instead of preserving source sheets") { + val boldStyle = CellStyle.default.withFont(Font("Arial", 12.0, bold = true)) + val sheet = Sheet("Test") + .put(ref"A1", CellValue.Text("=DANGER"), boldStyle) + .put("A2" -> "+EVIL") + .put("A3" -> "Normal text") + + val wb = Workbook(sheet) + val source = Files.createTempFile("test-secure-source-", ".xlsx") + val output = Files.createTempFile("test-secure-output-", ".xlsx") + + try + XlsxWriter.writeWith( + wb, + source, + WriterConfig.default.copy(sstPolicy = SstPolicy.Always) + ) match + case Left(err) => fail(s"Source write failed: $err") + case Right(()) => () + + val readWb = XlsxReader.read(source).fold(err => fail(s"Read failed: $err"), identity) + assert(readWb.sourceContext.exists(_.isClean), "Read workbook should be clean") + + XlsxWriter.writeWith(readWb, output, WriterConfig.secure) match + case Left(err) => fail(s"Secure write failed: $err") + case Right(()) => () + + val rewritten = XlsxReader.read(output).fold(err => fail(s"Re-read failed: $err"), identity) + val readSheet = rewritten.sheets.head + + readSheet.cells.get(ref"A1").map(_.value) match + case Some(CellValue.Text(t)) => assertEquals(t, "'=DANGER") + case other => fail(s"Expected Text, got: $other") + + val a1Style = readSheet.getCellStyle(ref"A1") + assert(a1Style.nonEmpty, "A1 style should survive secure clean-source rewrite") + assertEquals(a1Style.map(_.font.bold), Some(true)) + + readSheet.cells.get(ref"A2").map(_.value) match + case Some(CellValue.Text(t)) => assertEquals(t, "'+EVIL") + case other => fail(s"Expected Text, got: $other") + + val sharedStringsXml = readZipEntry(output, "xl/sharedStrings.xml") + assert(sharedStringsXml.contains("'=DANGER"), "SST should contain escaped dangerous text") + assert( + !sharedStringsXml.contains("=DANGER"), + "SST should not preserve raw dangerous text" + ) + + finally + Files.deleteIfExists(source) + Files.deleteIfExists(output) + } + test("FormulaInjectionPolicy enum values") { // Verify enum values exist and are distinct assert(FormulaInjectionPolicy.Escape != FormulaInjectionPolicy.None) diff --git a/xl-ooxml/test/src/com/tjclp/xl/ooxml/XlsxWriterRichTextEdgeCasesSpec.scala b/xl-ooxml/test/src/com/tjclp/xl/ooxml/XlsxWriterRichTextEdgeCasesSpec.scala index 955689cb..d25ac6e8 100644 --- a/xl-ooxml/test/src/com/tjclp/xl/ooxml/XlsxWriterRichTextEdgeCasesSpec.scala +++ b/xl-ooxml/test/src/com/tjclp/xl/ooxml/XlsxWriterRichTextEdgeCasesSpec.scala @@ -27,9 +27,7 @@ import munit.FunSuite @SuppressWarnings(Array("org.wartremover.warts.OptionPartial")) class XlsxWriterRichTextEdgeCasesSpec extends FunSuite: - // TODO: Enable when rawRPrXml preservation is fully implemented - // Currently family attribute may not be preserved in all cases - test("preserves RichText rPr formatting byte-perfect (rawRPrXml)".ignore) { + test("preserves RichText rPr formatting byte-perfect (rawRPrXml)") { // Regression test for commit 802e020 // Bug: RichText formatting lost (Times New Roman, underline styles) // Solution: Added rawRPrXml field to TextRun for preserving original XML @@ -73,9 +71,9 @@ class XlsxWriterRichTextEdgeCasesSpec extends FunSuite: "Vertical alignment (superscript/subscript) lost" ) - // Should have font family attribute + // Should have font family element and attribute assert( - sstXml.contains("family="), + sstXml.contains(" fail(s"Failed to write: $err"), identity) - // Verify sheet1.xml and SST have xml:space="preserve" + // Verify sheet1.xml points at SST and SST has xml:space="preserve" val outputZip = new ZipFile(output.toFile) val sheetXml = readEntryString(outputZip, outputZip.getEntry("xl/worksheets/sheet1.xml")) val sstXml = readEntryString(outputZip, outputZip.getEntry("xl/sharedStrings.xml")) - // Simple text with leading space (A1) + // Simple text with leading space (A1) is stored in sharedStrings.xml. val a1Match = """]*>.*?""".r.findFirstIn(sheetXml) assert(a1Match.isDefined, "Cell A1 not found") + assert(a1Match.get.contains("""t="s""""), "A1 should reference sharedStrings.xml") assert( - a1Match.get.contains("""xml:space="preserve""""), - "A1 (leading space) should have xml:space=\"preserve\"" + sstXml.contains(""" Leading space"""), + "A1 shared string should have xml:space=\"preserve\"" ) // RichText with double spaces (B1) @@ -149,9 +146,7 @@ class XlsxWriterRichTextEdgeCasesSpec extends FunSuite: Files.deleteIfExists(output) } - // TODO: Enable when empty row preservation is fully implemented in regenerated sheets - // Currently empty rows may get cells during modification (expected behavior for regenerated sheets) - test("preserves empty rows from original (row with no cells but with attributes)".ignore) { + test("preserves empty rows from original (row with no cells but with attributes)") { // Regression test for commit e1f36fe // Bug: Empty rows (like Row 1) were dropped during regeneration // Solution: Filter and preserve preserved.rows.filter(_.cells.isEmpty), combine with rows containing cells @@ -184,7 +179,7 @@ class XlsxWriterRichTextEdgeCasesSpec extends FunSuite: ) // Row 1 should have attributes but no cells - val row1Match = """]*>.*?""".r.findFirstIn(sheetXml) + val row1Match = """]*/>|]*>""".r.findFirstIn(sheetXml) assert(row1Match.isDefined, "Row 1 element not found") assert( !row1Match.get.contains("]*>.*?""".r.findFirstIn(sheetXml) + val row2Match = """]*/>|]*>""".r.findFirstIn(sheetXml) assert(row2Match.isDefined, "Row 2 element not found") assert( row2Match.get.contains("s="), @@ -223,9 +218,7 @@ class XlsxWriterRichTextEdgeCasesSpec extends FunSuite: Files.deleteIfExists(output) } - // TODO: Enable when row-level style output in regenerated sheets matches exact Excel pattern - // Currently row-level styles may be validated/removed during regeneration (per OOXML spec) - test("preserves row-level style when cells have different styles (Excel pattern)".ignore) { + test("preserves row-level style when cells have different styles (Excel pattern)") { // Regression test for commit 802e020 // Bug: Row s and customFormat invalid when cells have varied styles // Solution: Validate and remove if cells have different styles (per OOXML spec)