Add configurable list formatting for CSV/TSV serialization#3134
Add configurable list formatting for CSV/TSV serialization#3134
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3134 +/- ##
==========================================
+ Coverage 80.11% 83.87% +3.76%
==========================================
Files 148 148
Lines 16992 17010 +18
Branches 3508 3515 +7
==========================================
+ Hits 13613 14267 +654
+ Misses 2638 1950 -688
- Partials 741 793 +52
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e788b02 to
2264cc3
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds configurable list formatting for CSV/TSV serialization to address issue #3041, enabling users to control how multivalued fields are serialized (with or without brackets, custom delimiters, and whitespace handling).
Changes:
- Adds schema-level annotations (
list_syntax,list_delimiter,list_strip_whitespace) to control multivalued field formatting in CSV/TSV output - Implements CLI options (
--list-syntax,--list-delimiter,--list-strip-whitespace) to override schema annotations - Extends CSV/TSV loaders and dumpers to handle plaintext-style lists (e.g.,
a|b|c) in addition to python-style lists (e.g.,[a|b|c])
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/data/csvs.md | Comprehensive documentation of the new configuration options with examples and usage instructions |
| packages/linkml/src/linkml/converter/cli.py | Adds three new CLI options for list formatting that apply to both input and output CSV/TSV operations |
| packages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py | Implements list formatting configuration for CSV/TSV output, reading from schema annotations or CLI overrides |
| packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py | Implements list formatting configuration for CSV/TSV input, including helper functions for annotation reading and whitespace stripping |
| tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py | Comprehensive integration tests covering plaintext mode, custom delimiters, whitespace handling, and edge cases |
| tests/linkml_runtime/test_utils/test_csv_utils.py | Unit tests for annotation reading (contains a test schema that uses slot-level annotations inconsistently with implementation) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
|
Re: patch coverage
Update: Added CLI integration tests in commit a1db0e5. The CLI options ( Test coverage includes:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py
Outdated
Show resolved
Hide resolved
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
cmungall
left a comment
There was a problem hiding this comment.
make the tests more consistent with other tests
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
Consolidated unaddressed feedback — list formatting (PR #3134)Gathering all outstanding feedback from multiple sources so nothing falls through the cracks. 1. Chris's CHANGES_REQUESTED review (Feb 5) — addressed in code, awaiting re-reviewAll 5 inline comments have been addressed:
Status: Code changes pushed, Chris has not re-reviewed yet. 2. Chris's design spec from Dec 15 rolling meeting notesThe agreed-upon design from our Dec 15 meeting (Chris & Mark rolling notes): # Schema annotation spec
attributes:
name_list:
multivalued: true
annotations:
list_syntax: python ## allowed: python | plaintext
list_delimiter: "; " ## must include space explicitly. No effect if list_syntax == 'python'Mapping to json-flattener:
Cascading: Use SchemaView — default to schema-level annotations, slot-level annotations override. Additional guidance from Chris:
Need to verify: Does the current implementation match this spec exactly? Specifically:
3. Chris's helper function visibility comment (PR inline)Chris said "consider making more clearly intended as public (doctests might work better if it's intended as private)". I made them public and noted the repo has no doctest infrastructure — filed #3146 to track adding it. Chris hasn't responded to this. 4. Copilot review — invalid
|
9b1e739 to
d0aa8d5
Compare
|
@cmungall Heads up on one deviation from our Dec 15 spec. We discussed slot-level annotations overriding schema-level defaults, but the current implementation only supports schema-level annotations. Why: json-flattener's I think schema-level-only is the right call here. The main use case driving this (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a given schema anyway. If a per-slot use case comes up later, we can add it via json-flattener at that point. Also rebased onto current main and resolved the conflict with #3118's new converter tests. All 54 tests pass (25 converter + 29 CSV/TSV runtime). Ready for re-review when you get a chance. |
Scope and known limitationsWhat this PR does and doesn't touchThis PR modifies the That means after this merges, Schema-level only annotationsOur Dec 15 spec discussed slot-level annotation overrides via SchemaView. The implementation is schema-level only — see my earlier comment for why (json-flattener's Known edge cases
|
- Update CLI help and docs to say "when loading and dumping" (not just loading) - Simplify CLI help text to avoid formatting issues with special characters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move annotations from slot-level to schema-level in test schema to match actual implementation behavior - Remove unused variables from skipped test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for --list-syntax, --list-delimiter, and --list-strip-whitespace options in linkml-convert CLI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert test_csv_utils.py from hybrid unittest/pytest to pure pytest - Convert class-based tests to plain functions in test_csv_tsv_loader_dumper.py - Remove verbose comment blocks and chatty docstrings - Rename helper functions to public (drop underscore prefix): get_list_config_from_annotations, enhance_configmap_for_multivalued_primitives, strip_whitespace_from_lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mechanical conversion: drop class wrapper, remove self, remove import unittest. No behavior change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only accept case-insensitive "true" or "false" for the list_strip_whitespace annotation, with a warning for invalid values. Aligns with the direction in #3144 to avoid YAML 1.1 boolean conventions (yes/no, on/off, 0/1) in CSV-related configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When enabled (via schema annotation or CLI flag), raises ValueError before serializing if any multivalued field value contains the list delimiter character. This catches round-trip corruption at write time rather than silently producing corrupt output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed param Address Kevin Schaper's review feedback on PR #3134: 1. Rename `list_syntax` annotation and CLI option to `list_wrapper` with permissible values: `square` [a|b], `curly` {a|b}, `paren` (a|b), `none` a|b (default: square). Replaces the old `python`/`plaintext` naming. 2. Extract shared list formatting functions (get_list_config_from_annotations, strip_whitespace_from_lists, enhance_configmap_for_multivalued_primitives, check_data_for_delimiter) into a new `list_utils.py` module in linkml_runtime/utils/, instead of defining them in the loader and importing into the dumper. 3. Remove unused `index_slot` parameter from get_list_config_from_annotations — the function only reads schema-level annotations and never used this parameter.
Introduces a ListConfig dataclass to bundle the four list-formatting settings (markers, delimiter, strip_whitespace, refuse_delimiter_in_data) that were previously passed as loose variables. Config resolution (defaults -> schema annotations -> CLI overrides) now happens in a single get_list_config() function rather than being duplicated in both the loader and dumper. Addresses review feedback from sneakers-the-rat and Kevin Schaper.
368acb7 to
1e833ca
Compare
…esolver Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig dataclass that bundles truthy/falsy values (loading) and output format (dumping). Config resolution (defaults -> schema annotations -> CLI overrides) now happens in a single get_boolean_config() function in boolean_utils.py rather than being split across the loader and dumper. The loader and dumper are now much leaner — boolean-specific logic (coercion, output conversion, slot introspection) lives in one place.
…esolver Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig dataclass that bundles truthy/falsy values (loading) and output format (dumping). Config resolution (defaults -> schema annotations -> CLI overrides) now happens in a single get_boolean_config() function in boolean_utils.py rather than being split across the loader and dumper. The loader and dumper are now much leaner — boolean-specific logic (coercion, output conversion, slot introspection) lives in one place.
…esolver Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig dataclass that bundles truthy/falsy values (loading) and output format (dumping). Config resolution (defaults -> schema annotations -> CLI overrides) now happens in a single get_boolean_config() function in boolean_utils.py rather than being split across the loader and dumper. The loader and dumper are now much leaner — boolean-specific logic (coercion, output conversion, slot introspection) lives in one place.
New Summary from 2026-02-11
Adds configurable multivalued field formatting for CSV/TSV serialization, via schema-level annotations and CLI options.
Before: Multivalued fields always serialize with brackets:
[value1|value2|value3]After: With
list_wrapper: none, fields serialize without brackets:value1|value2|value3Closes #3041. Addresses the core of #2581 (filed by @matentzn as a blocker for supporting common delimited formats like pipe-separated, semicolon-separated, etc.).
Origin and design
This follows the design @cmungall and I agreed on in our Dec 15 rolling meeting notes:
With mapping to json-flattener:
list_wrapper: none→csv_list_markers=("", ""),list_delimiter→csv_inner_delimiter.Deviation from spec: schema-level only
The Dec 15 spec discussed slot-level annotations overriding schema-level defaults via SchemaView. The implementation is schema-level only. json-flattener's
GlobalConfigdefinescsv_list_markersandcsv_inner_delimiterat the top level with no per-column configuration path, so slot-level overrides would require extending json-flattener itself. The primary use case (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a schema, so this felt like the right scope for now.No changes to csvutils.py
Per Chris's guidance ("prefer no changes in csvutils.py"), configuration is handled in the loader and dumper rather than in the shared utility layer.
SSSOM alignment
@matentzn suggested checking how SSSOM handles multivalued field packing. With
list_wrapper: noneandlist_delimiter: "|", our output matches SSSOM's TSV spec exactly (plaina|b|c, no brackets, strip whitespace). LinkML generalizes what SSSOM hardcodes — appropriate for a general-purpose modeling language where different schemas need different conventions.The SSSOM ecosystem is actively working on delimiter-in-value escaping (sssom#507, sssom-java#17). This PR doesn't implement escaping either, but the annotation-based configuration provides the right foundation to add it later.
What's not in scope
linkml-validateloader: This PR modifies thelinkml_runtimeloader/dumper (used bylinkml-convert), not the separatelinkml.validator.loaders.delimited_file_loader. Filed linkml-validate CSV/TSV loader lacks schema-aware parsing (boolean coercion, list splitting) #3147 to track unifying them.refuse_delimiter_in_dataannotation/CLI flag raises aValueErrorbefore serialization if any value contains the delimiter — preventing silent data corruption. Full escaping (e.g., SSSOM 1.1's backslash approach) can be added later.list_elements_ordered: true. Worth noting but orthogonal to this PR.Configuration reference
Schema annotations
list_wrappersquare,curly,paren,nonesquaresquareuses[a|b],curlyuses{a|b},parenuses(a|b),nonehas no wrappera|blist_delimiter|(pipe)list_strip_whitespacetrue,falsetruerefuse_delimiter_in_datatrue,falsefalseValueErrorif any multivalued field value contains the delimiter, preventing silent data corruptionCLI options (override schema annotations)
linkml-convert -s schema.yaml -C Container -S items -t tsv \ --list-wrapper none \ --list-delimiter "|" \ --list-strip-whitespace \ --refuse-delimiter-in-data \ input.yaml--list-wrappersquare,curly,paren, ornone--list-delimiter--list-strip-whitespace/--no-list-strip-whitespace--refuse-delimiter-in-data/--no-refuse-delimiter-in-dataReview feedback addressed
From @cmungall's review (Feb 5):
From Copilot:
list_wrappervaluesCoverage:
Files changed
docs/data/csvs.md— documentationpackages/linkml/src/linkml/converter/cli.py— CLI optionspackages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py— output formattingpackages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py— input parsingtests/linkml/test_utils/test_converter.py— CLI teststests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py— runtime teststests/linkml_runtime/test_utils/test_csv_utils.py— utility tests