Skip to content

Add configurable list formatting for CSV/TSV serialization#3134

Merged
turbomam merged 27 commits intomainfrom
issue-3041-annotation-based-csv-delimiters
Feb 27, 2026
Merged

Add configurable list formatting for CSV/TSV serialization#3134
turbomam merged 27 commits intomainfrom
issue-3041-annotation-based-csv-delimiters

Conversation

@turbomam
Copy link
Member

@turbomam turbomam commented Feb 4, 2026

New Summary from 2026-02-11

Adds configurable multivalued field formatting for CSV/TSV serialization, via schema-level annotations and CLI options.

Before: Multivalued fields always serialize with brackets: [value1|value2|value3]
After: With list_wrapper: none, fields serialize without brackets: value1|value2|value3

Closes #3041. Addresses the core of #2581 (filed by @matentzn as a blocker for supporting common delimited formats like pipe-separated, semicolon-separated, etc.).

Origin and design

This follows the design @cmungall and I agreed on in our Dec 15 rolling meeting notes:

annotations:
  list_wrapper: none   # square (default) | curly | paren | none
  list_delimiter: "; "     # any string; space must be explicit

With mapping to json-flattener: list_wrapper: nonecsv_list_markers=("", ""), list_delimitercsv_inner_delimiter.

Deviation from spec: schema-level only

The Dec 15 spec discussed slot-level annotations overriding schema-level defaults via SchemaView. The implementation is schema-level only. json-flattener's GlobalConfig defines csv_list_markers and csv_inner_delimiter at the top level with no per-column configuration path, so slot-level overrides would require extending json-flattener itself. The primary use case (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a schema, so this felt like the right scope for now.

No changes to csvutils.py

Per Chris's guidance ("prefer no changes in csvutils.py"), configuration is handled in the loader and dumper rather than in the shared utility layer.

SSSOM alignment

@matentzn suggested checking how SSSOM handles multivalued field packing. With list_wrapper: none and list_delimiter: "|", our output matches SSSOM's TSV spec exactly (plain a|b|c, no brackets, strip whitespace). LinkML generalizes what SSSOM hardcodes — appropriate for a general-purpose modeling language where different schemas need different conventions.

The SSSOM ecosystem is actively working on delimiter-in-value escaping (sssom#507, sssom-java#17). This PR doesn't implement escaping either, but the annotation-based configuration provides the right foundation to add it later.

What's not in scope

  • linkml-validate loader: This PR modifies the linkml_runtime loader/dumper (used by linkml-convert), not the separate linkml.validator.loaders.delimited_file_loader. Filed linkml-validate CSV/TSV loader lacks schema-aware parsing (boolean coercion, list splitting) #3147 to track unifying them.
  • Pandera / column-oriented data: @tfliss and @sneakers-the-rat raised broader tabular concerns in Discussion #1996. This is row-oriented only — those feel like follow-up work for the tabular data library discussion.
  • Delimiter-in-value escaping: Neither this PR nor SSSOM 1.0 implement escaping. Instead, the refuse_delimiter_in_data annotation/CLI flag raises a ValueError before serialization if any value contains the delimiter — preventing silent data corruption. Full escaping (e.g., SSSOM 1.1's backslash approach) can be added later.
  • RDF order preservation: @gouttegd clarified that multivalued slot order non-preservation is a LinkML-wide property (not SSSOM-specific), since the RDF translation rules use unstructured triples even when list_elements_ordered: true. Worth noting but orthogonal to this PR.

Configuration reference

Schema annotations

id: https://example.org/myschema
name: myschema
annotations:
  list_wrapper: none
  list_delimiter: "|"
  list_strip_whitespace: "true"
  refuse_delimiter_in_data: "true"
Annotation Values Default Description
list_wrapper square, curly, paren, none square square uses [a|b], curly uses {a|b}, paren uses (a|b), none has no wrapper a|b
list_delimiter any string | (pipe) Character(s) used to separate list items
list_strip_whitespace true, false true Strip whitespace around delimiters when loading and dumping
refuse_delimiter_in_data true, false false Raise ValueError if any multivalued field value contains the delimiter, preventing silent data corruption

CLI options (override schema annotations)

linkml-convert -s schema.yaml -C Container -S items -t tsv \
  --list-wrapper none \
  --list-delimiter "|" \
  --list-strip-whitespace \
  --refuse-delimiter-in-data \
  input.yaml
CLI Option Default Description
--list-wrapper None (use schema) square, curly, paren, or none
--list-delimiter None (use schema) Delimiter string
--list-strip-whitespace / --no-list-strip-whitespace None (use schema) Strip whitespace from list values
--refuse-delimiter-in-data / --no-refuse-delimiter-in-data None (use schema) Raise error if any value contains the delimiter

Review feedback addressed

From @cmungall's review (Feb 5):

  • ✅ Converted all tests to pure idiomatic pytest (no unittest classes, no hybrid styles)
  • ✅ Removed verbose agent-conversation-style comments
  • ✅ Made helper functions public (dropped underscore prefix)

From Copilot:

  • ✅ Fixed schema-level vs slot-level annotation mismatch in tests
  • ✅ Removed unused variables
  • ✅ Added warning log for invalid list_wrapper values

Coverage:

  • ✅ Added CLI integration tests — 25 converter tests + 29 CSV/TSV runtime tests pass

Files changed

  • docs/data/csvs.md — documentation
  • packages/linkml/src/linkml/converter/cli.py — CLI options
  • packages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py — output formatting
  • packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py — input parsing
  • tests/linkml/test_utils/test_converter.py — CLI tests
  • tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py — runtime tests
  • tests/linkml_runtime/test_utils/test_csv_utils.py — utility tests

@turbomam turbomam changed the title Add annotation-based CSV delimiter configuration Add annotation-based xSV delimiter configuration Feb 4, 2026
@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 61.11111% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.87%. Comparing base (921970d) to head (41f33ab).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
packages/linkml/src/linkml/converter/cli.py 61.11% 2 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3134      +/-   ##
==========================================
+ Coverage   80.11%   83.87%   +3.76%     
==========================================
  Files         148      148              
  Lines       16992    17010      +18     
  Branches     3508     3515       +7     
==========================================
+ Hits        13613    14267     +654     
+ Misses       2638     1950     -688     
- Partials      741      793      +52     
Flag Coverage Δ
linkml 80.12% <61.11%> (+0.01%) ⬆️
runtime 80.12% <61.11%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@turbomam turbomam force-pushed the issue-3041-annotation-based-csv-delimiters branch from e788b02 to 2264cc3 Compare February 4, 2026 17:49
@turbomam turbomam changed the title Add annotation-based xSV delimiter configuration Add configurable list formatting for CSV/TSV serialization Feb 5, 2026
@turbomam turbomam marked this pull request as ready for review February 5, 2026 15:23
Copilot AI review requested due to automatic review settings February 5, 2026 15:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds configurable list formatting for CSV/TSV serialization to address issue #3041, enabling users to control how multivalued fields are serialized (with or without brackets, custom delimiters, and whitespace handling).

Changes:

  • Adds schema-level annotations (list_syntax, list_delimiter, list_strip_whitespace) to control multivalued field formatting in CSV/TSV output
  • Implements CLI options (--list-syntax, --list-delimiter, --list-strip-whitespace) to override schema annotations
  • Extends CSV/TSV loaders and dumpers to handle plaintext-style lists (e.g., a|b|c) in addition to python-style lists (e.g., [a|b|c])

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/data/csvs.md Comprehensive documentation of the new configuration options with examples and usage instructions
packages/linkml/src/linkml/converter/cli.py Adds three new CLI options for list formatting that apply to both input and output CSV/TSV operations
packages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py Implements list formatting configuration for CSV/TSV output, reading from schema annotations or CLI overrides
packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py Implements list formatting configuration for CSV/TSV input, including helper functions for annotation reading and whitespace stripping
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py Comprehensive integration tests covering plaintext mode, custom delimiters, whitespace handling, and edge cases
tests/linkml_runtime/test_utils/test_csv_utils.py Unit tests for annotation reading (contains a test schema that uses slot-level annotations inconsistently with implementation)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@turbomam
Copy link
Member Author

turbomam commented Feb 5, 2026

Re: patch coverage

The 12 uncovered lines are in cli.py where the new CLI options are passed through to the loader/dumper.

Update: Added CLI integration tests in commit a1db0e5. The CLI options (--list-syntax, --list-delimiter, --list-strip-whitespace) are now tested directly via CliRunner.

Test coverage includes:

  • 4 CLI tests in test_converter.py (linkml package)
  • 33 tests in test_csv_tsv_loader_dumper.py and test_csv_utils.py (linkml_runtime package)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the tests more consistent with other tests

@turbomam
Copy link
Member Author

turbomam commented Feb 11, 2026

Consolidated unaddressed feedback — list formatting (PR #3134)

Gathering all outstanding feedback from multiple sources so nothing falls through the cracks.

1. Chris's CHANGES_REQUESTED review (Feb 5) — addressed in code, awaiting re-review

All 5 inline comments have been addressed:

  • ✅ Converted test file from hybrid unittest/pytest to pure pytest (9b1e739, 19b9006)
  • ✅ Removed verbose agent-conversation-style comments
  • ✅ Removed stale KeyConfig context note
  • ✅ Converted test classes to plain test_* functions
  • ✅ Made helper functions public (dropped underscore prefix)

Status: Code changes pushed, Chris has not re-reviewed yet.

2. Chris's design spec from Dec 15 rolling meeting notes

The agreed-upon design from our Dec 15 meeting (Chris & Mark rolling notes):

# Schema annotation spec
attributes:
  name_list:
    multivalued: true
    annotations:
      list_syntax: python  ## allowed: python | plaintext
      list_delimiter: "; "  ## must include space explicitly. No effect if list_syntax == 'python'

Mapping to json-flattener:

  • If list_syntax == "python" → use defaults
  • Else → csv_list_markers = ("", ""), csv_inner_delimiter = $list_delimiter

Cascading: Use SchemaView — default to schema-level annotations, slot-level annotations override.

Additional guidance from Chris:

  • "prefer no changes in packages/linkml_runtime/src/linkml_runtime/utils/csvutils.py"
  • "remember that json-flattener isn't really schema aware"
  • "pass csv list marker (a tuple) and inner delimiter (or a csv style syntax enum) in schema"

Need to verify: Does the current implementation match this spec exactly? Specifically:

  • Slot-level annotation override of schema-level annotations
  • Correct mapping to json-flattener's csv_list_markers and csv_inner_delimiter
  • No changes in csvutils.py

3. Chris's helper function visibility comment (PR inline)

Chris said "consider making more clearly intended as public (doctests might work better if it's intended as private)". I made them public and noted the repo has no doctest infrastructure — filed #3146 to track adding it. Chris hasn't responded to this.

4. Copilot review — invalid list_syntax values

Copilot flagged that invalid list_syntax values (e.g. "foobar") silently default to python style. Fixed in 971b3ec — added a warning log. ✅

5. Discussion #1996 context

My progress update in the "Improved ways of working with tabular data" discussion links this PR. Broader context from the discussion:

  • @tfliss raised interaction with Pandera generator for inlined-as-simple-dict and range classes
  • @sneakers-the-rat raised the two orientations of tabular data (column-oriented vs row-oriented)
  • Chris's original post discusses whether we should have a separate tabular data library with plugin architecture

6. Feb 9 rolling notes

Chris noted "Finish linkml PRs" as a current action item.

7. Related issues

Next steps

  1. Verify implementation matches Chris's Dec 15 spec (especially slot-level override cascade and json-flattener mapping)
  2. Rebase if needed
  3. Request re-review from @cmungall

🤖 Generated with Claude Code

@turbomam turbomam force-pushed the issue-3041-annotation-based-csv-delimiters branch from 9b1e739 to d0aa8d5 Compare February 11, 2026 16:58
@turbomam
Copy link
Member Author

@cmungall Heads up on one deviation from our Dec 15 spec. We discussed slot-level annotations overriding schema-level defaults, but the current implementation only supports schema-level annotations.

Why: json-flattener's GlobalConfig defines csv_list_markers and csv_inner_delimiter at the top level, not on KeyConfig. These get applied uniformly to all columns — there's no per-column configuration path. So slot-level overrides would require extending json-flattener itself.

I think schema-level-only is the right call here. The main use case driving this (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a given schema anyway. If a per-slot use case comes up later, we can add it via json-flattener at that point.

Also rebased onto current main and resolved the conflict with #3118's new converter tests. All 54 tests pass (25 converter + 29 CSV/TSV runtime). Ready for re-review when you get a chance.

@turbomam
Copy link
Member Author

Scope and known limitations

What this PR does and doesn't touch

This PR modifies the linkml_runtime loader/dumper (used by linkml-convert). It does not touch the separate linkml.validator.loaders.delimited_file_loader (used by linkml-validate), which is a simpler 79-line loader built on bare csv.DictReader without json-flattener.

That means after this merges, linkml-convert will correctly split a|b into ['a', 'b'], but linkml-validate on the same CSV will still see it as the raw string a|b. I filed #3147 to track unifying these two loaders — that felt like a separate effort.

Schema-level only annotations

Our Dec 15 spec discussed slot-level annotation overrides via SchemaView. The implementation is schema-level only — see my earlier comment for why (json-flattener's GlobalConfig has no per-column delimiter support).

Known edge cases

  • Delimiter-in-value: If a value contains the delimiter character, round-tripping will break. No escaping mechanism yet. Tracked in a skipped test with a note.
  • Empty multivalued fields: Skipped test due to a json-flattener json_clean issue — empty lists don't roundtrip cleanly.
  • Pandera / column-oriented data: @tfliss and @sneakers-the-rat raised broader tabular concerns in Discussion Improved ways of working with tabular data #1996. This PR is row-oriented only and doesn't address those — they feel like follow-up work for the tabular data library discussion.

list_strip_whitespace accepts only true/false

Tightened in e4d955e to only accept case-insensitive "true" or "false". Previously accepted YAML 1.1 conventions (yes/no, 0/1). Changed to stay consistent with the direction in #3144 per Chris's feedback about not mixing boolean conventions.

turbomam and others added 12 commits February 26, 2026 09:12
- Update CLI help and docs to say "when loading and dumping" (not just loading)
- Simplify CLI help text to avoid formatting issues with special characters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move annotations from slot-level to schema-level in test schema to match
  actual implementation behavior
- Remove unused variables from skipped test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for --list-syntax, --list-delimiter, and --list-strip-whitespace
options in linkml-convert CLI.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert test_csv_utils.py from hybrid unittest/pytest to pure pytest
- Convert class-based tests to plain functions in test_csv_tsv_loader_dumper.py
- Remove verbose comment blocks and chatty docstrings
- Rename helper functions to public (drop underscore prefix):
  get_list_config_from_annotations, enhance_configmap_for_multivalued_primitives,
  strip_whitespace_from_lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mechanical conversion: drop class wrapper, remove self, remove
import unittest. No behavior change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only accept case-insensitive "true" or "false" for the
list_strip_whitespace annotation, with a warning for invalid values.
Aligns with the direction in #3144 to avoid YAML 1.1 boolean
conventions (yes/no, on/off, 0/1) in CSV-related configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When enabled (via schema annotation or CLI flag), raises ValueError
before serializing if any multivalued field value contains the list
delimiter character. This catches round-trip corruption at write time
rather than silently producing corrupt output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed param

Address Kevin Schaper's review feedback on PR #3134:

1. Rename `list_syntax` annotation and CLI option to `list_wrapper` with
   permissible values: `square` [a|b], `curly` {a|b}, `paren` (a|b),
   `none` a|b (default: square). Replaces the old `python`/`plaintext`
   naming.

2. Extract shared list formatting functions (get_list_config_from_annotations,
   strip_whitespace_from_lists, enhance_configmap_for_multivalued_primitives,
   check_data_for_delimiter) into a new `list_utils.py` module in
   linkml_runtime/utils/, instead of defining them in the loader and
   importing into the dumper.

3. Remove unused `index_slot` parameter from
   get_list_config_from_annotations — the function only reads schema-level
   annotations and never used this parameter.
Introduces a ListConfig dataclass to bundle the four list-formatting
settings (markers, delimiter, strip_whitespace, refuse_delimiter_in_data)
that were previously passed as loose variables. Config resolution
(defaults -> schema annotations -> CLI overrides) now happens in a
single get_list_config() function rather than being duplicated in
both the loader and dumper.

Addresses review feedback from sneakers-the-rat and Kevin Schaper.
@turbomam turbomam force-pushed the issue-3041-annotation-based-csv-delimiters branch from 368acb7 to 1e833ca Compare February 26, 2026 14:12
turbomam added a commit that referenced this pull request Feb 26, 2026
…esolver

Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig
dataclass that bundles truthy/falsy values (loading) and output format
(dumping). Config resolution (defaults -> schema annotations -> CLI
overrides) now happens in a single get_boolean_config() function in
boolean_utils.py rather than being split across the loader and dumper.

The loader and dumper are now much leaner — boolean-specific logic
(coercion, output conversion, slot introspection) lives in one place.
@turbomam turbomam merged commit ca63aae into main Feb 27, 2026
17 checks passed
@turbomam turbomam deleted the issue-3041-annotation-based-csv-delimiters branch February 27, 2026 23:08
turbomam added a commit that referenced this pull request Mar 4, 2026
…esolver

Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig
dataclass that bundles truthy/falsy values (loading) and output format
(dumping). Config resolution (defaults -> schema annotations -> CLI
overrides) now happens in a single get_boolean_config() function in
boolean_utils.py rather than being split across the loader and dumper.

The loader and dumper are now much leaner — boolean-specific logic
(coercion, output conversion, slot introspection) lives in one place.
turbomam added a commit that referenced this pull request Mar 10, 2026
…esolver

Mirrors the ListConfig pattern from PR #3134: introduces a BooleanConfig
dataclass that bundles truthy/falsy values (loading) and output format
(dumping). Config resolution (defaults -> schema annotations -> CLI
overrides) now happens in a single get_boolean_config() function in
boolean_utils.py rather than being split across the loader and dumper.

The loader and dumper are now much leaner — boolean-specific logic
(coercion, output conversion, slot introspection) lives in one place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSV/TSV loader does not split brackets-free, multivalued primitive slots (with pipe delimiter)

6 participants