Skip to content

Results format is not token efficient #225

@marr75

Description

@marr75

I noticed the results format is json with key value pairs per every cell. This is an extremely inefficient tokenization of tabular data.

There's some good prior art/research on encoding tabular data for LLMs. One of the best papers is from Microsoft Research, SpreadsheetLLM. A higher-performing, simpler encoding of the data is just to use markdown tables - it's superior to csv even. You could go a step further and use their "SheetCompressor" method (I believe it's open-sourced) but this is probably not necessary with more modern LLMs than what they conducted their research on (GPT-4).


Format Comparison

Current JSON Format

  • Size: 2,790 bytes, 771 tokens
  • Overhead: Each cell requires key name repetition (70 key strings for 10 rows × 7 columns)
  • Readability: Requires JSON parsing or pretty-printing to read
  • Scanning: Difficult to compare values across rows

Proposed Markdown Format

  • Size: 1,244 bytes, 361 tokens (53% reduction)
  • Overhead: Column names appear once in header row
  • Readability: Immediately scannable in raw form
  • Scanning: Easy to compare values vertically and horizontally

Benefits

  1. Reduced bandwidth: ~55% smaller payload for tabular data
  2. Better UX: Results are human-readable without additional processing
  3. Tool compatibility: Markdown tables work in GitHub, Slack, documentation, and most viewers
  4. Easier debugging: Less need to pipe through jq or format JSON to understand results
  5. Smaller tokenization = better AI understanding (see work by Microsoft Research on SpreadsheetLLM and LLMLingua)

dbhub_current_json_format.json
dbhub_proposed_markdown_format.md

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions