-
Notifications
You must be signed in to change notification settings - Fork 159
Open
Description
I noticed the results format is json with key value pairs per every cell. This is an extremely inefficient tokenization of tabular data.
There's some good prior art/research on encoding tabular data for LLMs. One of the best papers is from Microsoft Research, SpreadsheetLLM. A higher-performing, simpler encoding of the data is just to use markdown tables - it's superior to csv even. You could go a step further and use their "SheetCompressor" method (I believe it's open-sourced) but this is probably not necessary with more modern LLMs than what they conducted their research on (GPT-4).
Format Comparison
Current JSON Format
- Size: 2,790 bytes, 771 tokens
- Overhead: Each cell requires key name repetition (70 key strings for 10 rows × 7 columns)
- Readability: Requires JSON parsing or pretty-printing to read
- Scanning: Difficult to compare values across rows
Proposed Markdown Format
- Size: 1,244 bytes, 361 tokens (53% reduction)
- Overhead: Column names appear once in header row
- Readability: Immediately scannable in raw form
- Scanning: Easy to compare values vertically and horizontally
Benefits
- Reduced bandwidth: ~55% smaller payload for tabular data
- Better UX: Results are human-readable without additional processing
- Tool compatibility: Markdown tables work in GitHub, Slack, documentation, and most viewers
- Easier debugging: Less need to pipe through
jqor format JSON to understand results - Smaller tokenization = better AI understanding (see work by Microsoft Research on SpreadsheetLLM and LLMLingua)
dbhub_current_json_format.json
dbhub_proposed_markdown_format.md