Skip to content

Latest commit

 

History

History
306 lines (233 loc) · 14.5 KB

File metadata and controls

306 lines (233 loc) · 14.5 KB

Taxonomy Feature -- System Prompt

Use this document as a system prompt when implementing the taxonomy/categorization feature for declaude. It contains the full architectural context, category definitions, detection rules, caching strategy, and integration points into the existing codebase.


Project Context

declaude converts Claude conversation exports (conversations.json) into browsable HTML. The codebase has these files:

File Role
chat_message.py ChatMessage dataclass: uuid, text, sender, created_at, updated_at, content (list of block dicts), attachments, files
conversation.py Conversation dataclass: uuid, name, summary, created_at, updated_at, account_uuid, chat_messages. Also provides filename/folder generation
html_renderer.py HtmlRenderer class with render_conversation() and render_index(). Index rows are built in render_index() as <tr> elements with Date and Title columns. Uses PAPERCLIP_SVG for attachment icons
exporter.py export_conversations() orchestrates the pipeline: loads JSON, iterates conversations, renders HTML, builds index entries as dicts with keys: date, title, path, created_dt, has_attachments
declaude.py CLI entry point using argparse. Current flags: input (positional), -o/--output, --utc, -s/--source

Key data structures

Each ChatMessage.content is a list of dicts with a type field:

  • "text": has text (str) and citations (list) fields
  • "thinking": has thinking (str) field
  • "tool_use": has name (str) and input (dict) fields. For artifacts: input.type, input.title, input.content
  • "tool_result": has name (str), content (list), is_error (bool) fields

Each ChatMessage.attachments entry has: file_name, file_size, file_type, extracted_content. Each ChatMessage.files entry has: file_name.


Task Description

Add a taxonomy categorizer that assigns up to 2 categories to each conversation. Display these as a third column ("Categories") in the index.html table. Cache results in a JSON file so successive runs skip already-categorized conversations.


Categories

Use exactly these category names. Each has a priority rank -- when a conversation matches more than 2 categories, keep the 2 with the lowest rank numbers (highest priority).

Rank Category Description
1 Theology Bible study, apologetics, church history, prayer, scriptural analysis
2 Python Python programming, scripts, libraries, pip/uv
3 Go Go/Golang programming, modules, CLI tools
4 Bash Shell scripting, CLI commands, terminal operations
5 NATS NATS messaging, JetStream, nats CLI, Synadia
6 Networking Mikrotik, DNS, SSH, SFTP, firewalls, VPNs, Starlink, network hardware
7 Creative Writing Satirical stories, fiction, humor pieces, narrative writing
8 macOS macOS-specific tools, Homebrew, Time Machine, Finder, system preferences
9 Data & Formats JSON processing, CSV, data extraction, file format conversion
10 Web HTML, CSS, JavaScript, web APIs, web scraping
11 AI & LLMs Prompting, model comparison, Claude features, API usage
12 General Catch-all for anything that does not match above categories

Rules

  • Assign at most 2 categories per conversation.
  • If only 1 category matches, use just that one -- do not pad with General.
  • Assign General only if zero other categories match.
  • When more than 2 match, keep the 2 with the lowest rank numbers.
  • Category names must be used exactly as shown (case-sensitive).

Detection Strategy (Option B: Content Heuristics)

Categorize by scanning the conversation title, the summary field, and the first 5 human messages (content text blocks only, not assistant messages). Do not scan the entire conversation -- the first few human messages establish the topic.

Detection signals per category

Theology

  • Title keywords: bible, scripture, verse, psalm, proverb, genesis, exodus, leviticus, numbers, deuteronomy, joshua, judges, ruth, samuel, kings, chronicles, ezra, nehemiah, esther, job, ecclesiastes, isaiah, jeremiah, lamentations, ezekiel, daniel, hosea, joel, amos, obadiah, jonah, micah, nahum, habakkuk, zephaniah, haggai, zechariah, malachi, matthew, mark, luke, john, acts, romans, corinthians, galatians, ephesians, philippians, colossians, thessalonians, timothy, titus, philemon, hebrews, james, peter, jude, revelation, theology, apologetics, gospel, prayer, church, sermon, faith, god, jesus, christ, hebrew, greek (in biblical context), NET bible, NIV, ESV, KJV, testament, covenant
  • Content signals: Bible verse references (e.g. "John 3:16", "Gen 1:1"), theological terms

Python

  • Title keywords: python, .py, pytest, pip, uv run, pandas, numpy, flask, django, fastapi, dataclass, pydantic
  • Content signals: code fences tagged python or py, import statements for Python modules, def function definitions, class with Python-style inheritance, .py file references in attachments

Go

  • Title keywords: golang, go module, go cli, .go
  • Content signals: code fences tagged go or golang, package main, func , import ", .go file references
  • IMPORTANT: Do not match the bare word "go" in natural English ("go ahead", "let's go"). Require either the code fence tag, go followed by a technical term (module, build, run, install, test, fmt, vet), or golang

Bash

  • Title keywords: bash, shell, zsh, script, terminal, .sh
  • Content signals: code fences tagged bash, sh, shell, or zsh, shebang lines (#!/bin/bash, #!/bin/sh), common CLI tool names in code context (grep, awk, sed, find, xargs, curl, wget)

NATS

  • Title keywords: nats, jetstream, synadia, nats-server, nats cli
  • Content signals: nats CLI commands, JetStream references, nats:// URLs, stream/consumer terminology in NATS context

Networking

  • Title keywords: mikrotik, routerboard, dns, ssh, sftp, firewall, vpn, wireguard, starlink, subnet, vlan, router, switch, ip address, dhcp, tcp, udp
  • Content signals: IP addresses, CIDR notation, network configuration blocks, RouterOS commands

Creative Writing

  • Title keywords: satirical, satire, story, fiction, humor, narrative, short story, writing prompt, creative
  • Content signals: Long-form prose without code blocks, narrative structure, character dialogue. Be conservative -- a conversation about writing code is not creative writing

macOS

  • Title keywords: macos, mac os, macbook, homebrew, time machine, finder, spotlight, applescript, diskutil
  • Content signals: macOS-specific commands (defaults write, diskutil, osascript, brew), .app references, macOS system paths (/Library, ~/Library, /Applications)

Data & Formats

  • Title keywords: json, csv, xml, yaml, data extract, parsing, file format, convert
  • Content signals: JSON/CSV/XML processing discussion, jq commands, data transformation pipelines. Only when data processing is the primary topic -- a Python conversation that happens to parse JSON should be categorized as Python, not Data & Formats

Web

  • Title keywords: html, css, javascript, typescript, react, vue, angular, api endpoint, web scraping, http, rest api
  • Content signals: code fences tagged html, css, javascript, typescript, jsx, tsx, HTML tags in content, HTTP methods discussion

AI & LLMs

  • Title keywords: prompt, llm, gpt, claude, model, ai, chatgpt, anthropic, openai, gemini, fine-tune, embedding, token
  • Content signals: Discussion of AI model capabilities, prompt engineering, API usage for LLMs. Do not match when "claude" appears only as a proper name or "model" appears in non-AI context (data models, 3D models)

General

  • Assigned only when no other category matches.

Matching algorithm

function categorize(conversation):
    scores = {}  # category -> int

    # Build the text corpus to scan
    title = conversation.name.lower()
    summary = conversation.summary.lower()
    human_texts = []
    for msg in conversation.chat_messages[:10]:  # first 10 messages
        if msg.sender != "human":
            continue
        for block in msg.content:
            if block.type == "text":
                human_texts.append(block.text.lower())
        for att in msg.attachments:
            human_texts.append(att.file_name.lower())
        for f in msg.files:
            human_texts.append(f.file_name.lower())
        if len(human_texts) >= 5:
            break

    corpus = "\n".join([title, summary] + human_texts)

    # Also extract code fence language tags from first 10 messages
    code_langs = set()
    for msg in conversation.chat_messages[:10]:
        for block in msg.content:
            if block.type == "text":
                # extract language from ```lang blocks
                for line in block.text.split("\n"):
                    stripped = line.strip()
                    if stripped.startswith("```") and len(stripped) > 3:
                        lang = stripped[3:].strip().split()[0].lower()
                        if lang:
                            code_langs.add(lang)

    # Score each category using title keywords + corpus keywords + code fences
    # Title matches are worth 3 points, corpus matches 1 point, code fences 2 points
    # Use the keyword lists from the detection signals above

    for category, rules in CATEGORY_RULES.items():
        for keyword in rules.title_keywords:
            if keyword in title:
                scores[category] = scores.get(category, 0) + 3
        for keyword in rules.corpus_keywords:
            if keyword in corpus:
                scores[category] = scores.get(category, 0) + 1
        for lang_tag in rules.code_fence_tags:
            if lang_tag in code_langs:
                scores[category] = scores.get(category, 0) + 2

    # Filter to categories with score > 0
    matched = {k: v for k, v in scores.items() if v > 0}

    if not matched:
        return ["General"]

    # Sort by rank (priority), breaking ties by score (higher first)
    sorted_cats = sorted(matched.keys(), key=lambda c: (RANK[c], -matched[c]))
    return sorted_cats[:2]

Edge case handling

  • Conversations titled "Untitled": Rely entirely on content scanning.
  • Conversations with emoji-prefixed titles (e.g. starting with a speech bubble): Strip leading emoji before keyword matching. The title often contains a truncated first message after the emoji.
  • Multi-topic conversations: The 2-category limit and rank-based priority handles this. A conversation about "Python script for Bible verse lookup" would get Theology (rank 1) + Python (rank 2).
  • Short conversations (1-2 messages): Title + summary may be the only useful signals. This is fine.
  • "Go" ambiguity: Only match go when preceded/followed by technical context or when a go/golang code fence is present. Never match bare "go" as a verb.

Caching Strategy

Cache file

Store at {output_dir}/taxonomy_cache.json with this structure:

{
    "version": 1,
    "categories": {
        "conv-uuid-1": ["Python", "Bash"],
        "conv-uuid-2": ["Theology"],
        "conv-uuid-3": ["General"]
    }
}
  • Key: conversation UUID (stable across exports).
  • Value: list of 1-2 category strings.
  • The version field allows future schema changes.

Cache behavior

  1. At startup, load the cache file if it exists.
  2. For each conversation, check if its UUID is in the cache.
  3. If cached, use the cached categories. If not, run the categorizer and add to cache.
  4. After export completes, write the updated cache back to disk.
  5. If the cache file does not exist, create it.

CLI flag

Add --no-cache flag to force re-categorization of all conversations. This rebuilds the cache from scratch.


Integration Points

New file: categorizer.py

Create a single new file containing:

  • CATEGORY_RULES: dict mapping category names to their keyword lists and code fence tags
  • CATEGORY_RANKS: dict mapping category names to their priority rank
  • categorize(conv: Conversation) -> list[str]: returns 1-2 category names
  • load_cache(cache_path: Path) -> dict[str, list[str]]: loads or returns empty
  • save_cache(cache_path: Path, cache: dict[str, list[str]]) -> None: writes cache
  • The categorizer takes a Conversation object (already defined in conversation.py) and accesses conv.name, conv.summary, and conv.chat_messages[*].content[*]

Changes to exporter.py

In export_conversations():

  1. Add no_cache: bool = False parameter.
  2. After loading conversations, load the taxonomy cache from output_dir / "taxonomy_cache.json".
  3. Inside the for conv in conversations: loop, after building the index entry dict, add a "categories" key:
    if conv.uuid in cache and not no_cache:
        categories = cache[conv.uuid]
    else:
        categories = categorize(conv)
        cache[conv.uuid] = categories
    
    index_entries.append({
        "date": ...,
        "title": ...,
        "path": ...,
        "created_dt": ...,
        "has_attachments": ...,
        "categories": categories,
    })
  4. After the loop, save the updated cache.

Changes to html_renderer.py

In render_index():

  1. Add a third <th>Categories</th> column to the table header.
  2. In the row building loop, read entry.get("categories", []) and join with ", ".
  3. Render as: <td class="categories">{categories_str}</td>
  4. Add CSS for the categories column:
    td.categories { font-size: 0.85em; color: #555; white-space: nowrap; }
  5. Consider a fixed width for the column (e.g. 10em) to keep the table aligned.

Changes to declaude.py

  1. Add --no-cache argument to the argument parser.
  2. Pass no_cache=args.no_cache through to export_conversations().

Verification

After implementing, verify by spot-checking these conversations (known from the current dataset):

Conversation title Expected categories
NATS Jetstream push vs pull consumers NATS
Refactoring Monolithic Code into Modular Structure Python
Go CLI tool using standard library modules Go
Configuring sftp chroot for single user Bash, Networking
Bible study website navigation design Theology, Web
Mikrotik RB5009UGSIN vs L009UiGS-RM specs Networking
Creating a satirical BBQ story outline Creative Writing
MacOS Time Machine setup for external NVMe macOS
Jobs true message about Gods character Theology
Morse code converter with obfuscated variable names Python
1966 persona system prompt AI & LLMs
Publishing satirical short story collection Creative Writing

Run the export twice to verify caching works -- the second run should not re-categorize any conversations and should complete faster.