Taxonomy Feature -- System Prompt

Use this document as a system prompt when implementing the taxonomy/categorization feature for declaude. It contains the full architectural context, category definitions, detection rules, caching strategy, and integration points into the existing codebase.

Project Context

declaude converts Claude conversation exports (conversations.json) into browsable HTML. The codebase has these files:

File	Role
`chat_message.py`	`ChatMessage` dataclass: uuid, text, sender, created_at, updated_at, content (list of block dicts), attachments, files
`conversation.py`	`Conversation` dataclass: uuid, name, summary, created_at, updated_at, account_uuid, chat_messages. Also provides filename/folder generation
`html_renderer.py`	`HtmlRenderer` class with `render_conversation()` and `render_index()`. Index rows are built in `render_index()` as `<tr>` elements with Date and Title columns. Uses `PAPERCLIP_SVG` for attachment icons
`exporter.py`	`export_conversations()` orchestrates the pipeline: loads JSON, iterates conversations, renders HTML, builds index entries as dicts with keys: date, title, path, created_dt, has_attachments
`declaude.py`	CLI entry point using argparse. Current flags: input (positional), -o/--output, --utc, -s/--source

Key data structures

Each ChatMessage.content is a list of dicts with a type field:

"text": has text (str) and citations (list) fields
"thinking": has thinking (str) field
"tool_use": has name (str) and input (dict) fields. For artifacts: input.type, input.title, input.content
"tool_result": has name (str), content (list), is_error (bool) fields

Each ChatMessage.attachments entry has: file_name, file_size, file_type, extracted_content. Each ChatMessage.files entry has: file_name.

Task Description

Add a taxonomy categorizer that assigns up to 2 categories to each conversation. Display these as a third column ("Categories") in the index.html table. Cache results in a JSON file so successive runs skip already-categorized conversations.

Rank	Category	Description
1	Theology	Bible study, apologetics, church history, prayer, scriptural analysis
2	Python	Python programming, scripts, libraries, pip/uv
3	Go	Go/Golang programming, modules, CLI tools
4	Bash	Shell scripting, CLI commands, terminal operations
5	NATS	NATS messaging, JetStream, nats CLI, Synadia
6	Networking	Mikrotik, DNS, SSH, SFTP, firewalls, VPNs, Starlink, network hardware
7	Creative Writing	Satirical stories, fiction, humor pieces, narrative writing
8	macOS	macOS-specific tools, Homebrew, Time Machine, Finder, system preferences
9	Data & Formats	JSON processing, CSV, data extraction, file format conversion
10	Web	HTML, CSS, JavaScript, web APIs, web scraping
11	AI & LLMs	Prompting, model comparison, Claude features, API usage
12	General	Catch-all for anything that does not match above categories

Detection Strategy (Option B: Content Heuristics)

Categorize by scanning the conversation title, the summary field, and the first 5 human messages (content text blocks only, not assistant messages). Do not scan the entire conversation -- the first few human messages establish the topic.

Detection signals per category

Theology

Title keywords: bible, scripture, verse, psalm, proverb, genesis, exodus, leviticus, numbers, deuteronomy, joshua, judges, ruth, samuel, kings, chronicles, ezra, nehemiah, esther, job, ecclesiastes, isaiah, jeremiah, lamentations, ezekiel, daniel, hosea, joel, amos, obadiah, jonah, micah, nahum, habakkuk, zephaniah, haggai, zechariah, malachi, matthew, mark, luke, john, acts, romans, corinthians, galatians, ephesians, philippians, colossians, thessalonians, timothy, titus, philemon, hebrews, james, peter, jude, revelation, theology, apologetics, gospel, prayer, church, sermon, faith, god, jesus, christ, hebrew, greek (in biblical context), NET bible, NIV, ESV, KJV, testament, covenant
Content signals: Bible verse references (e.g. "John 3:16", "Gen 1:1"), theological terms

Python

Title keywords: python, .py, pytest, pip, uv run, pandas, numpy, flask, django, fastapi, dataclass, pydantic
Content signals: code fences tagged python or py, import statements for Python modules, def function definitions, class with Python-style inheritance, .py file references in attachments

Title keywords: golang, go module, go cli, .go
Content signals: code fences tagged go or golang, package main, func , import ", .go file references
IMPORTANT: Do not match the bare word "go" in natural English ("go ahead", "let's go"). Require either the code fence tag, go followed by a technical term (module, build, run, install, test, fmt, vet), or golang

Bash

Title keywords: bash, shell, zsh, script, terminal, .sh
Content signals: code fences tagged bash, sh, shell, or zsh, shebang lines (#!/bin/bash, #!/bin/sh), common CLI tool names in code context (grep, awk, sed, find, xargs, curl, wget)

NATS

Title keywords: nats, jetstream, synadia, nats-server, nats cli
Content signals: nats CLI commands, JetStream references, nats:// URLs, stream/consumer terminology in NATS context

Networking

Title keywords: mikrotik, routerboard, dns, ssh, sftp, firewall, vpn, wireguard, starlink, subnet, vlan, router, switch, ip address, dhcp, tcp, udp
Content signals: IP addresses, CIDR notation, network configuration blocks, RouterOS commands

Creative Writing

Title keywords: satirical, satire, story, fiction, humor, narrative, short story, writing prompt, creative
Content signals: Long-form prose without code blocks, narrative structure, character dialogue. Be conservative -- a conversation about writing code is not creative writing

macOS

Title keywords: macos, mac os, macbook, homebrew, time machine, finder, spotlight, applescript, diskutil
Content signals: macOS-specific commands (defaults write, diskutil, osascript, brew), .app references, macOS system paths (/Library, ~/Library, /Applications)

Data & Formats

Title keywords: json, csv, xml, yaml, data extract, parsing, file format, convert
Content signals: JSON/CSV/XML processing discussion, jq commands, data transformation pipelines. Only when data processing is the primary topic -- a Python conversation that happens to parse JSON should be categorized as Python, not Data & Formats

Web

Title keywords: html, css, javascript, typescript, react, vue, angular, api endpoint, web scraping, http, rest api
Content signals: code fences tagged html, css, javascript, typescript, jsx, tsx, HTML tags in content, HTTP methods discussion

AI & LLMs

Title keywords: prompt, llm, gpt, claude, model, ai, chatgpt, anthropic, openai, gemini, fine-tune, embedding, token
Content signals: Discussion of AI model capabilities, prompt engineering, API usage for LLMs. Do not match when "claude" appears only as a proper name or "model" appears in non-AI context (data models, 3D models)

General

Assigned only when no other category matches.

Matching algorithm

function categorize(conversation):
    scores = {}  # category -> int

    # Build the text corpus to scan
    title = conversation.name.lower()
    summary = conversation.summary.lower()
    human_texts = []
    for msg in conversation.chat_messages[:10]:  # first 10 messages
        if msg.sender != "human":
            continue
        for block in msg.content:
            if block.type == "text":
                human_texts.append(block.text.lower())
        for att in msg.attachments:
            human_texts.append(att.file_name.lower())
        for f in msg.files:
            human_texts.append(f.file_name.lower())
        if len(human_texts) >= 5:
            break

    corpus = "\n".join([title, summary] + human_texts)

    # Also extract code fence language tags from first 10 messages
    code_langs = set()
    for msg in conversation.chat_messages[:10]:
        for block in msg.content:
            if block.type == "text":
                # extract language from ```lang blocks
                for line in block.text.split("\n"):
                    stripped = line.strip()
                    if stripped.startswith("```") and len(stripped) > 3:
                        lang = stripped[3:].strip().split()[0].lower()
                        if lang:
                            code_langs.add(lang)

    # Score each category using title keywords + corpus keywords + code fences
    # Title matches are worth 3 points, corpus matches 1 point, code fences 2 points
    # Use the keyword lists from the detection signals above

    for category, rules in CATEGORY_RULES.items():
        for keyword in rules.title_keywords:
            if keyword in title:
                scores[category] = scores.get(category, 0) + 3
        for keyword in rules.corpus_keywords:
            if keyword in corpus:
                scores[category] = scores.get(category, 0) + 1
        for lang_tag in rules.code_fence_tags:
            if lang_tag in code_langs:
                scores[category] = scores.get(category, 0) + 2

    # Filter to categories with score > 0
    matched = {k: v for k, v in scores.items() if v > 0}

    if not matched:
        return ["General"]

    # Sort by rank (priority), breaking ties by score (higher first)
    sorted_cats = sorted(matched.keys(), key=lambda c: (RANK[c], -matched[c]))
    return sorted_cats[:2]

Edge case handling

Conversations titled "Untitled": Rely entirely on content scanning.
Conversations with emoji-prefixed titles (e.g. starting with a speech bubble): Strip leading emoji before keyword matching. The title often contains a truncated first message after the emoji.
Multi-topic conversations: The 2-category limit and rank-based priority handles this. A conversation about "Python script for Bible verse lookup" would get Theology (rank 1) + Python (rank 2).
Short conversations (1-2 messages): Title + summary may be the only useful signals. This is fine.
"Go" ambiguity: Only match go when preceded/followed by technical context or when a go/golang code fence is present. Never match bare "go" as a verb.

Caching Strategy

Cache file

Store at {output_dir}/taxonomy_cache.json with this structure:

{
    "version": 1,
    "categories": {
        "conv-uuid-1": ["Python", "Bash"],
        "conv-uuid-2": ["Theology"],
        "conv-uuid-3": ["General"]
    }
}

Key: conversation UUID (stable across exports).
Value: list of 1-2 category strings.
The version field allows future schema changes.

Cache behavior

At startup, load the cache file if it exists.
For each conversation, check if its UUID is in the cache.
If cached, use the cached categories. If not, run the categorizer and add to cache.
After export completes, write the updated cache back to disk.
If the cache file does not exist, create it.

CLI flag

Add --no-cache flag to force re-categorization of all conversations. This rebuilds the cache from scratch.

Integration Points

New file: `categorizer.py`

Create a single new file containing:

CATEGORY_RULES: dict mapping category names to their keyword lists and code fence tags
CATEGORY_RANKS: dict mapping category names to their priority rank
categorize(conv: Conversation) -> list[str]: returns 1-2 category names
load_cache(cache_path: Path) -> dict[str, list[str]]: loads or returns empty
save_cache(cache_path: Path, cache: dict[str, list[str]]) -> None: writes cache
The categorizer takes a Conversation object (already defined in conversation.py) and accesses conv.name, conv.summary, and conv.chat_messages[*].content[*]

Changes to `exporter.py`

In export_conversations():

Add no_cache: bool = False parameter.
After loading conversations, load the taxonomy cache from output_dir / "taxonomy_cache.json".

Inside the for conv in conversations: loop, after building the index entry dict, add a "categories" key:

if conv.uuid in cache and not no_cache:
    categories = cache[conv.uuid]
else:
    categories = categorize(conv)
    cache[conv.uuid] = categories

index_entries.append({
    "date": ...,
    "title": ...,
    "path": ...,
    "created_dt": ...,
    "has_attachments": ...,
    "categories": categories,
})

After the loop, save the updated cache.

Changes to `html_renderer.py`

In render_index():

Add a third <th>Categories</th> column to the table header.
In the row building loop, read entry.get("categories", []) and join with ", ".
Render as: <td class="categories">{categories_str}</td>

Add CSS for the categories column:

td.categories { font-size: 0.85em; color: #555; white-space: nowrap; }

Consider a fixed width for the column (e.g. 10em) to keep the table aligned.

Changes to `declaude.py`

Add --no-cache argument to the argument parser.
Pass no_cache=args.no_cache through to export_conversations().

Verification

After implementing, verify by spot-checking these conversations (known from the current dataset):

Conversation title	Expected categories
NATS Jetstream push vs pull consumers	NATS
Refactoring Monolithic Code into Modular Structure	Python
Go CLI tool using standard library modules	Go
Configuring sftp chroot for single user	Bash, Networking
Bible study website navigation design	Theology, Web
Mikrotik RB5009UGSIN vs L009UiGS-RM specs	Networking
Creating a satirical BBQ story outline	Creative Writing
MacOS Time Machine setup for external NVMe	macOS
Jobs true message about Gods character	Theology
Morse code converter with obfuscated variable names	Python
1966 persona system prompt	AI & LLMs
Publishing satirical short story collection	Creative Writing

Run the export twice to verify caching works -- the second run should not re-categorize any conversations and should complete faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy Feature -- System Prompt

Project Context

Key data structures

Task Description

Categories

Rules

Detection Strategy (Option B: Content Heuristics)

Detection signals per category

Matching algorithm

Edge case handling

Caching Strategy

Cache file

Cache behavior

CLI flag

Integration Points

New file: `categorizer.py`

Changes to `exporter.py`

Changes to `html_renderer.py`

Changes to `declaude.py`

Verification

FilesExpand file tree

TAXONOMY.md

Latest commit

History

TAXONOMY.md

File metadata and controls

Taxonomy Feature -- System Prompt

Project Context

Key data structures

Task Description

Categories

Rules

Detection Strategy (Option B: Content Heuristics)

Detection signals per category

Matching algorithm

Edge case handling

Caching Strategy

Cache file

Cache behavior

CLI flag

Integration Points

New file: categorizer.py

Changes to exporter.py

Changes to html_renderer.py

Changes to declaude.py

Verification

New file: `categorizer.py`

Changes to `exporter.py`

Changes to `html_renderer.py`

Changes to `declaude.py`