Cogito

"I think, therefore I am" — but what, exactly, are you thinking?

Cogito is a methodology for building a cognitive self-model from your personal knowledge base and git history.

The premise: your Obsidian vault (or any git-versioned knowledge base) contains two kinds of signal — what you changed (git diffs) and what you wrote (note content). Together they are a timestamped record of your thinking across time. This pipeline extracts both, embeds them, and surfaces patterns you were not aware of.

Your data stays local. What you open-source is the method.

中文说明 →

Three directions

Cogito approaches self-modeling as three separable questions. You can run any combination.

Direction A — Attention Map

When did you work on what? How did your focus shift over time?

extract_diffs.py       → data/diffs.jsonl        (timestamped diff records)
embed_and_cluster.py   → data/clusters.jsonl      (embedded + KMeans clustered)
visualise.py           → data/self_model_3d.html  (interactive timeline)

Each dot is one commit. Position in space = semantic content. X axis = time. Color = topic cluster. Dot size = output intensity (net lines changed).

Direction B — Behavioral Patterns

When are you most active? What does your work rhythm look like?

direction_b/temporal_patterns.py  → data/temporal_report.html  (hour/weekday/intensity heatmaps)

Uses the hour, weekday, and net_lines fields already in diffs.jsonl. No extra API calls. Four charts: commit frequency heatmap, output intensity heatmap, monthly volume, and monthly volume broken down by domain.

Direction C — Identity Structure

What concepts appear across all your domains without you being aware of it?

direction_c/embed_notes.py            → data/notes_chunks.jsonl       (note content, chunked + embedded)
direction_c/cross_domain_analysis.py  → data/cross_domain_map.html    (combined behavioral + content map)
                                      → data/cross_domain_clusters.json  (machine-readable, for AI querying)

This is the hard direction. It combines two data layers:

Behavioral layer (diffs.jsonl) — what you did
Cognitive layer (notes_chunks.jsonl) — what you thought and wrote

Clusters that span multiple domains (creative writing, technical work, legal documents, job search) are candidates for your cognitive identity anchors — the patterns you return to everywhere without choosing to.

The output is not a dashboard. It's a conversation starter.

A 2D scatter plot of 5000+ points tells you nothing. The real output is cross_domain_clusters.json — a structured summary of which domains cluster together, with representative samples. Feed it to an AI that knows your context and ask: what does this mean about me?

The system surfaces the pattern. The AI explains what it means. You decide if it's true.

This is the part no existing self-tracking tool does: not visualization, but interpretation grounded in your actual behavioral record.

The core research problem

Systems don't acknowledge humans as humans.

Seat time ≠ competence. Keywords ≠ capability. A formal record ≠ what actually happened. Every existing classification system optimizes for what it can measure, not for what is real.

Cogito is one answer to this: if you can reconstruct a person's cognitive structure from their behavioral traces, you have evidence that doesn't depend on institutional categories. The person who falls through every existing classification still left a record — in the diffs, in the notes, in the patterns that span everything they ever worked on.

This is also why git diffs matter more than just reading notes:

A note is a final state. A diff is a decision.

When you delete a sentence and rewrite it, that's a data point about how your thinking changed. Standard RAG over your notes throws this away. Cogito keeps it.

Setup

pip install gitpython openai umap-learn scikit-learn plotly numpy python-dotenv matplotlib networkx jieba

Create a .env file in your vault root:

OPENAI_API_KEY=sk-...

Edit VAULT_DIR at the top of each script to point to your vault (must be a git repo).

Not using Obsidian?

Cogito works on any folder of text files — Obsidian is just the most natural starting point because it already has a git plugin. The ingest.py script normalises other sources into the same format:

Source	Command	Directions supported
Any `.md` / `.txt` folder	`python ingest.py --source folder --path /path/to/notes`	C only
Notion export (zip or folder)	`python ingest.py --source notion --path /path/to/export.zip`	C only
Google Docs via Google Takeout	`python ingest.py --source gdocs --path /path/to/takeout.zip`	C only
Any git repo	point `VAULT_DIR` at it in the scripts	A + B + C

Output lands in data/ingested/<source>/. Then run direction_c/embed_notes.py pointing at that folder.

Why non-git sources only support Direction C: Direction A and B require commit history — timestamps, diffs, net lines changed. A Google Doc or Notion export is a snapshot, not a history. You get the content layer but not the behavioral layer.

If you want the full pipeline on Google Docs or Notion, export to a git-tracked folder and commit regularly. Even one commit per writing session is enough signal for Direction A.

Approximate API cost

Step	Records	Est. cost
Direction A (diffs)	~4000 commits	~$0.50–1.00 USD
Direction C (notes)	~3000 chunks	~$0.30–0.60 USD

Uses text-embedding-3-small. Resume support built in — safe to interrupt and restart.

Related work

Plot Ark — the institutional-scale version of this problem: xAPI behavioral analytics for learning systems
career-ops — OSS job search pipeline; SQLite architecture RFC (#919) contributed here

What this is not

Not a therapy tool
Not a productivity tracker
Not a replacement for introspection

It's a mirror with a longer memory than you have.

License

MIT — use it, fork it, run it on your own data.

Built by someone who couldn't figure out what was driving them, so they built a system to find out.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
direction_a		direction_a
direction_b		direction_b
direction_c		direction_c
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
ingest.py		ingest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cogito

Three directions

Direction A — Attention Map

Direction B — Behavioral Patterns

Direction C — Identity Structure

The core research problem

Setup

Not using Obsidian?

Approximate API cost

Related work

What this is not

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cogito

Three directions

Direction A — Attention Map

Direction B — Behavioral Patterns

Direction C — Identity Structure

The core research problem

Setup

Not using Obsidian?

Approximate API cost

Related work

What this is not

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages