|
2 | 2 |
|
3 | 3 | ## Use Case |
4 | 4 |
|
5 | | -- For a directory of markdown notes, determine what are the top five topical clusters. |
| 5 | +- For a directory of markdown notes, determine what are the top five topical clusters. |
6 | 6 | - Beacause hashtags and front matter tags are normalized, related terms will group on tags. |
7 | 7 | - Works with markdown note-taking applications like Obsidian, Zettlr, LogSeq, and FOAM. |
8 | 8 |
|
|
12 | 12 |
|
13 | 13 | ## Requirements |
14 | 14 |
|
15 | | -- Break out YAML front matter tags and Camel case hash tags as plain words. |
16 | | - - Example, `key-word` becomes `key word` for analysis. |
17 | | - - Example, `KeyWord` becomes `key word` for analysis. |
18 | | - - Conversion happens before n-gram analysis of body text. |
| 15 | +- Break out YAML front matter tags and Camel case hash tags as plain words. |
| 16 | + - Example, `key-word` becomes `key word` for analysis. |
| 17 | + - Example, `KeyWord` becomes `key word` for analysis. |
| 18 | + - Conversion happens before n-gram analysis of body text. |
19 | 19 | - Ignore short common headers. The best way to to only tokenize headers three words or longer. |
20 | 20 | - The ability to have custom stop words to clean up cluster results. Use this for brands, fractional words, and other words that show up in clusters but isn't useful. |
| 21 | +- Use Jupyter for concepts, for implementation use command line script that can focus on specific directories. |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +## Interpretation |
| 26 | + |
| 27 | +### Scatter Plot: Content Semantic Map |
| 28 | + |
| 29 | +Each dot represents one markdown note from your corpus `ZETTEL_ROOT`, a markdown repo. |
| 30 | + |
| 31 | +Here's how to interpret the scatter plot it produces: |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | +- **Color/cluster membership** indicates semantic similarity—notes of the same color share similar concepts and vocabulary |
| 36 | +- **Physical proximity** means notes are highly semantically related; dots clustered together contain overlapping ideas |
| 37 | +- **Distance between clusters** shows conceptual separation—far clusters represent distinct topics |
| 38 | +- **Cluster density** reflects thematic cohesion—tight clusters have focused meaning; loose clusters contain diverse but related concepts |
| 39 | +- **Isolated outliers** (dots far from clusters) represent unique notes that don't align well with major themes |
| 40 | +- **Top terms printed for each cluster** (C0, C1, etc.) reveal the dominant concepts defining that cluster |
| 41 | +- **Dimensionality reduction caveat** as the 2D plot compresses high-dimensional semantic space, so visual distance is approximate |
| 42 | + |
| 43 | +The key insight: **examine cluster labels and look for outliers**, then review the notes associated with them to validate whether the semantic grouping makes sense for your content. |
| 44 | + |
| 45 | + |
21 | 46 |
|
22 | 47 | ## User Story |
23 | 48 |
|
| 49 | +### "Is my writing on topic?" |
| 50 | + |
| 51 | +- User has a markdown note-taking application with files stored as plain text. They want to get an idea of what they have been writing about. |
| 52 | + - After running the script, they can see the top eight clusters of note-taking topics. |
| 53 | + - After careful consideration, the user focuses on a specific cluster to create a report. |
| 54 | +- For the desired cluster, the tool reports observed context. User sees tight mapping of dots. |
| 55 | + |
| 56 | +### "Where to prune research set? Tighten work up?" |
24 | 57 |
|
25 | | -- User have a markdown note-taking application with files stored as plain text. They want to get an idea of what they have been writing about. |
26 | | - - After running the script, they can see the top eight clusters of note-taking topics. |
27 | | - - After careful consideration, the user focuses on a specific cluster to create a report. |
| 58 | +- User is examining a body of research, looking for a concentration to write a paper, but also wants awareness when it comes to distractions. |
| 59 | + - All relevant research, proposal, and paper outline is put in the same directory. |
| 60 | + - There may include draft materials, relevant commentary, and research notes. |
| 61 | +- User runs script against directory to see if there are any outlyers to validate. Decision on tangents. |
| 62 | +- An outliner is found, a cluster of n-grams that has out of place words. User searches corpus to move those notes out of the project. |
| 63 | + - There is a level of curation, determinging if the note is on purpose for the project. |
| 64 | + - In some cases, the outlier indicates a relevant topic that needs more research or expanding of context. |
28 | 65 |
|
29 | 66 |
|
30 | 67 | > Copyright 2026 [JWH Consolidated LLC](https://www.jwhco.com/?utm_source=repository&utm_medium=github.com&utm_content=visualize-content-clusters) All rights reserved. |
31 | 68 |
|
32 | | -/EOF/ |
| 69 | +/EOF/ |
0 commit comments