Skip to content

joyboseroy/darshana-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

darshana-graph

A text-grounded knowledge graph of Indian philosophy, covering Hindu darshanas, the Buddhist Pali Canon, and Jain philosophical texts, built entirely from public-domain and openly-licensed source texts, with LLM-assisted concept tagging constrained to a closed vocabulary.

Dataset on HuggingFace: joyboseroy/darshana-graph

Debate simulator tool: vada-simulator — a citation-grounded multi-agent debate engine built on this graph, where agents representing different schools can only cite real graph edges, with fabricated or borrowed citations programmatically rejected.

What this is

Most digital resources for Indian philosophy are single-text, single-translator. This project instead aligns the same root texts (principally the Bhagavad Gita and Brahma Sutras) across multiple independent historical commentators, so the same verse or sutra can be read side by side through 18 distinct commentators spanning Advaita, Vishishtadvaita, Dvaita, Dvaitadvaita, Achintya Bhedabheda, and more, plus the full Pali Canon and core Jain texts.

Every concept and relationship in the resulting graph is anchored to a specific passage in a real source text, with a verbatim evidence quote. The LLM tagging step is pure classification over text already in front of the model. There is no retrieval and no reliance on the model's prior knowledge of Indian philosophy.

Repository structure

darshana-graph/
  corpus/                          raw scraped/converted text, per source
    bg.json                        Bhagavad Gita, 13 commentators
    bg_prabhupada_1972.json        Prabhupada's "As It Is" (1972)
    brahma_sutras.json             Thibaut: Shankara + Ramanuja
    brahma_sutras_madhva.json      Madhva's Brahma Sutra Bhashya
    brahma_sutras_nimbarka.json    Nimbarka + Srinivasa (partial)
    gambhirananda_*.json           Gambhirananda's Upanishads + Brahma Sutra
    upanishads_muller.json         Muller's 10 principal Upanishads
    darshanas.json                 Samkhya, Yoga, Nyaya, Vaisheshika
    tattvartha_sutra.json          Jain Tattvartha Sutra
    jainism.json                   Acaranga + Sutrakritanga
    buddhism.json                  Full Pali Canon (110k+ segments)
    buddhism_philosophical_subset.json   curated subset used for tagging
  corpus/tagged/                   LLM-tagged output, one .jsonl per source
  hf_dataset/                      final merged files pushed to HuggingFace
    darshana_corpus.jsonl
    darshana_graph.jsonl
    README.md                      HuggingFace dataset card
  scrape.py                        Vedanta scrapers (Gita API, sacred-texts.com)
  scrape_new.py                    Buddhism (bilara-data), Jainism, Darshanas
  scrape_tattvartha.py             Tattvartha Sutra (wisdomlib, resumable)
  scrape_nimbarka.py               Nimbarka commentary (wisdomlib, resumable)
  convert_gita.py                  gita/gita repo -> corpus schema
  convert_muller.py                Max Muller SBE text files -> corpus schema
  convert_madhva.py                Madhva djvu OCR text -> corpus schema
  convert_prabhupada_1972.py       Prabhupada 1972 PDF text -> corpus schema
  extract_gambhirananda.py         OCR pipeline for scanned Gambhirananda PDFs
  filter_buddhism_philosophical.py Sample/filter Pali Canon for tagging
  fix_duplicate_ids.py             ID-collision detection and repair
  tag_corpus.py                    LLM tagging pipeline (Groq + closed vocab)
  audit_tagged.py                  Coverage/quality audit + tension preview
  prepare_hf_dataset.py            Merge corpus + tagged output for release
  inventory.py                     Full corpus status/coverage report
  test_sources.py                  Connectivity check for all scrape targets

Pipeline overview

  1. Scrape/convert source texts into a unified JSON schema (see Schema below). Each script in the repo root handles one source or source family; run inventory.py at any point to see what's collected and what's missing.
  2. Fix data hygiene issues as needed. fix_duplicate_ids.py checks every corpus file for ID collisions, a real issue that affected three files in this project, traced to scrapers that reset per-page block counters across hundreds of HTML pages.
  3. Tag the corpus with tag_corpus.py, which calls an LLM (default: Llama 3.1 8B via Groq) to extract concepts and typed relationships from each record's verse text plus up to two associated commentaries. The model is constrained to a closed vocabulary defined in the script; any output outside that vocabulary is dropped post-hoc by validate_and_filter(), regardless of whether the underlying JSON parsed successfully.
  4. Audit with audit_tagged.py for coverage percentages, relation/ school distributions, and a preview of cross-school "tension": concept pairs where different schools assert different relation types.
  5. Merge and release with prepare_hf_dataset.py, which produces the two files pushed to HuggingFace.

Schema

See the HuggingFace dataset card for the full schema definition, relation vocabulary, and per-tradition breakdown.

Setup

pip install requests beautifulsoup4 lxml groq --break-system-packages

# For OCR-based extraction (Gambhirananda PDFs):
sudo apt-get install -y poppler-utils tesseract-ocr
pip install pytesseract pillow pdf2image --break-system-packages

export GROQ_API_KEY="your-key-here"

External clones needed before running the Buddhism/Gita pipelines:

git clone --depth=1 https://github.com/suttacentral/bilara-data
git clone --depth=1 https://github.com/gita/gita

Reproducing the pipeline

# 1. Scrape everything (each is independently resumable where noted)
python scrape.py --all
python scrape_new.py --all --bilara ./bilara-data
python scrape_tattvartha.py --resume
python convert_gita.py --gita-dir ./gita/data

# 2. Check for ID collisions before tagging
python fix_duplicate_ids.py

# 3. Tag (slow on Groq free tier; a paid Developer tier key removes the
#    rate-limit bottleneck for a few dollars total at this corpus size)
python tag_corpus.py --all --workers 6 --resume

# 4. Audit and merge
python audit_tagged.py
python prepare_hf_dataset.py

Example: the same verse, five readings

Bhagavad Gita 2.20 describes the soul as unborn, eternal, and untouched by the body's death. All five schools in this dataset agree on that much. Where they sharply diverge is on what this soul's relationship to Brahman, the ultimate reality, actually is.

Shankara (Advaita Vedanta): "The Self spoken of here is, in its true nature, not different from Brahman at all. What appears as an individual soul, distinct and embodied, is a limitation imposed by ignorance. Remove the ignorance, and what remains is Brahman alone, without a second."

Ramanuja (Vishishtadvaita): "The soul is eternally real and eternally distinct from the Lord, even in liberation. It is not annihilated into Brahman but exists as an eternal mode, dependent on and inseparable from the Supreme, the way light depends on its source without becoming identical to it."

Madhva (Dvaita): "This verse proves the soul's eternal, irreducible distinctness. There is no stage, in bondage or in release, where the individual soul becomes one with God. To read identity into this verse is to ignore its plain sense in favor of a borrowed doctrine."

Prabhupada (Achintya Bhedabheda): "The living entity is an eternal, individual spirit soul, simultaneously one with and different from the Supreme, in a manner beyond ordinary logical resolution."

Three schools call this the same one Self appearing as many. Two schools call it eternally distinct souls dependent on or related to one Supreme. The verse never changes. The reading does.

A second example, Bhagavad Gita 18.61, on whether the Lord directing all beings through maya overrides individual free will:

Shankara: maya is the power of ignorance through which the one Self appears divided into many; in truth, nothing moves and no one is directed.

Ramanuja: the Lord directs each soul strictly according to that soul's own past actions, using maya as the instrument of just governance, never overriding the soul's own agency.

Madhva: maya is not illusion but God's real power to ordain the movements of real, distinct souls, an assertion of divine sovereignty rather than a denial of individual reality.

Example: pulled directly from the tagged graph data

This isn't a hand-picked illustration. The script generate_readme_examples.py queries the actual tagged dataset and surfaces real disagreements with real evidence quotes.

Atman and Brahman: the central debate of Vedanta

Advaita Vedanta (822 relevant passages found): holds atman and brahman are identical.

Vishishtadvaita (191 relevant passages found): holds atman and brahman are distinct. From the text (Brahma Sutras): "Our text teaches that the creation of the aggregate of sentient and non-sentient things results from the mere wish of a being free from all connexion with non-sentient matter."

Dvaita (57 relevant passages found): holds atman and brahman are distinct, but for a different reason than Vishishtadvaita. From the text (Brahma Sutras): "difference of degree is clearly seen in the bliss enjoyed by the souls from the best of men upwards to Brahma the four-faced."

Achintya Bhedabheda (21 relevant passages found): holds atman and brahman are identical, in the specific devotional sense that "everything is born of Him, everything is sustained by Him, and everything, after annihilation, rests in Him."

Three schools land on "identical," two land on "distinct," and they don't even agree with each other on what distinct means. That's the kind of disagreement this dataset is built to surface.

Atman and Jiva: is the individual soul the same as the self?

Advaita Vedanta (39 relevant passages found): holds atman and jiva are distinct. From the text (Brahma Sutras): "the waking being may be either the original soul, or he may be God, or some other individual soul."

Vishishtadvaita (12 relevant passages found): holds atman and jiva are identical. From the Bhagavad Gita: "the Jiva itself is eternal, indestructible, and incomprehensible."

Dvaita (8 relevant passages found): holds atman and jiva are identical, but reads the relationship very differently within its broader system. From the text (Brahma Sutras): "Now, being but one individual he goes forth separated."

Generate more examples yourself with python generate_readme_examples.py --concept-a X --concept-b Y, or run --top-pairs N to see the most contested concept pairs in the dataset.

How differently do these commentators actually argue?

Beyond what each school concludes, the corpus lets you measure how they argue. Running stylometric_comparison.py on commentators with enough substantial prose passages in the corpus shows real differences in argumentative style:

Commentator School Avg length (chars) Cites scripture explicitly Names and refutes an opponent
Shankara Advaita 1,848 2.8% of passages 7.2% of passages
Ramanuja Vishishtadvaita 1,136 5.2% 4.6%
Madhva Dvaita 513 17.1% 2.0%
Prabhupada Achintya Bhedabheda 1,740 8.2% 0.3%
Nimbarka Dvaitadvaita 328 0.0% 16.5%
Srinivasa Dvaitadvaita (sub-commentary) 2,138 1.4% 43.1%
Pujyapada Jain 1,306 0.3% 1.1%

Running the stylometric comparison on commentators with reliable sentence-level data in the corpus surfaces real differences. Within the Dvaita-Dvaitadvaita family specifically, three commentators writing across roughly six centuries show a clear trajectory: Madhva (13th century) leans heavily on direct scriptural citation (17.1% of passages) and rarely refutes opponents explicitly (2.0%). Nimbarka, founder of the related but distinct Dvaitadvaita school, refutes more often (13.2%). Srinivasa, writing a sub-commentary defending Nimbarka's school against later criticism, refutes opponents in 42.0% of his passages, by far the highest rate of any commentator measured, consistent with a school whose argumentative posture hardened as it had to defend itself on multiple fronts over time.

Within the Pali Canon itself, distinct collections show measurably different prose styles even without any cross-tradition comparison: the Dhammapada's verses average 31 characters per segment, the most aphoristic and compressed text in the collection, while Samyutta Nikaya and Udana prose segments average around 70-74 characters, consistent with their more discursive, doctrinally elaborated style.

A caveat worth stating plainly: words-per-sentence is not meaningful for several commentators and for the Pali Canon collections generally, since much of this text is captured as short segments without standard sentence-ending punctuation. We report this honestly via a %NoPunct diagnostic column in the script's output rather than silently showing a misleading number; treat any words-per-sentence figure with high %NoPunct as unreliable.

Embeddings vs LLM tagging: two methods measuring different things

A natural question once you have both an LLM-tagged graph and a passage corpus is whether a model-free method agrees with the LLM's findings. We tried this with embedding_disagreement_finder.py: group commentary passages by school using literal concept-name matching, embed them locally, and measure cosine distance between each school's centroid per concept.

Concept Avg. cross-school distance Schools compared Reliable sample size?
moksha 0.363 7 Mixed, some pairs thin
jiva 0.313 3 No, 3 schools only, small samples
maya 0.289 7 Mixed
dharma 0.277 8 Mostly yes
atman 0.250 5 No, samples range 1 to 47 passages
karma 0.162 7 Mostly yes
brahman 0.153 6 Yes, samples range 19 to 1,173 passages

The result is genuinely interesting, but not in the way we expected. Atman's ranking is unreliable, since the literal word "atman" appears far less often than equivalent phrasing like "the Self" or "soul," leaving some schools with single-digit sample sizes. Brahman's ranking, by contrast, rests on a solid sample (1,173 passages for Advaita alone, hundreds for other schools), and yet brahman shows the lowest embedding distance of any concept tested, despite being the single most contested concept in the LLM-tagged graph (253 contested pairs found, atman-brahman the most frequent).

We think this is a real finding rather than a failure of the method: schools can discuss the same concept in similar topical register, vocabulary, and sentence structure while asserting opposite metaphysical claims about it. An embedding model trained for topical and stylistic similarity has no particular reason to separate "X is identical to Y" from "X is distinct from Y" if both sentences otherwise read as typical philosophical prose about the same subject. This is a useful caution against treating off-the-shelf sentence embeddings as a proxy for philosophical agreement: semantic similarity of discussion is not the same thing as propositional agreement.

Full per-school passage counts are reproducible via python embedding_disagreement_finder.py --concept X. Disentangling topical similarity from propositional agreement, perhaps via a model fine-tuned or prompted to embed claims rather than passages, is a meaningful direction for future work.

What the contested-pairs data actually shows

Running generate_readme_examples.py --top-pairs 20 surfaces 253 contested concept pairs across the tagged graph. Two are worth a closer look, since both have enough passages per school to trust the pattern rather than just the headline.

Atman and brahman remains the most contested pair, and the picture is more nuanced than a simple two-sided dispute. Advaita asserts identity (822 passages). Dvaita and Vishishtadvaita both assert distinctness (57 and 191 passages respectively), but for different stated reasons: Vishishtadvaita grounds distinctness in an account of creation arising from the Lord's own undifferentiated will, while Dvaita grounds it in differences of degree among liberated souls. Dvaitadvaita, with the largest specific-school sample of any non-Advaita school here (139 passages), stakes out a genuine third position: not identity, not distinctness, but a "qualified aspect of" relationship that sits between the two extremes, consistent with its name (bhedabheda, difference-and-non-difference).

Atman and jiva shows an interesting reversal of the usual pattern. Here it's Advaita that asserts distinctness (39 passages, the largest sample in this pair), while Dvaitadvaita (13 passages), Vishishtadvaita (12 passages), and Dvaita (8 passages) all assert some form of identity between the individual self and the individual soul. Advaita's distinctness claim here isn't a contradiction of its identity claim for atman-brahman, it reflects the standard Advaita move of treating jiva as the empirically conditioned, body-bound appearance of atman, with atman itself ultimately identical to brahman once that conditioning is seen through.

Both examples are reproducible directly: python generate_readme_examples.py --concept-a atman --concept-b brahman (or jiva) shows the full passage breakdown per school.

The most contested concepts in the corpus, ranked

Running generate_readme_examples.py --top-pairs 20 against the full tagged graph surfaces 253 contested concept pairs in total (pairs where two or more schools assert a different relation type for the same two concepts). The top 20 by number of distinct schools weighing in:

Rank Concept pair Schools involved
1 atman-soul 6
2 atman-jiva 5
3 atman-brahman 5
4 atman-body 5
5 dharma-moksha 5
6 atman-maya 5
7 karma-moksha 5
8 dharma-karma 5
9 atman-moksha 5
10 atman-dharma 5
11 brahman-karma 5
12 atman-karma 5
13 atman-paramatman 5
14 brahman-dharma 5
15 atman-samsara 5
16 jiva-karma 5
17 atman-manas 4
18 brahman-moksha 4
19 atman-ether 4
20 brahman-maya 4

Atman appears in 13 of these 20 pairs, which is unsurprising given it sits at the center of Vedanta's defining dispute, but it does mean this ranking is partly a ranking of "what atman is contested in relation to" rather than 20 independently surprising disputes. Number of schools involved is also not the same as sample reliability: pairs like atman-brahman and atman-jiva rest on double- and triple-digit passage counts per school, while pairs near the bottom of this list, atman-manas, atman-ether, brahman-maya, rest on single-digit passage counts for at least one school and should be read as suggestive rather than confirmed. Full per-school passage counts and evidence quotes for any pair are reproducible directly: python generate_readme_examples.py --concept-a X --concept-b Y.

Relation-type usage by school

Beyond which concepts schools disagree about, it's worth asking whether schools differ in how they use the relation vocabulary overall. Running audit_tagged.py --relation-profile against the tagged graph, restricted to schools with at least 20 specific-attribution edges, gives this breakdown:

School Total edges IS_QUALIFIED_ASPECT_OF IS_IDENTICAL_TO IS_DISTINCT_FROM IS_CAUSE_OF
Vishishtadvaita 1,051 73.4% 8.4% 8.8% 8.1%
Achintya Bhedabheda 1,366 49.6% 11.5% 10.0% 19.5%
Dvaitadvaita 1,824 46.9% 9.0% 20.3% 20.7%
Advaita 3,930 36.8% 30.0% 14.8% 15.5%
Dvaita 1,331 36.4% 10.4% 21.6% 27.5%

(Samkhya, Yoga, Vaisheshika, and the two Jain schools are excluded here for having fewer than 20 specific-attribution edges each, too few for a percentage to be meaningful; see the script's full output for those raw counts.)

Read this table with one caveat firmly in place: IS_QUALIFIED_ASPECT_OF dominates every single row, consistent with the global over-representation already documented in the paper (roughly 4.5 times more frequent than any other relation type across the whole graph), so this table is showing the same known extraction artifact at different intensities per school, not five independently clean philosophical signatures.

That said, one pattern survives the caveat: Advaita's IS_IDENTICAL_TO rate (30.0%) is more than double every other school's, which is an aggregate, relation-vocabulary-level echo of the same atman-brahman identity claim already documented at the single-concept-pair level elsewhere in this corpus, arrived at here through a completely different aggregation (whole-school relation usage rather than one contested pair). That two independent slices of the same graph agree is a small but genuine cross-check on the extraction pipeline, not just a restatement of one finding.

The IS_QUALIFIED_ASPECT_OF over-triggering is not school-specific

Running audit_tagged.py --relation-profile-by-text breaks down relation-type usage by source text rather than by school, testing whether the already-documented IS_QUALIFIED_ASPECT_OF over-representation (see paper Section 5.3) is driven by which school wrote a commentary or by something about the root text itself. The answer is unambiguous: every one of the 28 source texts with sufficient data shows IS_QUALIFIED_ASPECT_OF as the most or second-most frequent relation, regardless of tradition, language, or genre.

Source text Total edges IS_QUALIFIED_ASPECT_OF IS_CAUSE_OF IS_DISTINCT_FROM
Bhagavad Gita 4,550 50.8% 18.8% 12.2%
Brahma Sutras 7,737 42.6% 20.5% 19.5%
Yoga Sutras 1,062 55.9% 17.9% 11.8%
Tattvartha Sutra (Jain) 1,023 59.5% 17.9% 16.0%
Samyutta Nikaya 1,767 33.1% 26.3% 25.6%
Dhammapada 106 28.3% 37.7% 21.7%
Acaranga Sutra (Jain) 159 51.6% 27.0% 15.1%

This rules out a school-specific cause: the over-triggering appears at similar intensity in Hindu sutra literature, Hindu devotional poetry, Jain canonical texts, and Pali Buddhist suttas alike, four traditions with no shared commentarial lineage. The likely explanation is that the model defaults to this label when the relationship between two concepts is real but underspecified by the immediate textual context, which happens at a similar baseline rate regardless of which tradition or language the source text comes from. This points future prompt engineering toward addressing the model's general handling of relational uncertainty, rather than tuning per-tradition or per-school prompting.

Doctrine-level stylistic variation within the Pali Canon

The curated Buddhist philosophical subset carries a theme_tag field spanning eight core doctrinal categories (dependent origination, the five aggregates, the six sense bases, the unanswered questions, the eightfold path, the seven factors of awakening, the bases of spiritual power, and the four noble truths), each represented by exactly 300 passages, a uniquely balanced cut of this corpus with no sample-size caveat needed. Running stylometric_comparison.py --by-theme shows these eight doctrines cluster together on length and sentence complexity (72-89 characters, 8.8-11.8 words per pseudo-sentence) but vary meaningfully in vocabulary diversity, from 0.128 (the unanswered questions) to 0.222 (the five aggregates). The five aggregates teaching's enumerative structure, naming five distinct components of experience, plausibly drives its higher lexical variety; the unanswered questions material's low diversity is consistent with canonical passages on this topic relying on a small set of repeated formulaic phrasing rather than varied discursive language, though we have not confirmed this by close reading. (Five further theme tags in this field, dhammapada, udana, itivuttaka, sutta_nipata, and khuddakapatha, are collection names rather than doctrines and reproduce the collection-level findings already reported above rather than offering an independent confirmation.)

Theme tagging validated against the actual translated vocabulary

A natural worry with any hand-curated theme tagging is whether the labels actually track the content. We checked this directly by searching for each theme's expected vocabulary (in Sujato's English translation, not Pali loanwords, since the translation renders Pali technical terms into plain English, "not-self" rather than "anatta," "noble truth" rather than "ariya-sacca") across all 13 theme buckets.

The result is a clean validation: each theme's signature term is sharply concentrated in its own bucket and nearly absent elsewhere. "Psychic power" appears 95 times in the bases_of_power theme and at most once in any other theme. "Eightfold" appears 62 times in eightfold_path versus at most 3 elsewhere. "Awakening factors" appears 67 times in seven_factors_awakening versus at most 1 elsewhere. "Noble truth" and "suffering" both peak sharply in four_noble_truths. This gives real confidence that the theme_tag field tracks genuine doctrinal content rather than being a loose or arbitrary label.

One caveat: "not-self" appears in only 10 of 300 five_aggregates passages, lower than the doctrine's centrality to that theme might suggest, likely because these are sequential canonical passages and the explicit not-self framing appears at specific points in a longer discourse rather than in every individual segment. "Mendicant," the term of address used throughout Sujato's translation when the Buddha speaks to monks, appears as background noise across nearly every theme and should not be read as a doctrinal signal.

Why Pali Canon material shows a 0.0% refutation rate, and why that's misleading

The stylometric script's citation and refutation regex markers were written against Sanskrit commentarial idiom (phrases like "the opponent argues" or "as it is said") and return 0.0% for every Pali Canon collection and theme tested, with no exceptions. Reading the actual text behind this number shows the 0.0% is a measurement artifact, not evidence that Pali material lacks argumentative content. The undeclared_questions theme, for instance, is built entirely around philosophical debate: passages name actual historical interlocutors ("the Jain ascetic of the Ñātika clan"), record direct questions about why certain metaphysical positions were not addressed ("Sir, why didn't you answer Vacchagotta's question?"), and stage refusals to assert competing metaphysical claims ("a realized one neither still exists nor no longer exists after death"). This is genuine philosophical argumentation, conducted through named dialogue and structured refusal rather than third-person commentarial refutation phrasing, and our regex markers, built for the latter register, cannot detect the former at all.

This means every refutation-rate claim in this corpus's stylometric analysis should be read as describing Sanskrit Vedanta commentarial style specifically, not philosophical argumentativeness in general. A genuinely cross-tradition argumentation measure would need separate, idiom-specific markers for Pali dialogue structure (named interlocutors, question-refusal patterns, the recurring "neither X nor not-X" tetralemma phrasing) rather than reusing Sanskrit-commentary markers and reporting a flat zero as if it meant absence.

A first working measure of argumentation in the Pali Canon

The existing citation/refutation regex markers, built for Sanskrit commentarial idiom, return 0.0% on every Pali Canon theme and collection in this corpus, which we initially read as a real absence before checking the underlying text directly. Reading actual passages from the undeclared_questions theme showed this was a measurement gap, not a real zero: the material is full of named interlocutors, direct questions about unanswered metaphysical positions, and the recurring "neither X nor not-X" tetralemma refusal pattern, genuine argumentative structure conducted through dialogue rather than commentarial third-person refutation.

We added a small set of Pali-dialogue-specific markers (named-interlocutor introductions, question-refusal exchanges, tetralemma phrasing) and reran the theme comparison. The result is a sharp, theme-specific signal: undeclared_questions registers 13.7% (41 of 300 passages), more than thirteen times higher than every other theme, which all fall between 0.0% and 1.0%. This is the first non-zero, theme-differentiated argumentation measure obtained for any Pali Canon material in this project, and it lands precisely in the one doctrine built explicitly around philosophical refusal and debate, which is a reasonable face-validity check on the markers even though they were built from a handful of directly-read examples rather than a systematic survey.

Buddhism stands apart from the Hindu/Jain schools more than they stand apart from each other

For dharma, the one bridge concept with a reliable Theravada sample (249 source passages, capped at 40 for the pairwise comparison, the same cap used for every other school), Theravada's embedding distance from every Hindu and Jain school ranges from 0.436 to 0.532, averaging 0.491 across all seven Theravada-involving pairs. Every comparison among the seven Hindu and Jain schools themselves, by contrast, ranges from 0.148 to 0.262, averaging 0.205. Theravada sits roughly 2.4 times further from the Hindu/Jain group, on this single concept, than the Hindu/Jain schools sit from each other.

This is consistent with dharma carrying substantially different technical content across traditions: Hindu dharma as cosmic and social duty, Theravada dhamma as both the Buddha's teaching and the Abhidhamma's momentary phenomena, Jain dharma as a named substance category in Jain physics, distinct senses sharing one transliterated word. We were unable to extend this comparison to the other bridge concepts we mapped (atman/anatta, moksha/nibbana, karma/kamma): none returned a usable Theravada sample, either because the literal-string passage search found no matches or, in the case of moksha and atman, because the Hindu and Jain samples themselves were too thin (1 to 13 passages per school) for the comparison to be meaningful regardless. This finding should therefore be read as a single, well-sampled data point about one concept, not a general claim about Buddhist-Hindu-Jain distance.

Insights from running the analysis scripts

These are results from the stylometric and embedding scripts included in this repository, run on the corpus as released. Numbers are flagged where the underlying sample is thin or where sentence-level statistics are unreliable, rather than presented uniformly.

Citation versus refutation: two different ways to make an argument

Across the eight commentators with enough data for a meaningful comparison, scriptural citation density and explicit refutation rate are moderately negatively correlated (Pearson r approximately -0.32 across all eight, -0.46 restricted to the four commentators with the cleanest sentence-level data: Shankara, Prabhupada, Srinivasa, and Pujyapada). Commentators who lean heavily on quoting scripture as proof tend to spend less time explicitly naming and refuting opponents, and vice versa. Madhva sits at one extreme (17.1% scripture citation, 2.0% refutation), Srinivasa at the other (0.7% citation, 42.0% refutation). With only eight data points this is suggestive rather than statistically conclusive, but it points at something real: there appear to be at least two distinct argumentative postures available within the same broad tradition, prove your point by accumulating textual authority, or prove your point by dismantling the alternative.

A school's argumentative posture can harden across generations

Madhva (13th century, founder of Dvaita), Nimbarka (founder of the related Dvaitadvaita school), and Srinivasa (a later sub-commentator defending Dvaitadvaita) show a striking generational trend in refutation rate: 2.0%, 13.2%, and 42.0% respectively. This is consistent with a school's argumentative posture shifting over time from confidently citing scripture to extensively defending itself against rival schools, though we have not verified this against independent historical scholarship and present it as a pattern worth a specialist's attention rather than a settled claim.

The least and most polemical commentators in the corpus

Restricting to the four commentators with the most reliable sentence-level data, Prabhupada (0.3% refutation rate) and Pujyapada (1.1%) are by far the least likely to explicitly name and refute an opposing view, while Srinivasa (42.0%) is the most. Shankara sits in between (7.2%). This is a small, reliable sample rather than a comprehensive ranking, since most other commentators in the corpus have too little continuous prose with standard punctuation for the sentence-level statistics to be trustworthy (see the %NoPunct diagnostic in the script's output).

The Pali Canon has its own internal genre structure

Without any comparison to Hindu or Jain material at all, the six Pali Canon collections tagged in this corpus show real differences in passage length: the Dhammapada averages 31 characters per segment, the most compressed and aphoristic text in the set, while Samyutta Nikaya and Udana average 70-74 characters, consistent with their more discursive, doctrinally elaborated style. Sentence-level statistics (words per sentence) are not meaningful for any Pali Canon collection in this corpus, since 89-100% of segments lack standard sentence-ending punctuation by construction (they are typically single clauses or stock phrases rather than full paragraphs).

What we could not yet test

Jainism's two major sub-traditions, Digambara and Shvetambara, are not separately distinguishable in the current corpus: the Tattvartha Sutra is tagged only under Pujyapada's Digambara commentary, with no Shvetambara commentator currently represented in the tagged data, even though Shvetambara acceptance of the text is mentioned in the dataset's documentation.

A within-Advaita stylistic comparison (Shankara versus the corpus's other four nominally Advaita commentators: Sridhara, Anandagiri, Nilakantha, Dhanpati) is also not currently possible at the embedding level, since only Shankara's passages carry a populated school field in the tagged commentary data; the others are present in the corpus but not yet attributable to a specific school in a way the analysis scripts can use.

A third, more severe gap is documented separately above: the extracted graph has zero specific school attribution anywhere for Buddhist material (see "Buddhism stands apart..." and the Known Limitations section), which is a stronger version of the general-attribution problem rather than a parallel case to the two gaps above.

Future work

Beyond fixing the gaps noted above, this corpus and pipeline support a substantial range of further work we have not attempted:

  • A formal, randomly sampled, multi-annotator precision evaluation of the LLM-tagged concept graph, which currently rests on informal spot-checking rather than a reported metric.
  • Tracing how the meaning of a single concept (such as maya) shifts across the roughly thirteen centuries separating Shankara from Prabhupada, using each commentator's known era as an additional axis alongside school.
  • Cross-tradition concept bridging: quantifying how much functional overlap exists between concepts traditionally treated as opposites or unrelated across traditions, such as Buddhist anatta and Vedantic atman, or dukkha and the Hindu and Jain treatments of bondage and suffering.
  • Reconstructing implied historical debates by pairing each commentator's explicit refutation passages with the specific rival-school passage most likely being refuted, surfacing arguments that were never written as direct dialogue but functioned as one.
  • A school-conditioned dialogue or question-answering system that draws only on a specific commentator's extracted concept relationships as its grounding, rather than general language model knowledge, to test whether a model can argue consistently from a specific historical position rather than merely describing it. A first version of this exists: vada-simulator, a multi-agent debate engine where each school can only cite real graph edges. Open questions that remain: validating consistency over longer multi-turn exchanges, extending beyond two/three schools at once, and formal evaluation of whether cited evidence actually represents each school's strongest case.
  • Transitive reasoning over the concept graph: given two directly asserted relationships, checking whether any third passage explicitly confirms or contradicts the logically implied relationship between concepts that were never directly compared by the original commentator.
  • A larger, more linguistically validated stylometric feature set, extending beyond the simple regular-expression and sentence-length statistics used here to syntactic complexity measures and a learned, rather than hand-specified, citation and refutation classifier.
  • Resolving the Jainism Digambara and Shvetambara tagging gap and the within-Advaita school-attribution gap noted above, both of which are data completeness issues rather than fundamental limitations of the approach.

Several additional analyses are enabled by the existing corpus, graph, and scripts without requiring new data collection, but were not completed in this release:

  • School-distinctiveness ranking via embedding centroid divergence across a wider concept list than atman/brahman alone
  • A formal stylometry-based school classifier (logistic regression or random forest on citation/refutation/length features), pending a larger set of reliably-punctuated commentators than currently available
  • Within-Advaita stylistic comparison once Sridhara, Anandagiri, Nilakantha, and Dhanpati receive verified school attribution in the source data (see known gaps above); we deliberately avoided assigning this label ourselves without source verification
  • Century-stratified stylistic drift analysis, pending resolution of disputed dates for Nimbarka and Srinivasa
  • Graph relation-type profiles and concept centrality (degree, PageRank) computed per school over the existing edge set
  • Cross-file duplicate-ID auditing as a general data-hygiene check beyond the within-file fix already applied
  • An ablation comparing extraction quality across LLMs of different scale on a small fixed verse sample

We would welcome contributions on any of these, or on directions we have not anticipated; the corpus and all analysis code are released specifically to make this kind of follow-on work possible.

Known limitations

See the "Known limitations" section of the HuggingFace dataset card for the full, honest accounting. In short: no human expert review, single-pass LLM tagging with an estimated 70-85% precision, a tendency for the tagging model to over-use the IS_QUALIFIED_ASPECT_OF relation and the general school tag rather than committing to a more specific label, partial coverage for Nimbarka/Srinivasa due to source-site reliability during scraping, and a complete absence of specific school attribution for Buddhist material (see "Why Pali Canon material shows a 0.0% refutation rate" above for the related stylometric gap, and the limitations note further up this page for the graph-attribution gap specifically).

  • Buddhist material has no specific school attribution at all in the extracted graph: all 2,306 edges sourced from Buddhist passages carry a general school label, with zero edges anywhere attributed specifically to theravada, despite theravada being a valid label in the predefined vocabulary. This is more severe than the roughly 70% general-attribution rate documented elsewhere in the graph, and we suspect it relates to Buddhist passages standing alone as root text rather than being paired with a named, school-attributed commentary the way Vedanta passages are, though we have not confirmed this explanation. Until addressed, any Buddhist-focused use of this corpus should rely on text-level search (as in our stylometric analysis) rather than the extracted graph.

License

Code in this repository (the scraping, conversion, and tagging pipeline) is released under MIT. The corpus and graph data follow the licensing described in the HuggingFace dataset card, CC-BY-4.0 for the aggregation/tagging work, with underlying source texts retaining their original licenses (public domain, CC0 for Sujato's Pali Canon translations via SuttaCentral, or explicit free-reproduction grants as noted per source).

Acknowledgements

Built on public-domain translations by George Thibaut, Max Muller, S. Subba Rau, Roma Bose, Hermann Jacobi, and Vijay K. Jain; the SuttaCentral bilara-data project and Bhikkhu Sujato's Pali Canon translations; and the gita/gita open dataset of Bhagavad Gita translations and commentaries.

About

A text-grounded knowledge graph of Indian philosophy, covering Hindu darshanas, the Buddhist Pali Canon, and Jain philosophical texts

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages