Talmud Illuminated scraper

About the project

The code here scrapes the content of the blog Talmud Illuminated
The goal of the Talmud Illuminated Project is to bring the benefits of study to everyone who wants it. These benefits are multiple and more can be added.
This scraping project was started on Tisha B'Av 2021. Tisha B'Av is the saddest day of Jewish history. It is a fast day when Torah and Talmud study are prohibited. However, this project was perfect for Tisha b'Av. In scraping, one is not learning Torah, just creating a structure for the data. So, it may be a mitzvah, but a mitzvah is not prohibited on Tisha B'Av At the same time, it is not business (which is also not encouraged on this day).

About scraping

The code iterates through every Talmud volume (masechet) name, and through every page, by the number of pages in the masechet.
Then, it searches for this page using Google Blogger API, get a response in JSON, and parses through the response. The search may bring back a few pages, and the parser find the one it is looking for using the title.
For example, the code may be looking for "bava kamma 48" string. For each JSON result, it finds the answer where the title field is "bava kamma 48".
All pages are stored into this project under the content folder, and committed to GitHub.

About the dates

The scraping project was basically completed in six days, from Tisha B'Av till Tu B'Av, 2021. That year, Tu B'Av fell out on Shabbat. The Talmud Illuminated project was started on the previous Tu B'Av that fell out on Shabbat, 2008. Here is the start. This part of the project thus took exactly 13 years on the Jewish calendar.

Instructions

Crawl (run Full Crawl configuration)
QA (run QA configuration)
MakeSite (run MakeSite configuration)
Copy the site to TalmudIlluminatedContent repo
- cp -r site/* ../TalmudIlluminatedContent/
Deploy from there (open TalmudIlluminatedContent project and run deploy.sh)

Custom GPT:
MosesAI

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
CustomGPT		CustomGPT
content		content
data		data
doc		doc
full_text/brachot		full_text/brachot
site		site
src		src
text-for-indexing_pages		text-for-indexing_pages
text-for-indexing_paragraphs		text-for-indexing_paragraphs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect_all.sh		collect_all.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Talmud Illuminated scraper

About the project

About scraping

About the dates

Instructions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Talmud Illuminated scraper

About the project

About scraping

About the dates

Instructions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages