git clone https://github.com/0xRoko/hjpp && cd scraper
Install dependencies
pnpm install # or
npm install # or
yarn installAll data is stored in .data directory.
Directories:
definition-pages- contains fetched definition pages, each page is stored in separate file, filenames:sha256(url)sitemap- contains raw sitemap files
Files:
sitemap.json- contains all sitemap records (sitemap record + sha256 hash for each url)status.json- contains status of fetched pages (record + recordStatus)definitions.json- contains parsed definitions
This is the final output of the parser (after running hjpp parse), it contains all definitions.
Definition example:
{
"id": "sezanj",
"rijec": "sȅžanj",
"detalji": "<b>sȅžanj</b> <i>m</i> 〈G -žnja, N <i>mn</i> -žnji, G sȇžānjā〉",
"vrsta": "imenica",
"izvedeniOblici": null,
"definicija": "<i>pov.</i> mjera za dužinu koja odgovara razmaku raširenih ruku",
"sintagma": null,
"frazeologija": null,
"etimologija": "vidi <a href=\"/r/segnuti\"><b>ségnuti</b> </a>",
"onomastika": null,
"hjp": {
"keyWord": "sežanj",
"url": "....",
"id": "d19mWhQ="
}
}Each definition has unique ID, it's derived from rijec field using slugify function, in case there are multiple definitions for the same word, ID will be defined as:
const hashedWord = createHash("SHA3-512").update(def.rijec);
const suffix = data.digest("base64url").slice(0, 2);
const urlEncodedSuffix = encodeURIComponent(wrd);
const newId = `${id}-${urlEncoded}`;Since fields like definicija, etimologija and sintagma contain HTML (purified by parser), they also can contain links to other definitions. Parser will update href tags to point to correct definition. Default format is r/<id>.
Parsing steps:
This will fetch all sitemaps and parse them into single sitemap.json file. Also it will build status.json for tracking fetched pages.
Reruning this command WILL reset status.json file (fetch progress will be lost).
pnpm hjpp sitemapThis will fetch all definition pages and store them in .data/definitions directory.
Status of fetched pages is stored in status.json file, because you will most likely get IP banned if you are not using proxies (but you will be notified if that happens)
pnpm hjpp fetchThis will parse all definition pages and store all definitions in .data/definitions.json file.
pnpm hjpp parseNote: progress won't be saved, so make sure all definitions are fetched.
pnpm hjpp fetch --forceOr you can run
pnpm hjpp sitemapwhich will refetch all sitemaps and reset status.json file.
And then run as usual
pnpm hjpp fetch