HIIK Helper is a Python project for collecting conflict-related news articles and converting them into structured HIIK-style event data. It combines a Scrapy crawler with OpenAI/instructor batch workflows and Pydantic models.
hiik_helper/main.pystarts the Scrapy crawler.hiik_helper/spiders/hiik_default_spider.pycrawls Karen News article pages and stores discovered article content.hiik_helper/spiders/hiik_xml_spider.pyis a placeholder XML feed spider.hiik_helper/article_generator.pygenerates synthetic articles from existing examples.hiik_helper/article_extractor_openai.pycreates OpenAI batch requests for extracting HIIK parameters from articles.hiik_helper/utils.pyreads and writes batch JSONL output andHiikCorpusrecords.hiik_helper/pydantic_models/contains the article and HIIK corpus schemas.found_articles.jsonandvisited_urls.jsonare crawler state/data files used by the spider.
- Python 3.11, as declared in
pyproject.toml - Poetry
- Scrapy
- OpenAI API access for generation and extraction workflows
Install dependencies:
poetry installFor OpenAI-backed workflows, set:
export OPENAI_API_KEY="your-api-key"The default crawler starts from https://karennews.org/ and follows article links matching the configured article URL patterns.
poetry run python hiik_helper/main.pyThe crawler reads visited_urls.json before running and writes updates to:
found_articles.jsonvisited_urls.json
The helper script run-article-extractor.sh runs the same crawler through a local absolute Python path. Update that path before using it on another machine.
ArticleGenerator samples articles from found_articles.json, builds few-shot prompts, and writes generated article records.
Entrypoint:
poetry run python hiik_helper/article_generator_script.pyMost actions in the script are currently commented out. Enable the specific generation or batch-read section you want to run.
Typical generated files include:
generated_articles.jsonbatch_file.jsonldata/created_articles/batch_data/*.jsonl
ArticleExtractor converts article text into OpenAI batch messages and expects responses matching HiikCorpus.HiikArticle.HiikParameters.
Entrypoint:
poetry run python hiik_helper/article_extractor_openai_script.pyThe extraction script is also mostly configured through commented sections. The normal flow is:
- Read generated articles into an
ArticleCorpus. - Convert them into
HiikCorpusarticle shells. - Create a batch JSONL request file.
- Parse the OpenAI batch output JSONL.
- Append structured records to a HIIK corpus JSONL file.
Generated data can become large quickly and is ignored by Git:
data/batch_file.jsonlgenerated_articles.jsonfound_articles.jsonvisited_urls.json
Some of these files may already be tracked in the repository history. Adding them to .gitignore prevents new untracked copies from appearing, but it does not stop Git from tracking files that are already in the index. Use git rm --cached <path> only when you intentionally want to stop tracking an existing file without deleting the local copy.
README.mddocuments the intended workflow, but several scripts are still exploratory and use commented blocks for selecting actions.- Some imports assume scripts are run from specific working directories rather than as installed package modules.
- The XML spider is not production-ready.
- There are no active tests beyond an empty
testspackage.