Message queue to coordinate data processing

As we get more and more scraped articles and snapshots, it'll become slower to select the data to work on at each step of the processing, e.g. selecting all new snapshots to parse with SQL would be slow and hard to run in parallel.

A message queue is needed to coordinate processing work and to easily generate processing logs and stats.  Kafka is good and popular but adds up to the maintenance costs.  An alternative would be to use existing MySQL as transport and storage for the queue, because we need persistent messages to ensure correct data output.

We can use [Kombu](https://docs.celeryproject.org/projects/kombu/en/stable/) from the Celery project, which supports an [SQLAlchemy transport](https://docs.celeryproject.org/projects/kombu/en/stable/reference/kombu.transport.sqlalchemy.html) to save the messages in MySQL.  Kombu can do simple standard-library-style queues and more complex workers models.

We can work on the message formats and flows first until we've decoupled ArticleParser from ZeroScraper's database schema at Python level.  Then there would be multiple paths to choose from for future works such as multiple ArticleParsers in parallel, tracking processing tasks and dashboard, lesser database maintenance workload (when we can easily tell if a set of snapshots are all processed successfully), separate publishers from ArticleParsers, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message queue to coordinate data processing #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Message queue to coordinate data processing #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions