Skip to content

Message queue to coordinate data processing #134

@pm5

Description

@pm5

As we get more and more scraped articles and snapshots, it'll become slower to select the data to work on at each step of the processing, e.g. selecting all new snapshots to parse with SQL would be slow and hard to run in parallel.

A message queue is needed to coordinate processing work and to easily generate processing logs and stats. Kafka is good and popular but adds up to the maintenance costs. An alternative would be to use existing MySQL as transport and storage for the queue, because we need persistent messages to ensure correct data output.

We can use Kombu from the Celery project, which supports an SQLAlchemy transport to save the messages in MySQL. Kombu can do simple standard-library-style queues and more complex workers models.

We can work on the message formats and flows first until we've decoupled ArticleParser from ZeroScraper's database schema at Python level. Then there would be multiple paths to choose from for future works such as multiple ArticleParsers in parallel, tracking processing tasks and dashboard, lesser database maintenance workload (when we can easily tell if a set of snapshots are all processed successfully), separate publishers from ArticleParsers, etc.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions