You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/deploy/servers/data-support.rst
+38-38Lines changed: 38 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,42 @@
1
1
Data support
2
2
============
3
3
4
+
Set up incremental updates
5
+
--------------------------
6
+
7
+
This creates a cron job to run a ``scrapy crawl`` command. The `DatabaseStore <https://kingfisher-collect.readthedocs.io/en/latest/contributing/extensions/database_store.html>`__ extension implements the incremental updates.
8
+
9
+
#. `Choose a spider <https://kingfisher-collect.readthedocs.io/en/latest/spiders.html>`__ that collects the desired data. Prefer the spider that:
10
+
11
+
- Accepts a ``from_date`` spider argument, preferably at the same granularity as the cron schedule
12
+
- Is fastest: for example, ``_bulk``, instead of ``_api``
13
+
- Reduces processing: for example, a spider that yields compiled releases
14
+
15
+
If needed, improve the spider in `Kingfisher Collect <https://github.com/open-contracting/kingfisher-collect>`__.
16
+
#. Add an entry to the ``python_apps.kingfisher_collect.crawls`` section of the ``pillar/kingfisher_main.sls`` file:
17
+
18
+
``identifier``
19
+
An uppercase, underscore-separated name, like ``DOMINICAN_REPUBLIC``.
20
+
``spider``
21
+
The spider's name, like ``dominican_republic_api``.
22
+
``crawl_time``
23
+
The current date, like ``'2025-05-06'`` (though, any date works).
24
+
``spider_arguments`` (optional)
25
+
Any `spider arguments <https://kingfisher-collect.readthedocs.io/en/latest/spiders.html#spider-arguments>`__.
26
+
27
+
If the spider doesn't yield compiled releases, add ``-a compile_releases=true``.
28
+
``cardinal`` (optional)
29
+
``True``, to enable a pipeline involving `Cardinal <https://cardinal.readthedocs.io/en/latest/>`__.
30
+
``users`` (optional)
31
+
A list of additional :ref:`PostgreSQL users<pg-users>` that need read access to the database.
32
+
``day`` (optional)
33
+
The day of the month on which to run the cron job.
34
+
35
+
Required if an incremental update takes longer than a day.
36
+
37
+
#. If an *initial crawl* would take longer than a day, run the `scrapy crawl <https://github.com/open-contracting/deploy/blob/main/salt/kingfisher/collect/files/cron.sh>`__ command manually.
38
+
#. :doc:`Deploy the server<../deploy>`.
39
+
4
40
Create a data support main server
5
41
---------------------------------
6
42
@@ -19,8 +55,8 @@ Dependents
19
55
20
56
#. Notify RBC Group of the new domain name for the new PostgreSQL server.
21
57
22
-
Update Salt and halt jobs
23
-
~~~~~~~~~~~~~~~~~~~~~~~~~
58
+
Update Salt configuration and halt jobs
59
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24
60
25
61
#. Check that ``docker.uid`` in the server's Pillar file matches the entry in the ``/etc/passwd`` file for the ``docker.user`` (``deployer``).
26
62
#. Change ``cron.present`` to ``cron.absent`` in the ``salt/pelican/backend/init.sls`` file.
@@ -63,42 +99,6 @@ Kingfisher Collect
63
99
64
100
Once DNS has propagated, :ref:`update-spiders`.
65
101
66
-
Set up incremental updates
67
-
^^^^^^^^^^^^^^^^^^^^^^^^^^
68
-
69
-
This creates a cron job to run a ``scrapy crawl`` command. The `DatabaseStore <https://kingfisher-collect.readthedocs.io/en/latest/contributing/extensions/database_store.html>`__ extension implements the incremental updates.
70
-
71
-
#. `Choose a spider <https://kingfisher-collect.readthedocs.io/en/latest/spiders.html>`__ that collects the desired data. Prefer the spider that:
72
-
73
-
- Accepts a ``from_date`` spider argument, preferably at the same granularity as the cron schedule
74
-
- Is fastest: for example, ``_bulk``, instead of ``_api``
75
-
- Reduces processing: for example, a spider that yields compiled releases
76
-
77
-
If needed, improve the spider in `Kingfisher Collect <https://github.com/open-contracting/kingfisher-collect>`__.
78
-
#. Add an entry to the ``python_apps.kingfisher_collect.crawls`` section of the ``pillar/kingfisher_main.sls`` file:
79
-
80
-
``identifier``
81
-
An uppercase, underscore-separated name, like ``DOMINICAN_REPUBLIC``.
82
-
``spider``
83
-
The spider's name, like ``dominican_republic_api``.
84
-
``crawl_time``
85
-
The current date, like ``'2025-05-06'`` (though, any date works).
86
-
``spider_arguments`` (optional)
87
-
Any `spider arguments <https://kingfisher-collect.readthedocs.io/en/latest/spiders.html#spider-arguments>`__.
88
-
89
-
If the spider doesn't yield compiled releases, add ``-a compile_releases=true``.
90
-
``cardinal`` (optional)
91
-
``True``, to enable a pipeline involving `Cardinal <https://cardinal.readthedocs.io/en/latest/>`__.
92
-
``users`` (optional)
93
-
A list of additional :ref:`PostgreSQL users<pg-users>` that need read access to the database.
94
-
``day`` (optional)
95
-
The day of the month on which to run the cron job.
96
-
97
-
Required if an incremental update takes longer than a day.
98
-
99
-
#. If an *initial crawl* would take longer than a day, run the `scrapy crawl <https://github.com/open-contracting/deploy/blob/main/salt/kingfisher/collect/files/cron.sh>`__ command manually.
0 commit comments