Conversation
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
|
I have made the changes as per suggestions. If any other changes are required, do let me know. Looking forward! |
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
|
@TG1999 @keshav-space |
keshav-space
left a comment
There was a problem hiding this comment.
Thanks @Samk1710, see comments below.
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
|
Hi @keshav-space |
keshav-space
left a comment
There was a problem hiding this comment.
Thanks @Samk1710, see the comments below. And also please explain why you need to break the request into individual months in backfill_from_year?
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
Thanks @keshav-space for the review and clarifications. I have implemented all the suggestions. |
|
@Samk1710 here is response to #1 (comment)
This does not explain the need for monthly iteration. In every EUVD advisory we get
Again this is lot of words which absolutely answers nothing. No this does not make logic simpler you are breaking response into months but you are still using API |
|
@keshav-space Regarding datePublished, I went for fetching by dateUpdated, because advisories can be updated even after publication. For example, the above advisory was published in 2009 and updated in 2017. So if we use datePublished we will miss the 2017 update. Also if we keep the dateUpdated fetching and organize by datePublished we have to keep on touching backfill data while daily syncs and is also an extra layer of parsing. Also for fetching we can't do
Hence wee can either do a yearly backfill or a monthly backfill as they would be restartable and have similar API call numbers. I chose monthly as I already mentioned it was how most data sources I laid eyes on were structured. If you have further questions or want some changes in the code implementation kindly let me know. |
Exactly, API response is already paginated so there is no need to do double pagination like chunking response by months or years or anything else.
Use
If concern is that the pipeline may fail due to a network call, then backfill and daily mode are not solving that problem. It can fail even during a daily run if we get a huge number of updates. Moreover there is no built in mechanism to automatically start from the last failure in the next run. Get rid of backfill and daily mode. Instead, we should have something simpler like this: class EUVDAdvisoryMirror(BasePipeline):
url = "https://euvdservices.enisa.europa.eu/api/search"
@classmethod
def steps(cls):
return (
cls.load_checkpoint,
cls.create_session,
cls.collect_new_advisory,
cls.save_checkpoint,
)
This way in first run pipeline will automatically collect all advisories, and in subsequent runs it will collect only incremental updates. In case of a pipeline failure, it will always collect data since the last successful run. |
|
Thanks @keshav-space for the suggestions. I will update the implementation shortly. |
Signed-off-by: Samk <sampurnapyne1710@gmail.com>
|
Hey @keshav-space |
EUVD Mirror Pipeline
The script has two modes:
As of now the PR focuses on the code, once reviewed and approved I can add the locally collected backfill data.