Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions fog/ingest/server/src/controller.rs
Original file line number Diff line number Diff line change
Expand Up @@ -411,11 +411,19 @@ where
return;
}
}
Err(err) => {
// Failing to publish a report is not fatal, because we attempt to publish
Err(Error::PublishReport) => {
// This is already logged adequately within

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated to this PR, but in regards to the logging inside publish_report. I see a comment noting that if it "doesn't resolve itself" then we need to take action. Could you provide context for how I might identify that it is not going to resolve itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good question, and maybe we should change the code more aggressively.

So the idea here was that we try to publish, but we don't block if we can't. Because it's not critical that we republish the report with a higher expiry number, as long as one of our publications succeeds before the key actually expires.

There are other things we could do here:

  1. Use retries, and log an error but only if e.g. 3 retries fail
  2. Keep track of how often the report publication has failed, and start blocking and looping infinitely if it fails too many times in a row.

I wanted to avoid making significant changes in a release branch, so I just wanted to adjust the log level in this commit.

Could you provide context for how I might identify that it is not going to resolve itself?

If this warning persists every time we attempt to publish, then eventually the key will expire and users can't publish.
If we manage to process a block without emitting this warning, then the system has recovered.

I don't think random connection issues are going to cause this to happen -- if we randomly lose connection to postgres and this publication action fails, it has to be resolved when we try to write block data to postgres, because that is going to block infinitely on being able to write the data, and we won't hit this action again until that action succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider whether to make this do something else in master branch (1.2.0) I think?

// self.publish_report
//
// Failing to publish a report is not fatal, because we
// attempt to publish
// on every block, and even if it only succeeds half
// the time, it wont make much difference in the pubkey
// expiry window.
}
Err(err) => {
// If the error is something other than PublishReport then we should log it
// at a high severity level
log::error!(
self.logger,
"Could not publish ingest report at block index {}: {}",
Expand Down Expand Up @@ -1148,17 +1156,20 @@ where
};
let report_id = self.config.fog_report_id.as_ref();

log::info!(self.logger, "publishing report to sql");
log::info!(self.logger, "publishing report to DB");
self.recovery_db
.set_report(ingress_public_key, report_id, &report_data)
.map(|x| {
counters::LAST_PUBLISHED_PUBKEY_EXPIRY.set(report_data.pubkey_expiry as i64);
x
})
.map_err(|err| {
log::error!(
// This is a warning because transient report DB write errors are not a cause
// for action, as long as this write succeeds eventually it's
// fine.
log::warn!(
self.logger,
"Could not publish report and check on ingress key status: {}",
"Could not publish report to DB and check on ingress key status: {}",
err
);
// Note: At this revision, we don't have generic constraints for converting
Expand Down