Skip to content

Feature/data 2175 kill timeout spark#45

Merged
claudiazi merged 12 commits intomasterfrom
feature/DATA-2175-kill-timeout-spark
Jan 21, 2025
Merged

Feature/data 2175 kill timeout spark#45
claudiazi merged 12 commits intomasterfrom
feature/DATA-2175-kill-timeout-spark

Conversation

@claudiazi
Copy link

@claudiazi claudiazi commented Jan 17, 2025

Feature:

  • Add spark app name to the spark job
  • Function kill_spark_job handles the termination of Spark jobs via sending the yarn command to ssm
  • Overwrite the function on_kill to terminate the spark via calling kill_spark_job function when the Airflow task is marked as failed, or manually killed via the Airflow UI or CLI.
  • Extend the error Handling in execute: when anything fails in execute, also terminates the spark job via kill_spark_job

Tested in datastg:

@claudiazi claudiazi requested a review from a team as a code owner January 17, 2025 09:19
Copy link
Member

@siklosid siklosid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, it would be nice to add more context and the PR description. E.g.: some documentation about what is best practice in ariflow handling timeout, how did you come to this implementation.

Also it would be nice to handle when we cancel/clear the job on the UI, so that also kills the running spark job.

job_args=_parse_args(self._template_parameters),
spark_args=_parse_spark_args(self._task.spark_args),
spark_conf_args=_parse_spark_args(self._task.spark_conf_args, '=', 'conf '),
spark_app_name=self._task.spark_conf_args.get("spark.app.name", ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎆

Parameters={"commands": [kill_command]}
)
raise AirflowException(
f"Spark job exceeded the execution timeout of {self._execution_timeout} seconds and was terminated.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this only happen if it's a timeout? What if we cancel the job on airflow UI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truee!!

)
if application_id:
self.kill_spark_job(emr_master_instance_id, application_id)
raise AirflowException("Task timed out and the Spark job was terminated.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already raising an exception in the kill_spark_job, so I'm not even sure the code will ever get here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true! > <

@claudiazi claudiazi requested review from a team and siklosid January 20, 2025 10:39
Copy link
Member

@siklosid siklosid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. @claudiazi Do you know why were there so many code style changes? It made the review a bit harder, because it was not clear what is an actual change and what not.

@claudiazi
Copy link
Author

@siklosid sorry!! I blacked the code 🙈

@claudiazi claudiazi merged commit 9ad0c1e into master Jan 21, 2025
1 check passed
@siklosid
Copy link
Member

@claudiazi I think black is part of the pre-commit hook. Does this mean it was not blacked before? Or was it blacked with different settings?

@claudiazi
Copy link
Author

@siklosid seems that the pre-commit was not working at all. 🤔
ahh bec in the install-dev no pre-commit was installed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants