Skip to content

Initial Kubernetes deployment changes#17

Draft
JackKammerer wants to merge 13 commits intoFaaSr:mainfrom
JackKammerer:kubernetes-deployment
Draft

Initial Kubernetes deployment changes#17
JackKammerer wants to merge 13 commits intoFaaSr:mainfrom
JackKammerer:kubernetes-deployment

Conversation

@JackKammerer
Copy link
Copy Markdown

These are the basic changes I made to allow the backend to work with Kubernetes. Most of this is in-line with the other implementations. The most important functions are the new helpers in kubernetes_helper.py and the invoke_kubernetes function.

The biggest change I need to make before this can be considered complete is to remove the "verify=False" which is included in the call to make the POST request in make_kubernetes_request in kubernetes_helper.py. This is only included as the NRP Kubernetes cluster I am testing with uses self-signed certificates (along with local instances). There is a way to solve this issue, but it is about implementing a way to specify a certificate authority in the FaaSr config, and then ensuring the request has access to the certificate for https verification. I will work on fixing this and adding it to the pull request when I am finished.

If there are any other changes I should make, please let me know!

@Ashish-Ramrakhiani
Copy link
Copy Markdown
Collaborator

Thanks for the PR Jack! The overall structure is solid and follows the
existing backend patterns well. I tested this end-to-end on an Oracle
Cloud VM running k3s and got two parallel Kubernetes jobs completing
successfully with logs written to S3. A few issues to address before merge:

1. verify=False (you already flagged this)
The POST in make_kubernetes_request() uses verify=False disabling SSL
verification entirely. Same issue exists in PR #28. The fix needs a
CABundle or SSLCertificate field in the server config so users can
provide their CA cert. Until then this is a security risk.

2. validate_jwt_token() rejects modern Kubernetes tokens
This caused failures during testing. Modern Kubernetes (v1.21+) and k3s
use bound service account tokens with URL-based issuers like
https://kubernetes.default.svc.cluster.local and nested JSON structure —
not the legacy "kubernetes/serviceaccount" format your validation expects.
The secret.name field also doesn't exist in modern tokens. The validation
needs to handle both formats:

3. GH_PAT not forwarded to container env vars
invoke_kubernetes() only passes PAYLOAD_URL and OVERWRITTEN to the
container. When the container tries to fetch function code from a private
GitHub repo it gets 404 because GH_PAT is missing. invoke_slurm() already
handles this — just add the same pattern:

4. Typo: "ComputerServers" instead of "ComputeServers"
In the UseSecretStore check inside invoke_kubernetes():

The condition always evaluates False so secrets are always included in
the payload even when UseSecretStore is True.

5. ttlSecondsAfterFinished: 10 is too short for debugging
Pods and job objects disappear before logs can be inspected. Recommend
increasing to 300 or making it configurable via TimeLimit in the server
config (consistent with SLURM and GCP).

6. activeDeadlineSeconds: 300 hardcoded
5 minutes will silently kill long-running scientific workflows. Should be
configurable via TimeLimit in the server config.

@JackKammerer
Copy link
Copy Markdown
Author

Made more changes:

  • Added support for the following fields in the Kubernetes compute server: "AllowSelfSignedCertificate", "SSLCertificate", "MaxCPU", "MaxMemory", "TimeLimit", "AdditionalTimeToLive", "NumberOfRetries". These are all configurable for the endpoint server. The most important is the "SSLCertificate", which is a string that is the authority certificate for the server.
  • GH_PAT is now passed through with the other environment variables.
  • The old, long-term JWT tokens and the new, temporary JWT tokens are both validated and used for authentication.
  • Code ran correctly against the NRP backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants