-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Adjust ray offline store to support abfs(s) ADLS Azure Storage #5911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sdk/python/feast/infra/offline_stores/contrib/ray_offline_store/ray.py
Outdated
Show resolved
Hide resolved
| destination_path = storage.file_options.uri | ||
| if not destination_path.startswith(("s3://", "gs://", "hdfs://")): | ||
| if not destination_path.startswith( | ||
| ("s3://", "gs://", "hdfs://", "abfs://", "abfss://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Remote storage URI schemes supported by the Ray offline store
# S3: Amazon S3
# GCS: Google Cloud Storage
# HDFS: Hadoop Distributed File System
# Azure: Azure Storage Gen2
REMOTE_STORAGE_SCHEMES = ("s3://", "gs://", "hdfs://", "abfs://", "abfss://")
Can we define a constant for supported remote storage URI schemes at top and use it later at all three locations?
a77ca9e to
58d8df3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devin Review found 1 new potential issue.
🔴 1 issue in files not directly in the diff
🔴 Ray compute engine job.py not updated to support Azure storage schemes (sdk/python/feast/infra/compute_engines/ray/job.py:208)
The PR adds Azure storage support (abfs://, abfss://) to the Ray offline store by introducing a REMOTE_STORAGE_SCHEMES constant, but fails to update the RayDAGRetrievalJob.persist() method in job.py which still uses a hardcoded tuple ("s3://", "gs://", "hdfs://").
Click to expand
Impact
When users try to persist datasets to Azure storage using the Ray compute engine (via RayDAGRetrievalJob), the code will incorrectly:
- Try to check if the Azure path exists as a local file (
os.path.exists(destination_path)) - Try to create local directories (
os.makedirs(os.path.dirname(destination_path), exist_ok=True))
This happens because Azure paths like abfss://container@account.dfs.core.windows.net/path are not recognized as remote storage.
Actual vs Expected
- Actual:
job.py:208uses("s3://", "gs://", "hdfs://")which doesn't include Azure schemes - Expected: Should use
REMOTE_STORAGE_SCHEMESfromray.pywhich includes("s3://", "gs://", "hdfs://", "abfs://", "abfss://")
Code comparison
ray.py:70 defines:
REMOTE_STORAGE_SCHEMES = ("s3://", "gs://", "hdfs://", "abfs://", "abfss://")But job.py:208 still uses:
if not destination_path.startswith(("s3://", "gs://", "hdfs://")):Recommendation: Import and use REMOTE_STORAGE_SCHEMES from feast.infra.offline_stores.contrib.ray_offline_store.ray or define a shared constant in a common module that both files can import.
View issue and 6 additional flags in Devin Review.
@jbauer12 I think this comment make sense, we need to update job.py as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devin Review found 2 new potential issues.
🔴 1 issue in files not directly in the diff
🔴 Ray compute engine job.py not updated to support Azure storage schemes (sdk/python/feast/infra/compute_engines/ray/job.py:208)
The PR adds Azure storage support (abfs://, abfss://) to the Ray offline store by introducing a REMOTE_STORAGE_SCHEMES constant, but fails to update the RayDAGRetrievalJob.persist() method in job.py which still uses a hardcoded tuple ("s3://", "gs://", "hdfs://").
Click to expand
Impact
When users try to persist datasets to Azure storage using the Ray compute engine (via RayDAGRetrievalJob), the code will incorrectly:
- Try to check if the Azure path exists as a local file (
os.path.exists(destination_path)) - Try to create local directories (
os.makedirs(os.path.dirname(destination_path), exist_ok=True))
This happens because Azure paths like abfss://container@account.dfs.core.windows.net/path are not recognized as remote storage.
Actual vs Expected
- Actual:
job.py:208uses("s3://", "gs://", "hdfs://")which doesn't include Azure schemes - Expected: Should use
REMOTE_STORAGE_SCHEMESfromray.pywhich includes("s3://", "gs://", "hdfs://", "abfs://", "abfss://")
Code comparison
ray.py:70 defines:
REMOTE_STORAGE_SCHEMES = ("s3://", "gs://", "hdfs://", "abfs://", "abfss://")But job.py:208 still uses:
if not destination_path.startswith(("s3://", "gs://", "hdfs://")):Recommendation: Import and use REMOTE_STORAGE_SCHEMES from feast.infra.offline_stores.contrib.ray_offline_store.ray or define a shared constant in a common module that both files can import.
View issues and 8 additional flags in Devin Review.
sdk/python/feast/infra/offline_stores/contrib/ray_offline_store/ray.py
Outdated
Show resolved
Hide resolved
bb4d75a to
698bd1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sdk/python/feast/infra/offline_stores/contrib/ray_offline_store/ray.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
Signed-off-by: Jonas Bauer <jbauer@easy2parts.com>
177d70d to
764be5a
Compare
What this PR does / why we need it:
Adjustments to support ADLS Gen2 Azure Storage in Ray Offline Store.
Which issue(s) this PR fixes:
Fixes##5844