-
Notifications
You must be signed in to change notification settings - Fork 155
added support for huggingface dataset repositories #1007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
added support for huggingface dataset repositories #1007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the HuggingFace import functionality to support both models and datasets by adding URL parsing logic and routing to appropriate API endpoints.
- Adds dataset support alongside existing model support through HuggingFace API
- Introduces URL path parsing to differentiate between models and datasets
- Updates API URL constants and function signatures to handle both resource types
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/lib/hf/list.go | Splits tree URL format into model and dataset variants; adds kind parameter to route to correct API endpoint |
| pkg/lib/hf/download.go | Splits resolve URL format into model and dataset variants; adds kind parameter for downloads |
| pkg/cmd/kitimport/util_test.go | Extends test cases to verify dataset URL parsing and returns expected kind alongside repository |
| pkg/cmd/kitimport/util.go | Updates extractRepoFromURL to parse dataset URLs and return both normalized path and kind |
| pkg/cmd/kitimport/hfimport.go | Threads kind parameter through ListFiles and DownloadFiles calls |
| pkg/cmd/kitimport/cmd.go | Updates call to extractRepoFromURL to handle new return signature with kind |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/cmd/kitimport/util.go
Outdated
| if len(segments) == 3 && segments[0] == "datasets" { | ||
| kind = "dataset" | ||
| normalizedPath = strings.Join(segments[1:], "/") | ||
| } else if len(segments) == 2 { |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining constants for magic strings "dataset" and "model" (e.g., const KindDataset = \"dataset\" and const KindModel = \"model\") to ensure consistency across the codebase and prevent typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created Typed enums called RepositoryType and hfRepoType ,No string literals like "dataset" or "model" used in comparisons
| var treeURL string | ||
| if kind == "dataset" { | ||
| treeURL = fmt.Sprintf(datasetTreeURLFmt, modelRepo, ref) | ||
| } else { | ||
| treeURL = fmt.Sprintf(modelTreeURLFmt, modelRepo, ref) | ||
| } |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string literal "dataset" is used for comparison. Consider defining constants for these kind values to ensure consistency and prevent typos across the codebase.
| var resolveURLFmt string | ||
| if kind == "dataset" { | ||
| resolveURLFmt = datasetResolveURLFmt | ||
| } else { | ||
| resolveURLFmt = modelResolveURLFmt | ||
| } |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string literal "dataset" is used for comparison. Consider defining constants for these kind values to ensure consistency and prevent typos across the codebase.
pkg/cmd/kitimport/util.go
Outdated
| if segments[0] == "datasets" { | ||
| return "", "", fmt.Errorf("invalid dataset URL format: expected datasets/org/repo, got '%s'", path) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this check is appropriate for all uses of extractRepoFromURL -- it makes sense for huggingface URLs but kit import supports any git repository as well; what if I try to import from my-gitlab.org/datasets/my-dataset?
This is mixing huggingface-specific logic into this function without making it clear; this function should handle both git repositories (arbitrary names) and huggingface ones.
We should either handle huggingface specifically or rename the function to reflect its actual purpose and update all usages to be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractRepoFromURL() is generic now and works for any git repository (GitLab, GitHub, etc.). my-gitlab.org/datasets/my-dataset would be handled correctly by extractRepoFromURL() as a generic org/repo pair. HuggingFace-specific parsing (including "datasets" prefix handling) is isolated in parseHuggingFaceRepo(). Usage in cmd.go (lines 134-142) tries HuggingFace parsing first, then falls back to generic parsing.
pkg/cmd/kitimport/util.go
Outdated
| if len(segments) != 2 { | ||
| return "", fmt.Errorf("could not extract organization and repository from '%s'", path) | ||
|
|
||
| var kind string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use a typed enum for kind to make things more explicit
type hfRepoType int
const (
hfModelRepoType hfRepoType = iota
hfDatasetRepoType
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented. Added hfRepoType enum in hfimport.go (lines 178-183) and RepositoryType in pkg/lib/hf/repo.go. All comparisons use typed constants (RepoTypeDataset, RepoTypeModel) instead of string literals.
pkg/cmd/kitimport/util.go
Outdated
| // extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface. | ||
| // Returns an error we cannot automatically handle the input URL/string. | ||
| // - https://example.com/segment1/segment2 --> segment1/segment2 | ||
| // - 'organization/repository' --> 'organization/repository' | ||
| func extractRepoFromURL(rawUrl string) (string, error) { | ||
| // Returns the normalized repository path (org/repo), the kind ("model" or "dataset"), and an error if parsing fails. | ||
| // - https://huggingface.co/org/repo --> (org/repo, "model", nil) | ||
| // - https://huggingface.co/datasets/org/repo --> (org/repo, "dataset", nil) | ||
| // - 'organization/repository' --> (organization/repository, "model", nil) | ||
| // - 'datasets/organization/repository' --> (organization/repository, "dataset", nil) | ||
| func extractRepoFromURL(rawUrl string) (string, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes change the purpose of this function, making it specific for handling huggingface repositories.
// extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface.vs
Returns the normalized repository path (org/repo), the kind ("model" or "dataset"),
-- the 'kind' here only makes sense in a huggingface context, not GitHub (or any Git repository, for that matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractRepoFromURL() remains generic and only returns the normalized repo path (org/repo) , no kind/type. HuggingFace-specific logic (including dataset detection) is handled by parseHuggingFaceRepo() in hfimport.go.
|
Thanks for the review @amisevsk, Its my short sight that i missed these things. I am working on resolving these comments |
@amisevsk please check and let me know if theres any other changes are required, thank you for your time. |
|
PR checks are failing |
|
@gorkem working on it |
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
…tion, enums use int with iota Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Added the standard Apache License 2.0 header to pkg/lib/hf/repo.go for compliance and clarity regarding licensing. Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
5fcf2b2 to
eb35533
Compare
|
@gorkem , CI checks should pass now |
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
amisevsk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes @arnab2001. Looking pretty good, just a few comments before we can merge.
pkg/cmd/kitimport/hfimport.go
Outdated
| func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) { | ||
| u, err := url.Parse(rawURL) | ||
| if err != nil { | ||
| return "", 0, fmt.Errorf("failed to parse url: %w", err) | ||
| } | ||
|
|
||
| if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") { | ||
| return "", 0, fmt.Errorf("not a Hugging Face repository") | ||
| } | ||
|
|
||
| path := strings.Trim(u.Path, "/") | ||
| segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' }) | ||
|
|
||
| if len(segments) >= 3 && segments[0] == "datasets" { | ||
| return strings.Join(segments[1:3], "/"), hfRepoTypeDataset, nil | ||
| } | ||
| if len(segments) == 2 && segments[0] == "datasets" { | ||
| return "", 0, fmt.Errorf("could not extract repository from path '%s'", path) | ||
| } | ||
|
|
||
| if len(segments) >= 2 { | ||
| return strings.Join(segments[len(segments)-2:], "/"), hfRepoTypeModel, nil | ||
| } | ||
|
|
||
| return "", 0, fmt.Errorf("could not extract repository from path '%s'", path) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this function into lib/hf and have it return the lib/hf-defined repo types (e.g. it could go into repo.go. This would allow us to avoid the duplicate hfRepoType enum and the mapRepoType function.
pkg/cmd/kitimport/hfimport.go
Outdated
| func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) { | ||
| u, err := url.Parse(rawURL) | ||
| if err != nil { | ||
| return "", 0, fmt.Errorf("failed to parse url: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining an explicit repoTypeUnknown in the enum (placing it first in the list) and handling that in the lib/hf functions to improve zero-values for the error case.
pkg/cmd/kitimport/hfimport.go
Outdated
| path := strings.Trim(u.Path, "/") | ||
| segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're trimming / from the path, FieldsFunc will be equivalent to strings.Split(path, "/") except harder to read.
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
|
@amisevsk Done. Moved ParseHuggingFaceRepo to pkg/lib/hf/repo.go, added RepoTypeUnknown as the zero value, and simplified FieldsFunc to strings.Split. Removed the duplicate hfRepoType enum and mapRepoType function from hfimport.go. |
| return "", RepoTypeUnknown, fmt.Errorf("failed to parse url: %w", err) | ||
| } | ||
|
|
||
| if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this check, inputs like https://huggingface.co.evil.com/org/repo pass the “is HF” check and proceed as if they were Hugging Face. Can you tighten the parsing and validation of the host?
| path := strings.Trim(u.Path, "/") | ||
| segments := strings.Split(path, "/") | ||
|
|
||
| if len(segments) >= 3 && segments[0] == "datasets" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kit import huggingface.co/datasets/org/repo --tool hf will fail because huggingface.co/datasets/org/repo (no scheme) will be treated as RepoTypeModel because when the scheme is missing segments[0] is the Host (aka huggingface.co)
|
Working on it |
Description
add Hugging Face dataset support to kit import by detecting dataset URLs and hitting the correct api/datasets/resolve endpoints, so both models and datasets can be packaged
Testing
go test ./...KITOPS_HOME=$(mktemp -d) ./kit import https://huggingface.co/datasets/nvidia/ProfBench --ref main --tag test/profbench:latest --tool hf/kit unpack test/profbench:latest --dir "$(mktemp -d)"Linked issues
closes #1004
AI-Assisted Code