Skip to content

Conversation

@arnab2001
Copy link

Description

add Hugging Face dataset support to kit import by detecting dataset URLs and hitting the correct api/datasets/resolve endpoints, so both models and datasets can be packaged

Testing

go test ./...
KITOPS_HOME=$(mktemp -d) ./kit import https://huggingface.co/datasets/nvidia/ProfBench --ref main --tag test/profbench:latest --tool hf
/kit unpack test/profbench:latest --dir "$(mktemp -d)"

Screenshot 2025-11-04 at 6 04 19 PM

Linked issues

closes #1004

AI-Assisted Code

  • This PR contains AI-generated code that I have reviewed and tested
  • I take full responsibility for all code in this PR, regardless of how it was created

@gorkem gorkem requested review from amisevsk and Copilot November 4, 2025 15:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the HuggingFace import functionality to support both models and datasets by adding URL parsing logic and routing to appropriate API endpoints.

  • Adds dataset support alongside existing model support through HuggingFace API
  • Introduces URL path parsing to differentiate between models and datasets
  • Updates API URL constants and function signatures to handle both resource types

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/lib/hf/list.go Splits tree URL format into model and dataset variants; adds kind parameter to route to correct API endpoint
pkg/lib/hf/download.go Splits resolve URL format into model and dataset variants; adds kind parameter for downloads
pkg/cmd/kitimport/util_test.go Extends test cases to verify dataset URL parsing and returns expected kind alongside repository
pkg/cmd/kitimport/util.go Updates extractRepoFromURL to parse dataset URLs and return both normalized path and kind
pkg/cmd/kitimport/hfimport.go Threads kind parameter through ListFiles and DownloadFiles calls
pkg/cmd/kitimport/cmd.go Updates call to extractRepoFromURL to handle new return signature with kind

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 183 to 186
if len(segments) == 3 && segments[0] == "datasets" {
kind = "dataset"
normalizedPath = strings.Join(segments[1:], "/")
} else if len(segments) == 2 {
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider defining constants for magic strings "dataset" and "model" (e.g., const KindDataset = \"dataset\" and const KindModel = \"model\") to ensure consistency across the codebase and prevent typos.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created Typed enums called RepositoryType and hfRepoType ,No string literals like "dataset" or "model" used in comparisons

Comment on lines 54 to 59
var treeURL string
if kind == "dataset" {
treeURL = fmt.Sprintf(datasetTreeURLFmt, modelRepo, ref)
} else {
treeURL = fmt.Sprintf(modelTreeURLFmt, modelRepo, ref)
}
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string literal "dataset" is used for comparison. Consider defining constants for these kind values to ensure consistency and prevent typos across the codebase.

Copilot uses AI. Check for mistakes.
Comment on lines 60 to 65
var resolveURLFmt string
if kind == "dataset" {
resolveURLFmt = datasetResolveURLFmt
} else {
resolveURLFmt = modelResolveURLFmt
}
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string literal "dataset" is used for comparison. Consider defining constants for these kind values to ensure consistency and prevent typos across the codebase.

Copilot uses AI. Check for mistakes.
Comment on lines 187 to 189
if segments[0] == "datasets" {
return "", "", fmt.Errorf("invalid dataset URL format: expected datasets/org/repo, got '%s'", path)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this check is appropriate for all uses of extractRepoFromURL -- it makes sense for huggingface URLs but kit import supports any git repository as well; what if I try to import from my-gitlab.org/datasets/my-dataset?

This is mixing huggingface-specific logic into this function without making it clear; this function should handle both git repositories (arbitrary names) and huggingface ones.

We should either handle huggingface specifically or rename the function to reflect its actual purpose and update all usages to be appropriate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractRepoFromURL() is generic now and works for any git repository (GitLab, GitHub, etc.). my-gitlab.org/datasets/my-dataset would be handled correctly by extractRepoFromURL() as a generic org/repo pair. HuggingFace-specific parsing (including "datasets" prefix handling) is isolated in parseHuggingFaceRepo(). Usage in cmd.go (lines 134-142) tries HuggingFace parsing first, then falls back to generic parsing.

if len(segments) != 2 {
return "", fmt.Errorf("could not extract organization and repository from '%s'", path)

var kind string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use a typed enum for kind to make things more explicit

type hfRepoType int

const (
    hfModelRepoType hfRepoType = iota
    hfDatasetRepoType
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Added hfRepoType enum in hfimport.go (lines 178-183) and RepositoryType in pkg/lib/hf/repo.go. All comparisons use typed constants (RepoTypeDataset, RepoTypeModel) instead of string literals.

Comment on lines 165 to 171
// extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface.
// Returns an error we cannot automatically handle the input URL/string.
// - https://example.com/segment1/segment2 --> segment1/segment2
// - 'organization/repository' --> 'organization/repository'
func extractRepoFromURL(rawUrl string) (string, error) {
// Returns the normalized repository path (org/repo), the kind ("model" or "dataset"), and an error if parsing fails.
// - https://huggingface.co/org/repo --> (org/repo, "model", nil)
// - https://huggingface.co/datasets/org/repo --> (org/repo, "dataset", nil)
// - 'organization/repository' --> (organization/repository, "model", nil)
// - 'datasets/organization/repository' --> (organization/repository, "dataset", nil)
func extractRepoFromURL(rawUrl string) (string, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes change the purpose of this function, making it specific for handling huggingface repositories.

// extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface.

vs

Returns the normalized repository path (org/repo), the kind ("model" or "dataset"),

-- the 'kind' here only makes sense in a huggingface context, not GitHub (or any Git repository, for that matter).

Copy link
Author

@arnab2001 arnab2001 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractRepoFromURL() remains generic and only returns the normalized repo path (org/repo) , no kind/type. HuggingFace-specific logic (including dataset detection) is handled by parseHuggingFaceRepo() in hfimport.go.

@arnab2001
Copy link
Author

Thanks for the review @amisevsk, Its my short sight that i missed these things. I am working on resolving these comments

@arnab2001
Copy link
Author

arnab2001 commented Nov 5, 2025

  • Created typed enums: RepositoryType in pkg/lib/hf/repo.go and hfRepoType in pkg/cmd/kitimport/hfimport.go, now no string literals ("dataset", "model") are used in comparisons.
    All comparisons use typed constants now

  • Seperation: extractRepoFromURL() in util.go is now generic and handles GitHub/Git URLs only, parseHuggingFaceRepo() in hfimport.go handles all HuggingFace-specific logic (including dataset detection). extractRepoFromURL() is no longer mixing concerns

  • Both enums use int with iota as suggested
    Proper mapping function mapRepoType() converts between the two enum types

@amisevsk please check and let me know if theres any other changes are required, thank you for your time.

@arnab2001 arnab2001 requested a review from amisevsk November 5, 2025 04:58
@gorkem
Copy link
Member

gorkem commented Dec 3, 2025

PR checks are failing

@arnab2001
Copy link
Author

@gorkem working on it

Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
…tion, enums use int with iota

Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Added the standard Apache License 2.0 header to pkg/lib/hf/repo.go for compliance and clarity regarding licensing.

Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
@arnab2001 arnab2001 force-pushed the added-support-for-huggingface-dataset-repositories branch from 5fcf2b2 to eb35533 Compare December 5, 2025 18:27
@arnab2001
Copy link
Author

@gorkem , CI checks should pass now

Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Copy link
Contributor

@amisevsk amisevsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes @arnab2001. Looking pretty good, just a few comments before we can merge.

Comment on lines 194 to 219
func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) {
u, err := url.Parse(rawURL)
if err != nil {
return "", 0, fmt.Errorf("failed to parse url: %w", err)
}

if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") {
return "", 0, fmt.Errorf("not a Hugging Face repository")
}

path := strings.Trim(u.Path, "/")
segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' })

if len(segments) >= 3 && segments[0] == "datasets" {
return strings.Join(segments[1:3], "/"), hfRepoTypeDataset, nil
}
if len(segments) == 2 && segments[0] == "datasets" {
return "", 0, fmt.Errorf("could not extract repository from path '%s'", path)
}

if len(segments) >= 2 {
return strings.Join(segments[len(segments)-2:], "/"), hfRepoTypeModel, nil
}

return "", 0, fmt.Errorf("could not extract repository from path '%s'", path)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this function into lib/hf and have it return the lib/hf-defined repo types (e.g. it could go into repo.go. This would allow us to avoid the duplicate hfRepoType enum and the mapRepoType function.

func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) {
u, err := url.Parse(rawURL)
if err != nil {
return "", 0, fmt.Errorf("failed to parse url: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider defining an explicit repoTypeUnknown in the enum (placing it first in the list) and handling that in the lib/hf functions to improve zero-values for the error case.

Comment on lines 204 to 205
path := strings.Trim(u.Path, "/")
segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're trimming / from the path, FieldsFunc will be equivalent to strings.Split(path, "/") except harder to read.

Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
@arnab2001
Copy link
Author

@amisevsk Done. Moved ParseHuggingFaceRepo to pkg/lib/hf/repo.go, added RepoTypeUnknown as the zero value, and simplified FieldsFunc to strings.Split. Removed the duplicate hfRepoType enum and mapRepoType function from hfimport.go.

return "", RepoTypeUnknown, fmt.Errorf("failed to parse url: %w", err)
}

if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this check, inputs like https://huggingface.co.evil.com/org/repo pass the “is HF” check and proceed as if they were Hugging Face. Can you tighten the parsing and validation of the host?

path := strings.Trim(u.Path, "/")
segments := strings.Split(path, "/")

if len(segments) >= 3 && segments[0] == "datasets" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kit import huggingface.co/datasets/org/repo --tool hf will fail because huggingface.co/datasets/org/repo (no scheme) will be treated as RepoTypeModel because when the scheme is missing segments[0] is the Host (aka huggingface.co)

@arnab2001
Copy link
Author

Working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kit import does not support huggingface dataset repositories

3 participants