Skip to content

longregen/nix-hug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nix-hug

License: MIT Nix Flake CI Version

Declarative Hugging Face model and dataset management for Nix. nix-hug pins models to exact revisions, fetches only the files you need, builds offline-compatible HuggingFace Hub caches, and supports importing from and exporting to the local HuggingFace cache.

The CLI is used to download models into the nix store:

$ nix run github:longregen/nix-hug -- fetch MiniMaxAI/MiniMax-M2.5
nix-hug-lib.fetchModel {
  url = "MiniMaxAI/MiniMax-M2.5";
  rev = "abc123...";
  fileTreeHash = "sha256-...";
};

The output can then be used in nix:

# Smoke test: an app that just loads the model in python
let
  minimax = nix-hug-lib.fetchModel {
    url = "MiniMaxAI/MiniMax-M2.5";
    rev = "abc123...";
    fileTreeHash = "sha256-...";
  };
  cache = nix-hug-lib.buildCache {
    models = [ minimax ];
  };
  python = pkgs.python3.withPackages (p: [ p.transformers p.torch ]);
in
  pkgs.writeShellApplication {
    name = "say-minimax-inefficiently";
    runtimeInputs = [ python ];
    text = ''
      export HF_HUB_CACHE=${cache}
      export TRANSFORMERS_OFFLINE=1
      python -c "
        from transformers import AutoModelForCausalLM
        model = AutoModelForCausalLM.from_pretrained('MiniMaxAI/MiniMax-M2.5')
        print(model)
      "
    '';
  }

Table of Contents

Quick Start

Add nix-hug to your flake inputs:

{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs";
    nix-hug.url = "github:longregen/nix-hug";
  };
}

Use the CLI to fetch a model. It resolves the revision, computes hashes, and prints a Nix expression you can paste into your configuration:

$ nix-hug fetch mistralai/Mistral-7B-Instruct-v0.3 --include '*.safetensors'

Use the output in your flake to build an offline HuggingFace Hub cache:

let
  nix-hug-lib = nix-hug.lib.${system};

  mistral = nix-hug-lib.fetchModel {
    url = "mistralai/Mistral-7B-Instruct-v0.3";
    rev = "abc123...";  # pinned commit hash from CLI output
    filters = { include = [ ".*\\.safetensors" ]; };
    fileTreeHash = "sha256-...";
  };

  cache = nix-hug-lib.buildCache {
    models = [ mistral ];
  };
in
  pkgs.mkShell {
    HF_HUB_CACHE = cache;
    TRANSFORMERS_OFFLINE = "1";
  }

Running Python within this shell will find the model without network access (the transformers lib reads the env variable HF_HUB_CACHE):

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

How It Works

nix-hug has two parts: a bash-based CLI, and a nix library. The CLI's fetch subcommand resolves the git ref to a commit hash via the Hugging Face API. It then fetches the repository's file tree metadata and computes a SHA256 hash of how the directory structure for consumption by HuggingFace libraries will look like. The output of the CLI is a Nix expression that pins that "fileTreeHash" and stores the git ref.

When consuming it, the nix-based lib evaluates that expression, and executes the same steps that the bash-based CLI does: fetchGit clones the Hugging Face repository at the pinned revision. This retrieves all small files (configs, tokenizer data, etc.) but only LFS pointer files for large weights. For each LFS file then fetchurl downloads it from HuggingFace's CDN using the LFS SHA256 OID as the content hash. Filters can be provided to selectively download some of these large filters, in case the repository contains a lot of model files that you don't need (for example, one might want only one particular large ".safetensors" file from a repository that has also ONNX files, or many quantizations together in the same repo). A derivation assembles the result: the git checkout with real model files replacing the LFS pointers.

buildCache takes fetched models and datasets and arranges them into the directory layout that HuggingFace Hub's Python libraries expect:

models--org--repo/
  refs/
    main            # contains the pinned commit hash
  snapshots/
    <rev>/          # the actual model files

Set HF_HUB_CACHE to this store path and any library that reads from the Hub cache (transformers, diffusers, sentence-transformers) will find the model without making network requests. Please note that datasets is known to cause problems sometimes (contributions welcome).

Everything is content-addressed. The same inputs produce the same store paths. Models can be shared across machines, cached in CI, and pinned in lockfiles the same way as any other Nix dependency.

HuggingFace cache integration

nix-collect-garbage removes store paths not referenced by a GC root. For large models, re-downloading after collection is expensive. The export command copies a model from the Nix store into the local HuggingFace cache directory, and import copies it back. This uses the same directory layout that transformers, diffusers, and other HF libraries read from. The cache location is determined by $HF_HUB_CACHE, $HF_HOME/hub, or defaults to $XDG_CACHE_HOME/huggingface/hub/.

Installation

As a flake input

{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs";
    nix-hug.url = "github:longregen/nix-hug";
  };

  outputs = { nixpkgs, nix-hug, ... }:
    let
      system = "x86_64-linux";
      pkgs = nixpkgs.legacyPackages.${system};
      nix-hug-lib = nix-hug.lib.${system};

      my-model = nix-hug-lib.fetchModel {
        url = "stas/tiny-random-llama-2";
        rev = "3579d71fd57e04f5a364d824d3a2ec3e913dbb67";
        fileTreeHash = "sha256-mD+VYvxsLFH7+jiumTZYcE3f3kpMKeimaR0eElkT7FI=";
      };

      model-cache = nix-hug-lib.buildCache {
        models = [ my-model ];
      };
    in {
      packages.${system} = {
        inherit my-model model-cache;
        default = nix-hug.packages.${system}.default;
      };

      devShells.${system}.default = pkgs.mkShell {
        buildInputs = [ nix-hug.packages.${system}.default ];
      };
    };
}

Run directly

$ nix run github:longregen/nix-hug -- fetch mistralai/Mistral-7B-Instruct-v0.3

CLI Reference

Global options:

  • --debug: enable verbose logging
  • --version: print version
  • --help: show help

fetch

Downloads a model or dataset from Hugging Face and prints a Nix expression with pinned revision and hashes.

$ nix-hug fetch <url> [options]

Options:

  • --ref REF: git reference to resolve (default: main)
  • --include PATTERN: include files matching a glob pattern
  • --exclude PATTERN: exclude files matching a glob pattern
  • --file FILENAME: include a specific file by name
  • --dry-run: show what would be fetched without downloading
# Fetch only safetensors weights
$ nix-hug fetch mistralai/Mistral-7B-Instruct-v0.3 --include '*.safetensors'

# Fetch a dataset
$ nix-hug fetch rajpurkar/squad --include '*.json'

# Fetch a single config file
$ nix-hug fetch google-bert/bert-base-uncased --file config.json

The CLI auto-detects whether a repository is a model or dataset by querying the Hugging Face API.

ls

Lists files in a repository without downloading anything. Accepts the same filter options as fetch.

$ nix-hug ls mistralai/Mistral-7B-Instruct-v0.3
$ nix-hug ls stanfordnlp/imdb --include '*.parquet'

export

Fetches a model or dataset and copies it into the local HuggingFace cache directory. This makes the model available to transformers, diffusers, and other HF libraries, and preserves it outside the Nix store (surviving garbage collection).

The cache location is determined by $HF_HUB_CACHE, $HF_HOME/hub, or defaults to $XDG_CACHE_HOME/huggingface/hub/.

Accepts the same filter options as fetch.

$ nix-hug export openai-community/gpt2
$ nix-hug export openai-community/gpt2 --include '*.safetensors'

import

Imports a model or dataset from the local HuggingFace cache into the Nix store. If you already have models downloaded by transformers, diffusers, or huggingface-cli, this avoids re-downloading files that are already on disk. Use nix-hug scan to see what's available before importing.

The imported store path has the same layout as nix-hug fetch, so the output can be used with buildCache and nix build.

The cache location is determined by $HF_HUB_CACHE, $HF_HOME/hub, or defaults to $XDG_CACHE_HOME/huggingface/hub/.

$ nix-hug import <url> [options]

Options:

  • --ref REF: match a specific revision
  • --include PATTERN: include files matching a glob pattern
  • --exclude PATTERN: exclude files matching a glob pattern
  • --file FILENAME: include a specific file by name
$ nix-hug import openai-community/gpt2
$ nix-hug import openai-community/gpt2 --include '*.safetensors'

scan

Lists all models and datasets in the local HuggingFace cache. Useful for discovering what's available before running import.

The cache location is determined by $HF_HUB_CACHE, $HF_HOME/hub, or defaults to $XDG_CACHE_HOME/huggingface/hub/.

$ nix-hug scan

Shows each cached repository with its type, revision, size, file count, whether it's already in the Nix store, and any ref labels.

Nix Library

The library is available as nix-hug.lib.${system} from the flake output.

fetchModel / fetchDataset

Fetch a model or dataset from Hugging Face and returns a derivation.

nix-hug-lib.fetchModel {
  url = "stas/tiny-random-llama-2";
  rev = "3579d71fd57e04f5a364d824d3a2ec3e913dbb67";
  fileTreeHash = "sha256-mD+VYvxsLFH7+jiumTZYcE3f3kpMKeimaR0eElkT7FI=";
}

fetchDataset has the same interface:

nix-hug-lib.fetchDataset {
  url = "rajpurkar/squad";
  rev = "abc123...";
  filters = { include = [ ".*\\.json" ]; };
  fileTreeHash = "sha256-...";
}

Parameters:

  • url (required): repository identifier (see URL Formats)
  • rev (required): git commit hash (40 characters)
  • fileTreeHash (required): SHA256 hash of the HF API file tree response
  • filters (optional): filter object with include, exclude, or files

The filters attribute accepts one of three forms:

  • { include = [ "regex" ... ]; } keeps only matching LFS files
  • { exclude = [ "regex" ... ]; } skips matching LFS files
  • { files = [ "filename" ... ]; } selects specific files by exact name

Non-LFS files (configs, tokenizer files) are always included unless files is used.

buildCache

Combines fetched models and datasets into a HuggingFace Hub-compatible cache directory using symlinks (no data duplication).

nix-hug-lib.buildCache {
  models = [ my-model another-model ];
  datasets = [ my-dataset ];
}

Use the result as HF_HUB_CACHE:

$ export HF_HUB_CACHE=/nix/store/...-hf-hub-cache
$ export TRANSFORMERS_OFFLINE=1
$ python your_script.py

URL Formats

Models:

  • mistralai/Mistral-7B-Instruct-v0.3
  • hf:mistralai/Mistral-7B-Instruct-v0.3
  • https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

Datasets:

  • rajpurkar/squad
  • hf-datasets:rajpurkar/squad
  • datasets/rajpurkar/squad
  • https://huggingface.co/datasets/rajpurkar/squad

When you use a bare org/repo path, the CLI queries the Hugging Face API to determine whether the repository is a model or dataset.

Development

$ nix develop
$ ./cli/nix-hug --help

Run the tests:

$ nix flake check

License

This software is provided free under the MIT License.

About

A nix library for HuggingFace models and datasets.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors