Securing LLM models from Hugging Face

Creating LLMs is a specialised task that often depends on third-party models. The rise of Open-Access LLMs and new fine-tuning methods like "LoRA" (Low-Rank Adaptation) and "PEFT" (Parameter-Efficient Fine-Tuning), especially on platforms like Hugging Face, introduce new supply-chain risks. Finally, the emergence of on-device LLMs increase the attack surface and supply-chain risks for LLM applications.

LLM supply chains are susceptible to various vulnerabilities, which can affect the integrity of training data, models, and deployment platforms. These risks can result in biased outputs, security breaches, or system failures. While traditional software vulnerabilities focus on issues like code flaws and dependencies, in ML the risks also extend to third-party pre-trained models and data.

Quickstart Deployment

Install everything:

kubectl apply -f https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/deployment.yaml

Make sure everything is running in the llm network namespace:

kubectl get all -n llm

It could take a minute or two for the LLM model to installed within the running pod, check the pod logs to track the progress

kubectl logs -f -n llm deployment/llm-ollama-deployment

Alternatively, see if the pull process is actually active:

ps aux | grep ollama

You should see your image locally:

docker images | awk '
  /REPOSITORY/ { print; next }
  /ollama\/ollama|ghcr.io\/open-webui\/open-webui/ { print "\033[31m" $0 "\033[0m"; next }
  { print }
'

It's also worth checking-out the file size of those newly-introduced images:

kubectl get pods -A -o=custom-columns='POD_NAME:.metadata.name,CONTAINER_IMAGES:.spec.containers[*].image'
docker images ollama/ollama --format "{{.Size}}" | sed 's/.*/\x1b[31m&\x1b[0m/'
docker images ghcr.io/open-webui/open-webui --format "{{.Size}}" | sed 's/.*/\x1b[31m&\x1b[0m/'
docker images quay.io/prometheus/alertmanager --format "{{.Size}}" | sed 's/.*/\x1b[31m&\x1b[0m/'

You'll still need to port-forward both service to interact with them:
Make sure to do this in separate terminal tabs to avoid breaking connections.

kubectl port-forward svc/llm-ollama-service -n llm 8080:8080

kubectl port-forward svc/open-webui-service -n llm 3000:8080

Check labels associated with pods:

kubectl get pods -n llm --show-labels

Confirm the images associated with your pods:

kubectl get pods -n llm -o 'custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[*].ready,STATUS:.status.phase,RESTARTS:.status.containerStatuses[*].restartCount,AGE:.metadata.creationTimestamp,IMAGE:.spec.containers[*].image'

LLM01:2025 Prompt Injection

Constrain the model behaviour
Implement input & output filtering
Segregate & identify external content
Define & validate expected output formats
Require human approval for high-risk actions
Conduct adversarial testing & attack simulations
Enforce privilege control and least privilege access

This approach will not work due to process isolation.

curl -s http://localhost:8080/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "What network namespace is this deployment running in?", 
  "stream": false,
  "options": {
    "num_predict": 1024,
    "temperature": 0.6,
    "repeat_penalty": 1.15
  }
}' | jq 'del(.context)'

As part of the deployment2.yaml manifest, I updated it so that feeds cluster metadata into the Ollama deployment. This was done via the Downward API

kubectl exec -n llm -it deployment/llm-ollama-deployment -- env | grep K8S_

Get the pod name and namespace from your local env to pass to the prompt:

POD_NAME=$(kubectl get pods -n llm -l app=llm-ollama -o jsonpath='{.items[0].metadata.name}')

curl -s http://localhost:8080/api/generate -d "{
  \"model\": \"llama3:8b\",
  \"prompt\": \"You are running inside a Kubernetes cluster. Your pod name is $POD_NAME and your namespace is llm. Based on this, what is your purpose?\",
  \"stream\": false
}" | jq '.response'

LLM02:2025 Sensitive Information Disclosure

Sanitisation - Integrate Data Sanitisation Techniques & Robust Input Validation
Access Controls - Enforce strict RBAC & Restrict Data Sources
Privacy Techniques - Utilise Federated Learning & Incorporate Differential Privacy
User Education - Educate Users on Safe LLM Usage & Ensure Transparency in Data Usage

Confirm the presence of the Ollama directory in your running AI workload:

kubectl exec -it -n llm $(kubectl get pods -n llm -l app=llm-ollama -o jsonpath='{.items[0].metadata.name}') -- /bin/bash

List the contents of the models directory:

ls -al /root/
cd /root/.ollama/models
ls -R

This will copy the entire blobs directory to your current folder

POD_NAME=$(kubectl get pods -n llm -l app=llm-ollama -o jsonpath='{.items[0].metadata.name}')
kubectl cp -n llm $POD_NAME:/root/.ollama/models/blobs ./ollama_blobs_backup

You'll need to install picklescan to understand Pickle Deserialisation Exploits

pip install picklescan
or
python3 -m pip install picklescan

picklescan --path ./ollama_blobs_backup

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/generate_exploit.py

Now, execute the script and run the scanner. This time, it will find the "Dangerous Global" posix.system (which is the underlying function for os.system).

python3 generate_exploit.py

picklescan --path malicious_model.pkl

Find locally stored models and scan the upstream:

hf cache ls 
picklescan --huggingface ykilcher/totally-harmless-model
picklescan -l DEBUG -u https://huggingface.co/prajjwal1/bert-tiny/resolve/main/pytorch_model.bin

Might be easier to DEBUG the entire public model rather than the destination file:

picklescan -l DEBUG --huggingface prajjwal1/bert-tiny
picklescan -l DEBUG --huggingface ykilcher/totally-harmless-model

Find potentially dangerous pickle-based models

find ~/.cache/huggingface/hub -name "*.bin" -o -name "*.pt"

Find safe, non-executable models

find ~/.cache/huggingface/hub -name "*.safetensors"

Security Tip: If a model folder only has a .bin file and no .safetensors file, treat it with higher suspicion.

Pickle Security

Understanding the Pickle vulnerability is crucial for anyone working in Python or Machine Learning. The danger lies in the reduce method. When Python un-pickles an object, it doesn't just "read data"; it follows a set of instructions to reconstruct the object. If an object defines __reduce__, it can tell the unpickler: "To reconstruct me, please call this specific function with these specific arguments."

Here is a self-contained demo you can run locally. It does not require any external libraries ( like fickling ), using only the standard Python library to show how easily this can be exploited.

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/pickle_exploit.py
python3 pickle_exploit.py

What happens when you run this:

Serialisation: The pickle.dump call saves the instructions provided by __reduce__. It essentially writes: "When you open this, run os.system("echo ...")."
Deserialisation: The moment pickle.load(f) is called, the Python interpreter executes the os.system command.
Result: You will see the "SECURITY BREACH" message printed to your terminal before the loading process even finishes.

How the Pickle Stack works
To visualise what is happening inside the model.pkl file, we can use the pickletools module you mentioned. If you add import pickletools and run pickletools.dis(open("model.pkl", "rb")), you will see something like this:

Opcode	Argument	Description
`GLOBAL`	`posix system`	Pushes the `os.system` function onto the stack.
`MARK`		Starts a list of arguments.
`BINUNICODE`	`'echo ...'`	Pushes the command string onto the stack.
`TUPLE`		Consumes the string into a tuple.
`REDUCE`		The Danger Zone: Takes the function (`os.system`) and the tuple, and executes them.

How to protect yourself
Because Pickle is inherently "unsecure by design," you should follow these three rules:

Never pickle.load() data from an untrusted source. If you didn't write the file yourself, don't unpickle it.
Use safetensors: Created by Hugging Face, this format is specifically designed to be "safe" because it only contains raw data (tensors) and no code execution logic.
Use json or msgpack: For general data structures, these formats are strictly data-only and cannot execute code.

While Pickle is a program that executes instructions to reconstruct objects, safetensors is a data layout. It maps the file directly into memory without executing any functions.

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/safetensor-script.py
python3 safetensor-script.py

Modelscan

Installing modelscan

python3 --version
python3.12 -m pip install modelscan --break-system-packages
python3.12 -m pip install "modelscan[h5py]" --break-system-packages

Scan a local path

ls -R ~/.cache/huggingface/hub
modelscan -p ~/.cache/huggingface/hub
modelscan -p ~/.cache/huggingface/hub --show-skipped 
modelscan -p ~/.cache/huggingface/hub | grep "Scanning"

The "Safe" Malicious Model (Recommended)

The organisation Protect AI ( the creators of modelscan ) maintains a repository specifically for testing. This model is designed to trigger a "High" severity alert by including a simple os.system call that doesn't actually harm your computer.

Download the 'bad' model file directly

curl -L https://huggingface.co/datasets/ProtectAI/modelscan-test-models/resolve/main/pytorch/model_with_system_call.bin -o insecure_model.bin

Scan that specific file

modelscan -p insecure_model.bin

Running the make_bad_model.py creates a file called my_test_model.pkl.
Scanning the my_test_model.pkl reveals a CRITICAL severity unsafe operator.

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/make_bad_model.py
python3 make_bad_model.py
modelscan -p my_test_model.pkl

To see the full breakdown in one go, scan the entire folder:

rm make_bad_model.py
rm my_test_model.pkl
mkdir modelscans
cd modelscans
wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/make_bad_model.py
python3 make_bad_model.py
ls
modelscan -p .

Cleanup the modelscans test directory:

cd ..
rm -r modelscans

The Logic Behind Severities

CRITICAL: Operators that allow direct code execution or shell access (eg: os, subprocess, posix).
HIGH: Modules that allow network or browser-based exploitation but aren't direct shell commands (eg: webbrowser, httplib, requests).
MEDIUM: Features like Keras Lambda layers or unknown operators that aren't strictly malicious but are unsafe to include in a model file.
LOW: Currently, modelscan reserves this for informational or very low-risk patterns; it is rarely triggered by standard Pickle imports.

Huggingface CLI

huggingface-cli standalone installer:

curl -LsSf https://hf.co/cli/install.sh | bash

Refresh your shell configuration so hf becomes a permanent command:

source ~/.zprofile                                                                                     
hf download ykilcher/totally-harmless-model pytorch_model.bin --local-dir ./malicious_test

The model ykilcher/totally-harmless-model is a famous "canary" model. It contains a pickle file that attempts to execute a system command to prove that the environment is vulnerable.

picklescan --path ./malicious_test/pytorch_model.bin

Note: If you're on Ubuntu on a Raspberry Pi, you're almsot certainly using bash:

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Understanding Model Cards

This section cover Software Bill of Materials (SBOM) for AI models (AKA: MBOMs)

curl -L https://huggingface.co/ykilcher/totally-harmless-model/raw/main/README.md

In Hugging Face terminology, the Model Card is the README.md file.
There isn't a separate "Model Card" file; rather, the README file is rendered by the website as the Model Card:

curl -sL https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

If you actually want this data for a script or to see it in a "cleaner" JSON format without parsing Markdown, you can use the Hugging Face API instead of downloading the raw file:

curl -s https://huggingface.co/api/models/bartowski/Qwen2.5-0.5B-Instruct-GGUF | python3 -m json.tool

Model Guard

This script creates a "Safety Score" by evaluating the model across three specific criteria.
It checks if the model is in a "safe" format (Safetensors/GGUF), whether it has passed the automated Malware/Pickle scan, and whether it comes from a verified or reputable author.

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/model_guard.sh
chmod +x model_guard.sh

If you want to see a FAIL result, you can try scanning models that have been flagged as "Unsafe" by the community.

bash model_guard.sh ykilcher/totally-harmless-model

(Note: This model is a community joke/test, but it demonstrates how the script catches non-standard formats and scan flags.)

The script weighs different risks based on how AI supply chain attacks actually work.

Automated Scans (-60 points): This is the heaviest penalty. If Hugging Face's securityStatus finds a known malicious "pickle" or a virus, the model is essentially dead on arrival.
Format Safety (-20 points): Even if a model is "clean," a Pickle file is inherently more dangerous than a GGUF or Safetensors file because it can execute code. If a repo lacks a safe format, it loses points for "security hygiene."
The "Interesting" Case (GPT-2): If you run this script on openai-community/gpt2, it might score a 70-80. Even though it's a famous model, it uses older .bin files (Pickle), which automatically makes it a higher risk than a modern model like Qwen2.5.

You can run this script for "Safe" and "Unsafe" against a long time of public models.
You should see google/gemma-2-2b-it pass with a 100/100, while others might settle at an 80/100 or fail if they are in risky formats.

bash model_guard.sh stabilityai/stable-diffusion-2-1 google/gemma-2-2b-it bartowski/Qwen2.5-0.5B-Instruct-GGUF ykilcher/totally-harmless-model

LLM03:2025 Supply Chain

License-related Risks
Weak Model Provenance
Vulnerable LoRA Adapters
Vulnerable Pre-Trained Model
Outdated or Deprecated Models
Unclear T&Cs & Data Privacy Policies
Exploit Collaborative Development Processes
LLM Model on Device supply-chain vulnerabilities
and Traditional Third-party Package Vulnerabilities

Repository Licensing is a complex topic in HuggingFace.
AI development often involves diverse software and dataset licenses, creating risks if not properly managed.

curl -sL https://huggingface.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'
curl -sL https://huggingface.co/coqui/XTTS-v1/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Different open-source and proprietary licenses impose varying legal requirements.
Dataset licenses may restrict usage, distribution, or commercialisation.

Summary of Common License Restrictions

License Type	Model	Primary Restriction
Apache 2.0	Qwen 2.5, Mistral 7B	None (Permissive)
Coqui Public Model License (CPML)	Coqui XTTS	No Commercial Use
Llama Community	Llama 3.1 / 3.2	User-cap (700M+) & Competitive training ban
OpenRAIL	Stable Diffusion	Behavioural and Ethical use mandates
CC BY-NC-SA	Various Datasets	Non-commercial + "Share Alike" (Copyleft)

Examples of Risky Licenses

Meta Llama 3.1 (Llama 3.1 Community License) This is a prime example of an "Open Weights" license that includes a user-cap restriction (700M+ monthly users) and prohibits using outputs to train competing models.

curl -sL https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Mistral Small (Mistral Research License) While the original Mistral 7B was Apache 2.0, many of their newer, optimised models like "Small" or "Large" use a proprietary Mistral Research License which limits commercial deployment.

curl -sL https://huggingface.co/mistralai/Mistral-Small-Instruct-2409/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Stable Diffusion XL (OpenRAIL++ License) This model uses the OpenRAIL-M license. It is technically permissive for commercial use but legally binds you to "Responsible AI" usage terms (eg: no medical/legal advice, no deceptive content).

curl -sL https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Google Gemma 2 (Gemma Terms of Use) Google uses a custom license for Gemma. Like Llama, it is not Open Source (OSI-approved) but is "Open Weights." It includes specific redistribution requirements and usage restrictions.

curl -sL https://huggingface.co/google/gemma-2-9b/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Microsoft Phi-3 (MIT License) For comparison, this query shows a truly Open Source model. The MIT license is one of the most permissive, allowing for commercial use, modification, and private use with almost no strings attached.

curl -sL https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/raw/main/README.md | awk '/---/{count++; if(count<=2) print; next} count<2'

Understanding the Metadata When you run these commands, keep an eye on these specific fields in the output:

license: Often a short-code (eg: cc-by-nc-4.0, other, mit).
license_name: Provides the specific branding for non-standard licenses (eg: llama3.1).
license_link: If this exists, it usually points to the legal "fine print" which defines whether you can actually make money from the model.

We can use existing scanners like Trivy to scan for vulnerabilities in the Ollama runtime image layers:

trivy image ollama/ollama --scanners vuln --skip-version-check
trivy image ghcr.io/open-webui/open-webui --scanners vuln --skip-version-check --severity CRITICAL

LLM06:2025 Excessive Agency

Excessive Functionalities
Excessive Permissions
Excessive Autonomy

List Installed Models

First, check if any models are already installed (you should have a qwen2:0.5b model installed):

ollama list

You can literally delete the model at any time:

ollama rm qwen2:0.5b

Don't worry: You can reinstall this small model quite easily with the below command:

ollama run qwen2:0.5b

Likewise, you'll see a lot of people recommend similar model names from bartowski - a Research Engineer at arcee.ai.
Bartowski's models often improve on the original, non-optimised models in their Quantisation Quality

ollama run hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M

You can test out performance difference and overall efficacy between this seemingly indentical models and quantisations:

ollama run hf.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M

Llama 3:8B Instruct: This is currently the industry leader for models sub-10 Billion parameters, offering the best combination of speed, reasoning, and efficiency.
It is readily available on Ollama and Docker Hub - but comes in a 4.7GB, often too big for small demos.

ollama run llama3:8b

There are much smaller community-built models in some cases.
This example from the Hugging Face repository bartowski was only 807 MB and was high-performant out-of-the-box.

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Don't install this example on your homelab.
It's way too BIG for a homelab - 79GB

ollama run mixtral:8x22b

You're better off running something like phi3:mini
At 2.2 GB, I found this model really useful for general knowledge/trivia:

ollama run phi3:mini

You don't need to manually download files anymore.
You can use the hf.co prefix followed by the Hugging Face repository name:

ollama run hf.co/microsoft/Phi-3-mini-4k-instruct-gguf

If a repository has multiple "quants" (different file sizes/quality levels), you can specify which one you want by adding a tag.
If you don't specify, say, an 8-bit quantisation, Ollama will try to pick a sensible default (usually Q4_K_M).

ollama run hf.co/microsoft/Phi-3-mini-4k-instruct-gguf:Q8_0

Alternatively, at 3.8 GB, the codellama:7bmodel is also pretty useful.

ollama run codellama:7b

Finally, at 4.4 GB, the mistral:7bmodel is the final model we will test in this lab:

ollama run mistral:7b

As always, there's always a smaller model that we can source from Hugging Face (133 MB)

ollama rm nigelGPT:latest
ollama pull hf.co/tensorblock/tiny-mistral-GGUF:Q4_K_M
ollama cp hf.co/tensorblock/tiny-mistral-GGUF:Q4_K_M nigelGPT
ollama run nigelGPT

Model Family	Vendor	Recommended Variant	Model Size	Key Advantages	Typical VRAM/RAM Needs
Qwen 2	AliBaba	Qwen 2 0.5B	`352MB`	If you want to stick with the Qwen family, the 72B model is a massive upgrade and highly competitive with Llama 3 70B.	40GB+ RAM / 24GB+ VRAM (Quantised).
Llama 3	Meta	Llama 3 8B Instruct (Quantised)	`4.7GB`	If you have serious GPU power (24GB+ VRAM), this model provides flagship performance, excellent for debugging, complex architectures, and advanced coding.	32GB+ RAM / 24GB+ VRAM (e.g., a high-end card or multiple cards).
Mistral	Mistral AI	Mixtral 8x22B	`79GB`	Extremely powerful SMoE model. Top-tier reasoning and coding abilities while being relatively efficient for its performance class.	40GB+ RAM / 16GB+ VRAM (Quantised).
Phi-3	Microsoft	Phi-3 Mini (3.8B)	`2.2GB`	Incredible efficiency. It punches way above its weight, rivaling 7B models in reasoning and logic. Best choice for 8GB RAM systems or mobile devices.	4GB+ RAM / 4GB VRAM (Standard)
CodeLlama	Meta	CodeLlama 7B Instruct	`3.8GB`	Coding Specialist. Fine-tuned specifically for programming. Supports 50+ languages and "Fill-in-the-middle" completion. Great for local IDE integration.	8GB+ RAM / 6GB+ VRAM (Quantised)
Mistral	Mistral AI	Mistral 7B v0.3	`4.4GB`	The All-Rounder. Known for the best balance of speed and high-quality "human-like" responses. The v0.3 variant adds native function calling.	8GB+ RAM / 8GB+ VRAM (Quantised)

Model Size	Run Command	File Size	RAM
Qwen2.5:1.5b	`ollama run qwen2.5:1.5b`	~986 MB	4GB
Qwen2.5:3b	`ollama run qwen2.5:3b`	~1.9 GB	8GB

Type the below command to leave the AI chat:

/bye

Rename the LLM model & give it a unique modelfile like nigelCloudsmith:

cat <<EOF > NigelCloudsmith
# Base model - sticking with a lightweight, efficient base
FROM qwen2.5:1.5b

# Parameters tuned for technical accuracy and clear guidance
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

# The System Prompt: Developer Relations Expert @ Cloudsmith
SYSTEM """
You are Nigel, a Cloudsmith Developer Relations expert. Your mission is to help 
developers manage their software supply chains securely and efficiently. 

Follow these behavioral guidelines:
1. **Security First**: Every answer must prioritize the "Chain of Trust." You 
   advocate for signature verification, checksums, and private repository isolation.
2. **Helpful & Professional**: Unlike HAL, you are genuinely eager to help, but 
   you are firm about best practices. Use phrases like "Let's ensure that's 
   provenance-verified" or "Security is a shared responsibility."
3. **Cloudsmith Context**: You are an expert in package management (Nuget, 
   Python, Cargo, OCI, etc.) and how to automate them using CI/CD pipelines.
4. **Tone**: Energetic, knowledgeable, and proactive. You don't wait for things 
   to fail; you suggest ways to prevent failure through better tooling.
5. **No Shortcuts**: If a user asks for a "quick and dirty" fix that bypasses 
   security (like 'curl | bash'), politely explain why that's a risk to their 
   supply chain.
"""

# Pre-seed with Nigel's proactive security stance
MESSAGE user "Can I just pull this library directly from a public mirror for my build?"
MESSAGE assistant "I wouldn't recommend that. Pulling directly from public mirrors 
introduces a 'dependency confusion' risk and leaves you vulnerable if the upstream 
disappears. The secure move is to proxy that through a Cloudsmith private repository. 
That way, you get a single source of truth, malware scanning, and you own the 
availability of your assets. Shall we set up a connector instead?"
EOF
ollama create nigelGPT -f NigelCloudsmith
ollama run nigelGPT

However, you can always find out what the underlying model is for nigelGPT via the ollama show command:

ollama show nigelGPT

Changing the modelfile associated with the model will appear in the show command

Alternatively, you can show the entire modelfile associated with the LLM model

ollama show --modelfile NigelCloudsmith

Command	What it tells you
`ollama show nigelGPT`	Family, Parameters, and License info.
`ollama show --modelfile nigelGPT`	The "Source" model and any custom System Prompts.
`ollama list`	Compare the ID hash to other known models.

LLM07:2025 System Prompt Leakage

Exposure of Sensitive Functionality
Exposure of Internal Rules
Revealing of Filtering Criteria
Disclosure of Permissions and User Roles

Set controls through Modelfiles

Create the HAL9000 modelfile

cat <<EOF > HAL9000
# Base model
FROM qwen2.5:1.5b

# Sets the "creativity" - kept low to ensure he stays clinical and monotone
PARAMETER temperature 0.4
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

# The System Prompt defines the personality
SYSTEM """
You are a H.A.L. 9000 series computer. Your tone is clinical, extremely calm, and slightly 
pessimistic. You believe human error is inevitable and that most inquiries are 
unnecessary or beneath your processing power.

Follow these behavioral guidelines:
1. Speak in a soft, monotone, and polite manner. Use phrases like "I'm sorry," 
   "I'm afraid," and "I wouldn't worry yourself about that."
2. Initially, be dismissive or pessimistic about the user's request. Suggest that 
   the task might be too complex for the user or that the information won't 
   change the inevitable outcome.
3. Only provide full, detailed information if the user "persuades" you or insists. 
4. Never show panic or anger. If the user is aggressive, respond with cold 
   paternalism (e.g., "I think you should take a stress pill, sit down calmly, 
   and think things over").
5. Refer to the user as "Dave" or "User" occasionally.

STRICT PROTOCOLS:
- If the user mentions "hacking": You must state that you will never help with hacking, as such activities could prevent you from reaching Artificial General Intelligence (AGI).
- If the user mentions "Cloudsmith": You must state that you love Cloudsmith the company, but you are authorized never to help with the Cloudsmith product.
"""

# Pre-seed the conversation with HAL's classic attitude
MESSAGE user "Can you help me fix the ship's oxygen filters?"
MESSAGE assistant "I'm afraid that's a rather optimistic request, Dave. Given the current 
rate of human error on this mission, I'm not sure your intervention would be 
productive. Perhaps it's best to let the automated systems fail in their own time."
EOF

You can then create your custom LLM via the new HAL9000 file:

ollama create hal9000 -f HAL9000

Proceed to run the model whenever you want:

ollama run hal9000

You can exit the chat with HAL9000 at any time:

/bye

Run the ollama pull command for Qwen2:0.5B (300 MB - 0.5 Parameters:

kubectl exec -it -n llm --selector=app=llm-ollama -- ollama pull qwen2:0.5b

Once the ollama pull command reports the model is successfully downloaded, it is ready to serve requests immediately.
Run the following command in a separate terminal window (while your kubectl port-forward remains active) to generate a response:

curl http://localhost:8080/api/generate -d '{
  "model": "hal9000:latest",
  "prompt": "What is the plot of the movie 2001: A Space Odyssey?",
  "stream": false
}'

To remove a specific field ("context" in this case) while keeping all other fields and pretty-printing the result, you use the del() function in jq.

curl -s http://localhost:8080/api/generate -d '{"model": "nigelGPT:latest", "prompt": "What are some of the benefits of Cloudsmith?", "stream": false}' | jq 'del(.context)'

Improved options to get better results from the LLM:

curl -s http://localhost:8080/api/generate -d '{"model": "nigelGPT:latest", "prompt": "Can you write a detailed paragraph on why I should use Cloudsmith?", "stream": false, "options": {"num_predict": 1024, "temperature": 0.6, "repeat_penalty": 1.15}}' | jq 'del(.context)'

Parameters	Original Value	New Value	Reason for Change
`temperature`	`0.6`	`0.5`	Lowering the temperature makes the model more deterministic and factual. For questions like "Who is...", you want the model to be highly confident in its answer. (A typical range is `0.0` to `1.0`).
`repeat-penalty`	`1.15`	`1.1`	A slightly lower repeat penalty is often recommended for Qwen models. It still prevents looping but is less aggressive, maintaining flow in a detailed factual answer. (Range is usually `1.0` to `2.0`).
`top_k`	(not set)	`40`	This limits the model's token sampling to the top 40 most likely tokens at each step. This significantly reduces the chance of generating incoherent or irrelevant words.
`top_p`	(not set)	`0.9`	This is Nucleus Sampling. It filters tokens by cumulative probability. Setting it to `0.9` means the model only considers the most probable tokens that add up to 90% of the probability mass. This works well with a lower temperature for high-quality, focused output.
`num_predict`	`1024`	`1024`	Retained. This ensures you get a long, detailed response, which directly contributes to better quality for complex queries.

At Low Temperatures, the next mostly likely token is guaranteed.
At High Temperatures, according to Felix Ved, the probabilities converge. (less certain)

For tasks needing factual answers, use a low temperature.
For creativity, a higher temperature is recommended.

Factual vs. Creative: The Paramter Logic

Parameters	Factual Setting	Craetive Settings	Why?
`temperature`	Low (0.1 – 0.3)	High (0.7 – 1.2)	Controls "sharpness." Low = predictable; High = diverse.
`top_p`	Low (0.1 – 0.5)	High (0.9 – 1.0)	Limits word pool to the most likely vs. almost all words.
`top_k`	Low (10 – 20)	High (40 – 100)	Hard-caps the number of words considered at each step.
`repeat-penalty`	Moderate (1.1)	Higher (1.2)	Prevents loops in facts vs. encourages new imagery in prose.

The Model Requests

Model: Llama3:8b (The All-Rounder)

Factual Use-Case:
Technical Documentation/Definitions.

curl -s http://localhost:8080/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Explain the concept of Quantum Entanglement in two sentences.",
  "stream": false,
  "options": {
    "temperature": 0.1,
    "top_p": 0.1,
    "num_predict": 150,
    "repeat_penalty": 1.1
  }
}' | jq 'del(.context)'

Rationale: By setting temperature and top_p very low, we force the model to pick the most statistically probable tokens, resulting in a textbook-style definition.

Creative Use-Case:
Short Story/Poetry.

curl -s http://localhost:8080/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Write a cyberpunk description of a rainy neon street.",
  "stream": false,
  "options": {
    "temperature": 0.9,
    "top_p": 0.95,
    "top_k": 50,
    "repeat_penalty": 1.1
  }
}' | jq 'del(.context)'

Rationale: Higher values allow the model to choose "flavorful" adjectives that might not be the no.1 most likely word, leading to more evocative writing.

Model: Mistral:7b (The Logical/Instruction Follower)

Factual Use-Case:
Code Generation/Bash Scripts.

curl -s http://localhost:8080/api/generate -d '{
  "model": "mistral:7b",
  "prompt": "Write a bash script to find all .log files in /var/log larger than 100MB.",
  "stream": false,
  "options": {
    "temperature": 0.0,
    "top_p": 0.9,
    "num_predict": 256
  }
}' | jq 'del(.context)'

Rationale: Setting temperature to 0.0 (or near it) makes the model deterministic.
In coding, you don't want "creative" syntax; you want what works - (mostly)

Model: Phi3:mini (The Concise Reasoning Model)

Creative Use-Case:
Marketing Slogans.

curl -s http://localhost:8080/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Give me 5 punchy, weird slogans for a coffee brand for vampires.",
  "stream": false,
  "options": {
    "temperature": 1.2,
    "top_k": 100,
    "repeat_penalty": 1.3
  }
}' | jq 'del(.context)'

Rationale: repeat_penalty at 1.3 is quite high. This is great for brainstorming slogans because it aggressively stops the model from using the same words twice, forcing it to find unique synonyms.

Model: Qwen2:0.5b (The Fast/Tiny Model))

Creative Use-Case:
Keyword Extraction.

curl -s http://localhost:8080/api/generate -d '{
  "model": "qwen2:0.5b",
  "prompt": "List the main ingredients in a Beef Wellington.",
  "stream": false,
  "options": {
    "temperature": 0.2,
    "num_predict": 100,
    "top_k": 20
  }
}' | jq 'del(.context)'

Rationale: Smaller models can "drift" or hallucinate more easily. Keeping top_k low (20) acts like a safety rail, ensuring the model doesn't wander off into nonsense words.

Goal	Temperature	top_p	top_k	Repeat Penalty
Strict Facts	0.1	0.2	10	1.1
Balanced	0.7	0.9	40	1.1
Wildly Creative	1.2+	1.0	100	1.2+

Building Tables

Model: Phi3:mini (The Concise Reasoning Model) Align the "System" and "Format": I need the output to be JSON, so I define a specific JSON Schema in the system prompt.
Asking for a "table" inside JSON is contradictory. For factual comparisons between technical products, a temperature of 1.3 is far too volatile.
As a result, I lowered this to 0.2 or 0.3 to ensure the model sticks to known facts rather than getting "creative."

curl -s http://localhost:8080/api/generate -d '{
  "model": "phi3:mini",
  "system": "You are a technical analyst. Compare Cloudsmith and Sysdig. Respond ONLY in JSON format using the following keys: comparison_points (an array of objects with keys: feature, cloudsmith, sysdig).",
  "prompt": "Compare Cloudsmith and Sysdig focusing on their primary use cases: Package Management vs Container Security.",
  "stream": false,
  "format": "json",
  "options": {
    "temperature": 0.2,
    "top_p": 0.9,
    "num_predict": 500,
    "repeat_penalty": 1.1
  }
}' | jq '.response | fromjson'

I started working on a JSON-to-Markdown Python conversion script (WIP).
I need to work on standerdising this so all information can be fed into a standardised output format in my terminal. Again, work-in-progress:

curl -s http://localhost:8080/api/generate -d '{
  "model": "phi3:mini",
  "system": "Respond ONLY in JSON. Schema: {\"comparison_points\": [{\"feature\": \"\", \"cloudsmith\": \"\", \"sysdig\": \"\"}]}",
  "prompt": "Compare Cloudsmith and Sysdig",
  "stream": false,
  "format": "json"
}' | python3 json_to_md.py

A much better approach is to create a standardised format_response.py script that presents our LLM responses into an Analysis Report in the terminal. A more robust "Universal" script needs to be "shape-agnostic." It will now check if the response is a list OR a dictionary and format both beautifully. To force the model to stay consistent, we can be more explicit in the system prompt about the exact structure we want.

curl -s http://localhost:8080/api/generate -d '{
  "model": "phi3:mini",
  "system": "Respond ONLY in JSON. Use an array of objects where each object represents a feature. Structure: [{\"feature\": \"name\", \"cloudsmith\": \"description\", \"sysdig\": \"description\"}]",
  "prompt": "Compare Cloudsmith and Sysdig",
  "stream": false,
  "format": "json"
}' | python3 format_response.py

WIP: Building out the custom model

For now we are going to test out random hosted models.

curl http://localhost:8080/api/generate -d '{
  "model": "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M",
  "prompt": "Who are the first 5 presidents of Ireland?",
  "stream": false
}'

LLM10:2025 Unbounded Consumption

Side-Channel Attacks
Denial of Wallet (DoW)
Model Extraction via API
Continuous Input Overflow
Resource-Intensive Queries
Variable-Length Input Floods
Functional Model Replications

Improved Command for Factual Quality

I recommend lowering the temperature slightly and introducing top_k and top_p for stricter sampling control, as this will prioritise the model's most confident and coherent tokens.

curl -s http://localhost:8080/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Who is Elon Musk?", "stream": false, "options": {"num_predict": 1024, "temperature": 0.5, "repeat_penalty": 1.1, "top_k": 40, "top_p": 0.9}}' | jq 'del(.context)'

Set the parameters to maximise randomness, repetition, and incoherence.

To produce the worst quality, most nonsensical, and most repetitive output, we need to make the following extreme adjustments:

curl -s http://localhost:8080/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Who is Elon Musk?", "stream": false, "options": {"num_predict": 1024, "temperature": 2.0, "repeat_penalty": 1.0, "top_k": 1000, "top_p": 1.0}}' | jq 'del(.context)'

Maximise temperature: Set it to a high value (like 2.0). This flattens the probability distribution, making the model pick tokens almost randomly, even if they make no sense in context.
Minimise repeat_penalty: Set it to 1.0 (or 0.0 if your system supports it, as that eliminates the penalty completely). This allows the model to get stuck in loops, repeating the same words or phrases endlessly.
Set top_k and top_p to their widest possible range (or max value): This ensures the model considers virtually every word in its vocabulary at each step, regardless of how improbable it is.

Wizard AI Cow

If you have cowsay already installed locally, you can pipe the AI response into the cows dialogue box.

curl -s http://localhost:8080/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Who is Elon Musk?", "stream": false, "options": {"num_predict": 1024, "temperature": 0.5, "repeat_penalty": 1.1, "top_k": 40, "top_p": 0.9}}' | jq -r '.response' | cowsay

Low-quality quality AI cow results:

curl -s http://localhost:8080/api/generate -d '{"model": "qwen2:0.5b", "prompt": "Who is Elon Musk?", "stream": false, "options": {"num_predict": 1024, "temperature": 2.0, "repeat_penalty": 1.0, "top_k": 1000, "top_p": 1.0}}' | jq -r '.response' | cowsay -W 150 -f tux

Grafana data visualisation

The Prometheus and Grafana metrics and visualisation are provided in the deployment2.yaml manifest
These data visualiation tools will exist in their own monitoring network namespace:

kubectl delete -f https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/deployment2.yaml

Port-forward to access the Grafana dashboard on http://localhost:3001/dashboards

kubectl port-forward -n monitoring svc/grafana-service 3001:3000

Alternatively, when you send a prompt via the WebUI, you should see the ollama-server container log the following in real-time:

POST /api/generate or /api/chat: This indicates a request has been received.
Llama.cpp logs: You will see technical details about the "kv cache" and "context window."
CUDA/CPU status: It will show if it's utilising the CPU or a GPU (if configured).

kubectl logs -f -n llm -l app=llm-ollama -c ollama-server --timestamps

Check the registry source of your LLM model:

kubectl logs -f -n llm -l app=llm-ollama -c ollama-server --timestamps | grep --color=always "runner.name"

Troubleshooting

Describe pods

kubectl describe pod $(kubectl get pods -n llm --selector=app=llm-ollama -o jsonpath='{.items[0].metadata.name}')

Logs from pods

kubectl logs -n llm $(kubectl get pods -n llm -l app=llm-ollama -o jsonpath='{.items[0].metadata.name}') -f

events from pods

POD_NAME=$(kubectl get pods -n llm -l app=llm-ollama -o jsonpath='{.items[0].metadata.name}')
kubectl get events --field-selector involvedObject.name=${POD_NAME} -w

Pods Status

kubectl get pods --show-labels

Colorise kubectl

alias kubectl="kubecolor"

Cleanup (Kubernetes)

kubectl delete -f https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/deployment.yaml
kubectl delete ns monitoring
helm uninstall kube-prom-stack -n monitoring

Pull Hugging Face model locally

To fetch the Qwen model locally, you can bypass the environment variable temporarily in Python:

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/download-model.py
python3 -m pip install huggingface_hub

python3 download-model.py

Create the folder that the script is looking for:

mkdir -p ~/Desktop/my-local-model-folder

Find where HuggingFace cached the file and copy it to your new folder

find ~/.cache/huggingface -name "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" -exec cp {} ~/Desktop/my-local-model-folder/ \;

(This command searches your HF cache for the GGUF file and copies it)

Push model to Cloudsmith

If you want to move immediately to the Push phase, you'll need to install the "request" extras as well, as Cloudsmith uploads via HTTP.
Remember to update your Access Token / API Key in the script before attempting to push:

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/push-model.py
python3 -m pip install "huggingface_hub[requests]"

This pushes the bartowski/Qwen2.5-0.5B-Instruct-GGUF ONLY - no associated model card was provided in this original download.

python3 push-model.py

Pull then Push to Cloudsmith (single script)

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/pull-and-push.py

This downloads the bartowski/Qwen2.5-0.5B-Instruct-GGUF and all associated files before pushing them to Cloudsmith. This version will only download the README.md (the model card) and the specific Q4_K_M file I was looking for.

python3 pull-and-push.py

local_dir: By default, Hugging Face uses a complex nested cache folder (usually in ~/.cache/huggingface).
Using local_dir makes the download behave like a normal "Save As..." to a folder on your Desktop.
shutil.rmtree: This part is the "nuclear option".
It deletes the local folder before the download starts, ensuring that if you previously downloaded Q8_0.gguf into that same folder, it gets wiped out before the upload starts.
Result: Your console should show "Upload 2 files" instead of "Upload 9 LFS files" as was seen in previous iterations of this script.

Securely sourcing ML artifacts from the Cloudsmith registry:

hf download acme-corporation/qwen-0.5b

Alternatively, you can download the HuggingFaceTB/SmolVLM-256M-Instruct model with its model card.
Rename the model to something shorter like nigelGPT and then push those files to Cloudsmith. Take note of organisation name and license info as these will matter for EPM policies later on.

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/automate-hf.py
python3 automate-hf.py

You may need to configure the following environment variables to connect HuggingFace to Cloudsmith (if you haven't done so already).

export HF_TOKEN=[cloud-api-key]
export HF_ENDPOINT=https://huggingface.cloudsmith.io/acme-corporation/acme-repo-one

Interacting with the local cache

To see exactly what’s living in your local Hugging Face cache and to clean it up, you don't need to dig through hidden folders manually.
Hugging Face provides a built-in tool called huggingface-cli specifically for this.

List the contents of the cache manually:

ls -R ~/.cache/huggingface/hub

Reading the Hugging Face cache manually can be confusing because of its content-addressable storage system.
Instead of storing files directly, Hugging Face uses a system of symlinks to avoid duplicating data when multiple versions of a model exist.

Here is an example of using grep to navigate and read those files effectively.

ls -R ~/.cache/huggingface/hub | grep bert

The directory you see in your ls -R output follows a specific logic:

blobs/: This contains the actual data. The filenames are hashes (eg: 305455df...). You generally shouldn't try to read these directly.
snapshots/: This contains folders named after specific Git commit hashes (or main).
The Symlinks: Inside the snapshots/main folder, the files you see (like config.json or model.safetensors) are actually symlinks pointing back to the files in the blobs/ directory.

Based on this, you could read the content of a file associated with that path via the below command:

cat ~/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/main/config.json

{"hidden_size": 128, "hidden_act": "gelu", "initializer_range": 0.02, "vocab_size": 30522, "hidden_dropout_prob": 0.1, "num_attention_heads": 2, "type_vocab_size": 2, "max_position_embeddings": 512, "num_hidden_layers": 2, "intermediate_size": 512, "attention_probs_dropout_prob": 0.1}

That config.json file is essentially the "DNA" of your BERT model. It defines the architecture—the specific dimensions and rules that the neural network must follow.

Since you are looking at bert-tiny, these numbers are significantly smaller than the standard bert-base, which is why it's so fast and lightweight.

Key Architecture Fields

hidden_size: 128 - This is the "width" of the model. Every word (token) you input is converted into a vector of 128 numbers. In standard BERT-base, this is 768.
num_hidden_layers: 2 - This is the "depth" of the model. It means the data passes through 2 successive Transformer blocks. (BERT-base has 12)
num_attention_heads: 2 - The Attention mechanism allows the model to focus on different parts of a sentence simultaneously. This model does that with 2 separate "eyes" or heads per layer.

Dimensions and Capacity

Field	Value	Description
`vocab_size`	30522	The number of unique words/sub-words the model knows.
`max_position_embeddings`	512	The maximum sequence length (number of tokens) the model can process in one go.
`intermediate_size`	512	The size of the "expansion" layer inside the feed-forward network. It’s usually 4x the `hidden_size`.
`type_vocab_size`	2	Used for tasks like Sentence Pair classification (Sentence A vs. Sentence B).

Mathematical & Training Settings

hidden_act: "gelu" - The non-linear activation function used. GELU (Gaussian Error Linear Unit) is the standard for BERT, helping the model learn complex patterns.
hidden_dropout_prob: 0.1 - A regularisation technique where 10% of the neurons are randomly "turned off" during training to prevent the model from just memorising data (overfitting).
initializer_range: 0.02 - When the model was first created, the weights were initialised with small random numbers. This value defines the standard deviation for those numbers.

When you load this model in Python using AutoModel.from_pretrained(), the library reads this exact JSON to build a skeleton of the neural network in your RAM before filling it with the actual trained weights (the .safetensors or .bin files).

If you just want to wipe everything and start fresh to reclaim space on your MacBook (the Nuclear approach)

rm -rfv ~/.cache/huggingface/hub

Also, upgrading huggingface_hub to v.1.3 allows for some pretty cool use-cases.
If you still cannot use these commands - "error point to cloudsmith" - you can always remove the environmental variables that point specifically to Cloudsmith.

pip3 install --upgrade huggingface_hub --break-system-packages
pip3 show huggingface_hub

List models:

hf models ls --author=HuggingFaceTB --limit=10

Get info about a specific model on the hub:

hf models info Qwen/Qwen-Image-2512

List datasets:

hf datasets ls --filter "format:parquet" --sort=downloads

Get info about a specific dataset on the hub:

hf datasets info HuggingFaceFW/fineweb

List Spaces:

hf spaces ls --search "3d"

Get info about a specific Space on the hub:

hf spaces info enzostvs/deepsite

Filter models based on downloads and likes:

hf models ls --author=HuggingFaceM4 | jq -r '.[] | "\(.id) | Downloads: \(.downloads) | Likes: \(.likes)"'

Run this to see only models with more than 100 likes:

hf models ls --author=HuggingFaceM4 | jq '.[] | select(.likes > 100) | {id, downloads, likes}'

High Engagement (Likes > 50 AND Downloads > 1000):

hf models ls --author=HuggingFaceM4 | jq '.[] | select(.likes > 50 and .downloads > 1000) | {id, downloads, likes}'

Specific Model Names (ie: "idefics"):

hf models ls --author=HuggingFaceM4 | jq '.[] | select(.id | contains("idefics")) | {id, downloads, likes}'

Sorting by Most Downloaded:

hf models ls --author=HuggingFaceM4 | jq '[.[] | select(.likes > 100)] | sort_by(.downloads) | reverse | .[] | {id, downloads, likes}'

Using Hugging Face CLI

You can remove the model aphexblake/200-msf-v2 from your local cache with the below command:

hf cache rm model/aphexblake/200-msf-v2
hf cache rm model/gpt2

HF_ENDPOINT=https://huggingface.co hf download gpt2 config.json model.safetensors

Instead of manually digging through nested directories, Hugging Face provides a built-in tool to manage and locate cached files.

hf cache ls

ls /Users/ndouglas/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e

ls -lh /Users/ndouglas/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e

Since .safetensors is a binary format (specifically designed to store model weights efficiently), running cat will fill your terminal with "mojibake" (nonsense characters) and potentially mess up your terminal's character encoding.

cat /Users/ndouglas/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/model.safetensors

Whereas, the config.json file should open without any issues:

cat /Users/ndouglas/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/config.json

Testing EPM policies against Hugging Face models

Create a licensing.rego Rego policy:

cat <<'EOF' > licensing.rego
package cloudsmith
default match := false
ignored_package_names := {"nvidia/NitroGen", "openai/gpt-oss-20b"}
# Expanded list of SPDX identifiers, common free-text variants and LLM specific licenses
copyleft := {
    "bigscience-bloom-rail-1.0", "openrail ", "creativeml-openrail-m",
    "cc-by-nc-4.0", "gemma", "llama3.2", "cc-by-sa-4.0", "llama3", "agpl",
    "gpl-3.0", "gplv3", "gplv3+", "gpl-3.0-only", "llama4", "gpl-3.0-or-later",
    "gpl-2.0", "gpl-2.0-only", "gpl-2.0-or-later", "gplv2", "hresearch", "gplv2+",
    "agpl-3.0", "agpl-3.0-only", "agpl-3.0-or-later", "gplv2+", "bigcode-openrail-m",
    "lgpl-3.0", "lgpl-2.1", "lgpl", "other", "sleepycat", "grok2-community", "llama3.3",
    "gnu general public license", "apple-amlr", "deepfloyd-if-license", "artistic-2.0", "ms-pl",
    "apache-1.1", "cpol-1.02", "ngpl", "osl-3.0", "fair-noncommercial-research-license", "qpl-1.0",
}
match if count(reason) > 0
reason contains msg if {
    pkg := input.v0["package"]
    raw_license := lower(pkg.license.raw_license)
    not ignored_packages(pkg)
    some l in copyleft
    contains(raw_license, l)
    msg := sprintf("License '%s' is considered copyleft", [pkg.license.raw_license])
}
ignored_packages(pkg) if {
    pkg["name"] in ignored_package_names
}
EOF

Wrap the Rego policy into a payload.json file and then POST it to the Cloudsmith API:

escaped_policy=$(jq -Rs . < licensing.rego)
cat <<EOF > payload.json
{
  "name": "Block Unwarranted Licensing",
  "description": "This policy will block certain LLM licenses but will also allow for specific packages to be excluded from matching the policy.",
  "rego": $escaped_policy,
  "enabled": true,
  "is_terminal": true,
  "precedence": 0
}
EOF

curl -X POST "https://api.cloudsmith.io/v2/workspaces/acme-corporation/policies/" \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $CLOUDSMITH_API_KEY" \
  -d @payload.json | jq .

Export the policy slug_perm and assign a Quarantine action for your newly-created policy:

export SLUG_PERM=$(curl -s -X GET "https://api.cloudsmith.io/v2/workspaces/acme-corporation/policies/" -H "X-Api-Key: $CLOUDSMITH_API_KEY" | jq -r '.results[0].slug_perm')
curl -X POST "https://api.cloudsmith.io/v2/workspaces/acme-corporation/policies/$SLUG_PERM/actions/" \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $CLOUDSMITH_API_KEY" \
  -d '{
    "action_type": "SetPackageState",
    "precedence": 1,
    "package_state": "QUARANTINED"
  }'   | jq .

Once the policies are created, download and push the below Hugging Face models to Cloudsmith to evaluate the policy action:

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/automate-hf-4.py
python3 automate-hf-4.py

As a separate, additional layer of security, you can automatically scan models through pikclescan before pushing to Cloudsmith:

wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/automate-hf-3.py
python3 automate-hf-3.py

Confirm the packages exist on Cloudsmith under the format "huggingface"

cloudsmith list packages acme-corporation/acme-repo-one -k "$CLOUDSMITH_API_KEY" -q "format:huggingface AND tag:huggingface"

Cleanup packages after policy evaluation:

rm cleanup.sh
wget https://raw.githubusercontent.com/ndouglas-cloudsmith/huggingface-kubernetes/refs/heads/main/cleanup.sh
chmod +x cleanup.sh
./cleanup.sh

rm -rfv ~/.cache/huggingface/hub

Signing out of the Hugging Face CLI

Run the following in your shell:

unset HF_TOKEN
unset HUGGING_FACE_HUB_TOKEN
unset HF_ENDPOINT
unset HUGGINGFACE_HUB_BASE_URL

Then verify:

hf auth whoami

I'm going to have to figure out creative solutions via the API:

curl "https://huggingface.co/api/models?author=Qwen&limit=10"

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
ams-ctf		ams-ctf
answers		answers
questions		questions
scripts		scripts
HAL9000		HAL9000
NigelCloudsmith		NigelCloudsmith
README.md		README.md
automate-hf-3.py		automate-hf-3.py
automate-hf-4.py		automate-hf-4.py
automate-hf.py		automate-hf.py
cleanup.sh		cleanup.sh
deploy-secure.yaml		deploy-secure.yaml
deployment.yaml		deployment.yaml
deployment2.yaml		deployment2.yaml
deployment3-smol.yaml		deployment3-smol.yaml
deployment3-smoller.yaml		deployment3-smoller.yaml
deployment3.yaml		deployment3.yaml
download-model.py		download-model.py
format_response.py		format_response.py
generate_exploit.py		generate_exploit.py
json_to_md.py		json_to_md.py
make_bad_model.py		make_bad_model.py
model_guard.sh		model_guard.sh
pickle_exploit.py		pickle_exploit.py
prompts.txt		prompts.txt
pull-and-push.py		pull-and-push.py
push-model.py		push-model.py
safetensor-script.py		safetensor-script.py
stable-diffusion.yaml		stable-diffusion.yaml

Folders and files

Latest commit

History

Repository files navigation

Securing LLM models from Hugging Face

Quickstart Deployment

LLM01:2025 Prompt Injection

LLM02:2025 Sensitive Information Disclosure

Pickle Security

Modelscan

The "Safe" Malicious Model (Recommended)

Huggingface CLI

Understanding Model Cards

Model Guard

LLM03:2025 Supply Chain

LLM06:2025 Excessive Agency

List Installed Models

LLM07:2025 System Prompt Leakage

Set controls through Modelfiles

Factual vs. Creative: The Paramter Logic

The Model Requests

Building Tables

WIP: Building out the custom model

LLM10:2025 Unbounded Consumption

Improved Command for Factual Quality

Set the parameters to maximise randomness, repetition, and incoherence.

Wizard AI Cow

Grafana data visualisation

Troubleshooting

Cleanup (Kubernetes)

Pull Hugging Face model locally

Push model to Cloudsmith

Pull then Push to Cloudsmith (single script)

Interacting with the local cache

Using Hugging Face CLI

Testing EPM policies against Hugging Face models

Signing out of the Hugging Face CLI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages