A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.
pytorch vlm finetuning llm llms vlms multimodal-llm vision-tokenization nanochat vision-language-tokenizer
-
Updated
Feb 11, 2026 - Python