Skip to content

[v1.105.1] UnboundLocalError: local variable 'used_tool_json' referenced before assignment when using Jinja #1909

@50t0r25

Description

@50t0r25

Describe the Issue
I am encountering an exception in KoboldCpp version 1.105.1 on Windows when the --jinja option is enabled (disabling or enabling jinja_tools still leads to the exception). When sending a request to the OpenAI compatible API endpoint, the request fails with an UnboundLocalError regarding the variable used_tool_json.

This appears to be a regression, as version 1.104 worked correctly with the same settings.

Additional Information:
here's the full terminal output of this happening:

***
Welcome to KoboldCpp - Version 1.105.1
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend (flag=0)

Loading Chat Completions Adapter: C:\Users\User01\AppData\Local\Temp\_MEI31642\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Loaded existing savedatafile at 'C:\Users\User01\Documents\KoboldCPP\kcpp_savedata.jsondb'.
System: Windows 10.0.22631 AMD64 AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
Detected Available GPU Memory: 6144 MB
Detected Available RAM: 9985 MB
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', autofit=False, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=16384, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=22, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=True, jinja_tools=True, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj=None, mmprojcpu=False, model=[], model_param='C:/Users/User01/Documents/KoboldCPP/LLMs/Impish_Bloodmoon_12B.i1-Q4_K_M.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='C:/Users/User01/Documents/KoboldCPP/kcpp_savedata.jsondb', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=7, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=7, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', '0', 'mmq'], usemlock=True, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: C:\Users\User01\Documents\KoboldCPP\LLMs\Impish_Bloodmoon_12B.i1-Q4_K_M.gguf

The reported GGUF Arch is: llama
Arch Category: 0

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 Ti, compute capability 7.5, VMM: yes
The following devices will have suboptimal performance due to a lack of tensor cores:
  Device 0: NVIDIA GeForce GTX 1660 Ti
Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing.
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 Ti) (0000:01:00.0) - 5128 MiB free
llama_model_loader: loaded meta data with 55 key-value pairs and 363 tensors from C:\Users\User01\Documents\KoboldCPP\LLMs\Impish_Bloodmoon_12B.i1-Q4_K_M.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 6.96 GiB (4.88 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load:   - 131072 ('<|im_end|>')
load: special tokens cache size = 1002
load: token to piece cache size = 0.8499 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_embd_inp       = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 13B
print_info: model params     = 12.25 B
print_info: general.name     = Impish_Bloodmoon_12B
print_info: vocab type       = BPE
print_info: n_vocab          = 131074
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 131072 '<|im_end|>'
print_info: EOT token        = 131072 '<|im_end|>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 10 '<pad>'
print_info: LF token         = 1010 'ÄS'
print_info: EOG token        = 131072 '<|im_end|>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 363
warning: failed to VirtualLock 377495552-byte buffer (after previously locking 0 bytes): Invalid access to memory location.

load_tensors: offloading output layer to GPU
load_tensors: offloading 21 repeating layers to GPU
load_tensors: offloaded 22/41 layers to GPU
load_tensors:          CPU model buffer size =   360.01 MiB
load_tensors:        CUDA0 model buffer size =  3809.79 MiB
load_tensors:    CUDA_Host model buffer size =  2953.52 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
..........................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16640
llama_context: n_ctx_seq     = 16640
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (16640) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.50 MiB
llama_kv_cache:        CPU KV buffer size =  1235.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1365.00 MiB
llama_kv_cache: size = 2600.00 MiB ( 16640 cells,  40 layers,  1/1 seqs), K (f16): 1300.00 MiB, V (f16): 1300.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2904
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context:      CUDA0 compute buffer size =   304.01 MiB
llama_context:  CUDA_Host compute buffer size =    42.51 MiB
llama_context: graph nodes  = 1247
llama_context: graph splits = 211 (with bs=512), 2 (with bs=1)
Threadpool set to 7 threads and 7 blasthreads...
attach_threadpool: call
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: ChatML (Generic)
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 16384, "max_length": 896, "rep_pen": 1.05, "temperature": 0.7, "top_p": 0.95, "top_k": 50, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "<|im_start|>system\nYou are a polite and shy professional AI assistant, your name is Kobold. You reply in a short concise manner, simulating texting.\n", "trim_stop": true, "genkey": "KCPP4772", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "smoothing_curve": 1, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "presence_penalty": 0, "logit_bias": {}, "stop_sequence": ["<|im_end|>\n<|im_start|>user", "<|im_end|>\n<|im_start|>assistant"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "<|im_end|>\n<|im_start|>user\nsup<|im_end|>\n<|im_start|>assistant\n"}

Processing Prompt [BATCH] (44 / 44 tokens)
Generating (10 / 896 tokens)
(EOS token triggered! ID:131072)
[19:09:41] CtxLimit:54/16384, Amt:10/896, Init:0.15s, Process:0.92s (48.03T/s), Generate:1.55s (6.43T/s), Total:2.47s
Output: Hello, how can I assist you today?

Input: {"messages": [{"role": "system", "content": "You are a polite professional AI assistant, your name is Kobold. You reply in a short concise manner."}, {"role": "user", "content": "sup"}], "stream": true, "reasoning_format": "auto", "temperature": 0.7, "max_tokens": -1, "dynatemp_range": 0, "dynatemp_exponent": 1, "top_k": 40, "top_p": 0.95, "min_p": 0, "xtc_probability": 0, "xtc_threshold": 0.1, "typ_p": 1, "repeat_last_n": 360, "repeat_penalty": 1.05, "presence_penalty": 0, "frequency_penalty": 0, "dry_multiplier": 0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": -1, "samplers": ["top_k", "typ_p", "top_p", "min_p", "temperature"], "timings_per_token": true, "continue_assistant_turn": true}
----------------------------------------
Exception happened during processing of request from ('::1', 52769, 0, 0)
Traceback (most recent call last):
  File "socketserver.py", line 316, in _handle_request_noblock
  File "socketserver.py", line 347, in process_request
  File "socketserver.py", line 360, in finish_request
  File "koboldcpp.py", line 3194, in __call__
  File "http\server.py", line 647, in __init__
  File "socketserver.py", line 747, in __init__
  File "http\server.py", line 427, in handle
  File "http\server.py", line 415, in handle_one_request
  File "koboldcpp.py", line 4595, in do_POST
  File "koboldcpp.py", line 3039, in transform_genparams
UnboundLocalError: local variable 'used_tool_json' referenced before assignment
----------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions