nexus-3b-labs · Akshat120 · Jan 30, 2026
diff --git a/THINKING_MODE.md b/THINKING_MODE.md
@@ -0,0 +1,201 @@
+# Thinking Mode with Adapters
+
+This document explains how the thinking mode (`<think>` tags) works in NexusAI, especially when using fine-tuned adapters.
+
+---
+
+## Overview
+
+NexusAI supports a "thinking mode" where the model shows its reasoning process before responding:
+
+```
+<think>User is asking about electricity. I should explain in Tesla's voice...</think>
+Alternating current flows in harmony with nature's rhythms...
+```
+
+This feature works differently depending on whether you're using the **base model** or a **fine-tuned adapter**.
+
+---
+
+## How It Works
+
+### 1. Base Model (No Adapter)
+
+When no adapter is loaded:
+- Uses **Qwen's native thinking** via `enable_thinking=True` in the chat template
+- Adds thinking instructions to the system prompt
+- Includes a one-shot example to guide the format
+
+The model generates its own reasoning style.
+
+### 2. Adapter WITHOUT Thinking Support
+
+When an adapter is loaded but wasn't trained with `<think>` tags:
+- Thinking mode is **automatically disabled**
+- The toggle button turns amber and is locked
+- Model uses direct response format
+
+This prevents the model from generating incomplete responses (stopping after `</think>`).
+
+### 3. Adapter WITH Thinking Support
+
+When an adapter is trained with `<think>` tags in the training data:
+- Check "Adapter trained with `<think>` format" when loading
+- Qwen's native thinking is **disabled** (`enable_thinking=False`)
+- No thinking instructions added to prompt
+- The adapter generates thinking **naturally from its training**
+
+This ensures the adapter uses its own trained thinking style (e.g., Tesla's voice) rather than Qwen's generic reasoning.
+
+---
+
+## Training Data Format
+
+### Standard Format (No Thinking)
+
+```json
+{"prompt": "Hello", "response": "Hi there! How can I help?", "score": 10}
+```
+
+### Thinking Format
+
+```json
+{"prompt": "Hello", "response": "<think>User greeted me warmly.</think>Hi there! How can I help?", "score": 10}
+```
+
+### Example: Tesla Persona with Thinking
+
+```json
+{"prompt": "Hello Tesla", "response": "<think>A visitor greets me. I shall welcome them in my characteristic manner.</think>Greetings, seeker of truth. What stirs your mind today?", "score": 10}
+{"prompt": "Tell me about AC", "response": "<think>They wish to learn of alternating current. I shall explain with passion.</think>Alternating current flows in harmony with nature's rhythms—efficient and transformable.", "score": 10}
+```
+
+---
+
+## Technical Implementation
+
+### Backend Logic (`main.py`)
+
+The chat handler determines thinking mode based on three scenarios:
+
+```python
+# 1. No adapter + thinking enabled → use Qwen native thinking
+use_native_thinking = request.enable_thinking and not state.adapter_loaded
+
+# 2. Adapter with thinking support → let adapter handle it
+adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking
+
+# 3. Adapter without thinking support → direct response only
+use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking
+```
+
+#### Chat Template Parameters
+
+| Scenario | `enable_thinking` | Prompt Modification |
+|----------|-------------------|---------------------|
+| Base model + thinking | `True` | Add instructions + one-shot |
+| Adapter with thinking | `False` | None (adapter trained) |
+| Adapter without thinking | `False` | "Answer directly..." |
+| User disabled thinking | `False` | "Answer directly..." |
+
+### API Changes
+
+#### Load Adapter Request
+
+```json
+POST /v1/adapter/load
+{
+  "adapter_name": "tesla_adapter",
+  "system_prompt": "You are Nikola Tesla...",
+  "supports_thinking": true
+}
+```
+
+#### Model Status Response
+
+```json
+GET /v1/model/status
+{
+  "adapter_loaded": true,
+  "adapter_supports_thinking": true,
+  "thinking_supported": true,
+  ...
+}
+```
+
+### Frontend Changes
+
+- Added checkbox: "Adapter trained with `<think>` format"
+- Thinking toggle enabled when:
+  - No adapter loaded, OR
+  - Adapter loaded with `supports_thinking=true`
+- Thinking toggle disabled (amber) when:
+  - Adapter loaded without thinking support
+
+---
+
+## Why This Design?
+
+### Problem: Qwen's Native Thinking Conflicts with Trained Adapters
+
+Qwen3 models have built-in thinking support. When `enable_thinking=True`:
+- Qwen generates its **own** reasoning style
+- This overrides whatever the adapter was trained on
+- Result: Generic reasoning instead of persona-specific thinking
+
+### Solution: Let Adapters Control Their Own Thinking
+
+When an adapter is trained with `<think>` tags:
+1. Disable Qwen's native thinking (`enable_thinking=False`)
+2. Don't add any thinking instructions to the prompt
+3. The adapter naturally generates `<think>...</think>response` from training
+
+This preserves the adapter's unique voice and reasoning style.
+
+---
+
+## Quick Reference
+
+| State | Thinking Toggle | Behavior |
+|-------|-----------------|----------|
+| Base model | Enabled | Qwen native thinking |
+| Base model | Disabled | Direct response |
+| Adapter (no thinking) | Locked/Disabled | Direct response |
+| Adapter (with thinking) | Enabled | Adapter's trained thinking |
+| Adapter (with thinking) | Disabled | Direct response |
+
+---
+
+## Files Modified
+
+- `main.py` — Backend logic for thinking mode
+- `nexus-lab-ui/src/App.jsx` — Frontend checkbox and toggle logic
+- `training_data.jsonl` — Example data with `<think>` tags
+
+---
+
+## Troubleshooting
+
+### Adapter generates Qwen-style thinking instead of trained style
+
+**Cause:** `enable_thinking=True` is being passed to Qwen's chat template.
+
+**Fix:** Ensure "Adapter trained with `<think>` format" is checked when loading.
+
+### Model stops after `</think>` with no response
+
+**Cause:** Adapter wasn't trained with `<think>` tags but thinking mode is enabled.
+
+**Fix:** Either:
+1. Uncheck "Adapter trained with `<think>` format", or
+2. Retrain the adapter with `<think>` tags in responses
+
+### Thinking toggle is locked/amber
+
+**Expected:** This happens when an adapter without thinking support is loaded.
+
+**To enable:** Load an adapter trained with thinking, or unload the adapter.
+
+---
+
+*Last updated: January 2026*
diff --git a/main.py b/main.py
@@ -42,6 +42,7 @@ def __init__(self):
         self.model_name = None
         self.adapter_loaded = False
         self.active_adapter = None
+        self.adapter_supports_thinking = False  # True if adapter was trained with <think> tags
         # Default system prompt
         self.system_prompt = "You are a helpful AI assistant."
 
@@ -156,6 +157,7 @@ class ModelParamsRequest(BaseModel):
 class LoadAdapterRequest(BaseModel):
     system_prompt: str = ""
     adapter_name: str = ""
+    supports_thinking: bool = False  # True if adapter was trained with <think> format
 
 @app.post("/v1/adapter/load")
 async def load_adapter_handler(request: LoadAdapterRequest):
@@ -187,9 +189,10 @@ async def load_adapter_handler(request: LoadAdapterRequest):
         state.model.eval()
         state.adapter_loaded = True
         state.active_adapter = request.adapter_name or "Legacy Adapter"
+        state.adapter_supports_thinking = request.supports_thinking
         state.system_prompt = request.system_prompt if request.system_prompt else "You are a helpful AI assistant."
-        print(f"Adapter loaded. System Prompt: {state.system_prompt}")
-        return {"status": "Adapter loaded", "adapter": state.active_adapter}
+        print(f"Adapter loaded. System Prompt: {state.system_prompt}, Supports Thinking: {state.adapter_supports_thinking}")
+        return {"status": "Adapter loaded", "adapter": state.active_adapter, "supports_thinking": state.adapter_supports_thinking}
     except Exception as e:
         print(f"Error loading adapter: {e}")
         state.adapter_loaded = False
@@ -387,8 +390,9 @@ async def get_model_status():
         "current_model": state.model_name,
         "active_adapter": state.active_adapter,
         "adapter_loaded": state.adapter_loaded,
-        # Thinking mode is disabled when an adapter is loaded (adapters aren't trained on <think> format)
-        "thinking_supported": not state.adapter_loaded,
+        "adapter_supports_thinking": state.adapter_supports_thinking,
+        # Thinking is supported if: no adapter loaded, OR adapter was trained with thinking
+        "thinking_supported": not state.adapter_loaded or state.adapter_supports_thinking,
     }
 
 @app.post("/v1/model/unload")
@@ -449,6 +453,7 @@ async def unload_adapter_handler():
         state.model.eval()
         state.adapter_loaded = False
         state.active_adapter = None
+        state.adapter_supports_thinking = False
         state.system_prompt = "You are a helpful AI assistant."
         print("Adapter unloaded. Reverted to base model.")
         return {"status": "Adapter unloaded"}
@@ -527,20 +532,53 @@ async def chat_handler(request: ChatRequest):
         if state.tokenizer.chat_template:
             messages = [{"role": "system", "content": state.system_prompt}]
 
-            # When an adapter is loaded, skip thinking injection — adapters are trained
-            # on direct prompt→response without <think> tags, so they stop after </think>.
-            use_thinking = request.enable_thinking and not state.adapter_loaded
+            # Determine thinking mode based on adapter state:
+            # 1. No adapter + thinking enabled → use Qwen native thinking
+            # 2. Adapter with thinking support → let adapter handle it (no native thinking, no prompt modification)
+            # 3. Adapter without thinking support → direct response only
 
-            if use_thinking:
+            adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking
+            use_native_thinking = request.enable_thinking and not state.adapter_loaded
+            use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking
+
+            print(f"[DEBUG] adapter_loaded={state.adapter_loaded}, adapter_supports_thinking={state.adapter_supports_thinking}, "
+                  f"request.enable_thinking={request.enable_thinking}, adapter_handles_thinking={adapter_handles_thinking}, "
+                  f"use_native_thinking={use_native_thinking}, use_direct_response={use_direct_response}")
+
+            if use_native_thinking:
+                # No adapter: use Qwen's native thinking with prompt instructions
                 messages[0]["content"] += "\n\nYou MUST begin by reasoning step-by-step inside <think>...</think> tags. Do NOT speak to the user inside the tags. usage: <think>internal thought</think> final response"
-                # One-shot example to guide the model
                 messages.append({"role": "user", "content": "Hello"})
                 messages.append({"role": "assistant", "content": "<think>The user is greeting me. I should respond in character.</think>Greetings. I am ready to assist."})
+            elif adapter_handles_thinking:
+                # Adapter trained with thinking: let it handle naturally, no modifications needed
+                # The adapter learned <think>...</think>response format from training data
+                pass
+            elif use_direct_response:
+                # Adapter without thinking support: force direct response
+                messages[0]["content"] += "\n\nAnswer directly without showing your thinking process."
             else:
+                # User disabled thinking, no adapter
                 messages[0]["content"] += "\n\nAnswer directly without showing your thinking process."
 
             messages.append({"role": "user", "content": request.message})
-            input_ids = state.tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(state.model.device)
+
+            # Build kwargs for apply_chat_template
+            chat_template_kwargs = {
+                "return_tensors": "pt",
+                "add_generation_prompt": True,
+            }
+            # Qwen3 native enable_thinking: only use when no adapter and user wants thinking
+            # For adapters with thinking support, set False so adapter's trained format is used
+            chat_template_kwargs["enable_thinking"] = use_native_thinking
+
+            try:
+                input_ids = state.tokenizer.apply_chat_template(messages, **chat_template_kwargs).to(state.model.device)
+            except TypeError:
+                # Tokenizer doesn't support enable_thinking param — use standard call
+                input_ids = state.tokenizer.apply_chat_template(
+                    messages, return_tensors="pt", add_generation_prompt=True
+                ).to(state.model.device)
             # Explicit attention_mask (all 1s for single sequence) so the model doesn't warn when pad_token_id == eos_token_id
             attention_mask = input_ids.new_ones(input_ids.shape, dtype=torch.long)
         else: