diff --git a/THINKING_MODE.md b/THINKING_MODE.md new file mode 100644 index 0000000..5bb7938 --- /dev/null +++ b/THINKING_MODE.md @@ -0,0 +1,201 @@ +# Thinking Mode with Adapters + +This document explains how the thinking mode (`` tags) works in NexusAI, especially when using fine-tuned adapters. + +--- + +## Overview + +NexusAI supports a "thinking mode" where the model shows its reasoning process before responding: + +``` +User is asking about electricity. I should explain in Tesla's voice... +Alternating current flows in harmony with nature's rhythms... +``` + +This feature works differently depending on whether you're using the **base model** or a **fine-tuned adapter**. + +--- + +## How It Works + +### 1. Base Model (No Adapter) + +When no adapter is loaded: +- Uses **Qwen's native thinking** via `enable_thinking=True` in the chat template +- Adds thinking instructions to the system prompt +- Includes a one-shot example to guide the format + +The model generates its own reasoning style. + +### 2. Adapter WITHOUT Thinking Support + +When an adapter is loaded but wasn't trained with `` tags: +- Thinking mode is **automatically disabled** +- The toggle button turns amber and is locked +- Model uses direct response format + +This prevents the model from generating incomplete responses (stopping after ``). + +### 3. Adapter WITH Thinking Support + +When an adapter is trained with `` tags in the training data: +- Check "Adapter trained with `` format" when loading +- Qwen's native thinking is **disabled** (`enable_thinking=False`) +- No thinking instructions added to prompt +- The adapter generates thinking **naturally from its training** + +This ensures the adapter uses its own trained thinking style (e.g., Tesla's voice) rather than Qwen's generic reasoning. + +--- + +## Training Data Format + +### Standard Format (No Thinking) + +```json +{"prompt": "Hello", "response": "Hi there! How can I help?", "score": 10} +``` + +### Thinking Format + +```json +{"prompt": "Hello", "response": "User greeted me warmly.Hi there! How can I help?", "score": 10} +``` + +### Example: Tesla Persona with Thinking + +```json +{"prompt": "Hello Tesla", "response": "A visitor greets me. I shall welcome them in my characteristic manner.Greetings, seeker of truth. What stirs your mind today?", "score": 10} +{"prompt": "Tell me about AC", "response": "They wish to learn of alternating current. I shall explain with passion.Alternating current flows in harmony with nature's rhythms—efficient and transformable.", "score": 10} +``` + +--- + +## Technical Implementation + +### Backend Logic (`main.py`) + +The chat handler determines thinking mode based on three scenarios: + +```python +# 1. No adapter + thinking enabled → use Qwen native thinking +use_native_thinking = request.enable_thinking and not state.adapter_loaded + +# 2. Adapter with thinking support → let adapter handle it +adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking + +# 3. Adapter without thinking support → direct response only +use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking +``` + +#### Chat Template Parameters + +| Scenario | `enable_thinking` | Prompt Modification | +|----------|-------------------|---------------------| +| Base model + thinking | `True` | Add instructions + one-shot | +| Adapter with thinking | `False` | None (adapter trained) | +| Adapter without thinking | `False` | "Answer directly..." | +| User disabled thinking | `False` | "Answer directly..." | + +### API Changes + +#### Load Adapter Request + +```json +POST /v1/adapter/load +{ + "adapter_name": "tesla_adapter", + "system_prompt": "You are Nikola Tesla...", + "supports_thinking": true +} +``` + +#### Model Status Response + +```json +GET /v1/model/status +{ + "adapter_loaded": true, + "adapter_supports_thinking": true, + "thinking_supported": true, + ... +} +``` + +### Frontend Changes + +- Added checkbox: "Adapter trained with `` format" +- Thinking toggle enabled when: + - No adapter loaded, OR + - Adapter loaded with `supports_thinking=true` +- Thinking toggle disabled (amber) when: + - Adapter loaded without thinking support + +--- + +## Why This Design? + +### Problem: Qwen's Native Thinking Conflicts with Trained Adapters + +Qwen3 models have built-in thinking support. When `enable_thinking=True`: +- Qwen generates its **own** reasoning style +- This overrides whatever the adapter was trained on +- Result: Generic reasoning instead of persona-specific thinking + +### Solution: Let Adapters Control Their Own Thinking + +When an adapter is trained with `` tags: +1. Disable Qwen's native thinking (`enable_thinking=False`) +2. Don't add any thinking instructions to the prompt +3. The adapter naturally generates `...response` from training + +This preserves the adapter's unique voice and reasoning style. + +--- + +## Quick Reference + +| State | Thinking Toggle | Behavior | +|-------|-----------------|----------| +| Base model | Enabled | Qwen native thinking | +| Base model | Disabled | Direct response | +| Adapter (no thinking) | Locked/Disabled | Direct response | +| Adapter (with thinking) | Enabled | Adapter's trained thinking | +| Adapter (with thinking) | Disabled | Direct response | + +--- + +## Files Modified + +- `main.py` — Backend logic for thinking mode +- `nexus-lab-ui/src/App.jsx` — Frontend checkbox and toggle logic +- `training_data.jsonl` — Example data with `` tags + +--- + +## Troubleshooting + +### Adapter generates Qwen-style thinking instead of trained style + +**Cause:** `enable_thinking=True` is being passed to Qwen's chat template. + +**Fix:** Ensure "Adapter trained with `` format" is checked when loading. + +### Model stops after `` with no response + +**Cause:** Adapter wasn't trained with `` tags but thinking mode is enabled. + +**Fix:** Either: +1. Uncheck "Adapter trained with `` format", or +2. Retrain the adapter with `` tags in responses + +### Thinking toggle is locked/amber + +**Expected:** This happens when an adapter without thinking support is loaded. + +**To enable:** Load an adapter trained with thinking, or unload the adapter. + +--- + +*Last updated: January 2026* diff --git a/main.py b/main.py index a57e3ed..85c7646 100644 --- a/main.py +++ b/main.py @@ -42,6 +42,7 @@ def __init__(self): self.model_name = None self.adapter_loaded = False self.active_adapter = None + self.adapter_supports_thinking = False # True if adapter was trained with tags # Default system prompt self.system_prompt = "You are a helpful AI assistant." @@ -156,6 +157,7 @@ class ModelParamsRequest(BaseModel): class LoadAdapterRequest(BaseModel): system_prompt: str = "" adapter_name: str = "" + supports_thinking: bool = False # True if adapter was trained with format @app.post("/v1/adapter/load") async def load_adapter_handler(request: LoadAdapterRequest): @@ -187,9 +189,10 @@ async def load_adapter_handler(request: LoadAdapterRequest): state.model.eval() state.adapter_loaded = True state.active_adapter = request.adapter_name or "Legacy Adapter" + state.adapter_supports_thinking = request.supports_thinking state.system_prompt = request.system_prompt if request.system_prompt else "You are a helpful AI assistant." - print(f"Adapter loaded. System Prompt: {state.system_prompt}") - return {"status": "Adapter loaded", "adapter": state.active_adapter} + print(f"Adapter loaded. System Prompt: {state.system_prompt}, Supports Thinking: {state.adapter_supports_thinking}") + return {"status": "Adapter loaded", "adapter": state.active_adapter, "supports_thinking": state.adapter_supports_thinking} except Exception as e: print(f"Error loading adapter: {e}") state.adapter_loaded = False @@ -387,8 +390,9 @@ async def get_model_status(): "current_model": state.model_name, "active_adapter": state.active_adapter, "adapter_loaded": state.adapter_loaded, - # Thinking mode is disabled when an adapter is loaded (adapters aren't trained on format) - "thinking_supported": not state.adapter_loaded, + "adapter_supports_thinking": state.adapter_supports_thinking, + # Thinking is supported if: no adapter loaded, OR adapter was trained with thinking + "thinking_supported": not state.adapter_loaded or state.adapter_supports_thinking, } @app.post("/v1/model/unload") @@ -449,6 +453,7 @@ async def unload_adapter_handler(): state.model.eval() state.adapter_loaded = False state.active_adapter = None + state.adapter_supports_thinking = False state.system_prompt = "You are a helpful AI assistant." print("Adapter unloaded. Reverted to base model.") return {"status": "Adapter unloaded"} @@ -527,20 +532,53 @@ async def chat_handler(request: ChatRequest): if state.tokenizer.chat_template: messages = [{"role": "system", "content": state.system_prompt}] - # When an adapter is loaded, skip thinking injection — adapters are trained - # on direct prompt→response without tags, so they stop after . - use_thinking = request.enable_thinking and not state.adapter_loaded + # Determine thinking mode based on adapter state: + # 1. No adapter + thinking enabled → use Qwen native thinking + # 2. Adapter with thinking support → let adapter handle it (no native thinking, no prompt modification) + # 3. Adapter without thinking support → direct response only - if use_thinking: + adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking + use_native_thinking = request.enable_thinking and not state.adapter_loaded + use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking + + print(f"[DEBUG] adapter_loaded={state.adapter_loaded}, adapter_supports_thinking={state.adapter_supports_thinking}, " + f"request.enable_thinking={request.enable_thinking}, adapter_handles_thinking={adapter_handles_thinking}, " + f"use_native_thinking={use_native_thinking}, use_direct_response={use_direct_response}") + + if use_native_thinking: + # No adapter: use Qwen's native thinking with prompt instructions messages[0]["content"] += "\n\nYou MUST begin by reasoning step-by-step inside ... tags. Do NOT speak to the user inside the tags. usage: internal thought final response" - # One-shot example to guide the model messages.append({"role": "user", "content": "Hello"}) messages.append({"role": "assistant", "content": "The user is greeting me. I should respond in character.Greetings. I am ready to assist."}) + elif adapter_handles_thinking: + # Adapter trained with thinking: let it handle naturally, no modifications needed + # The adapter learned ...response format from training data + pass + elif use_direct_response: + # Adapter without thinking support: force direct response + messages[0]["content"] += "\n\nAnswer directly without showing your thinking process." else: + # User disabled thinking, no adapter messages[0]["content"] += "\n\nAnswer directly without showing your thinking process." messages.append({"role": "user", "content": request.message}) - input_ids = state.tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(state.model.device) + + # Build kwargs for apply_chat_template + chat_template_kwargs = { + "return_tensors": "pt", + "add_generation_prompt": True, + } + # Qwen3 native enable_thinking: only use when no adapter and user wants thinking + # For adapters with thinking support, set False so adapter's trained format is used + chat_template_kwargs["enable_thinking"] = use_native_thinking + + try: + input_ids = state.tokenizer.apply_chat_template(messages, **chat_template_kwargs).to(state.model.device) + except TypeError: + # Tokenizer doesn't support enable_thinking param — use standard call + input_ids = state.tokenizer.apply_chat_template( + messages, return_tensors="pt", add_generation_prompt=True + ).to(state.model.device) # Explicit attention_mask (all 1s for single sequence) so the model doesn't warn when pad_token_id == eos_token_id attention_mask = input_ids.new_ones(input_ids.shape, dtype=torch.long) else: diff --git a/nexus-lab-ui/src/App.jsx b/nexus-lab-ui/src/App.jsx index 8b5a313..b120750 100644 --- a/nexus-lab-ui/src/App.jsx +++ b/nexus-lab-ui/src/App.jsx @@ -41,6 +41,7 @@ export default function App() { const [adapterName, setAdapterName] = useState("my_adapter"); const [availableAdapters, setAvailableAdapters] = useState([]); const [selectedAdapter, setSelectedAdapter] = useState(""); + const [adapterSupportsThinking, setAdapterSupportsThinking] = useState(false); const [enableThinking, setEnableThinking] = useState(true); // Sidebar Resizing State @@ -357,17 +358,21 @@ export default function App() {
+ +
+
+ Training with Thinking Support +
+

+ To use thinking mode with your adapter, include <think> tags in your training responses: +

+ +{`{"prompt": "what's 2+2?", "response": "Simple math.That's 4!", "score": 10} +{"prompt": "hello", "response": "User greeted me.Hey! How are you?", "score": 10}`} + +

+ Then check "Adapter trained with <think> format" when loading the adapter. +

+
@@ -950,6 +984,78 @@ export default function App() { + {/* Section 4: How Adapters Work */} +
+
+ 4. How Adapters Work +
+
+

+ Adapters are lightweight LoRA weights (~10MB) that modify the base model's behavior without replacing it. Here's how they work in NexusAI: +

+ +
+
+
+ 1 +
+
+
Training Format
+

+ Adapters are trained on direct prompt → response pairs from your JSONL data. They learn your style, tone, and content without any special formatting. +

+
+
+ +
+
+ 2 +
+
+
Load & Unload
+

+ Load: Adapter weights are merged with the base model for inference.
+ Unload: Reverts to the pure base model — no adapter influence remains. +

+
+
+ +
+
+ ! +
+
+
Thinking Mode & Adapters
+

+ Thinking is automatically disabled when an adapter is loaded. Why? Adapters are trained on direct responses — they don't know about <think>...</think> tags and will stop generating after them. +

+
+
+
+ +
+
Want Thinking with Your Adapter?
+

+ Include <think> examples in your training data: +

+ + {`{"prompt": "hi", "response": "User greeted me...Hey! How are you?"}`} + +
+ +
+
+ Base Model + Thinking mode available. General-purpose responses. +
+
+ With Adapter + Direct responses only. Custom style/persona active. +
+
+
+
+ ) : (