Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions THINKING_MODE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Thinking Mode with Adapters

This document explains how the thinking mode (`<think>` tags) works in NexusAI, especially when using fine-tuned adapters.

---

## Overview

NexusAI supports a "thinking mode" where the model shows its reasoning process before responding:

```
<think>User is asking about electricity. I should explain in Tesla's voice...</think>
Alternating current flows in harmony with nature's rhythms...
```

This feature works differently depending on whether you're using the **base model** or a **fine-tuned adapter**.

---

## How It Works

### 1. Base Model (No Adapter)

When no adapter is loaded:
- Uses **Qwen's native thinking** via `enable_thinking=True` in the chat template
- Adds thinking instructions to the system prompt
- Includes a one-shot example to guide the format

The model generates its own reasoning style.

### 2. Adapter WITHOUT Thinking Support

When an adapter is loaded but wasn't trained with `<think>` tags:
- Thinking mode is **automatically disabled**
- The toggle button turns amber and is locked
- Model uses direct response format

This prevents the model from generating incomplete responses (stopping after `</think>`).

### 3. Adapter WITH Thinking Support

When an adapter is trained with `<think>` tags in the training data:
- Check "Adapter trained with `<think>` format" when loading
- Qwen's native thinking is **disabled** (`enable_thinking=False`)
- No thinking instructions added to prompt
- The adapter generates thinking **naturally from its training**

This ensures the adapter uses its own trained thinking style (e.g., Tesla's voice) rather than Qwen's generic reasoning.

---

## Training Data Format

### Standard Format (No Thinking)

```json
{"prompt": "Hello", "response": "Hi there! How can I help?", "score": 10}
```

### Thinking Format

```json
{"prompt": "Hello", "response": "<think>User greeted me warmly.</think>Hi there! How can I help?", "score": 10}
```

### Example: Tesla Persona with Thinking

```json
{"prompt": "Hello Tesla", "response": "<think>A visitor greets me. I shall welcome them in my characteristic manner.</think>Greetings, seeker of truth. What stirs your mind today?", "score": 10}
{"prompt": "Tell me about AC", "response": "<think>They wish to learn of alternating current. I shall explain with passion.</think>Alternating current flows in harmony with nature's rhythms—efficient and transformable.", "score": 10}
```

---

## Technical Implementation

### Backend Logic (`main.py`)

The chat handler determines thinking mode based on three scenarios:

```python
# 1. No adapter + thinking enabled → use Qwen native thinking
use_native_thinking = request.enable_thinking and not state.adapter_loaded

# 2. Adapter with thinking support → let adapter handle it
adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking

# 3. Adapter without thinking support → direct response only
use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking
```

#### Chat Template Parameters

| Scenario | `enable_thinking` | Prompt Modification |
|----------|-------------------|---------------------|
| Base model + thinking | `True` | Add instructions + one-shot |
| Adapter with thinking | `False` | None (adapter trained) |
| Adapter without thinking | `False` | "Answer directly..." |
| User disabled thinking | `False` | "Answer directly..." |

### API Changes

#### Load Adapter Request

```json
POST /v1/adapter/load
{
"adapter_name": "tesla_adapter",
"system_prompt": "You are Nikola Tesla...",
"supports_thinking": true
}
```

#### Model Status Response

```json
GET /v1/model/status
{
"adapter_loaded": true,
"adapter_supports_thinking": true,
"thinking_supported": true,
...
}
```

### Frontend Changes

- Added checkbox: "Adapter trained with `<think>` format"
- Thinking toggle enabled when:
- No adapter loaded, OR
- Adapter loaded with `supports_thinking=true`
- Thinking toggle disabled (amber) when:
- Adapter loaded without thinking support

---

## Why This Design?

### Problem: Qwen's Native Thinking Conflicts with Trained Adapters

Qwen3 models have built-in thinking support. When `enable_thinking=True`:
- Qwen generates its **own** reasoning style
- This overrides whatever the adapter was trained on
- Result: Generic reasoning instead of persona-specific thinking

### Solution: Let Adapters Control Their Own Thinking

When an adapter is trained with `<think>` tags:
1. Disable Qwen's native thinking (`enable_thinking=False`)
2. Don't add any thinking instructions to the prompt
3. The adapter naturally generates `<think>...</think>response` from training

This preserves the adapter's unique voice and reasoning style.

---

## Quick Reference

| State | Thinking Toggle | Behavior |
|-------|-----------------|----------|
| Base model | Enabled | Qwen native thinking |
| Base model | Disabled | Direct response |
| Adapter (no thinking) | Locked/Disabled | Direct response |
| Adapter (with thinking) | Enabled | Adapter's trained thinking |
| Adapter (with thinking) | Disabled | Direct response |

---

## Files Modified

- `main.py` — Backend logic for thinking mode
- `nexus-lab-ui/src/App.jsx` — Frontend checkbox and toggle logic
- `training_data.jsonl` — Example data with `<think>` tags

---

## Troubleshooting

### Adapter generates Qwen-style thinking instead of trained style

**Cause:** `enable_thinking=True` is being passed to Qwen's chat template.

**Fix:** Ensure "Adapter trained with `<think>` format" is checked when loading.

### Model stops after `</think>` with no response

**Cause:** Adapter wasn't trained with `<think>` tags but thinking mode is enabled.

**Fix:** Either:
1. Uncheck "Adapter trained with `<think>` format", or
2. Retrain the adapter with `<think>` tags in responses

### Thinking toggle is locked/amber

**Expected:** This happens when an adapter without thinking support is loaded.

**To enable:** Load an adapter trained with thinking, or unload the adapter.

---

*Last updated: January 2026*
58 changes: 48 additions & 10 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ def __init__(self):
self.model_name = None
self.adapter_loaded = False
self.active_adapter = None
self.adapter_supports_thinking = False # True if adapter was trained with <think> tags
# Default system prompt
self.system_prompt = "You are a helpful AI assistant."

Expand Down Expand Up @@ -156,6 +157,7 @@ class ModelParamsRequest(BaseModel):
class LoadAdapterRequest(BaseModel):
system_prompt: str = ""
adapter_name: str = ""
supports_thinking: bool = False # True if adapter was trained with <think> format

@app.post("/v1/adapter/load")
async def load_adapter_handler(request: LoadAdapterRequest):
Expand Down Expand Up @@ -187,9 +189,10 @@ async def load_adapter_handler(request: LoadAdapterRequest):
state.model.eval()
state.adapter_loaded = True
state.active_adapter = request.adapter_name or "Legacy Adapter"
state.adapter_supports_thinking = request.supports_thinking
state.system_prompt = request.system_prompt if request.system_prompt else "You are a helpful AI assistant."
print(f"Adapter loaded. System Prompt: {state.system_prompt}")
return {"status": "Adapter loaded", "adapter": state.active_adapter}
print(f"Adapter loaded. System Prompt: {state.system_prompt}, Supports Thinking: {state.adapter_supports_thinking}")
return {"status": "Adapter loaded", "adapter": state.active_adapter, "supports_thinking": state.adapter_supports_thinking}
except Exception as e:
print(f"Error loading adapter: {e}")
state.adapter_loaded = False
Expand Down Expand Up @@ -387,8 +390,9 @@ async def get_model_status():
"current_model": state.model_name,
"active_adapter": state.active_adapter,
"adapter_loaded": state.adapter_loaded,
# Thinking mode is disabled when an adapter is loaded (adapters aren't trained on <think> format)
"thinking_supported": not state.adapter_loaded,
"adapter_supports_thinking": state.adapter_supports_thinking,
# Thinking is supported if: no adapter loaded, OR adapter was trained with thinking
"thinking_supported": not state.adapter_loaded or state.adapter_supports_thinking,
}

@app.post("/v1/model/unload")
Expand Down Expand Up @@ -449,6 +453,7 @@ async def unload_adapter_handler():
state.model.eval()
state.adapter_loaded = False
state.active_adapter = None
state.adapter_supports_thinking = False
state.system_prompt = "You are a helpful AI assistant."
print("Adapter unloaded. Reverted to base model.")
return {"status": "Adapter unloaded"}
Expand Down Expand Up @@ -527,20 +532,53 @@ async def chat_handler(request: ChatRequest):
if state.tokenizer.chat_template:
messages = [{"role": "system", "content": state.system_prompt}]

# When an adapter is loaded, skip thinking injection — adapters are trained
# on direct prompt→response without <think> tags, so they stop after </think>.
use_thinking = request.enable_thinking and not state.adapter_loaded
# Determine thinking mode based on adapter state:
# 1. No adapter + thinking enabled → use Qwen native thinking
# 2. Adapter with thinking support → let adapter handle it (no native thinking, no prompt modification)
# 3. Adapter without thinking support → direct response only

if use_thinking:
adapter_handles_thinking = state.adapter_loaded and state.adapter_supports_thinking
use_native_thinking = request.enable_thinking and not state.adapter_loaded
use_direct_response = state.adapter_loaded and not state.adapter_supports_thinking

print(f"[DEBUG] adapter_loaded={state.adapter_loaded}, adapter_supports_thinking={state.adapter_supports_thinking}, "
f"request.enable_thinking={request.enable_thinking}, adapter_handles_thinking={adapter_handles_thinking}, "
f"use_native_thinking={use_native_thinking}, use_direct_response={use_direct_response}")

if use_native_thinking:
# No adapter: use Qwen's native thinking with prompt instructions
messages[0]["content"] += "\n\nYou MUST begin by reasoning step-by-step inside <think>...</think> tags. Do NOT speak to the user inside the tags. usage: <think>internal thought</think> final response"
# One-shot example to guide the model
messages.append({"role": "user", "content": "Hello"})
messages.append({"role": "assistant", "content": "<think>The user is greeting me. I should respond in character.</think>Greetings. I am ready to assist."})
elif adapter_handles_thinking:
# Adapter trained with thinking: let it handle naturally, no modifications needed
# The adapter learned <think>...</think>response format from training data
pass
elif use_direct_response:
# Adapter without thinking support: force direct response
messages[0]["content"] += "\n\nAnswer directly without showing your thinking process."
else:
# User disabled thinking, no adapter
messages[0]["content"] += "\n\nAnswer directly without showing your thinking process."

messages.append({"role": "user", "content": request.message})
input_ids = state.tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(state.model.device)

# Build kwargs for apply_chat_template
chat_template_kwargs = {
"return_tensors": "pt",
"add_generation_prompt": True,
}
# Qwen3 native enable_thinking: only use when no adapter and user wants thinking
# For adapters with thinking support, set False so adapter's trained format is used
chat_template_kwargs["enable_thinking"] = use_native_thinking

try:
input_ids = state.tokenizer.apply_chat_template(messages, **chat_template_kwargs).to(state.model.device)
except TypeError:
# Tokenizer doesn't support enable_thinking param — use standard call
input_ids = state.tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(state.model.device)
# Explicit attention_mask (all 1s for single sequence) so the model doesn't warn when pad_token_id == eos_token_id
attention_mask = input_ids.new_ones(input_ids.shape, dtype=torch.long)
else:
Expand Down
Loading