Skip to content

[BUG]: [Documentation Bug] Missing observation_template parameter in math_multiturn.md causes 0% completion rate #26

Description

@tinycrown

Description of the bug

Summary

Following the official math_multiturn.md documentation results in 0% completion rate and zero rewards. The documentation is missing a critical parameter: observation_template: '{observation}'.

What Should Happen

Training should achieve ~70% mean score and 100% completion rate (as shown in official training logs from 2025-12-20).

What Actually Happens

  • Completion rate: 0% (all 128 samples incomplete)
  • Mean reward: 0.0 (no rewards computed)
  • Turns per episode: 11.0 (hits maximum limit)
  • Response length: 5248 tokens (extremely long, model rambling)

Root Cause

The documentation shows:

interaction:
  config:
    env_port: <env_port>
    env_host: <client_endpoint>
But the working configuration requires:
interaction:
  config:
    env_endpoint: http://<host>:<port>
    observation_template: '{observation}'  # ⚠️ MISSING FROM DOCS
The observation_template parameter is nowhere mentioned in the documentation (verified by reviewing complete docs).

Impact
This blocks all users following the official guide from successfully training math agents.
  

### Steps To Reproduce


1. Follow `docs/math_multiturn.md` exactly as written
2. Start the math environment server:
   ```bash
   python opentinker/environment/math/math_tool_server.py --port 5000
3. Start training with the documented configuration:
python opentinker/client/math_tool_rl.py \
    tokenizer_path=Qwen/Qwen2.5-3B-Instruct \
    batch_size=16 \
    interaction.config.env_port=5000 \
    interaction.config.env_host=<your_host>
4. Observe validation results showing:
  val/completion_rate: 0.0
  val/mean_reward: 0.0
  val/incomplete_samples: 64/64 (100%)
Expected: ~70% mean score, 100% completion rate
Actual: 0% completion rate, zero rewards

---

## **4. Additional Information**

```markdown
## Evidence: Documentation vs. Working Configuration

### Metrics Comparison

| Metric | Following Docs | Official Logs | Difference |
|--------|---------------|---------------|------------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | 3x more |
| `response_length/mean` | 5248 | 594 | 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |

### Our Training Log (Following Docs - Broken)
[2026-01-19 22:39:55,983] Pre-training validation: val/mean_score: 0.0 val/completion_rate: 0.0 val/incomplete_samples: 64 (100%) val_game/mean_game_steps: 0.0

step:1 - training_step_reward:0.0 - completion_rate:0.0 - num_turns/mean:11.0 - response_length/mean:5248.328125

### Official Training Log (With observation_template - Working)
2025-12-20 22:31:11,795] Pre-training validation: val/mean_score: 0.7 val/completion_rate: 1.0 val/incomplete_samples: 0 (0%) val_game/mean_game_steps: 1.14

step:1 - training_step_reward:0.6171875 - completion_rate:1.0 - num_turns/mean:3.375 - response_length/mean:594.03125

## Environment
- **Repository**: Latest main branch
- **Model**: Qwen/Qwen2.5-3B-Instruct
- **GPU**: 4x NVIDIA H800 (80GB)
- **CUDA**: 12.8.1
- **Documentation reviewed**: Complete `math_multiturn.md` and `README.md`

## Proposed Fix
Update `docs/math_multiturn.md` to include:

```yaml
interaction:
  config:
    env_endpoint: http://<host>:<port>
    job_id: <job_id>
    max_steps: 5
    observation_template: '{observation}'  # Add this line with explanation
And add a note explaining this parameter is required for the model to correctly parse environment feedback.

### Additional Information

## Evidence: Documentation vs. Working Configuration

### Metrics Comparison

| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |

---

### Our Training Log (Following Documentation - NOT Working)

<details>
<summary>Click to expand full log</summary>
markdown
## Evidence: Documentation vs. Working Configuration
### Metrics Comparison
| Metric | Following Docs | Official Logs | Status |
|--------|---------------|---------------|--------|
| `completion_rate` | 0.0% | 100% | ❌ |
| `mean_reward` | 0.0 | 0.617 | ❌ |
| `num_turns/mean` | 11.0 | 3.375 | ❌ 3x more |
| `response_length/mean` | 5248 | 594 | ❌ 8.8x longer |
| `incomplete_samples` | 100% | 0% | ❌ |
| `mean_game_steps` | 0.0 | 1.19 | ❌ |
---
### Our Training Log (Following Documentation - NOT Working)
<details>
<summary>Click to expand full log</summary>
[2026-01-19 22:32:20,211][utils.http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2026-01-19 22:32:20,213][utils.http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}

Training configuration:

Algorithm: agent_loop
Epochs: 5
Batch size: 16
Max turns: 5
[2026-01-19 22:32:20,246][utils.http_training_client][INFO] - Training for 5 epochs (2343 steps per epoch, 11715 total steps) [2026-01-19 22:34:37,354][utils.http_training_client][INFO] - Workers initialized successfully [2026-01-19 22:34:37,355][utils.http_training_client][INFO] - Running validation before training...

[2026-01-19 22:39:55,982][utils.http_training_client][INFO] - Validation game stats: win_rate=0.00% [2026-01-19 22:39:55,983][utils.http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.0, 'val/std_score': 0.0, 'val/max_score': 0.0, 'val/min_score': 0.0, 'val_game/total_samples': 64, 'val_game/games_in_step': 0, 'val_game/incomplete_samples': 64, 'val_game/completion_rate': 0.0, 'val_game/mean_final_reward': 0.0, 'val_game/mean_sum_reward': 0.0, 'val_game/mean_sum_reward_all': 0.0, 'val_game/mean_avg_reward': 0.0, 'val_game/mean_game_steps': 0.0, 'val_game/mean_reward': 0.0, 'val_game/total_interactions': 320 }

step:1 - global_seqlen/mean:5245950976.0 - actor/entropy:11.931066513061523 - training_step_reward:0.0 - actor/kl_loss:0.0 - actor/pg_loss:0.0 - critic/score/mean:0.0 - critic/score/max:0.0 - critic/score/min:0.0 - critic/rewards/mean:0.0 - critic/rewards/max:0.0 - critic/rewards/min:0.0 - response_length/mean:5248.328125 - response_length/max:5290.0 - response_length/min:4354.0 - num_turns/min:11.0 - num_turns/max:11.0 - num_turns/mean:11.0 - timing_s/agent_loop/generate_sequences/mean:419.48593199058087 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:0 - game/incomplete_samples:128 - game/completion_rate:0.0 - game/mean_final_reward:0.0 - game/mean_sum_reward:0.0 - game/mean_avg_reward:0.0 - game/mean_game_steps:0.0 - game/mean_reward:0.0 - game/total_interactions:640


**Key Issues:**
- ❌ `completion_rate: 0.0` - No tasks completed
- ❌ `incomplete_samples: 128/128` - All samples failed
- ❌ `mean_reward: 0.0` - No rewards computed
- ❌ `num_turns/mean: 11.0` - Hit maximum turn limit
- ❌ `response_length/mean: 5248` - Extremely long responses
- ❌ `tool_calls/mean: 0.0` - No tool usage
- ❌ `mean_game_steps: 0.0` - Environment not progressing

</details>

---

### Official Training Log (With `observation_template` - Working)

<details>
<summary>Click to expand full log</summary>
[2025-12-20 22:28:36,413][http_training_client][INFO] - Initialized tracking with backends: ['console', 'wandb'] [ServiceClient] Passing multi_turn config to server: {'max_user_turns': 5, 'max_assistant_turns': 5, 'max_tokens_per_turn': 1024, 'weave_project': None, 'experiment_name': 'math_code_interpreter'} [2025-12-20 22:28:36,417][http_training_client][INFO] - Setting generation config: {'temperature': 1, 'top_p': 1, 'max_new_tokens': 8192}

Environment config: {'actor_rollout_ref': {'rollout': {'multi_turn': {'interaction_config_path': '/tmp/math_code_interpreter_interaction_config_tkywyksc.yaml', 'interaction_config_content': "interaction:\n- class_name: opentinker.environment.gym_environment_interaction.GymEnvironmentInteraction\n config:\n env_endpoint: http://172.22.224.251/:8088\n job_id: 0d46716b\n max_steps: 5\n observation_template: '{observation}'\n name: math_code_interpreter\n"}}}}

Training configuration:

Algorithm: agent_loop
Epochs: 1
Batch size: 16
Max turns: 5
[2025-12-20 22:28:36,474][http_training_client][INFO] - Training for 1 epochs (468 steps per epoch, 468 total steps) [2025-12-20 22:30:39,653][http_training_client][INFO] - Workers initialized successfully [2025-12-20 22:30:39,654][http_training_client][INFO] - Running validation before training...

[2025-12-20 22:31:11,794][http_training_client][INFO] - Validation game stats: win_rate=0.00% [2025-12-20 22:31:11,795][http_training_client][INFO] - Pre-training validation: { 'val/mean_score': 0.7, 'val/std_score': 0.45825756949558394, 'val/max_score': 1.0, 'val/min_score': 0.0, 'val_game/total_samples': 104, 'val_game/games_in_step': 104, 'val_game/incomplete_samples': 0, 'val_game/completion_rate': 1.0, 'val_game/mean_final_reward': 0.7019230769230769, 'val_game/mean_sum_reward': 0.7019230769230769, 'val_game/mean_sum_reward_all': 0.7019230769230769, 'val_game/mean_avg_reward': 0.6538461538461539, 'val_game/mean_game_steps': 1.1442307692307692, 'val_game/mean_reward': 0.7019230769230769, 'val_game/total_interactions': 119 }

step:1 - global_seqlen/mean:664253632.0 - actor/entropy:0.29363444447517395 - training_step_reward:0.6171875 - actor/pg_loss:0.02784726768732071 - critic/score/mean:0.6171875 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.6171875 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - response_length/mean:594.03125 - response_length/max:1944.0 - response_length/min:179.0 - num_turns/min:3.0 - num_turns/max:7.0 - num_turns/mean:3.375 - timing_s/agent_loop/generate_sequences/mean:7.187697933011805 - timing_s/agent_loop/tool_calls/mean:0.0 - game/total_samples:128 - game/games_in_step:128 - game/incomplete_samples:0 - game/completion_rate:1.0 - game/mean_final_reward:0.6171875 - game/mean_sum_reward:0.6171875 - game/mean_avg_reward:0.5807291666666666 - game/mean_game_steps:1.1875 - game/mean_reward:0.6171875 - game/total_interactions:152

**Key Success Indicators:**
- ✅ `completion_rate: 1.0` - All tasks completed
- ✅ `incomplete_samples: 0/128` - No failures
- ✅ `mean_reward: 0.617` - Rewards computed correctly
- ✅ `num_turns/mean: 3.375` - Efficient completion
- ✅ `response_length/mean: 594` - Concise responses
- ✅ `mean_game_steps: 1.19` - Environment working

</details>

---

## Configuration Difference

The **only** difference between our broken setup and the working setup is the presence of:

```yaml
observation_template: '{observation}'
This parameter is not mentioned anywhere in the official documentation.
Environment
Repository: https://github.com/open-tinker/OpenTinker (latest main branch, as of 2026-01-20)
Model: Qwen/Qwen2.5-3B-Instruct
GPU: 4x NVIDIA H800 (80GB each)
CUDA: 12.8.1
Python: 3.12
Documentation reviewed: Complete math_multiturn.md and README.md (verified via browser inspection)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions