[[fallback_providers]] in config.toml not used as runtime fallback when primary provider goes offline

### Description

 When a local provider (e.g. LM Studio) is configured as [default_model] and one or more
  [[fallback_providers]] (e.g. Gemini, OpenRouter) are defined in config.toml, agents do not fall over to the  
  cloud fallbacks when the local provider goes offline. Instead they keep retrying the offline endpoint until
  the request fails.  

### Expected Behavior

 The agent falls over to Gemini and responds successfully. 

  Actual Behavior                                                                                              
  The agent keeps retrying LM Studio, then returns an error. Gemini is never called.  

### Steps to Reproduce

1. Configure config.toml with a local LM Studio instance as [default_model] and Gemini as                    
  [[fallback_providers]]:
  [default_model]                                                                                              
  provider = "lmstudio"                                                                                      
  model = "google/gemma-4-26b-a4b"
  base_url = "http://192.168.1.1:1234/v1"                                                                    
                                                                                                               
  [[fallback_providers]]                                                                                       
  provider = "gemini"                                                                                          
  model = "gemini-2.5-pro"                                                                                   
  api_key_env = "GEMINI_API_KEY"

  2. Start the daemon, verify agents respond normally via LM Studio.                                           
  3. Shut down LM Studio.                                                                                      
  4. Send a message to any agent.

### OpenFang Version

0.5.5

### Operating System

Linux (x86_64)

### Logs / Screenshots

Root Cause                                                                                                   
                                                                                                             
  [[fallback_providers]] are added to self.default_driver at boot time, which is a FallbackDriver chain.       
  However, resolve_driver() in kernel.rs creates a fresh primary driver for each agent call (an HTTP client
  that always succeeds at creation time), and only returns default_driver if this fresh creation fails — which 
  never happens for HTTP-based providers like LM Studio.                                                     

  At runtime, when LM Studio is unreachable, the error is classified as Timeout / is_retryable = true, causing 
  retries of the same provider. The fallback chain in default_driver is never reached.
                                                                                                               
  Specifically, in kernel.rs resolve_driver():              

  // This branch is never taken for HTTP providers — create_driver() always succeeds
  Err(e) => {                                                                                                  
      if agent_provider == default_provider ... {                                                              
          Arc::clone(&self.default_driver)  // ← FallbackDriver with Gemini                                    
      }                                                                                                        
  }                                                         
  // ...                                                                                                       
  Ok(primary)  // ← always returns bare LM Studio driver, no fallbacks wired in
                                                                                                               
  **Fix**
                                                                                                               
  When an agent uses the default provider with no custom overrides and no per-agent [[fallback_models]], wire  
  the global [[fallback_providers]] into the agent's driver chain at resolution time, so any runtime failure
  (not just init failure) triggers the fallback.                                                               
                                                            
  Added after the existing fallback_models block in resolve_driver():                                          
   
  let uses_global_defaults = agent_provider == default_provider                                                
      && !has_custom_key                                                                                       
      && !has_custom_url
      && !self.config.fallback_providers.is_empty()                                                            
      && !Arc::ptr_eq(&primary, &self.default_driver);      
  if uses_global_defaults {                                                                                    
      let mut chain = vec![(primary, String::new())];
      for fb in &self.config.fallback_providers {                                                              
          if fb.provider == *agent_provider { continue; } // skip duplicate primary
          // ... create driver, push to chain                                                                  
      }                                                                                                        
      if chain.len() > 1 {                                                                                     
          return Ok(Arc::new(FallbackDriver::with_models(chain)));                                             
      }                                                                                                        
  }
                                                                                                               
  **Notes**                                                     

  - Agents with explicit [[fallback_models]] in their manifest are unaffected — those already create a         
  FallbackDriver correctly.
  - The duplicate-provider filter (fb.provider == *agent_provider) avoids adding a second unreachable LM Studio
   entry to the chain, which would otherwise happen when [[fallback_providers]] includes the same provider as  
  [default_model] (a common configuration pattern).
  - The Arc::ptr_eq guard prevents double-wrapping in the rare case where fresh driver creation failed and     
  primary is already default_driver.   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[fallback_providers]] in config.toml not used as runtime fallback when primary provider goes offline #1003

Description

Expected Behavior

Steps to Reproduce

OpenFang Version

Operating System

Logs / Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[[fallback_providers]] in config.toml not used as runtime fallback when primary provider goes offline #1003

Description

Description

Expected Behavior

Steps to Reproduce

OpenFang Version

Operating System

Logs / Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions