Skip to content

igc driver TX queue stall causing complete transmission failure on Intel I225-V #276

@paulwaldmann

Description

@paulwaldmann

igc driver TX queue stall causing complete transmission failure on Intel I225-V

Important notices
Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug
The igc driver for Intel I225-V NICs experiences TX queue stalls causing complete loss of transmission capability on the affected interface. The NIC stops transmitting packets, causing all VLANs on the parent interface to become unreachable. This occurs periodically after some hours/days of uptime. A full system reboot is required to restore functionality.

To Reproduce

  1. Use Intel I225-V NIC (igc0) as VLAN trunk interface with multiple VLANs
  2. Run system under normal load for several hours/days
  3. At some point, VLAN hosts become unreachable
  4. Ping attempts from firewall to VLAN hosts fail with "No buffer space available" or "Host is down"
  5. TX queue shows STALLED state:
    sysctl dev.igc.0.iflib.txq0.ring_state
    dev.igc.0.iflib.txq0.ring_state: pidx_head: 1590 pidx_tail: 1590 cidx: 1592 state: STALLED
    

Expected behavior
The igc interface should transmit packets reliably without TX queue stalls. VLAN traffic should flow normally.

Describe alternatives you considered

  1. Interface bounce (ifconfig igc0 down/up): Clears TX stall temporarily but VLANs remain non-functional
  2. PCI device reset (devctl reset pci0:2:0:0): Clears TX stall (STALLED → IDLE) but VLANs still cannot transmit
  3. VLAN interface bounce: No effect
  4. Full reboot: Only solution that restores functionality
  5. Hardware offload changes (LRO/TSO): Previously disabled, issue persisted; re-enabled with no improvement

Screenshots
N/A

Relevant log files

TX Queue Stall State:

root@glo-nofw001:~ # sysctl dev.igc.0.iflib.txq0.ring_state
dev.igc.0.iflib.txq0.ring_state: pidx_head: 1590 pidx_tail: 1590 cidx: 1592 state: STALLED

Note: cidx (1592) ahead of pidx (1590) indicates impossible/corrupt state.

Corrupt MAC Statistics (physically impossible values):

root@glo-nofw001:~ # sysctl dev.igc.0.mac_stats | grep -iE "error|drop|miss|coll"
dev.igc.0.mac_stats.mgmt_pkts_drop: 387698107785060
dev.igc.0.mac_stats.recv_length_errors: 387698107785060
dev.igc.0.mac_stats.missed_packets: 387698107785060
dev.igc.0.mac_stats.collision_count: 775396215570120
dev.igc.0.mac_stats.late_coll: 387698107785060
dev.igc.0.mac_stats.multiple_coll: 387698107785060
dev.igc.0.mac_stats.single_coll: 387698107785060
dev.igc.0.mac_stats.excess_coll: 387698107785060

These values (387+ trillion collisions) are physically impossible on a 1Gbps link.

RX buffer exhaustion counter:

root@glo-nofw001:~ # sysctl dev.igc.0.mac_stats.recv_no_buff
dev.igc.0.mac_stats.recv_no_buff: 381401685730590

Interface shows output errors:

"output errors":"800049327843420"

Netisr workstream imbalance (CPU 7 congested):

   7   7   ip         0    13   815074        0        0   103382   918456
   7   7   ip6        0    22        0        0        0   327851   327851

Ping failures during issue:

root@glo-nofw001:~ # ping 10.10.113.10
PING 10.10.113.10 (10.10.113.10): 56 data bytes
ping: sendto: No buffer space available

root@glo-nofw001:~ # ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2): 56 data bytes
ping: sendto: No buffer space available

tcpdump shows no traffic on igc0:

root@glo-nofw001:~ # tcpdump -i igc0 -c 10 -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on igc0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C
0 packets captured

ARP entries show incomplete for affected VLANs:

root@glo-nofw001:~ # arp -an | grep -E "10.10.113|10.10.106|10.10.100"
? (10.10.100.10) at (incomplete) on vlan0.100 expired [vlan]
? (10.10.106.10) at (incomplete) on vlan0.106 expired [vlan]
? (10.10.113.10) at (incomplete) on vlan0.113 expired [vlan]

Other igc ports on same board show clean stats (igc1/igc2/igc3):

"input errors":"0","packets transmitted":"88437","output errors":"0","collisions":"0"

Additional context

  • Issue is specific to igc0; other igc ports (igc1, igc2, igc3) on same appliance function correctly with clean statistics
  • The issue recurs periodically, suggesting a race condition or state corruption that builds up over time
  • System has 32GB RAM with 25GB free when issue occurs; mbuf statistics show no denial/exhaustion
  • PCI device reset clears the STALLED state but does not restore transmission capability, suggesting corruption beyond the TX ring (possibly DMA mappings or interrupt handlers)
  • Since if_igc is compiled into kernel, driver cannot be reloaded without full reboot
  • NIC is rev 0x03 (B3 stepping) which was supposed to address earlier I225-V hardware errata
  • This is occurring on official Deciso hardware with onboard NICs

Environment

  • Hardware: Deciso DEC3862 (official OPNsense appliance)
  • Software: OPNsense 25.7.10 (amd64)
  • OS: FreeBSD 14.3-RELEASE-p5
  • NIC: Intel I225-V Ethernet Controller (igc driver)
    igc0@pci0:2:0:0: vendor=0x8086 device=0x15f3 subvendor=0x8086 subdevice=0x0000
    rev=0x03 (B3 stepping)
    
  • Interface configuration: VLAN trunk with 25+ VLANs

Metadata

Metadata

Assignees

No one assigned

    Labels

    upstreamThird party issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions