Symptom
A Nexus 9k switch running NX-OS 9.3(x) or later may experience a kernel. Under "show logging onboard stack", there will be evidence of an internal interface such "veth0-1" going down, which eventually leads to the supervisor engine becoming unresponsive and failing to reply to EOBC heartbeat packets from all slots in the chassis. Eg:
SWITCH# show logging onboard stack
----------------------------
Module: 27
----------------------------
Switch OBFL Log: Enabled
**************************************************************
STACK TRACE GENERATED AT Fri Jan 1 12:00:00 2021 BST
**************************************************************
...
[31961149.512521] device veth0-1 left promiscuous mode
...
[31961150.172610] EOBC Heartbeat missed 16 from slot:27. BT101 R69 UR117 bmp 0x3ee0000f/3ee0000f
[31961150.172614] EOBC Heartbeat missed 16 from slot:28. BT101 R69 UR110 bmp 0x3ee0000f/3ee0000f
[31961150.172618] EOBC Heartbeat missed 16 from slot:29. BT101 R69 UR243 bmp 0x3ee0000f/3ee0000f
[31961150.212549] EPC Heartbeat down from slot:21
[31961150.212563] EPC Heartbeat down from slot:22
[31961150.212567] EPC Heartbeat down from slot:23
Conditions
- Nexus 9k (both multi-slot EOR and compact TOR switches impacted)
- Running NX-OS 9.3(x)
- "Show logging onboard stack" shows sign of an internal interface going down, eg: "device veth0-1 left promiscuous mode" and/or "tahoe0 down"
- No other specific conditions known at this time
Further Problem Description
Logging enhancements have been introduced in NX-OS images via CSCvz65993 that can provide TAC and the BU with more information when this issue occurs. The issue is believed to be caused by default namespace / default VRF deletion leading to several internal interfaces potentially going offline. It is believed the additional logging could be useful to determine if either of the below scenarios are hit:
1. If a process / KLM has triggered a namespace deletion, we can see the process / KLM info.
2. If It's a bug in the Linux Kernel