Symptom
In Remote Leaf setup running 5.2(5c), COOP appears to be stalling for a long period of time when getting the EP records from the remote Spine after a failover.
Remote leaf connection to IPN is BGP based. The failover is just bringing down an Uplink + VPN link in the IPN (which is WAN + IPsec VPN) of the customer fabric.
We see that BGP sessions to IPN routers and the COOP TCP connection goes down, then comes back, but it can be up to 45 minutes or longer before EPs are seen, during which time there is an outage.
Spines repository get flushed after COOP flap(hence empty), but not updated from Remote Leaves side immediately, while waiting for <=1h timer to expire, to trigger EP repo sync.
Conditions
Remote leaves connected via WAN but not directly to IPN;
Workaround
One of below:
- do a clear endpoint via epm cli on the RL , it will re-inject the EP’s . [clear system internal epm endpoint ]
- In Coop there is CLI , which will trigger on demand repo-refresh. Issue it on RL node "clear coop internal inconsistency"
- All RL's uplinks flap (shut\no shut from IPN router side) at the same time (which will guaranty that all coop sessions went down completely).
Further Problem Description
RCA: We confirmed that TCP connection was still connected from RL side and was disconnected from Oracle side. As there was disconnection from oracle side, it triggered fast aging deleted all routes from the RL and when the connected was established with RL using new socket, as connection was still up on RL side, the disconnect on old socket and connect on new sock on RL was in quick succession, hence there was no repo refresh triggered.
Fix: Since the time connection was broken from oracle side, even the RL was seeing ping failures, so a fix was added if at the time of connect if last successful pong was received before 150 seconds, we trigger a repo refresh.