...
The Smartconnect SSIP or network connectivity could be disrupted in a node if link aggregation interface in LACP mode is configured, and one of the port members in the lagg interface stops participating from the LACP aggregation.
Issue happens when a node is configured with any of the link aggregation interfaces: 10gige-agg-1ext-agg-1 And one of its port members is not participating into the lagg interface: lagg0: flags=8843 metric 0 mtu 1500 options=6c07bb ether 00:07:43:09:3c:77 inet6 fe80::207:43ff:fe09:3c77%lagg0 prefixlen 64 scopeid 0x8 zone 1 inet 10.25.58.xx netmask 0xffffff00 broadcast 10.25.58.xxx zone 1 nd6 options=21 media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4 laggport: cxgb0 flags=1c>> laggport: cxgb1 flags=0 This will cause OneFS to internally set the link aggregation interface to No Carrier status, due to a bug in network manager software (Flexnet): # isi network interface listLNN Name Status Owners IP Addresses --------------------------------------------------------------------------1 10gige-1 No Carrier - - 1 10gige-2 Up - - 1 10gige-agg-1 No Carrier groupnet0.subnet10g.pool10g 10.25.58.46 Possible failures causing the issue: Failed switch portIncorrect LACP configuration at switch portBad cable/SFP, or other physical issueA connected switch to a port was failed, or rebootedBXE driver bug reporting not full duplex in a port state (KB511208) Failures 1 to 4, are external to the cluster, and issue should go away as soon as these gets fixed. Failure 5 could be a persistent failure induced by a known OneFS-BXE bug(KB 511208). If node is lowest node id in pool, and Smartconnect SSIP is configured there, then: If failure 1,2, or 3 happen, then the SSIP will be moved to next lowest node id that is clear from any failureIf failure 4 is present, then the SSIP will not be available in any node, and DU is expected until workaround is implemented, patch is installed, or switch is fixed or gets available again after a reboot.If failure 5 is present: If only one port is failed, then SSIP will move to next available lowest node id not affected by the issue[DU] If all nodes in a cluster are BXE nodes, and all are affected by the bug, the SSIP will not be available, expect DU, until workaround or patch is applied. If the link aggregation in LACP mode is configured in a subnet-pool where its defined gateway is the default route in the node, then: If issue happens when node is running and default route is already set, then the default route will be continue configured and available, connectivity to already connected clients should continue working.[DU] If node is rebooted with any of the persistent failures, after it gets back up after the reboot, the default router will not be available, causing DU until external issue is fixed, workaround applied, or patch installed. If during upgrade to 8.0.0.6 or 8.1.0.2 any of the failures is present, then after the rolling reboot a DU is expected due to case described in cause A->c->ii, or cause B->b. A check must be made prior to the upgrade to evaluate you are clear from any of the described failures. Workaround Workaround to immediately restore link aggregation interface if only one member port is persistently down (Failed switch, failed cable/SFP, BXE bug, or other persistent issue) Step 1: Identify failed member port on link aggregation interface: # ifconfig lagg1: flags=8843 metric 0 mtu 1500 options=507bb ether 00:0e:1e:58:20:70 inet6 fe80::20e:1eff:fe58:2070%lagg1 prefixlen 64 scopeid 0x8 zone 1 inet 172.16.240.xxx netmask 0xffff0000 broadcast 172.16.255.xxx zone 1 nd6 options=21 media: Ethernet autoselect status: active laggproto lacp lagghash l2,l3,l4>> laggport: bxe1 flags=0 laggport: bxe0 flags=1c Step 2: Manually remove port member with command: ifconfig lagg1 -laggport bxe1 Network should be recovered in 10-20 seconds, after executing the command. This change will be lost after a reboot. After the external failure in a port has been identified and fixed, and port is again available, reconfigureport back into link aggregation configuration with command: ifconfig lagg1 laggport bxe1
A permanent fix will be available in the following OneFS maintenance releases once they become available: OneFS 8.0.0.7OneFS 8.1.0.4 Roll-Up patch is now available for:8.0.0.6 (bug 226984) - patch-2269848.1.0.2 (bug 226323) - patch-226323