...
Unexpected device reload when device-tracking feature (SISF) is enabled and left running for a long time. When the crash occurs the below messages are generated: Exception to IOS Thread: Frame pointer 0x4DDE51E0, PC = 0x69692C50 UNIX-EXT-SIGNAL: Segmentation fault(11), Process = SISF Main Thread The crash is caused by a memory leak with the below symptoms: 1. In the output of "show process memory platform sorted" the RSS counter for linux_iosd-imag process is continuously increasing. Switch#show process memory platform sorted System memory: 7764904K total, 3943700K used, 3821204K free, Lowest: 3820208K Pid Text Data Stack Dynamic RSS Name ---------------------------------------------------------------------- 10260 227197 2000240 136 364 2000240 linux_iosd-imag AND 2. In the output of "show process memory platform detailed name iosd smaps | begin IOS_PRIV_OPER_DB" the Size and Rss counters are rapidly increasing: Switch#show proc mem plat detailed name iosd smaps | begin IOS_PRIV_OPER_DB 8c80000000-8ccf5a1000 rw-s 00000000 00:28 88415 /tmp/rp/tdldb/0/IOS_PRIV_OPER_DB Size: 1300100 kB Rss: 1295284 kB
This problem may happen over time if device-tracking MAC entries are repetitively deleted and created.
A periodic reboot of the affected device will reset the memory usage. The "device-tracking tdl-disable" service-internal command may be used to disable the SISF Crimson DB entirely and avoid this issue; however, this means that no SISF data will be available via the Crimson DB to any consumer (e.g. DNAC). This command requires service-internal to remain enabled to persist through reloads. This workaround will prevent further memory leakage due to this issue, but will not recover any memory that has already been leaked; a reboot will do this. The CLI only disables Notifications to DNAC, the command still holds good for switches with no DNAC implementation (of discovered MAC and IP bindings) what so ever. With TDL disabled, there is no functional change on the switch itself. An SMU hot patch is available for download for C9300, C9400, C9500 and C9600 platforms.
When a Crimson DB entry is created, some elements like calendar_time are automatically allocated by the TDL library. The SISF code was also allocating that element without checking whether it already existed, leading to a leak. Over the course of time, after millions of creation / deletion cycles of device-tracking MAC entries, the switch runs out of Crimson memory. This issue was introduced in a new feature in 17.2.1 and 16.12.3. It exists in all releases from there until the releases where it has been fixed; it is fixed in 16.12.5, 17.3.2 and 17.4.1 (note that it is not fixed in 17.2.x). It is in PI code and may therefore be observed on any platform that supports both Crimson and SISF.