Symptom
Cat9k switch or a stack of switches might experience a failure when performing the 1-step upgrade procedure via the following CLI:
install add file flash: activate commit
the failure happens during the ACTIVATE phase of the install script and the following messages will be seen:
Jan 29 16:08:44.386: %INSTALL-5-INSTALL_START_INFO: Switch 1 R0/0: install_engine: Started install one-shot flash:
Jan 29 16:20:45.137: %INSTALL-3-OPERATION_ERROR_MESSAGE: Switch 1 R0/0: install_engine: Failed to install_add_activate_commit package flash:, Error: FAILED: install_activate exit(1)
Jan 29 16:20:45.137: %INSTALL-3-OPERATION_ERROR_MESSAGE: Switch 1 R0/0: install_engine: Failed to install_add_activate_commit package flash:, Error: [3]: FAILED: Activate failed in switch
Conditions
this is a rate timing issue
no specific conditions are known to increase the likelihood of encountering this issue
the basic set of known conditions is:
- Cat9k switch or stack of switches
- operating in INSTALL mode
- 1-step upgrade procedure is used for IOS upgrade
++> install add file flash: activate commit
Workaround
if this issue is encountered try recovering the installation by removing the inactive packages and starting the installer again
install remove inactive
install add file flash: activate commit
reload of the device is known to recover the problem and the subsequent upgrade attempt is will very likely be successful
Further Problem Description
This is a very rare timing issue where the LOCK conflict is caused by having the periodic.sh script invoking cleanup.sh script and at the exact same time when files are locked also the installer script is invoking cleanup.sh that wants to operate over the same set of files.
In the production code the periodic.sh script is scheduled to run every 5 minutes // 300 seconds which makes it very unlikely align to perfectly to cause this failure.
Starting from 16.12.5 and later 16.12.x release as well as all 17.x releases the bash scripts have been modified to not contain the potential for this timing issue to occur at all.
Therefore upgrading from 16.12.05 or 17.x to further releases there will not be even a theoretical chance for this issue to occur anymore.