Okay i know what happened:
Few days ago we found an issue that affects some nodes during boot, and it was hard to debug but we found that there is a package on the hub that links to a very old version of 0-fs
and that package sometimes overrides the latest version of 0-fs
which then causes the boot to fail.
What we did is that we deleted this package from the repo. Now booting nodes will then not see this package at all and will work with latest package. Great!
The problem is with nodes that is already up and running, these will keep running perfectly fine with no issue until there is a new zos version available:
While installing the update, the nodes will see that they have a local package that does not exist on the remote repository anymore! they are programmed to delete those packages.
The problem is that the old 0-fs package had copies of some libraries that actually shouldn’t be there! they have the librt.so
file that shows up in the error above. and Since zos does not have a package manager and rely on flist content to manage packages, the nodes simply deleted this file not knowing that it’s also required by other binaries on the system.
System broken!
- We interrupted the system update by forcing the version back to v3.7.2, so nodes that didn’t update yet don’t see a change.
- Nodes that has been booted after the package is already gone from the repo, won’t be affected at all since they never heard of that package, so even if they got the update they should be fine.
- Nodes that are still running (luckily) we will need to make sure that the update does not remove those files from the system
Note that this problem would hit mainnet with any update (not necessary the patch) and would have appeared with the next release and in that case would cause a complete blackout, so we are still lucky we caught and stopped this now.
The unlucky nodes that got affected by this need to be manually rebooted, and this grantees that they get immunity against this issue.
We will work on a fix to be included (patch or next release) to make sure gone packages from the hub repository DOES NOT DELETE files from the system.