Some nodes went down because of an upgrade bug

azmy · August 22, 2023, 12:33pm

Mainnet was getting an important security patch when an unexpected upgrade bug had put some of the nodes into a bad state.

Unfortunately affected nodes will not be able to come back unless manually rebooted. The affected nodes will show something in the logs like

[+] storaged: 2023-08-22T12:27:12Z fatal exiting error="failed to initialize storage module: failed to scan devices: stderr: lsblk: error while loading shared libraries: librt.so.1: cannot open shared object file: No such file or directory\n: exit status 127"

if your node is showing this, then you will have to manually reboot it to get it back online. This can’t be done remotely.

We managed to stop (and hopefully safe) most of the grid during this. Currently less than 200 nodes are being affected

We still investigating what has happened that caused this. Still no clear clue since the patch only targeted a completely different service than the one that is failing.

I will keep you updated when we reach something

sabrinasadik · August 22, 2023, 12:37pm

Updates do not interrupt running workloads but u can’t process new requests. Rebooting the nodes should get them back into the correct state.

azmy · August 22, 2023, 12:56pm

Okay i know what happened:

Few days ago we found an issue that affects some nodes during boot, and it was hard to debug but we found that there is a package on the hub that links to a very old version of 0-fs and that package sometimes overrides the latest version of 0-fs which then causes the boot to fail.

What we did is that we deleted this package from the repo. Now booting nodes will then not see this package at all and will work with latest package. Great!

The problem is with nodes that is already up and running, these will keep running perfectly fine with no issue until there is a new zos version available:

While installing the update, the nodes will see that they have a local package that does not exist on the remote repository anymore! they are programmed to delete those packages.

The problem is that the old 0-fs package had copies of some libraries that actually shouldn’t be there! they have the librt.so file that shows up in the error above. and Since zos does not have a package manager and rely on flist content to manage packages, the nodes simply deleted this file not knowing that it’s also required by other binaries on the system.

System broken!

We interrupted the system update by forcing the version back to v3.7.2, so nodes that didn’t update yet don’t see a change.
Nodes that has been booted after the package is already gone from the repo, won’t be affected at all since they never heard of that package, so even if they got the update they should be fine.
Nodes that are still running (luckily) we will need to make sure that the update does not remove those files from the system

Note that this problem would hit mainnet with any update (not necessary the patch) and would have appeared with the next release and in that case would cause a complete blackout, so we are still lucky we caught and stopped this now.

The unlucky nodes that got affected by this need to be manually rebooted, and this grantees that they get immunity against this issue.

We will work on a fix to be included (patch or next release) to make sure gone packages from the hub repository DOES NOT DELETE files from the system.

azmy · August 22, 2023, 2:47pm

Update on the status of the issue:

We fixed the code issue that causes the deletion of the library files.
We also added a dummy flist that matches the name of the old package that started all this, but it won’t contain any files, in that case, the system will not delete any files from that flist but will only try to extract what in the flist. Since the flist is empty no extra files will be extracted, but the most important is no files are deleted either.

The combinations of both fixes should make next release safe to apply

scott · August 29, 2023, 4:12am

After some brainstorming with @azmy, I ran a test to try to identify nodes that were affected by this bug and are still in the broken state. As far as we know, these nodes appear “Up” in the sense that they are reporting uptime, but are not reachable over RMB.

So I pulled a list of online nodes and sent them a status request over RMB. The following 141 nodes did not respond:

While this is not conclusive evidence that a node was affected by the bug, if any of your node ids are in the list, it would be a good idea to reboot them.

azmy · August 29, 2023, 7:28am

Great work @scott ! I will try to review logs of some random nodes from that list

FLnelson · August 29, 2023, 11:41am

A lot of those are mine and are not working with farmerbot. I will do a reboot, though the problem existed before the update bug.

scott · August 30, 2023, 10:43pm

That’s a good point. This is the same check that farmerbot is using to see that nodes woke up successfully, so any affected nodes would likely be in this list too. We’ve heard that rebooting such nodes can also help temporarily, though that won’t be fully resolved until the next Zos release.