Some nodes went down because of an upgrade bug

Mainnet was getting an important security patch when an unexpected upgrade bug had put some of the nodes into a bad state.

Unfortunately affected nodes will not be able to come back unless manually rebooted. The affected nodes will show something in the logs like

[+] storaged: 2023-08-22T12:27:12Z fatal exiting error="failed to initialize storage module: failed to scan devices: stderr: lsblk: error while loading shared libraries: librt.so.1: cannot open shared object file: No such file or directory\n: exit status 127"

if your node is showing this, then you will have to manually reboot it to get it back online. This can’t be done remotely.

We managed to stop (and hopefully safe) most of the grid during this. Currently less than 200 nodes are being affected

We still investigating what has happened that caused this. Still no clear clue since the patch only targeted a completely different service than the one that is failing.

I will keep you updated when we reach something

3 Likes

Updates do not interrupt running workloads but u can’t process new requests. Rebooting the nodes should get them back into the correct state.

Okay i know what happened:

Few days ago we found an issue that affects some nodes during boot, and it was hard to debug but we found that there is a package on the hub that links to a very old version of 0-fs and that package sometimes overrides the latest version of 0-fs which then causes the boot to fail.

What we did is that we deleted this package from the repo. Now booting nodes will then not see this package at all and will work with latest package. Great!

The problem is with nodes that is already up and running, these will keep running perfectly fine with no issue until there is a new zos version available:

While installing the update, the nodes will see that they have a local package that does not exist on the remote repository anymore! they are programmed to delete those packages.

The problem is that the old 0-fs package had copies of some libraries that actually shouldn’t be there! they have the librt.so file that shows up in the error above. and Since zos does not have a package manager and rely on flist content to manage packages, the nodes simply deleted this file not knowing that it’s also required by other binaries on the system.

System broken!

  • We interrupted the system update by forcing the version back to v3.7.2, so nodes that didn’t update yet don’t see a change.
  • Nodes that has been booted after the package is already gone from the repo, won’t be affected at all since they never heard of that package, so even if they got the update they should be fine.
  • Nodes that are still running (luckily) we will need to make sure that the update does not remove those files from the system

Note that this problem would hit mainnet with any update (not necessary the patch) and would have appeared with the next release and in that case would cause a complete blackout, so we are still lucky we caught and stopped this now.

The unlucky nodes that got affected by this need to be manually rebooted, and this grantees that they get immunity against this issue.

We will work on a fix to be included (patch or next release) to make sure gone packages from the hub repository DOES NOT DELETE files from the system.

7 Likes

Update on the status of the issue:

  • We fixed the code issue that causes the deletion of the library files.
  • We also added a dummy flist that matches the name of the old package that started all this, but it won’t contain any files, in that case, the system will not delete any files from that flist but will only try to extract what in the flist. Since the flist is empty no extra files will be extracted, but the most important is no files are deleted either.

The combinations of both fixes should make next release safe to apply

5 Likes

After some brainstorming with @azmy, I ran a test to try to identify nodes that were affected by this bug and are still in the broken state. As far as we know, these nodes appear “Up” in the sense that they are reporting uptime, but are not reachable over RMB.

So I pulled a list of online nodes and sent them a status request over RMB. The following 141 nodes did not respond:

22
23
36
39
53
126
151
315
316
589
591
793
1008
1066
1073
1082
1111
1123
1152
1187
1193
1208
1229
1233
1302
1316
1320
1352
1359
1369
1396
1450
1493
1510
1511
1518
1577
1579
1715
1730
1742
1749
1984
2066
2184
2186
2281
2293
2323
2388
2396
2467
2552
2556
2621
2687
2755
2783
2976
2979
2988
3095
3096
3099
3105
3106
3111
3122
3202
3260
3283
3356
3446
3447
3453
3471
3505
3512
3517
3582
3603
3638
3647
3657
3669
3688
3731
3760
3761
3796
3814
3830
3913
4018
4104
4151
4179
4198
4288
4289
4369
4473
4475
4476
4481
4485
4486
4491
4498
4505
4535
4669
4705
4745
4820
4903
5010
5018
5022
5059
5109
5212
5315
5369
5371
5492
5497
5606
5685
5701
5745
5923
5928
5929
5931
5932
5933
6024
6035
6073
6074

While this is not conclusive evidence that a node was affected by the bug, if any of your node ids are in the list, it would be a good idea to reboot them.

3 Likes

Great work @scott ! I will try to review logs of some random nodes from that list

1 Like

A lot of those are mine and are not working with farmerbot. I will do a reboot, though the problem existed before the update bug.

2 Likes

That’s a good point. This is the same check that farmerbot is using to see that nodes woke up successfully, so any affected nodes would likely be in this list too. We’ve heard that rebooting such nodes can also help temporarily, though that won’t be fully resolved until the next Zos release.