Node boot frequently fails [Closed]

Hi all, using the farmerbot my 4 nodes frequently shut down and boot up, as expected. The problem is that often times the node gets stuck and fails to boot, resulting in an offline status of the node but the hardware running.
It’s always the same point where the boot gets stuck:


I have to manually shut down the node and restart it, and most of the times it boots OK, but sometimes I even need to restart a second time.
Anyone any idea what the problem could be?
Thanks!

Hi @TFFarmer,

I can try to help and if it doesn’t resolve quickly we can write an issue on Github.

Here are some potential troubleshoots:

  1. Check if the BIOS are set correctly.
  2. Check the connection and remove anything linked to wifi or else. As you need direct ethernet connection.

I wonder if the unreachable problem might be on the TFGrid side.

This looks to me like some hardware component or internal connection has become intermittent. I would start by reseating any cards and connectors in the path of the SSD storage and networking. You could also try using a different NIC, either swapping among those built in or adding one in a PCI slot.

1 Like

Another idea for troubleshooting is to boot a live Linux distribution, like Ubuntu, and see if any errors present there. Sometimes it is possible to gather more helpful information than what is shown by Zos in this way.

The thing is that it started happening on all 4 servers, it is not just 1 server. So I don’t think it is related to hardware in the machines, but could it be something upstream? Like router settings?

Is there a checklist somewhere for correctly adding a node to the grid?

Couple questions

1.) does it always freeze at the pictured point?
2.) has anything at all changed in your setup other then adding fbot, specifically new devices or changes to network gear
3.) can you map out your network

The presence of the issues in multiple nodes likely indicates the problem is with something in common between all of them, which points me towards the network either locally or a communication issue with the grid

2 Likes

It is always at this exact same point the code stops running, yes.
I had to factory reset my router a while ago, but the behaviour was present before that as well.
I have 4 servers, a Mikrotik router and provider modem in bridge mode.

When did you first see it start happening that’s the information we’re after. You have been farmer a long while and things don’t just start, something has to change whether that be a bug or something on your end. The more specific details we add the better the differential.

There’s an entire world wide infrastructure involved in a 3Node just booting properly so it can be very difficult to isolate the problem.

To summarize so far

You have 4 nodes that are doing this, 

they freeze at the same point every time they freeze,

They sometimes boot completely normally

The nodes have no public ip configurations
Your network is made up of your isp modem bridged to a mikrotik device

Are all of your nodes plugged in to the mikrotik?

Hi,
Yes, all nodes are plugged into the Mikrotic router, as is the laptop running the farmerbot.
I have not noticed this happening begore I started using farmerbot, but of course, before using farmerbot the servers were always running…
Quite some time ago, Jan de Landtsheer made some changes to the Mikrotik regarding IPV6, that has not been done since the reset of the router, but since all was working I supposed it was not really necessary.

Also, when I try to start farmerbot, I get this message:
mainnet-rmbpeer-1 exited with code 1
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)

mainnet-rmbpeer-1 exited with code 1
dependency failed to start: container mainnet-grid3_client-1 is unhealthy

This is another somewhat obscure network related issue. It’s the first I’m seeing it with connection to the farmerbot.

I wonder if the issue only became apparent with more frequent reboots of the servers. Do you generally have other devices connected to the Microtik?

Yes, I do have everything connected to the Mikrotik, working stations & wifi acces point…

root@erwin-Inspiron-1750:~/mainnet# docker compose up
[+] Running 5/5
:heavy_check_mark: Network mainnet_default Created 0.4s
:heavy_check_mark: Container mainnet-redis-1 Created 1.2s
:heavy_check_mark: Container mainnet-grid3_client-1 Created 1.0s
:heavy_check_mark: Container mainnet-rmbpeer-1 Created 1.0s
:heavy_check_mark: Container mainnet-farmerbot-1 Created 0.7s
Attaching to mainnet-farmerbot-1, mainnet-grid3_client-1, mainnet-redis-1, mainnet-rmbpeer-1
mainnet-redis-1 | 1:C 05 Jun 2023 15:12:07.894 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
mainnet-redis-1 | 1:C 05 Jun 2023 15:12:07.894 # Redis version=7.0.8, bits=64, commit=00000000, modified=0, pid=1, just started
mainnet-redis-1 | 1:C 05 Jun 2023 15:12:07.894 # Configuration loaded
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.896 * Increased maximum number of open files to 10032 (it was originally set to 1024).
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.896 * monotonic clock: POSIX clock_gettime
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.897 * Running mode=standalone, port=6379.
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.897 # Server initialized
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.897 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.899 * Loading RDB produced by version 7.0.8
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.899 * RDB age 10 seconds
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.899 * RDB memory usage when created 1.77 Mb
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.909 * Done loading RDB, keys loaded: 1232, keys expired: 0.
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.910 * DB loaded from disk: 0.011 seconds
mainnet-redis-1 | 1:M 05 Jun 2023 15:12:07.910 * Ready to accept connections
mainnet-grid3_client-1 | yarn run v1.22.19
mainnet-grid3_client-1 | warning package.json: No license field
mainnet-grid3_client-1 | $ /node_modules/.bin/grid_http_server -c /root/config.json
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-grid3_client-1 | 2023-06-05 15:14:27 API-WS: disconnected from wss://tfchain.grid.tf/ws: 1006:: connection failed
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
mainnet-grid3_client-1 | 2023-06-05 15:16:41 API-WS: disconnected from wss://tfchain.grid.tf/ws: 1006:: connection failed
mainnet-rmbpeer-1 | cannot create substrate twin db object: Rpc error: RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99): RPC error: Networking or low-level protocol error: Error when opening the TCP socket: Address not available (os error 99)
mainnet-rmbpeer-1 exited with code 1
dependency failed to start: container mainnet-grid3_client-1 is unhealthy
root@erwin-Inspiron-1750:~/mainnet#

Acknowledged and raised a ticket via Support channel.

I did a little more searching. OS error 101 is also known as ENETUNREACH. This post has some useful info for better understanding what it means. Especially in the comment at the bottom:

Network is unreachable can be received by a perfectly fine system correctly configured: any router in the path to the destination can send back this error using the out of band channel meant for this: ICMP packet. The ICMP Destination network unreachable will translate into the application getting an ENETUNREACH.

The first place I’d look is at your router. You said you “had to factory reset” it a while ago. Just speculating here, but maybe the root cause of the error 101 and whatever caused you to do the factory reset are the same? Maybe there are some logs or diagnostics you can check on the Microtik?

1 Like

Just ordered a new router, let’s see if that fixes things…

Please let us know if it works!