How to ensure high availability running my workloads on the TFGrid

Geert · January 26, 2022, 10:06am

Question asked on the forum for which the discussion deserves the attention not to get lost on Telegram:

Hello! Getting excited about this TheeFold which I stumbled upon not so long ago. One question, of which I can’t seem to find the answer quickly: how about high availability? Or: what if my node is turned off for some hours, is there any company going out of business because some of their systems/sites are not hosted anymore?

Elsewhere in this forum is described the impact on rewards for farmers. Here I’d like to reflect on the user side perspective. A balance is to be made of course between speed (which will be higher when hardware is not too far from each other) and reliability and security (which will increase if data and workloads are running in different locations).

Users of the TFGrid can set up their infrastructure in a way that gives them the guarantee that their workloads remain up and running, but it’s up to them to make that happen.
What tooling can they use ?

they can run a Kubernetes cluster out of the box, using the hardware infrastructure offered by different nodes. This will enable them to do load balancing, but also failover opportunities.
they can set up different gateways on the grid so that if one goes down, there is also another way to reach the workloads
for persistent storage and archiving, the quantum-safe storage splits up your data and spreads the information over different chunks stored on different nodes. There is an ‘operational backup’ mechanism built in, ensuring availability of the storage with a very small redundancy, way lower than a full back-up, and operationally available. Meaning that if a storage node goes down, data will remain available and no switching is needed to this backup instance. The mechanism is explained on our wiki here: https://library.threefold.me/info/threefold#/technology/qsss/threefold__qss_algorithm

So yes, tooling is available for a user to have his workloads remain up and running permanently, and it requires less hardware resources that it would need in classic systems. Plus: it gives the flexibility to a user to put his workloads and archive spread over different locations, so disasters such as a fire won’t impact uptime.