Self healing, high availability, and the grid today

scott · January 18, 2023, 12:47am

One common question around here goes something like: well, what happens if the node my workload is running on goes down? A common confusion about the ThreeFold Grid is that workloads are somehow distributed among available nodes in such a way that the failure of an individual node just means that work is shifted elsewhere. While self healing and self driving capabilities are an end goal for the Grid, we’re not there yet. Currently each deployment is tied to a single node and it’s the responsibility of deployers to provide any desired redundancy.

Of course, even the big cloud providers have occasional outages, and IT architects of high stakes deployments will build in a layer of fault tolerance spanning multiple physical locations or even multiple cloud providers. For end users of these sites and services, the failure of a single hardware system supporting the operation would seldom be noticed. So, what do these configurations look like? Can you implement them on the Grid today? What might the future look like? Let’s find out.

High Availability

Often shortened to “HA”, high availability refers to a system with provisions to minimize downtime through redundant components. There are different places where HA can be implemented in a tech stack. Broadly, we could classify these as low level, operating in hardware or in the hypervisor that manages VMs, and high level, consisting of multiple systems or VMs orchestrated together to provide application level redundancy. Both of these could be working in the same stack, or either individually could be sufficient for a given use case.

The ThreeFold Grid is mostly designed with high level redundancy in mind. There are some good reasons that low level methods aren’t as well suited to a distributed cloud. Still, it’s worth understanding a bit about how the low level methods work.

Low level HA

RAID

One of the best known HA technologies is known as RAID (ignoring RAID 0 which only provides increased performance, not HA). It allows multiple disks to be joined into an array, with the data stored on the array spread across the disks in a manner including some duplication. Then the failure of some number of disks in the array can be tolerated without losing any data, and the failed disks can even be replaced without shutting down the computer, being restored from the remaining working disks.

While RAID began as a system implemented in special hardware devices, known as RAID controllers, these days RAID can be deployed purely as a software construct with some distinct advantages over the legacy systems such as portability, flexibility, and lower cost. Zero OS does not support hardware RAID systems, but it is possible to deploy software RAID within VMs on the Grid.

That said, software RAID is not an ideal solution on the Grid because flexibility is lacking in how disks are reserved. It’s not straightforward to ensure that two disk reservations are placed onto separate disks on a node, and of course, this requires that the node has multiple disks in the first place. Providing better support for deployments intending to use software RAID is one way that we can increase HA options on the Grid. However, this only covers the case of disk failure, and relying on a RAID setup can also mean relying on farmers to replace failed disks in a timely manner.

Hypervisor clusters

Some hypervisors like Proxmox and VMware include clustering capabilities that can restart a VM on a different machine in the cluster if one machine goes down, or even run a duplicate VM in parallel that can instantly take over in case of a fault. These systems also require a storage system for storing the VM data that is robust against the failure of nodes in the cluster. That could be a separate NAS/SAN storage system (perhaps itself using RAID), or a distributed storage system like CEPH running among the cluster nodes.

Using multiple computer systems in hypervisor clusters, potentially along with dedicated networking gear, requires a fairly tight coupling of those systems. To ensure that only one system is actually running the VM at a time, and thus avert potential data corruption, clusters use “fencing” techniques that include giving the individual systems the ability to power off the others or instruct networking gear to block their traffic. Techniques for running two synchronized duplicate VMs in parallel even sometimes used special cables to connect the two computers.

While it is true that a good number of ThreeFold nodes run in farms where multiple nodes are present in the same physical location, solutions relying on colocated nodes aren’t appropriate, in general, for the distributed nature of the Grid where many nodes run solo. Besides, the trend in clustering is towards Kubernetes, which works well on the Grid.

High level HA

Following the segue, Kubernetes is a high level system which can provide high availability in a more abstract environment. We already have preconfigured Kubernetes deployments available on the Grid. Kubernetes works well running inside of VMs, and it orchestrates containers as the working unit. This is an important distinction, because designing deployments using containers has some fundamental differences versus using VMs. Most importantly, it is generally considered more challenging to containerize an application and deploy it using Kubernetes than it is to get it running inside a VM.

It is also possible to build high level HA deployments using VMs. Kubernetes has the advantage of behaving somewhat like the clusters discussed above, in that components that were running on a failed system can be restarted on another running system within the cluster automatically. For the remainder of this section, I’ll describe some techniques that are generally applicable using either approach.

The internet as you know it

Here’s an example of a web application architecture. It could be a blog or an ecommerce website, where the blog posts or item descriptions are stored in a database. These are then retrieved and displayed for site visitors, who can take additional actions that might generate more entries for the database, like comments, reviews, and orders. Aside from simple static sites that don’t need a database, pretty much the entire internet is built on some variation of this basic template:

   +--------------+          
   |              |          
   |     User     |          
   |              |          
   +--------------+          
         ^ |                 
         | |                 
         | v                 
   +--------------+          
   |              |          
   |  Web Server  |          
   |              |          
   +--------------+          
         ^ |                 
         | |                 
         | v                 
   +--------------+          
   |              |          
   |   Database   |          
   |              |          
   +--------------+

The database could be hosted on the same system as the web server (could be a VM, a bare metal server, or even a single node Kubernetes cluster), making that system the single point of failure. By adding an additional component, called a load balancer, along with redundant servers, we can eliminate the single point of failure. The load balancers will try to distribute traffic evenly between the two servers when they’re both online, then direct all traffic to one server if the other fails. Let’s also duplicate our database on another system and add a backup:

         +--------------+                     
         |              |                     
         |     User     |                     
         |              |                     
         +--------------+                     
             /      \                         
            /        \                        
           /          \                       
+--------------+  +--------------+            
|              |  |              |            
|Load Balancer |  |Load Balancer |            
|              |  |              |            
+--------------+  +--------------+            
      |       \    /       |                  
      |        \  /        |                  
      |         \/         |                  
      |         /\         |                  
      |        /  \        |                  
      |       /    \       |                  
+--------------+  +--------------+            
|              |  |              |            
|  Web Server  |  |  Web Server  |            
|              |  |              |            
+--------------+  +--------------+            
      |            /                          
      |           /                           
      |          /                            
      |         /                             
      |        /                              
+--------------+  +--------------+           
|              |  |              |           
|   Database   |--|   Database   |           
|              |  |   Replica    |           
+--------------+  +--------------+           
           \           /                      
            \         /                       
             \       /                        
          +--------------+                    
          |              |                    
          |   Database   |                    
          |   Backup     |                    
          +--------------+

In this example, the load balancers, web servers, and databases could be running on two machines, with one stack on each, or on six total machines. The backup would be separate, in any case, providing a final fallback option if everything above it were to fail. In case the main database goes offline, the replica can take over.

An architecture like this can absolutely be deployed on the Grid today, using existing and freely available tools. There are some details glossed over in the diagram, especially how DNS is configured to connect users who type a certain URL in their browser to both of the load balancers. This means that the failure of one load balancer won’t be handled totally seamlessly—there will be at least some temporary delays in connecting.

My point at the moment in bringing that up is that eventually we hit some system outside of the Grid that has its own limitations. As part of the bigger project of “a new internet”, we might someday find replacements for such systems that can provide an even better solution than the best HA schemes out there today.

Self healing

While high availability systems are able to tolerate or recover from a certain number of faults, they aren’t able to restore their original state independent of external intervention. Self healing refers to the additional capability of an IT system to bring new resources online automatically to replace those that have gone offline. This is an emerging field of cloud automation, which applies closed loop feedback to maintain a steady state. It’s something like using a thermostat to hold a given temperature, rather than adding wood to a fire manually as needed.

In a traditional cloud environment, self healing capabilities would be enabled through the use of a machine operable interface to an account with the cloud provider. Any additional capacity provisioned would then be billed to the account as usual. On the ThreeFold Grid, all systems are open, so it’s possible to automate any deployment process and this means that self healing systems can be implemented on top of the existing Grid foundation. So that is to say that while these features are on the roadmap, there’s nothing stopping anyone from getting started today.

Your virtual system administrator

One concept we use to describe what the manifestation of self healing outcomes on the Grid could look like is the virtual system administrator, as part of your digital twin. This is a software entity that is programmed with the capability to deploy, and heal, different applications on your behalf. Of course, this means that the twin itself must be inherently robust against failure, and therefore should perhaps use some high availability concepts along with healing itself as needed. This would be the root of what we talk about whens saying “putting people at the center of their digital lives”, building a solid kernel that can be responsible for keeping data and services safe and online.

QSFS

I should mentioned our Quantum Safe Filesystem as an example of what self healing technology can look like, specifically for the case of storage. In Grid v2, we already had an implementation of QSFS that could provision new capacity automatically to replace any backend stores that went offline and reconstruct the desired level of fault tolerance. This capability will be recreated on Grid v3 as a foundational element of self healing tech stacks on the Grid.

Kubernetes and beyond

These days we mostly talk about VMs on the Grid, since indeed, all workloads run as VMs in two styles, micro and full. This distinction is important, because micro VMs are really based on containers and containers are largely seen as the future of software deployment. We have containers operating as “first class citizens” on the Grid, which opens the possibility for novel approaches to managing them. Currently Kubernetes is the leading solution for container orchestration, and we proudly support Kubernetes running in VMs on the Grid. However, this ultimately is an extra layer of abstraction (and resource consumption) whose function could be fulfilled by a Grid native alternative.

What could that look like and what could it enable? Those are questions for another time, but it seems pretty clear to me that there is always a next step in the natural evolution of every system.

An autonomous p2p infrastructure

I’ll close this piece by zooming out a bit, to recognize that ThreeFold technology enables more than autonomous software that can recover from failures. It also enables an autonomous infrastructure at the root layer of people and hardware. There’s no other project in the world that I know of which brings a complete solution for internet infrastructure based on independently controlled capacity that can scale anywhere there’s electricity and network connectivity.

I also want to challenge a bit of a paradigm shift in how we think about deployments and solutions on the Grid. We tend to focus a lot, including as I’ve done here, on how the existing ways of doing things can be ported to the Grid with minimal adaptation and effort. But aren’t we really interested in a new way of doing things? Realizing the promise of a peer to peer network that the internet was originally intended to be will require redesigning systems to avoid centralized points of failure in the first place.

Of course, blockchain networks are the most notable recent innovation in peer to peer networks, chugging along as nodes come and go. I wrote a bit recently about one project that is going a step further to eliminate the central point of failure in blockchain nodes themselves. To me, this is a great example of the type of thinking and building that’s well suited for a decentralized capacity network like the Grid.

Thanks for joining me on this journey of learning and dreaming. I’d love to hear any questions or reflections you have. Let’s keep on buidling something great together.