This has been a hot topic in the community lately, and I’ve been meaning to write a post about where we are and what’s possible. I don’t blame you for assuming that features like this already existing in the ThreeFold architecture. Especially older versions of the wiki were a little fuzzy about the current state of the tech versus what’s on the roadmap.
It turns out that building the foundation to enable autonomous and self healing applications took a while, but I think we’re now in a very solid position to talk about how to move these features forward.
The short answer is that systems with these properties are tricky to build. Removing one single point of failure tends to create another one upstream. Say you have a server hosting a website. You add another server for redundancy. Now you need a component (load balancer) that knows about both servers and routes traffic to and from them. But then you need a redundant load balancer. At some point, you run into the fact that your site’s domain is linked to one or more IP addresses that belong to components that could fail. Some cloud providers offer services to dynamically reassign IPs in case of such failures, or you could try a service offering high speed DNS failover.
To be able to offer a complete alternative, we still need a minimum of highly available systems, probably in a datacenter or equivalent environment, to at least handle public network access. Once traffic is inside the grid, designing architectures that offer fault tolerance using a collection of nodes run by newbies on salvaged hardware in their garages is totally possible.
That said, it’s a particular paradigm to design in this way. Many cloud users just want to spin up a VM, install some software, and let it run. This VM has volatile state in the RAM which will never be recoverable in the event of a sudden power outage, and non-volatile state on disk which could be synced elsewhere and recovered (using QSFS for example). Migrating this VM to another node and restoring its operation somewhat gracefully should be possible, but it won’t be seamless.
The grid is the only decentralized cloud network I know of that actually gives users this capability. Flux and Akash only support the execution of containers, which are not expected to retain state. Akash now has some limited persistent data support, but in general, these environments expect users to take responsibility for storing data safely elsewhere. The solutions must be designed as containerized and if it will ingest data, also connect to a storage service that handles concurrent connections.
Zero OS has native container support (actually micro VMs) and can provide self healing redundancy for containerized workloads with self healing storage capabilities on the same network, with some more work. TF Chain ultimately acts as the backstop in the “failure chain”, as a decentralized highly available replicated database to coordinate node activities.
So why isn’t this top priority? Well, for one, it’s already possible to run containerized workloads in a redundant manner on the grid using Kubernetes, if you know what you’re doing. Secondly, a network that allows me to provision a VM and start hacking at the command line, perhaps following any of countless tutorials online that start with “find yourself an Ubuntu VPS”, is way more interesting than a network that only supports containers, even with the fact of single point of failure.
I think the world generally agrees, because we have way more interest in using the grid since the generic VM feature was introduced to expand the offerings beyond Kubernetes VMs and native containers. So that is to say that I think the course of development so far makes since in helping to grow our grid utilization community and get broader testing for core features. There are plenty of fun and useful things you can do with capacity that doesn’t have 99.9% uptime (dev and test workloads are actually a big market). For the rest, there are gold certified nodes and Kubernetes, for now.