Backing up and restoring Grid VMs

scott · February 7, 2025, 11:26pm

This guide will demonstrate a couple of strategies for backing up and restoring VMs on the ThreeFold Grid. So far there isn’t a built in backup solution within Zos, which means that we need to approach this from within the VM.

Here’s a brief overview:

Works for full and micro VMs, deployed via the Dashboard or via other tools (that includes Dashboard applications, which are micro VMs)
Backs up an entire VM, including installed software and all data
Optionally use a second VM to receive data, in the case that the first VM is more than 50% full

Prepare source VM

What we are going to be doing is making a copy of the source VM while it is running. Generally this works just fine, but with one caveat:

If any write operations happen during the backup, it’s not guaranteed whether the old or new version of the file will be stored in the backup. This can cause data corruption in the backup (there’s no risk to the original data), especially for databases.

There are two ways to mitigate this:

Stop services, like databases, that write to disk before making the backup
Make a separate backup of any databases, using a supported method of database backup

It won’t be possible to cover either of these in a totally exhaustive way, since there are many possible permutations. I’ll show how to view and stop services in typical cases in the subsections below. If you’re not sure, just check them all and follow the instructions that apply.

Backing up databases separately is beyond the scope of this guide. You should be able to find instructions in the documentation for your database or via a web search.

Systemd

If your VM is a full VM, then it most likely has systemd as the init system. You can view services managed by systemd by running this command:

systemctl list-units --type=service --state=running --no-pager

If this results in an error that systemd isn’t found, no problem—just continue skip on to the next section. Otherwise, have a look through the list for anything that might be writing to disk. Stop those services like this:

# For example, we can stop the mysql database
systemctl stop mysql-server

Whatever you do, don’t stop ssh.service or sshd.service. You can lose the ability to connect to your VM by SSH in that case.

Ubuntu / Debian automatic updates

If your VM has systemd (full VM) and it’s an Ubuntu or Debian machine, there’s a chance you have automatic updates enabled. Needless to say, an automatic update is exactly the kind of thing we’d like to avoid during our backup.

To temporarily disable automatic updates, run this command:

dpkg-reconfigure unattended-upgrades

And then select “No” on the menu that pops up. To reenable later, just run the same command and select “Yes”.

Finally, let’s just double check that apt is not running:

pgrep apt || echo Good to go

If you see some numbers printed on screen, that means an apt is running. Try waiting a while and doing this again. If you see “Good to go” echoed back, then proceed on.

Zinit

Our ThreeFold micro VMs and application deployments come with zinit as the process manager. Check for zinit and list running services like this:

zinit

Stop services like this:

# Web servers aren't as risky as databases, but it won't hurt to stop nginx
zinit stop nginx

Docker

Finally, check if Docker is installed and if any containers are running:

docker ps

The simplest way to proceed here is to just stop Docker while making the backup, which will also stop any containers:

# With systemd
systemctl stop docker.service

# With zinit
zinit stop dockerd

Final check

After you tried nuking all services that might be writing data to disk and cause corruption in the backup, you could have a final look at running processes on the machine:

# Shows non kernel processes, for a fairly concise view
ps --ppid 2 -p 2 --deselect

If there’s some lingering processes that seem like they should have been stopped already, you can try the steps above again.

Backup time

Now we have a fork in the road. If your VM has adequate disk space you can keep things simple and make the backup to the VM’s own disk. Otherwise, you will need another machine to receive the backup (since life is already getting complicated in this case, I’ll assume that machine is always going to be a second VM).

Let’s have a look at the disk space situation:

# That's "h" for "human readable"
df -h

There’s going to be a bit of noise in these results, such as tmpfs which is actually on RAM and not disk. The general rule would be to look at the Mounted on column and look for either a bare / indicating the root filesystem or something starting with /mnt which is a typical mount point.

The formula here is basically to add up the used disk space and see if there’s some place where all of it can fit (either under root or under some mount). That will be pretty simple if no disks are mounted—then the question is basically whether the root filesystem usage is less than 50%.

If this is all feeling a bit overwhelming, you can always opt to backup to a second VM. Then all that’s important is that the second VM has enough space. If you plan to use a second VM, skip ahead to rsync. Otherwise, continue with tar.

Backing up with tar

Here we will create a compressed archive containing the entire backup. All commands below should be run as root.

First, make sure the pv utility is installed for progress monitoring:

apt update && apt install -y pv

Before running the long and scary looking command sequence below, let’s quickly explain:

Creates a tar archive
Exclude system directories that are either generic or don’t contain permanent data
Exclude the backup file itself (to avoid an infinite loop)
Pipe through pv to show progress
Compress with gzip

tar -c --exclude='/boot/*' --exclude='/dev/*' --exclude='/proc/*' --exclude='/sys/*' --exclude='/tmp/*' --exclude='/run/*' --exclude='/lost+found/' --exclude=/backup.tar.gz / | pv | gzip > /backup.tar.gz

When this command completes, you’ll have a backup at /backup.tar.gz which you can, for example, download to your local computer using scp or an SFTP client like Filezilla. Once you have copied the backup file elsewhere, you can remove it from the VM to free up the space:

rm /backup.tar.gz

With that, you are done, until it’s time to restore. See instructions below for info on doing that.

Backing up with rsync

In this case, we will backup to a second VM over a network connection using rsync. We’ll use this naming convention throughout:

VM1 - the original VM that we are backing up
VM2 - the new VM that’s receiving the backup

At this point, you’ll need to deploy your VM2. Here’s what I recommend:

Micro or full VM at your preference
Use “custom” capacity:
- 1 vCPU
- 1024 MB of RAM
- SSD big enough to receive all data from VM1 and hold the backup archive (go ahead and reserve 2x the stored data amount—this VM is only temporary so the cost is not important)
- Public IPv4 address (this is a simple way to get reliable communication between the VMs, but isn’t strictly required)

Install rsync

Both VMs will need rsync installed. We’ll also make sure nano is installed on VM2 while we’re at it:

# VM1
apt update && apt install -y rsync

# VM2
apt update && apt install -y rsync nano pv

Establish SSH between VMs

Using rsync requires SSH connectivity between the two VMs. We’ll generate a new SSH key on VM1 and add it to the authorized keys on VM2:

# VM1
ssh-keygen -t ed25519
# Hit enter to accept all defaults
cat ~/.ssh/id_ed25519.pub

Copy the key and paste it into the authorized keys file on VM2:

# VM2
nano ~/.ssh/authorized_keys 
# Paste in the key, then ctrl-o, enter to save and ctrl-x to exit

Here’s a gif showing these steps, with VM1 on the left and VM2 on the right (click to enlarge):

Do the backup

The rsync command below is adapted from this Arch Linux wiki page. Briefly:

Sync all files from VM1 root to a folder on VM2, /backup
Preserve file ownership and attributes
Show progress
Exclude system directories that are either generic or don’t contain permanent data

# VM1
# Substitute in the IP address of VM2, for example: root@1.2.3.4:/backup
rsync -aAXHv --exclude='/boot/*' --exclude='/dev/*' --exclude='/proc/*' --exclude='/sys/*' --exclude='/tmp/*' --exclude='/run/*' --exclude='/lost+found/' / root@<VM2 IP address>:/backup

Once that completes, we can again use tar to create a compressed version as a single file:

# VM2
tar -c -C /backup . | pv | gzip > /backup.tar.gz

Now you can download the backup.tar.gz file to its final destination using scp or an SFTP client like Filezilla. Once you have the backup safely stored, you can decomission VM2.

Restoring the backup

As before, we have a fork in the road. Our first, and simpler case will be restoring by extracting the tar archive directly into a new VM. This will work in many but not all cases. Due to some quirks, it can actually be impossible to restore the root filesystem into certain application deployments using tar. In that case, you can get around this by using a second VM and rsync again.

If you’re not sure, you can try the first method and if it fails, move on to the second. The symptom to look out for would be “disk quota exceeded” errors. In that case, destroy the VM you tried to recover into and start fresh.

Restoring with tar

First you will need to deploy the VM or application solution to restore the backup into. If you are restoring an application solution, follow the steps again to stop any running services. If you are restoring into a fresh VM then there’s no need to worry about this.

Upload and extract

Upload the backup.tar.gz into the root directory of the new VM first using the method of your choice. Then run:

cd /
tar -xf backup.tar.gz
# No tricks for showing progress here
# Just grab some coffee, or whatever, and hope for the best

Reenable auto updates

Don’t forget to reenable automatic updates, if applicable (this applies both for the original VM and on the restored VM):

dpkg-reconfigure unattended-upgrades # Choose "Yes"

Reboot

Finally, go ahead and reboot the VM. This serves two purposes. First, it will bring up all services and generally bring the machine to a “normal” state. Second, it will ensure that no issue blocking the VM from booting up was introduced along the way—better to discover that now than much later when the VM reboots unexpectedly due to the host node losing power.

# Full VM
reboot

# Micro VM
reboot -f

# If all else fails
echo b > /proc/sysrq-trigger

Cleanup

It will take a little while before the VM comes back and you can connect via SSH again. Have a look around and make sure everything looks normal, then cleanup the backup archive:

rm /backup.tar.gz

Restoring with rsync

To set this up, we will need two VMs. Deploy them as follows:

VM1 - deploy this VM with the same specs as the VM you backed up. If it was an application solution, deploy the same solution with the same specs
VM2 - this is the temp VM for extracting and transferring the backup. Micro or full VM at your preference with “custom” capacity:
- 1 vCPU
- 1024 MB of RAM
- SSD big enough to receive all data from VM1 and hold the backup archive (go ahead and reserve 2x the stored data amount—this VM is only temporary so the cost is not important)
- Public IPv4 address (this is a simple way to get reliable communication between the VMs, but isn’t strictly required)

Stop running services

If you are restoring an application solution, first follow the steps above again to make sure that all services are stopped.

Install rsync

Both VMs will need rsync installed. We’ll also make sure nano is installed on VM2 while we’re at it:

# VM1
apt update && apt install -y rsync

# VM2
apt update && apt install -y rsync nano

Establish SSH between VMs

This is exactly the same process as before. Just scroll up until you see the gif if you need to reference it.

Upload, extract, and transfer

Now upload the backup.tar.gz file to the root directory of VM2, using the method of your choice. When this is done, extract it like so:

# VM2
cd /
mkdir /backup
tar -C /backup -xf backup.tar.gz

Then from VM1, initiate the transfer via rsync:

# VM1
# Substitute in the IP address of VM2, for example: root@1.2.3.4:/backup
rsync -aAXHv --exclude='/boot/*' --exclude='/dev/*' --exclude='/proc/*' --exclude='/sys/*' --exclude='/tmp/*' --exclude='/run/*' --exclude='/lost+found/' root@<VM2 IP address>:/backup /

Reenable auto updates

Don’t forget to reenable automatic updates, if applicable:

# VM1
dpkg-reconfigure unattended-upgrades # Choose "Yes"

Reboot

Ensure the restore was successful and bring the VM back up to a normal state by rebooting:

# Full VM
reboot

# Micro VM
reboot -f

# If all else fails
echo b > /proc/sysrq-trigger

Postlog

I hope this guide is clear and helpful, but if you have any questions, please do post them below. We should eventually one day get a backup feature built into Zos that’s substantially easier to use than what I described here, but for now, at least we know it’s possible

Mik · February 7, 2025, 4:58am

Very nice guide Scott! Looks great.

I’d be curious to get the feedback of the community concerning having a built-in backup feature in ZOS. If it is something that is requested by many, it might become a higher priority to build it.

Thanks for the guide!

RobertL · February 10, 2025, 3:48pm

@scott This is very useful indeed! Thank you for taking this time to do so. And yes, not only for us; but despite the fact qsfs seems a great future to replace (unnecessary and data hungry backups) It seems this is still far away for average users. Easy back up possibility will greatly enhance chances of companies daring to use the grid.