GPU support updates

ahmedthabet · June 22, 2023, 6:10pm

We are quite busy nowadays with GPUs, but wanted to give you a quick update on how things are going with the GPU, Let’s start with the UI

Explorer

A new filter for GPU supported node is going to be added
gpu count
filtering capabilities based on the model / device

On the details pages we will be showing the card information and its status (reserved or available) also the ID that’s needed to be used during deployments is easily accessible and has a copy to clipboard button

Here’s an example of how it looks in case of reserved

PR has videos inside: https://github.com/threefoldtech/tfgrid-sdk-ts/pull/706

Dashboard

The dashboard is where to reserve the nodes the farmer should be able to set the extra fees on the form and the user also should be able to reserve and get the details of the node (cost including the extrafees, GPU informations) [In progress]

Playground

Currently the playground is the easiest way to deploy a VM, a new option GPU is added to the filters

That means it will limit the criteria of the search to the nodes you rented that has GPU, once it finds nodes it will also show a list of the available GPUs to use in the VM

Statistics View

A simple card with GPUs count been added

Testing

There’s an in progress test suite we are developing and definitely all kinds of help is appreciated to expand and review our testing plan

Now the upcoming part is more technical on GPU support itself as a feature across components like the the chain, zos, and the underlying libraries

TFChain

On TFChain there will be a small migration to store the information if the node has_gpu/or gpu devices count on the node and the extra price the farmer wants to receive as a fee ADR Link

graphql

Can query the nodes on devnet using such query

query MyQuery {
  nodes(where: {hasGpu_eq: true}) {
    hasGpu
    id
  }
}

ZOS and 3Nodes

ZOS support for detecting and enabling the GPU for a specific ZMachine (VM) is now merged https://github.com/threefoldtech/zos/pull/1973
- zos now has listGPUs rmb call that allows querying the node’s GPUs in the form of

   id: '0000:0e:00.0/1002/744c',
   vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]',
   device: 'Navi 31 [Radeon RX 7900 XT/7900 XTX]',
   contract: 31540
 }

Where vendor, device are quite self explanatory, but for contract is the contract that has GPU reserved, and ID specifics that exact slot along with the vendor and the device separated by slashes <SLOT>/<VENDOR>/<DEVICE>

Grid Proxy

Grid proxy provides a rest interface on top of graphql and some extra logic on top of the indexer database

node details endpoint returns gpu number https://gridproxy.dev.grid.tf/nodes/93

{"id":"0006258208-000008-fdc69","nodeId":93,"farmId":3042,"twinId":3181,"country":"Belgium","gridVersion":5,"city":"Berchem","uptime":595827,"created":1686828732,"farmingPolicyId":1,"updatedAt":1687430796,"capacity":{"total_resources":{"cru":32,"sru":512110190592,"hru":0,"mru":33665941504},"used_resources":{"cru":0,"sru":107374182400,"hru":0,"mru":3366594150}},"location":{"country":"Belgium","city":"Berchem","longitude":4.4294,"latitude":51.1904},"publicConfig":{"domain":"","gw4":"","gw6":"","ipv4":"","ipv6":""},"status":"up","certificationType":"Diy","dedicated":false,"rentContractId":31805,"rentedByTwinId":26,"serialNumber":"M80-CA013501994","power":{"state":"","target":""},"num_gpu":1,"extraFee":0}

stats endpoint https://gridproxy.dev.grid.tf/stats also now returns total number of gpus on the grid

{"nodes":82,"farms":4257,"countries":11,"totalCru":1010001108,"totalSru":100088947416594522,"totalMru":100012174259244860,"totalHru":321326156620739,"publicIps":417,"accessNodes":8,"gateways":6,"twins":3013,"contracts":31826,"nodesDistribution":{"":1,"42656c6769756d":1,"Belgium":40,"EG-1656487239":1,"Egy":1,"Egypt":23,"Germany":1,"SomeCountry":1,"United States":11,"egypt":1,"someCountry":1},"gpus":3}

The gridproxy will also be providing some search capabilities to search by vendor or a device [Still in progress]

Farmerbot

There’s an upcoming update be able to find nodes with GPUs for more details please check the issue

Clients

Golang

Client PR https://github.com/threefoldtech/tfgrid-sdk-go/pull/207/ supporting the new calls to listGPUs and to deploy a machine with GPUs

Here’s an example of how to deploy using the go client (part of our integration tests)

func TestVMWithGPUDeployment(t *testing.T) {
	tfPluginClient, err := setup()
	assert.NoError(t, err)

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	defer cancel()

	publicKey, privateKey, err := GenerateSSHKeyPair()
	assert.NoError(t, err)

	twinID := uint64(tfPluginClient.TwinID)
	nodeFilter := types.NodeFilter{
		Status:   &statusUp,
		FreeSRU:  convertGBToBytes(20),
		FreeMRU:  convertGBToBytes(8),
		RentedBy: &twinID,
		HasGPU:   &trueVal,
	}

	nodes, err := deployer.FilterNodes(ctx, tfPluginClient, nodeFilter)
	if err != nil {
		t.Skip("no available nodes found")
	}
	nodeID := uint32(nodes[0].NodeID)

	nodeClient, err := tfPluginClient.NcPool.GetNodeClient(tfPluginClient.SubstrateConn, nodeID)
	assert.NoError(t, err)

	gpus, err := nodeClient.GPUs(ctx)
	assert.NoError(t, err)

	network := workloads.ZNet{
		Name:        "gpuNetwork",
		Description: "network for testing gpu",
		Nodes:       []uint32{nodeID},
		IPRange: gridtypes.NewIPNet(net.IPNet{
			IP:   net.IPv4(10, 20, 0, 0),
			Mask: net.CIDRMask(16, 32),
		}),
		AddWGAccess: false,
	}

	disk := workloads.Disk{
		Name:   "gpuDisk",
		SizeGB: 20,
	}

	vm := workloads.VM{
		Name:       "gpu",
		Flist:      "https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist",
		CPU:        4,
		Planetary:  true,
		Memory:     1024 * 8,
		GPUs:       ConvertGPUsToStr(gpus),
		Entrypoint: "/init.sh",
		EnvVars: map[string]string{
			"SSH_KEY": publicKey,
		},
		Mounts: []workloads.Mount{
			{DiskName: disk.Name, MountPoint: "/data"},
		},
		NetworkName: network.Name,
	}

	err = tfPluginClient.NetworkDeployer.Deploy(ctx, &network)
	assert.NoError(t, err)

	defer func() {
		err = tfPluginClient.NetworkDeployer.Cancel(ctx, &network)
		assert.NoError(t, err)
	}()

	dl := workloads.NewDeployment("gpu", nodeID, "", nil, network.Name, []workloads.Disk{disk}, nil, []workloads.VM{vm}, nil)
	err = tfPluginClient.DeploymentDeployer.Deploy(ctx, &dl)
	assert.NoError(t, err)

	defer func() {
		err = tfPluginClient.DeploymentDeployer.Cancel(ctx, &dl)
		assert.NoError(t, err)
	}()

	vm, err = tfPluginClient.State.LoadVMFromGrid(nodeID, vm.Name, dl.Name)
	assert.NoError(t, err)
	assert.Equal(t, vm.GPUs, ConvertGPUsToStr(gpus))

	time.Sleep(30 * time.Second)
	output, err := RemoteRun("root", vm.YggIP, "lspci -v", privateKey)
	assert.NoError(t, err)
	assert.Contains(t, string(output), gpus[0].Vendor)
}

terraform

The support of GPUs is almost there, here’s an example

terraform {
  required_providers {
    grid = {
      source = "threefoldtechdev.com/providers/grid"
    }
  }
}
provider "grid" {
}

locals {
  name = "testvm"
}


resource "grid_network" "net1" {
  name        = local.name
  nodes       = [93]
  ip_range    = "10.1.0.0/16"
  description = "newer network"
}
resource "grid_deployment" "d1" {
  name         = local.name
  node         = 93
  network_name = grid_network.net1.name
  vms {
    name       = "vm1"
    flist      = "https://hub.grid.tf/tf-official-apps/base:latest.flist"
    cpu        = 2
    memory     = 1024
    entrypoint = "/sbin/zinit init"
    env_vars = {
      SSH_KEY = file("~/.ssh/id_rsa.pub")
    }
    planetary = true
    gpus = [
      "0000:0e:00.0/1002/744c"
    ]
  }

grid-cli

will include a --gpu flag to deploy on a specific node you rented

Typescript

There’s a lot depending on the updates here, e.g All of our frontend components

Client PR https://github.com/threefoldtech/tfgrid-sdk-ts/pull/666
Here there’re couple of updates regarding finding nodes with GPU, querying node for GPU information and deploying with support of GPU

an example script to deploy with GPU support

import { DiskModel, FilterOptions, MachineModel, MachinesModel, NetworkModel } from "../src";
import { config, getClient } from "./client_loader";
import { log } from "./utils";

async function main() {
  const grid3 = await getClient();

  // create network Object
  const n = new NetworkModel();
  n.name = "vmgpuNetwork";
  n.ip_range = "10.249.0.0/16";

  // create disk Object
  const disk = new DiskModel();
  disk.name = "vmgpuDisk";
  disk.size = 100;
  disk.mountpoint = "/testdisk";

  const vmQueryOptions: FilterOptions = {
    cru: 8,
    mru: 16, // GB
    sru: 100,
    availableFor: grid3.twinId,
    hasGPU: true,
    rentedBy: grid3.twinId,
  };

  // create vm node Object
  const vm = new MachineModel();
  vm.name = "vmgpu";
  vm.node_id = +(await grid3.capacity.filterNodes(vmQueryOptions))[0].nodeId; // TODO: allow random choice
  vm.disks = [disk];
  vm.public_ip = false;
  vm.planetary = true;
  vm.cpu = 8;
  vm.memory = 1024 * 16;
  vm.rootfs_size = 0;
  vm.flist = "https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist";
  vm.entrypoint = "/";
  vm.env = {
    SSH_KEY: config.ssh_key,
  };
  vm.gpu = ["0000:0e:00.0/1002/744c"]; // gpu card's id, you can check the available gpu from the dashboard

  // create VMs Object
  const vms = new MachinesModel();
  vms.name = "vmgpu";
  vms.network = n;
  vms.machines = [vm];
  vms.metadata = "";
  vms.description = "test deploying VM with GPU via ts grid3 client";

  // deploy vms
  const res = await grid3.machines.deploy(vms);
  log(res);

  // get the deployment
  const l = await grid3.machines.getObj(vms.name);
  log(l);

   // delete
  const d = await grid3.machines.delete({ name: vms.name });
  log(d);

  await grid3.disconnect();
}

main();

Manuals

It’s an in-progress effort, we are still quite busy to be code complete, but will have as soon as possible

Questions

Right now, seems the sizes of the images are going to be quite large if you can provide more insights or help improving the situation it’ll be amazing, right now the solution for us is provisioning a full virtual machine and the user manually install the drivers and the frameworks they need?

Thank you

Mik · October 6, 2023, 3:26pm

What are the different TF Manual links for GPU support?

Dedicated Nodes and GPU
- https://manual.grid.tf/dashboard/portal/dashboard_portal_dedicated_nodes.html#filter-and-reserve-a-gpu-node
Dashboard and GPU
- https://manual.grid.tf/dashboard/explorer/explorer_gpu_support.html
Javascript Client and GPU
- https://manual.grid.tf/javascript/grid3_javascript_gpu_support.html
Go and GPU
- https://manual.grid.tf/go/grid3_go_gpu_support.html
- https://manual.grid.tf/go/grid3_go_vm_with_gpu.html?highlight=gpu#example
TFGrid CLI and GPU
- https://manual.grid.tf/tfgridcmd/grid3_cli_vm.html#deploy-a-vm-with-gpu
Terraform and GPU
- https://manual.grid.tf/terraform/terraform_gpu_support.html
Full VM and GPU
- https://manual.grid.tf/playground/fullVm.html
Zero-OS API and GPU
- https://manual.grid.tf/internals/zos/manual/api.html?highlight=gpu#gpus
GraphQL and GPU
- https://manual.grid.tf/dashboard/explorer/explorer_graphql_intro.html?highlight=gpu#filtering-nodes-with-gpu-devices