Back to Blog

Running DeepSeek-R1 on Bare-Metal GPU using Talos Linux Kubernetes

A practical guide to running DeepSeek-R1 on bare-metal GPU servers using Talos Linux and Kubernetes with NVIDIA GPU support.

Setting up a Kubernetes cluster on bare-metal with GPU workloads can be a challenging task. Here I will walk through the entire process from renting a dedicated GPU server, installing Talos Linux, deploying a Kubernetes cluster, and running the DeepSeek LLM model.

Running LLMs on dedicated hardware gives you predictable performance, no API rate limits, and complete control over your inference stack. The trade-off is operational complexity, but with the right tooling, it becomes manageable.

WORKLOAD K8S RUNTIME KERNEL HARDWARE Pod / Workload DeepSeek-R1 (Ollama) NVIDIA GPU Operator k8s-device-plugin Container Runtime NVIDIA container toolkit Software - Driver Talos Linux OS Hardware/Kernel - Drivers Kernel Modules GPU (Hardware) PCIe NEEDS

GPU connection path within the Kubernetes cluster

Why Talos Linux?

Talos Linux is a modern, immutable Linux distribution designed specifically for running Kubernetes. It provides several advantages for GPU workloads:

Rent a GPU Server

Hetzner offers competitive GPU server pricing with two options:

Server GPU VRAM RAM Price
GEX44 NVIDIA RTX 4000 SFF 20 GB 64 GB DDR4 $205/mo
GEX130 NVIDIA RTX 6000 48 GB 128 GB DDR5 $931/mo

I chose the GEX44 for this setup. After submitting the order, the server was ready within an hour. Select the server with rescue mode enabled since we will be installing Talos Linux.

ssh root@176.9.98.109
------------------------------------------------------------------- Welcome to the Hetzner Rescue System. ------------------------------------------------------------------- This Rescue System is based on Debian GNU/Linux 12 (bookworm) You can install software like you would in a normal system. Hardware data: CPU1: 13th Gen Intel(R) Core(TM) i5-13500 (Cores 20) Memory: 64127 MB (Non-ECC) Disk /dev/nvme0n1: 1920 GB (=> 1788 GiB) Disk /dev/nvme1n1: 1920 GB (=> 1788 GiB) root@rescue ~ # _

Talos Linux Installation

Image Selection

We need to download Talos Linux with NVIDIA drivers. The Talos Image Factory lets you build custom images with specific extensions.

Talos uses a "system extensions" concept that allows adding drivers and additional components to the base image without compromising its immutable design. For GPU workloads, we need NVIDIA-specific extensions:

Important
The versions must match between extensions. These extensions provide the necessary NVIDIA drivers and container runtime support needed to access the GPU within Kubernetes.

Select Bare-metal Machine type with the latest Talos version (1.9.5) and amd64 architecture. After selecting extensions, choose the Disk Image (raw) option.

Installation

We cannot provide the dedicated server with a custom ISO, but we can boot into Hetzner's rescue system and write the Talos image directly to the drive:

# Download the Talos image
cd /tmp
wget -O /tmp/talos.raw.xz https://factory.talos.dev/image/26124abcbd408be693df9fe852c80ef1e6cc178e34d7d7d8430a28d1130b4227/v1.9.5/metal-amd64.raw.zst

# Check available disks
lsblk
# nvme0n1  259:0    0  1.7T  0 disk
# nvme1n1  259:1    0  1.7T  0 disk

# Write the image to disk
zstd -d -c talos.raw.xz | dd of=/dev/nvme0n1 bs=4M

# Mount EFI and create boot entry
mkdir -p /mnt/efi
mount /dev/nvme0n1p1 /mnt/efi
efibootmgr -c -d /dev/nvme0n1 -p 1 -L "Talos Linux" -l '\EFI\BOOT\BOOTX64.EFI'

# Reboot into Talos
reboot
Note
After reboot, SSH access will be lost since Talos is fully API-driven. All management happens through talosctl.

Cluster Configuration

Install talosctl from the Talos documentation, then generate cluster secrets:

# Generate secrets and config
mkdir cluster-config && cd cluster-config
talosctl gen secrets --output-file secrets.yaml

export CLUSTER_NAME="gpu-cluster"
export NODE_IP="176.9.98.109"
export API_ENDPOINT="https://$NODE_IP:6443"

talosctl gen config \
  --with-secrets secrets.yaml \
  --output-types talosconfig \
  --output talosconfig \
  $CLUSTER_NAME \
  $API_ENDPOINT

GPU Kernel Modules

Create the machine configuration with GPU support. The kernel modules must be loaded for the GPU to be accessible:

# nodes/n1.yaml
machine:
  install:
    disk: none
    diskSelector:
      size: '< 2TB'
    image: ghcr.io/siderolabs/installer:v1.9.5
  network:
    hostname: n1
    interfaces:
    - interface: eth0
      dhcp: true
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1
  files:
    - op: create
      content: |
        [plugins]
          [plugins."io.containerd.cri.v1.runtime"]
            [plugins."io.containerd.cri.v1.runtime".containerd]
              default_runtime_name = "nvidia"
      path: /etc/cri/conf.d/20-customization.part

The files section configures containerd to use the NVIDIA runtime by default, ensuring containers can access the GPU.

Bootstrap the Cluster

Apply the configuration and bootstrap:

# Generate rendered config
talosctl gen config \
  --output rendered/n1.yaml \
  --output-types controlplane \
  --dns-domain local.$CLUSTER_NAME \
  --with-cluster-discovery=false \
  --with-secrets secrets.yaml \
  --config-patch @patches/allow-controlplane-workloads.yaml \
  --config-patch @nodes/n1.yaml \
  $CLUSTER_NAME \
  $API_ENDPOINT

# Apply config to node
talosctl --nodes $NODE_IP apply-config --file rendered/n1.yaml --insecure

# Bootstrap the cluster
talosctl --nodes $NODE_IP bootstrap

Monitor the cluster status with the Talos dashboard:

talosctl --nodes 176.9.98.109 dashboard
Stage Running
n1
Kubernetes Healthy
v1.32.2
Kubelet Healthy
Ready
Type Active
controlplane

Export the kubeconfig once the cluster is ready:

talosctl -n $NODE_IP kubeconfig
kubectl get nodes -o wide
# NAME   STATUS   ROLES           AGE   VERSION   INTERNAL-IP
# n1     Ready    control-plane   33m   v1.32.2   176.9.98.109

Discover GPU Devices

In Talos Linux, everything is API-driven, so commands like lspci will not work. Use talosctl to discover GPU devices:

talosctl get pcidevices -n $NODE_IP | grep NVIDIA
NODE NAMESPACE TYPE ID CLASS VENDOR PRODUCT
176.9.98.109 hardware PCIDevice 0000:01:00.0 Display controller NVIDIA AD104GL [RTX 4000 SFF]
176.9.98.109 hardware PCIDevice 0000:01:00.1 Multimedia controller NVIDIA Audio device

Add GPU Capacity

Even though the GPU is detected and ready on the node, Kubelet does not expose it as capacity. Device plugins communicate with Kubelet to expose additional resources like GPUs.

GPU on Node Device Plugin DaemonSet nvidia-device-plugin Kubelet gRPC API nvidia .com/gpu Discovery Registration

NVIDIA device plugin exposes GPU resources to Kubernetes

Deploy the NVIDIA device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Verify GPU capacity is now available:

kubectl get no -o json | jq '.items[0].status.capacity'
# {
#   "cpu": "20",
#   "memory": "65626196Ki",
#   "nvidia.com/gpu": "1",
#   "pods": "110"
# }

The nvidia.com/gpu: "1" confirms the GPU is now available for Kubernetes workloads.

Running DeepSeek-R1

Deploy Ollama

Create a namespace that allows privileged containers (required for GPU access):

kubectl apply -f - <

Deploy Ollama using the Helm chart:

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm install ollama ollama-helm/ollama --namespace ollama \
    --set ollama.gpu.enabled=true,ollama.gpu.type=nvidia

Verify the deployment:

kubectl get pods -n ollama -o wide
# NAME                      READY   STATUS    RESTARTS   AGE
# ollama-776884645f-l66cv   1/1     Running   0          2m39s

kubectl get deploy -n ollama ollama -o yaml | grep nvidia
# nvidia.com/gpu: "1"

Pull and Run the Model

Set up port-forward to access Ollama from your local machine:

kubectl port-forward svc/ollama -n ollama http

The DeepSeek-R1 32B model with 4-bit quantization requires approximately 16GB of GPU memory. The RTX 4000 with 20GB VRAM is sufficient with room to spare.

ollama run deepseek-r1:32b
pulling manifest pulling 6150cb382311... 100% 19 GB pulling 369ca498f347... 100% 387 B pulling 6e4c38e1172f... 100% 1.1 KB verifying sha256 digest writing manifest success >>> How can I setup GPU workloads in Kubernetes? <think> Okay, so I want to set up GPU workloads in Kubernetes. I've heard that Kubernetes is good for container orchestration, but adding GPUs into the mix seems a bit more complicated. Let me figure this out step by step. First, I know that my cluster needs to have nodes with GPUs. So I need to ensure that at least some of my worker nodes have NVIDIA GPUs installed. But how do I check if they're properly recognized? Maybe using nvidia-smi...

The model is now running on your dedicated GPU hardware with full control over the inference stack.

Wrapping Up

Running DeepSeek-R1 on bare metal with Talos Linux provides:

The operational overhead is real but manageable. Talos Linux eliminates most of the traditional sysadmin burden, and Kubernetes provides familiar deployment patterns. Once set up, the cluster runs reliably with minimal intervention.

Next, I plan to explore the TensorRT-LLM inference toolkit for optimized performance.


References: