Running DeepSeek-R1 on Bare-Metal GPU using Talos Linux Kubernetes

Setting up a Kubernetes cluster on bare-metal with GPU workloads can be a challenging task. Here I will walk through the entire process from renting a dedicated GPU server, installing Talos Linux, deploying a Kubernetes cluster, and running the DeepSeek LLM model.

Running LLMs on dedicated hardware gives you predictable performance, no API rate limits, and complete control over your inference stack. The trade-off is operational complexity, but with the right tooling, it becomes manageable.

GPU connection path within the Kubernetes cluster

Why Talos Linux?

Talos Linux is a modern, immutable Linux distribution designed specifically for running Kubernetes. It provides several advantages for GPU workloads:

Security-focused: Minimal attack surface with no SSH, no shell, and no login
API-driven: Everything is managed through a secure API
Immutable infrastructure: Prevents configuration drift and improves reliability
Purpose-built for Kubernetes: Optimized specifically for container workloads
Seamless upgrades: Simple, atomic upgrades with rollback capability
NVIDIA support: Provides system extensions for NVIDIA drivers

Rent a GPU Server

Hetzner offers competitive GPU server pricing with two options:

Server	GPU	VRAM	RAM	Price
GEX44	NVIDIA RTX 4000 SFF	20 GB	64 GB DDR4	$205/mo
GEX130	NVIDIA RTX 6000	48 GB	128 GB DDR5	$931/mo

I chose the GEX44 for this setup. After submitting the order, the server was ready within an hour. Select the server with rescue mode enabled since we will be installing Talos Linux.

ssh root@176.9.98.109

------------------------------------------------------------------- Welcome to the Hetzner Rescue System. ------------------------------------------------------------------- This Rescue System is based on Debian GNU/Linux 12 (bookworm) You can install software like you would in a normal system. Hardware data: CPU1: 13th Gen Intel(R) Core(TM) i5-13500 (Cores 20) Memory: 64127 MB (Non-ECC) Disk /dev/nvme0n1: 1920 GB (=> 1788 GiB) Disk /dev/nvme1n1: 1920 GB (=> 1788 GiB) root@rescue ~ # _

Talos Linux Installation

Image Selection

We need to download Talos Linux with NVIDIA drivers. The Talos Image Factory lets you build custom images with specific extensions.

Talos uses a "system extensions" concept that allows adding drivers and additional components to the base image without compromising its immutable design. For GPU workloads, we need NVIDIA-specific extensions:

siderolabs/nvidia-container-toolkit-production NVIDIA container runtime extension (550.144.03-v1.17.3)
siderolabs/nonfree-kmod-nvidia-production NVIDIA driver kernel module (550.144.03-v1.9.5)

Important

The versions must match between extensions. These extensions provide the necessary NVIDIA drivers and container runtime support needed to access the GPU within Kubernetes.

Select Bare-metal Machine type with the latest Talos version (1.9.5) and amd64 architecture. After selecting extensions, choose the Disk Image (raw) option.

Installation

We cannot provide the dedicated server with a custom ISO, but we can boot into Hetzner's rescue system and write the Talos image directly to the drive:

# Download the Talos image
cd /tmp
wget -O /tmp/talos.raw.xz https://factory.talos.dev/image/26124abcbd408be693df9fe852c80ef1e6cc178e34d7d7d8430a28d1130b4227/v1.9.5/metal-amd64.raw.zst

# Check available disks
lsblk
# nvme0n1  259:0    0  1.7T  0 disk
# nvme1n1  259:1    0  1.7T  0 disk

# Write the image to disk
zstd -d -c talos.raw.xz | dd of=/dev/nvme0n1 bs=4M

# Mount EFI and create boot entry
mkdir -p /mnt/efi
mount /dev/nvme0n1p1 /mnt/efi
efibootmgr -c -d /dev/nvme0n1 -p 1 -L "Talos Linux" -l '\EFI\BOOT\BOOTX64.EFI'

# Reboot into Talos
reboot

Note

After reboot, SSH access will be lost since Talos is fully API-driven. All management happens through talosctl.

Cluster Configuration

Install talosctl from the Talos documentation, then generate cluster secrets:

# Generate secrets and config
mkdir cluster-config && cd cluster-config
talosctl gen secrets --output-file secrets.yaml

export CLUSTER_NAME="gpu-cluster"
export NODE_IP="176.9.98.109"
export API_ENDPOINT="https://$NODE_IP:6443"

talosctl gen config \
  --with-secrets secrets.yaml \
  --output-types talosconfig \
  --output talosconfig \
  $CLUSTER_NAME \
  $API_ENDPOINT

GPU Kernel Modules

Create the machine configuration with GPU support. The kernel modules must be loaded for the GPU to be accessible:

# nodes/n1.yaml
machine:
  install:
    disk: none
    diskSelector:
      size: '< 2TB'
    image: ghcr.io/siderolabs/installer:v1.9.5
  network:
    hostname: n1
    interfaces:
    - interface: eth0
      dhcp: true
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1
  files:
    - op: create
      content: |
        [plugins]
          [plugins."io.containerd.cri.v1.runtime"]
            [plugins."io.containerd.cri.v1.runtime".containerd]
              default_runtime_name = "nvidia"
      path: /etc/cri/conf.d/20-customization.part

The files section configures containerd to use the NVIDIA runtime by default, ensuring containers can access the GPU.

Bootstrap the Cluster

Apply the configuration and bootstrap:

# Generate rendered config
talosctl gen config \
  --output rendered/n1.yaml \
  --output-types controlplane \
  --dns-domain local.$CLUSTER_NAME \
  --with-cluster-discovery=false \
  --with-secrets secrets.yaml \
  --config-patch @patches/allow-controlplane-workloads.yaml \
  --config-patch @nodes/n1.yaml \
  $CLUSTER_NAME \
  $API_ENDPOINT

# Apply config to node
talosctl --nodes $NODE_IP apply-config --file rendered/n1.yaml --insecure

# Bootstrap the cluster
talosctl --nodes $NODE_IP bootstrap

Monitor the cluster status with the Talos dashboard:

talosctl --nodes 176.9.98.109 dashboard

Stage Running

Kubernetes Healthy

v1.32.2

Kubelet Healthy

Ready

Type Active

controlplane

Export the kubeconfig once the cluster is ready:

talosctl -n $NODE_IP kubeconfig
kubectl get nodes -o wide
# NAME   STATUS   ROLES           AGE   VERSION   INTERNAL-IP
# n1     Ready    control-plane   33m   v1.32.2   176.9.98.109

Discover GPU Devices

In Talos Linux, everything is API-driven, so commands like lspci will not work. Use talosctl to discover GPU devices:

talosctl get pcidevices -n $NODE_IP | grep NVIDIA

NODE	NAMESPACE	TYPE	ID	CLASS	VENDOR	PRODUCT
176.9.98.109	hardware	PCIDevice	0000:01:00.0	Display controller	NVIDIA	AD104GL [RTX 4000 SFF]
176.9.98.109	hardware	PCIDevice	0000:01:00.1	Multimedia controller	NVIDIA	Audio device

Add GPU Capacity

Even though the GPU is detected and ready on the node, Kubelet does not expose it as capacity. Device plugins communicate with Kubelet to expose additional resources like GPUs.

NVIDIA device plugin exposes GPU resources to Kubernetes

Deploy the NVIDIA device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Verify GPU capacity is now available:

kubectl get no -o json | jq '.items[0].status.capacity'
# {
#   "cpu": "20",
#   "memory": "65626196Ki",
#   "nvidia.com/gpu": "1",
#   "pods": "110"
# }

The nvidia.com/gpu: "1" confirms the GPU is now available for Kubernetes workloads.

Running DeepSeek-R1

Deploy Ollama

Create a namespace that allows privileged containers (required for GPU access):

kubectl apply -f - <

Deploy Ollama using the Helm chart:

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm install ollama ollama-helm/ollama --namespace ollama \
    --set ollama.gpu.enabled=true,ollama.gpu.type=nvidia

Verify the deployment:

kubectl get pods -n ollama -o wide
# NAME                      READY   STATUS    RESTARTS   AGE
# ollama-776884645f-l66cv   1/1     Running   0          2m39s

kubectl get deploy -n ollama ollama -o yaml | grep nvidia
# nvidia.com/gpu: "1"

Pull and Run the Model

Set up port-forward to access Ollama from your local machine:

kubectl port-forward svc/ollama -n ollama http

The DeepSeek-R1 32B model with 4-bit quantization requires approximately 16GB of GPU memory. The RTX 4000 with 20GB VRAM is sufficient with room to spare.

ollama run deepseek-r1:32b

pulling manifest pulling 6150cb382311... 100% 19 GB pulling 369ca498f347... 100% 387 B pulling 6e4c38e1172f... 100% 1.1 KB verifying sha256 digest writing manifest success >>> How can I setup GPU workloads in Kubernetes? <think> Okay, so I want to set up GPU workloads in Kubernetes. I've heard that Kubernetes is good for container orchestration, but adding GPUs into the mix seems a bit more complicated. Let me figure this out step by step. First, I know that my cluster needs to have nodes with GPUs. So I need to ensure that at least some of my worker nodes have NVIDIA GPUs installed. But how do I check if they're properly recognized? Maybe using nvidia-smi...

The model is now running on your dedicated GPU hardware with full control over the inference stack.

Wrapping Up

Running DeepSeek-R1 on bare metal with Talos Linux provides:

Predictable performance: No shared resources or noisy neighbors
No API limits: Run as many inferences as your hardware allows
Cost efficiency: At $205/month, heavy usage beats per-token pricing
Complete control: Tune the model, batch requests, cache responses

The operational overhead is real but manageable. Talos Linux eliminates most of the traditional sysadmin burden, and Kubernetes provides familiar deployment patterns. Once set up, the cluster runs reliably with minimal intervention.

Next, I plan to explore the TensorRT-LLM inference toolkit for optimized performance.

References: