Setting up a Kubernetes cluster on bare-metal with GPU workloads can be a challenging task. Here I will walk through the entire process from renting a dedicated GPU server, installing Talos Linux, deploying a Kubernetes cluster, and running the DeepSeek LLM model.
Running LLMs on dedicated hardware gives you predictable performance, no API rate limits, and complete control over your inference stack. The trade-off is operational complexity, but with the right tooling, it becomes manageable.
GPU connection path within the Kubernetes cluster
Why Talos Linux?
Talos Linux is a modern, immutable Linux distribution designed specifically for running Kubernetes. It provides several advantages for GPU workloads:
- Security-focused: Minimal attack surface with no SSH, no shell, and no login
- API-driven: Everything is managed through a secure API
- Immutable infrastructure: Prevents configuration drift and improves reliability
- Purpose-built for Kubernetes: Optimized specifically for container workloads
- Seamless upgrades: Simple, atomic upgrades with rollback capability
- NVIDIA support: Provides system extensions for NVIDIA drivers
Rent a GPU Server
Hetzner offers competitive GPU server pricing with two options:
| Server | GPU | VRAM | RAM | Price |
|---|---|---|---|---|
| GEX44 | NVIDIA RTX 4000 SFF | 20 GB | 64 GB DDR4 | $205/mo |
| GEX130 | NVIDIA RTX 6000 | 48 GB | 128 GB DDR5 | $931/mo |
I chose the GEX44 for this setup. After submitting the order, the server was ready within an hour. Select the server with rescue mode enabled since we will be installing Talos Linux.
Talos Linux Installation
Image Selection
We need to download Talos Linux with NVIDIA drivers. The Talos Image Factory lets you build custom images with specific extensions.
Talos uses a "system extensions" concept that allows adding drivers and additional components to the base image without compromising its immutable design. For GPU workloads, we need NVIDIA-specific extensions:
- siderolabs/nvidia-container-toolkit-production NVIDIA container runtime extension (550.144.03-v1.17.3)
- siderolabs/nonfree-kmod-nvidia-production NVIDIA driver kernel module (550.144.03-v1.9.5)
Select Bare-metal Machine type with the latest Talos version (1.9.5) and amd64 architecture. After selecting extensions, choose the Disk Image (raw) option.
Installation
We cannot provide the dedicated server with a custom ISO, but we can boot into Hetzner's rescue system and write the Talos image directly to the drive:
# Download the Talos image
cd /tmp
wget -O /tmp/talos.raw.xz https://factory.talos.dev/image/26124abcbd408be693df9fe852c80ef1e6cc178e34d7d7d8430a28d1130b4227/v1.9.5/metal-amd64.raw.zst
# Check available disks
lsblk
# nvme0n1 259:0 0 1.7T 0 disk
# nvme1n1 259:1 0 1.7T 0 disk
# Write the image to disk
zstd -d -c talos.raw.xz | dd of=/dev/nvme0n1 bs=4M
# Mount EFI and create boot entry
mkdir -p /mnt/efi
mount /dev/nvme0n1p1 /mnt/efi
efibootmgr -c -d /dev/nvme0n1 -p 1 -L "Talos Linux" -l '\EFI\BOOT\BOOTX64.EFI'
# Reboot into Talos
reboot
talosctl.
Cluster Configuration
Install talosctl from the Talos documentation, then generate cluster secrets:
# Generate secrets and config
mkdir cluster-config && cd cluster-config
talosctl gen secrets --output-file secrets.yaml
export CLUSTER_NAME="gpu-cluster"
export NODE_IP="176.9.98.109"
export API_ENDPOINT="https://$NODE_IP:6443"
talosctl gen config \
--with-secrets secrets.yaml \
--output-types talosconfig \
--output talosconfig \
$CLUSTER_NAME \
$API_ENDPOINT
GPU Kernel Modules
Create the machine configuration with GPU support. The kernel modules must be loaded for the GPU to be accessible:
# nodes/n1.yaml
machine:
install:
disk: none
diskSelector:
size: '< 2TB'
image: ghcr.io/siderolabs/installer:v1.9.5
network:
hostname: n1
interfaces:
- interface: eth0
dhcp: true
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
sysctls:
net.core.bpf_jit_harden: 1
files:
- op: create
content: |
[plugins]
[plugins."io.containerd.cri.v1.runtime"]
[plugins."io.containerd.cri.v1.runtime".containerd]
default_runtime_name = "nvidia"
path: /etc/cri/conf.d/20-customization.part
The files section configures containerd to use the NVIDIA runtime by default, ensuring containers can access the GPU.
Bootstrap the Cluster
Apply the configuration and bootstrap:
# Generate rendered config
talosctl gen config \
--output rendered/n1.yaml \
--output-types controlplane \
--dns-domain local.$CLUSTER_NAME \
--with-cluster-discovery=false \
--with-secrets secrets.yaml \
--config-patch @patches/allow-controlplane-workloads.yaml \
--config-patch @nodes/n1.yaml \
$CLUSTER_NAME \
$API_ENDPOINT
# Apply config to node
talosctl --nodes $NODE_IP apply-config --file rendered/n1.yaml --insecure
# Bootstrap the cluster
talosctl --nodes $NODE_IP bootstrap
Monitor the cluster status with the Talos dashboard:
Export the kubeconfig once the cluster is ready:
talosctl -n $NODE_IP kubeconfig
kubectl get nodes -o wide
# NAME STATUS ROLES AGE VERSION INTERNAL-IP
# n1 Ready control-plane 33m v1.32.2 176.9.98.109
Discover GPU Devices
In Talos Linux, everything is API-driven, so commands like lspci will not work. Use talosctl to discover GPU devices:
| NODE | NAMESPACE | TYPE | ID | CLASS | VENDOR | PRODUCT |
|---|---|---|---|---|---|---|
| 176.9.98.109 | hardware | PCIDevice | 0000:01:00.0 | Display controller | NVIDIA | AD104GL [RTX 4000 SFF] |
| 176.9.98.109 | hardware | PCIDevice | 0000:01:00.1 | Multimedia controller | NVIDIA | Audio device |
Add GPU Capacity
Even though the GPU is detected and ready on the node, Kubelet does not expose it as capacity. Device plugins communicate with Kubelet to expose additional resources like GPUs.
NVIDIA device plugin exposes GPU resources to Kubernetes
Deploy the NVIDIA device plugin:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
Verify GPU capacity is now available:
kubectl get no -o json | jq '.items[0].status.capacity'
# {
# "cpu": "20",
# "memory": "65626196Ki",
# "nvidia.com/gpu": "1",
# "pods": "110"
# }
The nvidia.com/gpu: "1" confirms the GPU is now available for Kubernetes workloads.
Running DeepSeek-R1
Deploy Ollama
Create a namespace that allows privileged containers (required for GPU access):
kubectl apply -f - <
Deploy Ollama using the Helm chart:
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm install ollama ollama-helm/ollama --namespace ollama \
--set ollama.gpu.enabled=true,ollama.gpu.type=nvidia
Verify the deployment:
kubectl get pods -n ollama -o wide
# NAME READY STATUS RESTARTS AGE
# ollama-776884645f-l66cv 1/1 Running 0 2m39s
kubectl get deploy -n ollama ollama -o yaml | grep nvidia
# nvidia.com/gpu: "1"
Pull and Run the Model
Set up port-forward to access Ollama from your local machine:
kubectl port-forward svc/ollama -n ollama http
The DeepSeek-R1 32B model with 4-bit quantization requires approximately 16GB of GPU memory. The RTX 4000 with 20GB VRAM is sufficient with room to spare.
The model is now running on your dedicated GPU hardware with full control over the inference stack.
Wrapping Up
Running DeepSeek-R1 on bare metal with Talos Linux provides:
- Predictable performance: No shared resources or noisy neighbors
- No API limits: Run as many inferences as your hardware allows
- Cost efficiency: At $205/month, heavy usage beats per-token pricing
- Complete control: Tune the model, batch requests, cache responses
The operational overhead is real but manageable. Talos Linux eliminates most of the traditional sysadmin burden, and Kubernetes provides familiar deployment patterns. Once set up, the cluster runs reliably with minimal intervention.
Next, I plan to explore the TensorRT-LLM inference toolkit for optimized performance.
References: