Dynamic Resource Allocation in Kubernetes: The End of GPU Hunger Games


How Kubernetes v1.34 finally solved the "my ML job is stuck waiting for a GPU that's sitting idle on node-42" problem
Picture this: It's 3 AM, your critical ML training job has been "Pending" for 6 hours, and somewhere in your 200-node cluster, there's a perfectly good GPU just sitting there, twiddling its digital thumbs. The scheduler can't see it, your pod can't claim it, and you're debugging YAML like it's 2019.
Welcome to the pre-DRA world of Kubernetes resource management, where GPUs were treated like mysterious black boxes that required incantations (device plugins), manual node labeling, and a lot of prayer.
Dynamic Resource Allocation (DRA) changes all of that. Think of it as Kubernetes finally learning to speak "GPU" fluently instead of just pointing and grunting.
Before DRA, getting a GPU in Kubernetes was like trying to order food at a restaurant where:
Here's what we used to do:
# The old way - crossing fingers and hoping
apiVersion: v1
kind: Pod
spec:
nodeSelector:
accelerator: nvidia-tesla-k80 # Hope this label exists
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1 # Hope this device plugin works
Problems with this approach:
DRA introduces three new Kubernetes resources that work together like a well-orchestrated team:
Think of DeviceClass as the restaurant menu that actually describes what's available:
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: high-memory-gpu
spec:
selectors:
- cel:
expression: |
device.driver == "nvidia.com/gpu" &&
device.attributes["memory"].quantity().value() >= 24000000000 && # 24GB+
device.attributes["compute-capability"].string() >= "8.0" # Ampere+
This says: "I'm defining a class of devices that are NVIDIA GPUs with at least 24GB memory and compute capability 8.0 or higher."
ResourceClaim is like placing a specific order:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: transformer-training-gpu
namespace: ml-research
spec:
devices:
requests:
- name: primary-gpu
deviceClassName: high-memory-gpu
count: 1
constraints:
- cel:
expression: 'device.attributes["cuda-version"].string() >= "12.0"'
This says: "I need one high-memory GPU with CUDA 12.0 or newer for my transformer training."
ResourceSlice objects (created automatically by device drivers) tell Kubernetes what's actually available:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
metadata:
name: node-gpu-worker-01
spec:
nodeName: gpu-worker-01
pool:
name: nvidia-driver-pool
resourceSliceCount: 1
devices:
- name: gpu-0
basic:
attributes:
memory: "24GB"
cuda-version: "12.2"
compute-capability: "8.6"
pcie-generation: "4"
capacity:
nvidia.com/gpu: "1"
Let's say you're building a platform that serves three different teams:
1. Research Team (needs the latest hardware):
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: research-gpu
spec:
selectors:
- cel:
expression: |
device.driver == "nvidia.com/gpu" &&
device.attributes["architecture"].string() == "Ada Lovelace" &&
device.attributes["memory"].quantity().value() >= 48000000000 # 48GB RTX 6000
2. Production Inference (needs reliable, efficient hardware):
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: inference-gpu
spec:
selectors:
- cel:
expression: |
device.driver == "nvidia.com/gpu" &&
device.attributes["tensor-cores"].string() == "true" &&
device.attributes["memory"].quantity().value() >= 16000000000 # 16GB minimum
3. Development Team (can use older hardware):
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: dev-gpu
spec:
selectors:
- cel:
expression: |
device.driver == "nvidia.com/gpu" &&
device.attributes["memory"].quantity().value() >= 8000000000 # 8GB is fine
Now, each team can request exactly what they need:
# Research deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: research-training
namespace: research
spec:
template:
spec:
resourceClaimTemplates:
- metadata:
name: research-gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: research-gpu
count: 2 # Multi-GPU training
containers:
- name: trainer
image: pytorch/pytorch:nightly
resources:
claims:
- name: research-gpu-claim
env:
- name: CUDA_VISIBLE_DEVICES
valueFrom:
resourceFieldRef:
resource: claims/research-gpu-claim/devices
When you create a ResourceClaim, here's the invisible choreography:
CUDA_VISIBLE_DEVICES automaticallyDRA isn't just about GPUs. It works with any specialized hardware:
Smart NICs for high-frequency trading:
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: ultra-low-latency-nic
spec:
selectors:
- cel:
expression: |
device.driver == "mellanox.com/connectx" &&
device.attributes["latency"].string() == "sub-microsecond"
FPGAs for signal processing:
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: signal-processing-fpga
spec:
selectors:
- cel:
expression: |
device.driver == "xilinx.com/fpga" &&
device.attributes["logic-cells"].quantity().value() >= 1000000
Before DRA:
nvidia-smi"With DRA:
kubectl get resourceclaims - see exactly what's requestedkubectl get resourceslices - see what hardware is availablekubectl describe pod my-training-pod - clear resource allocation statusYou don't have to rip everything out at once. Here's a gradual migration path:
Phase 1: Start with new workloads using DRA Phase 2: Create DeviceClasses that match your existing device plugin labels Phase 3: Migrate existing workloads using ResourceClaimTemplates in deployments Phase 4: Retire device plugins once everything is migrated
Early benchmarks show DRA actually improves scheduling performance:
DRA is just the beginning. Future enhancements might include:
Dynamic Resource Allocation transforms Kubernetes from a platform that tolerates specialized hardware to one that embraces it. No more fighting with device plugins, no more mysterious "Pending" pods, no more late-night debugging sessions trying to figure out why your GPU job won't start.
It's Kubernetes growing up and finally understanding that not all resources are created equal — and that's perfectly fine.
Ready to try DRA? Check the official documentation and start with a simple GPU DeviceClass. Your future self (and your ML team) will thank you.
Have war stories from the pre-DRA days? Found interesting ways to use ResourceClaims? Share them — the Kubernetes community thrives on real-world experiences.