AITowards Data Scienceabout 4 hours ago

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

1 min read

Kubernetes reports two healthy pods sharing one GPU via CUDA time-slicing, but tail latency for a small, latency-sensitive agent worsened by 66% at p99. Medians and throughput barely changed. The NVIDIA device plugin's time-slicing hides memory contention and queue starvation from pod status checks.

Level

Hype check

Tap to vote and see what everyone thinks.

#kubernetes #gpu #llm

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

More to chew on!

More to chew on!