Skip to main content

No-Kubernetes Deployment

llm-d's reference deployment runs on Kubernetes — workers are managed by Kubernetes Deployments, the EPP discovers them through an InferencePool, and the platform handles networking and lifecycle. Many environments don't have a Kubernetes control plane, though: HPC schedulers like Slurm or LSF launch workers dynamically, Ray-based stacks run workers as actors, bare-metal inference farms operate without K8s, and a single workstation with a couple of GPUs is often enough for development.

The No-Kubernetes path runs the same routing stack — the llm-d EPP, Envoy, and one or more model servers — directly as host processes or containers. The EPP gets its endpoint inventory from a YAML file on disk via the file-discovery plugin instead of watching an InferencePool over the Kubernetes API; everything else (EPP plugin set, scoring, Envoy ext_proc, vLLM arguments) is unchanged.

important

Without Kubernetes, some pieces of the llm-d stack are out of scope: InferenceObjective-driven FlowControl, the InferenceModelRewrite model-name rewriter, and PodMonitor-based Prometheus discovery. Scoring, prefix-cache affinity, saturation-based admission, and Prometheus metrics on --metrics-port all work; see the parity caveats for the full list.

Deploy

See the no-Kubernetes deployment guide for manifests and step-by-step deployment.

Architecture

No-Kubernetes Deployment

Traffic flows the same way as the optimized-baseline path:

client -> Envoy listener :8081
-> ext_proc gRPC :9002 (EPP picks endpoint, sets header)
-> ORIGINAL_DST cluster
-> reads x-gateway-destination-endpoint from EPP response
-> <address>:<port> of the vLLM worker chosen by the EPP

The EPP's datastore is populated entirely from endpoints.yaml. With watchFile: true the file is hot-reloaded on every atomic rewrite — adding, removing, or relabelling a worker takes effect without restarting the EPP. The plugin set, weights, and scheduling profile match the Optimized Baseline, so routing behaviour is identical to the Kubernetes-based deployment.

Further Reading