Skip to main content

PromQL Query Reference

Ready-to-use PromQL queries for monitoring llm-d deployments. Use these in the Prometheus UI or as the basis for Grafana panels.

To generate traffic and populate error metrics for testing, use the traffic generation script.

Tier 1: Immediate Failure & Saturation Indicators

Start here when something looks wrong.

Metric NeedPromQL Query
Overall Error Ratesum(rate(llm_d_router_epp_request_error_total[5m])) / sum(rate(llm_d_router_epp_request_total[5m]))
Per-Model Error Ratesum by(model_name) (rate(llm_d_router_epp_request_error_total[5m])) / sum by(model_name) (rate(llm_d_router_epp_request_total[5m]))
Request Preemptionssum by(pod, instance) (rate(vllm:num_preemptions[5m]))
Overall Latency P90histogram_quantile(0.90, sum by(le) (rate(llm_d_router_epp_request_duration_seconds_bucket[5m])))
Overall Latency P99histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_request_duration_seconds_bucket[5m])))
TTFT P99 per modelhistogram_quantile(0.99, sum by(le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))
Inter-Token Latency P99histogram_quantile(0.99, sum by(le, model_name) (rate(vllm:inter_token_latency_seconds_bucket[5m])))
Request Ratesum by(model_name) (rate(llm_d_router_epp_request_total[5m]))
GPU Utilizationavg by(gpu, node) (DCGM_FI_DEV_GPU_UTIL or nvidia_gpu_duty_cycle)
EPP E2E Latency P99histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_scheduler_e2e_duration_seconds_bucket[5m])))
EPP Plugin Latency P99histogram_quantile(0.99, sum by(le, plugin_type) (rate(llm_d_router_epp_plugin_duration_seconds_bucket[5m])))

Tier 2: Diagnostic Drill-Down

Basic Model Serving

Metric NeedPromQL Query
KV Cache Utilizationavg by(pod, model_name) (vllm:kv_cache_usage_perc)
Request Queue Depthsum by(pod, model_name) (vllm:num_requests_waiting)
Active Requestsavg by(pod) (vllm:num_requests_running)
Total Throughput (tokens/sec)sum by(model_name, pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))
Generation Token Ratesum by(model_name, pod) (rate(vllm:generation_tokens_total[5m]))

Routing & Load Balancing

Metric NeedPromQL Query
QPS per podsum by(pod) (rate(llm_d_router_epp_request_total[5m]))
Token distribution per podsum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))
Routing decision latency P99histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_plugin_duration_seconds_bucket[5m])))

Prefix Caching

Metric NeedPromQL Query
Cache hit ratesum(rate(vllm:prefix_cache_hits_total[5m])) / sum(rate(vllm:prefix_cache_queries_total[5m]))
Per-pod hit ratesum by(pod) (rate(vllm:prefix_cache_hits_total[5m])) / sum by(pod) (rate(vllm:prefix_cache_queries_total[5m]))
EPP prefix indexer sizellm_d_router_epp_prefix_indexer_size
EPP prefix hit ratio P90histogram_quantile(0.90, sum by(le) (rate(llm_d_router_epp_prefix_indexer_hit_ratio_bucket[5m])))

Prefill/Decode Disaggregation

Metric NeedPromQL Query
Prefill worker utilizationavg by(pod) (vllm:num_requests_running(pod=~".*prefill.*"))
Decode KV cache utilizationavg by(pod) (vllm:kv_cache_usage_perc(pod=~".*decode.*"))
P/D decision ratiosum(rate(llm_d_router_epp_pd_decision_total(decision_type="prefill-decode")[5m])) / sum(rate(llm_d_router_epp_pd_decision_total[5m]))

Flow Control

Requires the flowControl feature gate enabled on the EPP.

Metric NeedPromQL Query
Queue sizesum(llm_d_router_epp_flow_control_queue_size)
Queue size by prioritysum by(priority) (llm_d_router_epp_flow_control_queue_size)
Queue wait time P99histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_flow_control_request_queue_duration_seconds_bucket[5m])))
Pool saturationllm_d_router_epp_flow_control_pool_saturation

Notes

Metric name prefixes: Current deployments use llm_d_router_epp_*. Older deployments may use inference_objective_* or inference_extension_* — update accordingly if panels show "No data".

Histograms: Always include by(le) when using histogram_quantile():

histogram_quantile(0.99, sum by(le) (rate(metric_name_bucket[5m])))

Error metrics only appear after the first error occurs. Use the traffic generation script to populate them for testing.