PromQL Query Reference

Ready-to-use PromQL queries for monitoring llm-d deployments. Use these in the Prometheus UI or as the basis for Grafana panels.

To generate traffic and populate error metrics for testing, use the traffic generation script.

Tier 1: Immediate Failure & Saturation Indicators

Start here when something looks wrong.

Metric Need	PromQL Query
Overall Error Rate	`sum(rate(llm_d_router_epp_request_error_total[5m])) / sum(rate(llm_d_router_epp_request_total[5m]))`
Per-Model Error Rate	`sum by(model_name) (rate(llm_d_router_epp_request_error_total[5m])) / sum by(model_name) (rate(llm_d_router_epp_request_total[5m]))`
Request Preemptions	`sum by(pod, instance) (rate(vllm:num_preemptions[5m]))`
Overall Latency P90	`histogram_quantile(0.90, sum by(le) (rate(llm_d_router_epp_request_duration_seconds_bucket[5m])))`
Overall Latency P99	`histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_request_duration_seconds_bucket[5m])))`
TTFT P99 per model	`histogram_quantile(0.99, sum by(le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))`
Inter-Token Latency P99	`histogram_quantile(0.99, sum by(le, model_name) (rate(vllm:inter_token_latency_seconds_bucket[5m])))`
Request Rate	`sum by(model_name) (rate(llm_d_router_epp_request_total[5m]))`
GPU Utilization	`avg by(gpu, node) (DCGM_FI_DEV_GPU_UTIL or nvidia_gpu_duty_cycle)`
EPP E2E Latency P99	`histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_scheduler_e2e_duration_seconds_bucket[5m])))`
EPP Plugin Latency P99	`histogram_quantile(0.99, sum by(le, plugin_type) (rate(llm_d_router_epp_plugin_duration_seconds_bucket[5m])))`

Tier 2: Diagnostic Drill-Down

Basic Model Serving

Metric Need	PromQL Query
KV Cache Utilization	`avg by(pod, model_name) (vllm:kv_cache_usage_perc)`
Request Queue Depth	`sum by(pod, model_name) (vllm:num_requests_waiting)`
Active Requests	`avg by(pod) (vllm:num_requests_running)`
Total Throughput (tokens/sec)	`sum by(model_name, pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))`
Generation Token Rate	`sum by(model_name, pod) (rate(vllm:generation_tokens_total[5m]))`

Routing & Load Balancing

Metric Need	PromQL Query
QPS per pod	`sum by(pod) (rate(llm_d_router_epp_request_total[5m]))`
Token distribution per pod	`sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))`
Routing decision latency P99	`histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_plugin_duration_seconds_bucket[5m])))`

Prefix Caching

Metric Need	PromQL Query
Cache hit rate	`sum(rate(vllm:prefix_cache_hits_total[5m])) / sum(rate(vllm:prefix_cache_queries_total[5m]))`
Per-pod hit rate	`sum by(pod) (rate(vllm:prefix_cache_hits_total[5m])) / sum by(pod) (rate(vllm:prefix_cache_queries_total[5m]))`
EPP prefix indexer size	`llm_d_router_epp_prefix_indexer_size`
EPP prefix hit ratio P90	`histogram_quantile(0.90, sum by(le) (rate(llm_d_router_epp_prefix_indexer_hit_ratio_bucket[5m])))`

Prefill/Decode Disaggregation

Metric Need	PromQL Query
Prefill worker utilization	`avg by(pod) (vllm:num_requests_running(pod=~".prefill."))`
Decode KV cache utilization	`avg by(pod) (vllm:kv_cache_usage_perc(pod=~".decode."))`
P/D decision ratio	`sum(rate(llm_d_router_epp_pd_decision_total(decision_type="prefill-decode")[5m])) / sum(rate(llm_d_router_epp_pd_decision_total[5m]))`

Flow Control

Requires the flowControl feature gate enabled on the EPP.

Metric Need	PromQL Query
Queue size	`sum(llm_d_router_epp_flow_control_queue_size)`
Queue size by priority	`sum by(priority) (llm_d_router_epp_flow_control_queue_size)`
Queue wait time P99	`histogram_quantile(0.99, sum by(le) (rate(llm_d_router_epp_flow_control_request_queue_duration_seconds_bucket[5m])))`
Pool saturation	`llm_d_router_epp_flow_control_pool_saturation`

Notes

Metric name prefixes: Current deployments use llm_d_router_epp_*. Older deployments may use inference_objective_* or inference_extension_* — update accordingly if panels show "No data".

Histograms: Always include by(le) when using histogram_quantile():

histogram_quantile(0.99, sum by(le) (rate(metric_name_bucket[5m])))

Error metrics only appear after the first error occurs. Use the traffic generation script to populate them for testing.

Tier 1: Immediate Failure & Saturation Indicators​

Tier 2: Diagnostic Drill-Down​

Basic Model Serving​

Routing & Load Balancing​

Prefix Caching​

Prefill/Decode Disaggregation​

Flow Control​

Notes​