The Gateway API Inference Extension project addresses the unique challenges of running AI inference workloads on Kubernetes by introducing two new Custom Resource Definitions (CRDs), InferenceModel and InferencePool. This extension enhances request routing and load balancing through an intelligent endpoint selection process that utilizes real-time metrics from LLMs, optimizing GPU usage and improving system performance.
kubernetes ✓
+ ai-inference
gateway-api ✓
load-balancing ✓
gpu-utilization ✓