A Kubernetes-native platform for serving machine learning models at scale
LiteLLM is a unified interface to call 100+ LLMs using the OpenAI format, providing a proxy server for multiple LLM providers.
LMCache is an LLM serving engine extension that stores and reuses KV caches across requests to reduce time-to-first-token (TTFT) and increase throughput. It integrates with vLLM to provide GPU-accelerated inference with shared KV cache management.
34 images