vllm-openai

Last changed

Request a free trial

Contact our team to test out this image for free. Please also indicate any other images you would like to evaluate.

Chainguard Container for vllm-openai

vLLM is a high-throughput and memory-efficient inference engine for Large Language Models (LLMs). It provides an OpenAI-compatible API server for production LLM deployments with GPU acceleration.

Chainguard Containers are regularly-updated, secure-by-default container images.

Download this Container Image

For those with access, this container image is available on cgr.dev:


docker pull cgr.dev/ORGANIZATION/vllm-openai:latest

Be sure to replace the ORGANIZATION placeholder with the name used for your organization's private repository within the Chainguard Registry.

Compatibility Notes

Chainguard's vLLM image is comparable to the vllm/vllm-openai image, with several key differences:

vLLM is a rapidly evolving project where even minor version bumps can introduce significant changes, including breaking changes to APIs, dependencies, or kernel requirements. Due to this, Chainguard's vLLM image updates may take additional time to ensure stability and compatibility. We recommend testing new versions in a non-production environment before upgrading.

Package Differences

The following packages have been modified or removed compared to upstream:

CuPy: Removed entirely. CuPy's vendor backends require cuDNN 8, which conflicts with cuDNN 9 shipped in this image. CuPy is optional for vLLM and primarily used by Ray
torch-c-dlpack-ext: Not pre-installed. You may see a warning about EnvTensorAllocator not being enabled during startup. This is an optional TVM optimization and does not affect vLLM functionality
LMCache: Not pre-installed. For distributed KV cache sharing across multiple vLLM instances, install via pip install lmcache at runtime. See vLLM LMCache Examples for configuration

Expert Parallel (EP) Kernels for MoE Models

For Expert Parallel deployment with Mixture-of-Experts (MoE) models like DeepSeek-V2/V3, the EP kernels are not pre-installed in this image. Unlike upstream which ships pre-built EP kernels, Chainguard provides a build script due to upstream's specific version pinning requirements and custom patches for components like NVSHMEM, pplx-kernels, and DeepEP.

Building EP Kernels

The image includes /vllm-workspace/install_python_libraries.sh to build the required components. Before running, ensure you have:

A GPU with the appropriate CUDA architecture
Set TORCH_CUDA_ARCH_LIST for your GPU (e.g., "8.0;9.0" for A100/H100)


# Start an interactive container with persistent storage for the build
docker run --rm -it --gpus all \
  --shm-size 8g \
  -v ep_kernels_cache:/vllm-workspace/ep_kernels_workspace \
  -e TORCH_CUDA_ARCH_LIST="8.0;9.0" \
  cgr.dev/$ORGANIZATION/vllm-openai:latest \
  bash

# Inside the container, run the build script
/vllm-workspace/install_python_libraries.sh /vllm-workspace/ep_kernels_workspace

The script builds:

NVSHMEM (with DeepSeek patches) - NVIDIA shared memory library
pplx-kernels - Perplexity AI's optimized MoE kernels
DeepEP - DeepSeek's Expert Parallel kernels

Build time is approximately 10-20 minutes depending on hardware. The built kernels persist in the mounted volume for reuse.

Using Pre-built EP Kernels

After building once, mount the volume when running vLLM:


docker run --rm -it --gpus all \
  --shm-size 8g \
  -v ep_kernels_cache:/vllm-workspace/ep_kernels_workspace \
  -p 8000:8000 \
  cgr.dev/$ORGANIZATION/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-V2-Lite \
  --tensor-parallel-size 2 \
  --host 0.0.0.0

See the Expert Parallel Deployment Guide for detailed configuration options.

Running vLLM

Prerequisites

NVIDIA GPU with CUDA support
NVIDIA Container Toolkit installed
Docker with GPU support enabled

Basic Usage

Set the following environment variable to the name of your organization:


ORGANIZATION=my-organization

Start the vLLM OpenAI-compatible server with a model:


docker run --rm -it \
  --gpus all \
  --shm-size 1g \
  -p 8000:8000 \
  -v /path/to/cache:/root/.cache \
  cgr.dev/$ORGANIZATION/vllm-openai:latest \
  --model facebook/opt-125m \
  --host 0.0.0.0 \
  --port 8000

The server exposes OpenAI-compatible endpoints at http://localhost:8000.

Testing the API

Once the server is running, test it with curl:


# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Generate completions
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Deep learning is",
    "max_tokens": 50
  }'

GPU Memory and Shared Memory

For optimal performance, configure shared memory size with --shm-size:


docker run --rm -it \
  --gpus all \
  --shm-size 8g \
  -p 8000:8000 \
  cgr.dev/$ORGANIZATION/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-hf \
  --host 0.0.0.0

Audio Transcription with Whisper

Audio dependencies are not pre-installed (same as upstream) due to licensing concerns. To use Whisper models for speech-to-text, first install the audio dependencies:


docker exec <container_name> pip install vllm[audio]
docker restart <container_name>

Then start the server with a Whisper model:


docker run --rm -it \
  --gpus all \
  --shm-size 1g \
  -p 8000:8000 \
  cgr.dev/$ORGANIZATION/vllm-openai:latest \
  --model openai/whisper-large-v3 \
  --host 0.0.0.0

Test audio transcription:


curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@/path/to/audio.wav" \
  -F "model=openai/whisper-large-v3"

CUDA Compatibility

If you're running an older CUDA driver, you may encounter errors like:


CUDA error: the provided PTX was compiled with an unsupported toolchain

To resolve this, use CUDA forward compatibility by setting the LD_LIBRARY_PATH environment variable:


docker run --rm -it \
  --gpus all \
  -e LD_LIBRARY_PATH="/usr/local/cuda-12.9/compat" \
  cgr.dev/$ORGANIZATION/vllm-openai-fips:latest \
  --model facebook/opt-125m \
  --host 0.0.0.0

The /usr/local/cuda-12.9/compat directory contains NVIDIA's forward compatibility libraries that allow applications built with newer CUDA versions to run on older drivers.

Refer to NVIDIA's CUDA Compatibility documentation for more details on driver requirements and compatibility.

Documentation and Resources

What are Chainguard Containers?

Chainguard's free tier of Starter container images are built with Wolfi, our minimal Linux undistro.

All other Chainguard Containers are built with Chainguard OS, Chainguard's minimal Linux operating system designed to produce container images that meet the requirements of a more secure software supply chain.

The main features of Chainguard Containers include:

Minimal design, without unnecessary software bloat
Daily builds to ensure container images are up-to-date with available security patches
High quality build-time SBOMs attesting to the provenance of all artifacts within the image
Verifiable signatures provided by Sigstore
Reproducible builds with Cosign and apko (read more about reproducibility)

For cases where you need container images with shells and package managers to build or debug, most Chainguard Containers come paired with a development, or -dev, variant.

In all other cases, including Chainguard Containers tagged as :latest or with a specific version number, the container images include only an open-source application and its runtime dependencies. These minimal container images typically do not contain a shell or package manager.

Although the -dev container image variants have similar security features as their more minimal versions, they include additional software that is typically not necessary in production environments. We recommend using multi-stage builds to copy artifacts from the -dev variant into a more minimal production image.

Need additional packages?

To improve security, Chainguard Containers include only essential dependencies. Need more packages? Chainguard customers can use Custom Assembly to add packages, either through the Console, chainctl, or API.

To use Custom Assembly in the Chainguard Console: navigate to the image you'd like to customize in your Organization's list of images, and click on the Customize image button at the top of the page.

Learn More

Refer to our Chainguard Containers documentation on Chainguard Academy. Chainguard also offers VMs and Libraries — contact us for access.

Trademarks

This software listing is packaged by Chainguard. The trademarks set forth in this offering are owned by their respective companies, and use of them does not imply any affiliation, sponsorship, or endorsement by such companies.

Licenses

Chainguard's container images contain software packages that are direct or transitive dependencies. The following licenses were found in the "latest" tag of this image:

Apache-2.0
Artistic-1.0-Perl
BSD-1-Clause
BSD-2-Clause
BSD-3-Clause
BSD-3-Clause-Open-MPI
BSD-4-Clause-UC

For a complete list of licenses, please refer to this Image's SBOM.

Software license agreement

Compliance

Chainguard Containers are SLSA Level 3 compliant with detailed metadata and documentation about how it was built. We generate build provenance and a Software Bill of Materials (SBOM) for each release, with complete visibility into the software supply chain.

SLSA compliance at Chainguard

This image helps reduce time and effort in establishing PCI DSS 4.0 compliance with low-to-no CVEs.

PCI DSS at Chainguard

A FIPS validated version of this image is available for FedRAMP compliance. STIG is included with FIPS image.

Related images

vllm-openai-fips