Skip to content

GPU Memory Warnings

vLLM includes an optional GPU memory monitoring system that warns users when GPU memory usage exceeds a configurable threshold. This can be helpful for preventing Out-Of-Memory (OOM) crashes by providing early warnings.

Enabling Warnings

To enable GPU memory warnings, use the --enable-gpu-memory-warning flag:

vllm serve facebook/opt-125m --enable-gpu-memory-warning

Configuration

You can configure the warning threshold using --gpu-memory-warning-threshold (default: 0.9, i.e., 90%):

vllm serve facebook/opt-125m \
    --enable-gpu-memory-warning \
    --gpu-memory-warning-threshold 0.85

How It Works

When enabled, vLLM periodically checks the GPU memory usage (reserved memory vs total memory). If the usage ratio exceeds the threshold, a warning log is emitted.

Example warning:

WARNING 01-06 21:00:00 gpu_memory_monitor.py:134] GPU 0 memory usage high: 92.5% (reserved: 3.65GB / 3.95GB, allocated: 3.50GB). Consider reducing --max-num-seqs, --max-model-len, or using a smaller model to avoid OOM.

To prevent log spam, warnings are rate-limited (default: once every 60 seconds).

When to Use

This feature is particularly useful when:

  • Running on GPUs with limited VRAM.
  • Experimenting with new models or configurations.
  • Debugging OOM issues.

It has zero overhead when disabled (default).