If you're running a production AI workload in 2026 and need maximum tokens-per-second performance, one of the most effective combinations currently available is vLLM on an NVIDIA Blackwell B200 dedicated server. The B200 represents the latest generation of high-performance AI accelerators, and vLLM has evolved into a leading high-throughput serving engine with strong support for modern GPU architectures.
This guide walks you through the entire setup, from bare-metal prerequisites to production-tuned serving, on a dedicated server. In real-world deployments (including environments such as COLO BIRD), these patterns are commonly used for large-scale inference workloads.
Before jumping into commands, it's worth understanding why the B200 justifies its own setup guide rather than a generic vLLM install.
The NVIDIA Blackwell B200 features up to 192 GB of HBM3e memory with ~8 TB/s-class bandwidth per GPU, along with high-bandwidth NVLink interconnects (system-dependent, up to ~1.8 TB/s). It also expands support for lower-precision inference formats such as FP8, with emerging FP4 capabilities on next-generation tensor cores.
In practice, early benchmarks and internal testing suggest up to ~4× higher throughput compared to Hopper-class GPUs (e.g., H100/H200) under optimized conditions. On large-scale deployments, multi-GPU systems can reach tens of thousands of tokens per second depending on model size, batch configuration, and latency targets.
These performance gains come from a combination of hardware improvements and ongoing optimizations in the vLLM ecosystem, including more efficient attention kernels (such as FlashInfer), improved Mixture-of-Experts routing, and faster matrix operations. Proper configuration is required to fully realize these gains.
Before you start, confirm the following on your dedicated server:
GPU: NVIDIA B200 (or GB200 NVL72/NVL36). Most steps also apply to B100-class GPUs with minor adjustments.
Driver: Latest NVIDIA driver supporting Blackwell GPUs (check official compatibility matrix).
CUDA: CUDA 12.8+ (or the latest Blackwell-compatible release).
OS: Ubuntu 22.04 LTS or 24.04 LTS (recommended).
RAM: ≥ 512 GB system RAM (recommended for large models and high batch sizes; host memory can become a bottleneck).
Docker: Docker Engine 20.10+ with NVIDIA Container Toolkit.
Storage: Fast NVMe SSD — a 70B model requires ~140 GB in BF16.
Verify your installation: You should see your GPU, driver version, and CUDA version. If not, install or update drivers from NVIDIA before proceeding.
nvidia-smi
The most reliable way to run vLLM on modern GPUs is inside a container, ensuring compatibility with NVIDIA’s optimized CUDA and PyTorch builds.
# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify GPU access inside Docker:
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi
The vLLM project publishes official Docker images with GPU-optimized kernels and compatible PyTorch builds.
docker pull vllm/vllm-openai:latest
# For production, pin a version:
docker pull vllm/vllm-openai:v0.10.1
These images include PyTorch builds compatible with newer GPU architectures.
Attempting a bare-metal install may result in errors such as CUDA error: no kernel image is available for execution on the device.
Using containers is strongly recommended for Blackwell-class systems.
Download model weights before launching vLLM to avoid startup delays. We’ll use Llama 3 70B Instruct as an example.
pip install huggingface_hub
huggingface-cli login
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct \
--cache-dir /data/models \
--local-dir /data/models/llama-3-70b-instruct
Note: Large MoE models such as DeepSeek-R1 can exceed 300–400 GB, so ensure sufficient storage and download bandwidth.
Single GPU:
docker run --gpus all \
--ipc=host \
--network=host \
-p 8000:8000 \
-v /data/models:/models \
-e HUGGING_FACE_HUB_TOKEN=your_hf_token_here \
vllm/vllm-openai:v0.10.1 \
--model /models/llama-3-70b-instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-chunked-prefill
Multi-GPU:
docker run --gpus all \
--ipc=host \
--network=host \
-p 8000:8000 \
-v /data/models:/models \
-e HUGGING_FACE_HUB_TOKEN=your_hf_token_here \
vllm/vllm-openai:v0.10.1 \
--model /models/deepseek-r1 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.92 \
--max-model-len 65536 \
--enable-chunked-prefill
curl http://localhost:8000/health
curl http://localhost:8000/v1/models
# Test request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b-instruct",
"messages": [{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
"max_tokens": 200
}'
Key parameters to tune for high-throughput environments:
--max-num-seqs (batch size control)
--gpu-memory-utilization
--max-model-len
--enable-prefix-caching
Start conservatively and scale your batch size until latency exceeds your SLA.
Real-time monitoring is critical. Use the following commands:
watch -n 1 nvidia-smi
# Track vLLM specific metrics
curl http://localhost:8000/metrics
What to track:
GPU utilization
KV cache usage
Time-to-first-token
Throughput
For production visibility, integrate these metrics with Prometheus + Grafana.
Large Mixture-of-Experts models (100B+) benefit significantly from multi-GPU setups and tensor parallelism. Ensure you have:
Sufficient GPU count
Proper sharding (--tensor-parallel-size)
Adequate CPU + RAM support
| Error / Issue | Solution |
|---|---|
| CUDA kernel errors | Use official containerized builds. |
| OOM errors | Reduce sequence length or memory utilization. |
| Low GPU usage | Increase batching (--max-num-seqs). |
| Container exits unexpectedly | Increase shared memory size (--shm-size). |
Performance varies widely, but under optimized conditions:
70B models → High throughput on a single GPU.
400B+ models → Require multi-GPU scaling.
MoE models → Benefit heavily from parallelism.
Always benchmark with your real workload:
python benchmarks/benchmark_throughput.py \
--model /models/llama-3-70b-instruct \
--backend vllm \
--input-len 1024 \
--output-len 512
To maximize performance, always use containers, run CUDA 12.8+, tune batching and memory, and monitor your GPU + KV cache behavior. With proper configuration, Blackwell-class GPUs can deliver significant throughput gains over previous generations for production LLM inference.
Dedicated servers specifically provide full GPU access (no virtualization partitioning), direct NVMe storage, full NVLink bandwidth, and predictable costs at scale.
COLO BIRD provides high-performance bare-metal dedicated servers ready for heavy AI workloads. Running large MoE deployments or needing unshared Blackwell / Hopper GPU access? Deploy a custom server tailored for vLLM and maximize your tokens-per-second today.