How I Got vLLM Running on a Brand‑New Blackwell GPU (and Survived)
Three‑and‑a‑half bleary‑eyed days, a pile of cmake
errors, and one very confused cat later, I can finally say: vLLM now runs on my RTX 5070 Ti under Debian 12 with CUDA 12.8. If you’re holding a Blackwell card and wondering why nothing “just works,” this post is for you.
Why the pain?
- No official
sm_120
wheel — PyTorch nightly tops out at Lovelace (sm_90
). Blackwell needssm_120
. - CUDA 12.8 — bleeding‑edge runtime, but the only one the 50‑series driver supports.
- FlashInfer — fastest attention kernels out there, but they insist on a matching compiler stack.
In other words, the perfect storm of “it should work in theory” and “why is my coffee mug screaming?”
Hardware & Software Baseline
- GPU: NVIDIA RTX 5070 Ti (16 GB VRAM, Blackwell)
- OS: Debian 12 (Bookworm) — stock kernel 6.1
- Driver: 545.xx or newer
- CUDA Toolkit: 12.8
- Python: 3.11 (venv + Docker)
Step 1 — Install Driver & CUDA 12.8
First we exorcise nouveau, add the CUDA repo, and pray we don’t typo the URL.
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
| sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-drivers.gpg
echo 'deb [signed-by=/etc/apt/keyrings/nvidia-drivers.gpg] \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /' \
| sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
sudo apt update
sudo apt install -y nvidia-driver cuda
sudo reboot
After reboot, verify with nvidia-smi
. You should see CUDA 12.8 and a shiny Blackwell device name. If you don’t, welcome to Groundhog Day — repeat until reality matches documentation.
Step 2 — Build PyTorch for Blackwell
Nightly wheels choke on sm_120
, so we compile. AVX‑512 makes a real difference on modern Intel chips, so I turned it on.
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
export USE_CUDA=1
export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="12.0"
export MAX_JOBS=$(nproc)
export CMAKE_PREFIX_PATH=$(python3 -c "import sysconfig; print(sysconfig.get_paths()['data'])")
pip install -r requirements.txt
pip install ninja cmake pyyaml
python setup.py clean
python setup.py bdist_wheel
pip install dist/torch-*.whl
Step 3 — Build the Docker Image (vLLM + FlashInfer)
Because nothing says “I love repetition” like compiling again inside a container.
FROM nvcr.io/nvidia/pytorch:25.03-py3
ENV TORCH_CUDA_ARCH_LIST='12.0+PTX'
RUN apt-get update && apt-get install -y git cmake ccache python3-dev
RUN git clone https://github.com/flashinfer-ai/flashinfer.git --recursive /flashinfer
WORKDIR /flashinfer
RUN pip install -e . -v
RUN git clone https://github.com/vllm-project/vllm.git /vllm
WORKDIR /vllm
RUN pip install -r requirements/build.txt
RUN python setup.py develop
CMD ["bash"]
mkdir -p ~/vllm/ccache
docker build -t vllm-cu128 -f Dockerfile .
Step 4 — Pull Some Models
Time to hoard weights like a dragon.
huggingface-cli login
huggingface-cli download Qwen/QwQ-32B-AWQ \
--local-dir /models/QwQ-32B-AWQ --local-dir-use-symlinks False
huggingface-cli download TheBloke/WhiteRabbitNeo-13B-AWQ \
--local-dir /models/WhiteRabbitNeo-13B-AWQ
huggingface-cli download nvidia/DeepSeek-R1-FP4 \
--local-dir /models/DeepSeek-R1-FP4
Step 5 — Launch vLLM
The moment of truth. If this fails, scroll back to Step 1 and pretend you never saw this line.
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 -v /models:/models \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
vllm-cu128 \
python -m vllm.entrypoints.api_server \
--model /models/QwQ-32B-AWQ \
--quantization awq \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--max-model-len 32768
The server starts on localhost:8000
. A quick sanity check:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/QwQ-32B-AWQ","prompt":"Write a haiku about snow.","max_tokens":64}'
Benchmarks (Real‑World)
- DeepSeek‑R1‑FP4 (7B) — ~120 tok/s @ 4096 ctx
- WhiteRabbitNeo‑13B‑AWQ — ~68 tok/s @ 4096 ctx
- QwQ‑32B‑AWQ — 15‑18 tok/s @ 32k ctx
Troubleshooting Nuggets
- invalid device function → your
TORCH_CUDA_ARCH_LIST
is wrong or you’re loading the stock wheel. - OOM at load time → lower
--gpu-memory-utilization
or add--cpu-offload-gb 6
. - FlashInfer segfaults → rebuild it after you install your custom PyTorch wheel.
Final Thoughts
Blackwell support will land in upstream PyTorch soon enough, but if you want today’s performance, rolling your own stack is totally doable. I hope the guide saves you a few coffee refills and at least one existential crisis.
Happy compiling, and ping me on LinkedIn if you hit a weird edge case.
← Back to Blog