vLLM on Debian 12 & RTX 5070 Ti — My Sleepless‑Night Guide

How I Got vLLM Running on a Brand‑New Blackwell GPU (and Survived)

Three‑and‑a‑half bleary‑eyed days, a pile of cmake errors, and one very confused cat later, I can finally say: vLLM now runs on my RTX 5070 Ti under Debian 12 with CUDA 12.8. If you’re holding a Blackwell card and wondering why nothing “just works,” this post is for you.

Skip the chatter? All commands live in the repo → github.com/alenperic/Debian‑PyTorch‑RTX5070Ti

Why the pain?

No official sm_120 wheel — PyTorch nightly tops out at Lovelace (sm_90). Blackwell needs sm_120.
CUDA 12.8 — bleeding‑edge runtime, but the only one the 50‑series driver supports.
FlashInfer — fastest attention kernels out there, but they insist on a matching compiler stack.

In other words, the perfect storm of “it should work in theory” and “why is my coffee mug screaming?”

Hardware & Software Baseline

GPU: NVIDIA RTX 5070 Ti (16 GB VRAM, Blackwell)
OS: Debian 12 (Bookworm) — stock kernel 6.1
Driver: 545.xx or newer
CUDA Toolkit: 12.8
Python: 3.11 (venv + Docker)

Step 1 — Install Driver & CUDA 12.8

First we exorcise nouveau, add the CUDA repo, and pray we don’t typo the URL.

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
  | sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-drivers.gpg
echo 'deb [signed-by=/etc/apt/keyrings/nvidia-drivers.gpg] \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /' \
| sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
sudo apt update
sudo apt install -y nvidia-driver cuda
sudo reboot

After reboot, verify with nvidia-smi. You should see CUDA 12.8 and a shiny Blackwell device name. If you don’t, welcome to Groundhog Day — repeat until reality matches documentation.

Step 2 — Build PyTorch for Blackwell

Nightly wheels choke on sm_120, so we compile. AVX‑512 makes a real difference on modern Intel chips, so I turned it on.

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
export USE_CUDA=1
export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="12.0"
export MAX_JOBS=$(nproc)
export CMAKE_PREFIX_PATH=$(python3 -c "import sysconfig; print(sysconfig.get_paths()['data'])")
pip install -r requirements.txt
pip install ninja cmake pyyaml
python setup.py clean
python setup.py bdist_wheel
pip install dist/torch-*.whl

Step 3 — Build the Docker Image (vLLM + FlashInfer)

Because nothing says “I love repetition” like compiling again inside a container.

FROM nvcr.io/nvidia/pytorch:25.03-py3
ENV TORCH_CUDA_ARCH_LIST='12.0+PTX'
RUN apt-get update && apt-get install -y git cmake ccache python3-dev
RUN git clone https://github.com/flashinfer-ai/flashinfer.git --recursive /flashinfer
WORKDIR /flashinfer
RUN pip install -e . -v
RUN git clone https://github.com/vllm-project/vllm.git /vllm
WORKDIR /vllm
RUN pip install -r requirements/build.txt
RUN python setup.py develop
CMD ["bash"]

mkdir -p ~/vllm/ccache
docker build -t vllm-cu128 -f Dockerfile .

Step 4 — Pull Some Models

Time to hoard weights like a dragon.

huggingface-cli login
huggingface-cli download Qwen/QwQ-32B-AWQ \
  --local-dir /models/QwQ-32B-AWQ --local-dir-use-symlinks False
huggingface-cli download TheBloke/WhiteRabbitNeo-13B-AWQ \
  --local-dir /models/WhiteRabbitNeo-13B-AWQ
huggingface-cli download nvidia/DeepSeek-R1-FP4 \
  --local-dir /models/DeepSeek-R1-FP4

Step 5 — Launch vLLM

The moment of truth. If this fails, scroll back to Step 1 and pretend you never saw this line.

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 -v /models:/models \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  vllm-cu128 \
  python -m vllm.entrypoints.api_server \
    --model /models/QwQ-32B-AWQ \
    --quantization awq \
    --gpu-memory-utilization 0.90 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --enable-reasoning \
    --reasoning-parser deepseek_r1 \
    --max-model-len 32768

The server starts on localhost:8000. A quick sanity check:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/QwQ-32B-AWQ","prompt":"Write a haiku about snow.","max_tokens":64}'

Benchmarks (Real‑World)

DeepSeek‑R1‑FP4 (7B) — ~120 tok/s @ 4096 ctx
WhiteRabbitNeo‑13B‑AWQ — ~68 tok/s @ 4096 ctx
QwQ‑32B‑AWQ — 15‑18 tok/s @ 32k ctx

Troubleshooting Nuggets

invalid device function → your TORCH_CUDA_ARCH_LIST is wrong or you’re loading the stock wheel.
OOM at load time → lower --gpu-memory-utilization or add --cpu-offload-gb 6.
FlashInfer segfaults → rebuild it after you install your custom PyTorch wheel.

Final Thoughts

Blackwell support will land in upstream PyTorch soon enough, but if you want today’s performance, rolling your own stack is totally doable. I hope the guide saves you a few coffee refills and at least one existential crisis.

The full copy‑paste script lives here → Debian‑PyTorch‑RTX5070Ti

Happy compiling, and ping me on LinkedIn if you hit a weird edge case.

← Back to Blog

How I Got vLLM Running on a Brand‑New Blackwell GPU (and Survived)

Why the pain?

Hardware & Software Baseline

Step 1 — Install Driver & CUDA 12.8

Step 2 — Build PyTorch for Blackwell

Step 3 — Build the Docker Image (vLLM + FlashInfer)

Step 4 — Pull Some Models

Step 5 — Launch vLLM

Benchmarks (Real‑World)

Troubleshooting Nuggets

Final Thoughts

Step 1 — Install Driver & CUDA 12.8