Learning Ray on K3s: From Setup to Production-Ready CyberLLM RAG

When I first decided to deploy Ray on my K3s cluster, I thought it would just be a quick experiment—a weekend test of distributed AI workloads. Instead, it evolved into a full-blown guide on building, scaling, and validating a production-grade cybersecurity RAG system. What follows isn’t theory—it’s the real sequence of wins, failures, and fixes that led to a stable, scalable deployment.

🧩 Step 1: Preparing the Cluster

The setup began with a modest K3s cluster—lightweight, GPU-enabled, and ideal for experimentation. I confirmed the nodes were ready:

kubectl cluster-info
kubectl get nodes -o wide

With Kubernetes healthy, I installed KubeRay, the operator that makes Ray play nicely with K8s:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.2

That simple install marked the first milestone—the infrastructure was ready to host distributed AI jobs.

💥 Step 2: The OOMKilled Lesson

My first Ray cluster barely lasted five minutes. I’d allocated just 1500Mi memory to the head node. Kubernetes responded with predictable cruelty:

Last State: Terminated (Reason: OOMKilled, Exit Code: 137)
Restart Count: 14

The fix was obvious in hindsight—bump resources:

resources:
  requests:
    memory: 4Gi
  limits:
    memory: 8Gi

After that change, the cluster stabilized instantly. Lesson learned: if Ray is crashing, check your memory first.

⚙️ Step 3: Building the CyberLLM RAG MVP

Once the cluster was stable, I began the core task—building the RAG system that would power cybersecurity queries. The data ingestion pipeline aggregated:

MITRE ATT&CK (STIX)
CWE (XML)
CAPEC (XML)
NVD CVEs (JSON)
Sigma detection rules (YAML)

To start the build:

python scripts/build_rag.py

Initial throughput hovered around 127 docs/sec—not bad, but not great. Then I parallelized with Ray, optimized embedding batches, and saw it jump to 10,576 docs/sec. That’s an 83× boost from distributed execution alone.

🚀 Step 4: Serving with RayJob

With the vector store ready, I deployed the API via a RayJob manifest:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: cyber-rag-service
spec:
  entrypoint: >-
    python src/serving/ray_serve_app.py \
      --vector-store /data/embeddings \
      --model meta-llama/Llama-2-7b-chat-hf \
      --host 0.0.0.0 \
      --port 8000

After applying it and forwarding the port:

kubectl port-forward svc/raycluster-basic-head-svc 8000:8000

I tested the endpoint:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question":"What is ATT&CK technique T1059?","top_k":5}'

It returned a clean, structured JSON answer—with references. That was the first real moment of proof: this cluster was alive, intelligent, and useful.

✅ Step 5: Testing and Validation

Skipping tests is how you build fragile systems. I didn’t skip them.

python scripts/test_core.py
python scripts/test_pipeline.py

Output:

[TEST 1] Python Syntax   PASSED (14/14)
[TEST 2] Core Logic      PASSED (6/6)
[TEST 3] Data Process    PASSED (1,272 docs)
[TEST 4] K8s Manifests   PASSED (2/2)
...
Overall: ✅ 100% PASS

Green across the board. It wasn’t just functional—it was validated.

📊 Step 6: Observability and Scaling

Once everything worked, I opened the dashboards:

kubectl port-forward svc/raycluster-basic-head-svc 8265:8265  # Ray dashboard
kubectl port-forward svc/kps-grafana 3000:80                  # Grafana

The Ray dashboard visualized live scaling as worker pods spawned and terminated under load. In Grafana, I watched latency metrics dip as autoscaling kicked in—proof the cluster wasn’t just running, it was adapting in real time.

📈 Key Metrics

RAG Performance

Metric	Value
Knowledge base	~100K documents
Embedding dimension	384 (MiniLM)
Index build time	20–30 min
Retrieval latency (p95)	150–250ms
Generation latency (p95)	5–8s
Throughput	5–10 req/s (single GPU)

Training Benchmarks

Model	Method	Time	Memory	Output
Llama-2 7B	QLoRA	2–3 h	18 GB	200 MB adapter
Llama-2 7B	CPT	4–6 h	35 GB	13 GB model
Mistral 7B	QLoRA	2–3 h	16 GB	200 MB adapter

Resource Profile

Development: 4 CPU / 16 GB RAM / 1 GPU (8 GB VRAM) / 50 GB storage
Production: 4 CPU / 8 GB RAM per node + auto-scaling workers / 100 GB PVC

📁 Repository Snapshot

cyber-llm-rag/
├── scripts/
│   ├── download_data.py
│   ├── build_rag.py
│   └── test_pipeline.py
├── src/
│   ├── rag/
│   ├── training/
│   └── serving/
├── k8s/rayjob-cyber-rag.yaml
├── data/
├── models/
└── notebooks/

➡️ Full repo: github.com/alenperic/Cyber-LLM-RAG
Includes complete code, manifests, and documentation for training, serving, and testing.

💡 Lessons Learned

OOMKilled is inevitable — treat it as a diagnostic, not a disaster.
Parallelism changes everything — Ray took indexing from 127 → 10,576 docs/sec.
Tests prove stability — automation prevents regressions before they ship.
Observability is non-negotiable — dashboards are the difference between guessing and knowing.

🏁 Final Thoughts

What began as a tinkering project turned into a production-ready cybersecurity AI system—a Retrieval-Augmented Generation pipeline fine-tuned for CTI data, deployed with Ray Serve, scaling automatically on K3s, and validated end-to-end.

If you’re starting your own Ray journey:
expect some crashes, expect to tune aggressively, but also expect that once it clicks, you’ll have a self-healing, GPU-powered AI system that feels alive.

Sources & Links

Cyber-LLM-RAG GitHub Repository
Full codebase, manifests, and documentation for the project.
Ray on KubeRay Documentation
Official Ray docs for running Ray on Kubernetes with KubeRay.
Ray Serve Documentation
Docs for Ray Serve, API deployment, and scaling.
KubeRay GitHub Repository
Source code and issues for the KubeRay operator.
Ray Kubernetes User Guide
Best practices for Ray clusters on Kubernetes.
Ray Monitoring & Observability
How to monitor Ray clusters and jobs.
ChromaDB GitHub Repository
Vector database used for RAG pipeline.
HuggingFace Transformers
Model training and inference library.
PEFT (Parameter-Efficient Fine-Tuning)
Library for QLoRA and adapter-based training.
BitsAndBytes
Efficient quantization for LLM training.

Disclaimer: This blog was written with assistance from genAI and large language models (LLMs).

← Back to Blog