Table of Contents
Quick Answer
For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.
- Setup time: 1-3 hours
- Cost: $5/mo CPU VPS to $500/mo A100
- Throughput: 10-200 tokens/sec depending on setup
What You'll Need
- VPS with 16GB+ RAM (CPU) or GPU VPS (RunPod, Vast.ai, Hetzner GPU)
- Docker installed
- Domain + Caddy/Traefik for HTTPS
- Ollama or vLLM
Steps
- Choose model size by hardware.
- CPU (16GB RAM): Llama 3.1 8B Q4_K_M, Qwen 2.5 7B
- Single GPU 24GB (RTX 3090/4090): Llama 3.1 70B Q4, Qwen 2.5 32B
- A100 80GB: Llama 3.1 70B full, Mixtral 8x22B Q4
- Install Ollama (simplest).
curl -fsSL https://ollama.com/install.sh | sh. Pull model:ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on:11434. - Or install vLLM (production).
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama. - Put HTTPS in front. Caddy one-liner:
your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs. - Add auth. Ollama/vLLM don't ship auth. Use a simple Node.js or Caddy basic-auth proxy — reject requests without
Authorization: Bearer <key>. - Test with OpenAI SDK.
new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format. - Monitor. Track GPU utilization (
nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS. - Scale. Multi-GPU with vLLM tensor parallelism. Or run multiple single-GPU nodes behind a load balancer.
Common Mistakes
- Running 70B on 16GB: OOM crashes. Check quantization fits VRAM/RAM.
- No rate limiting: One runaway client saturates the box. Add per-key limits.
- Ignoring context window: Each 1K ctx = ~500MB KV cache. Don't over-allocate.
- No HTTPS: Browsers & most OpenAI SDKs refuse HTTP. Always use Caddy/Traefik.
- Skipping quantization: Full-precision 8B needs 16GB VRAM. Q4_K_M needs 5GB with minor quality loss.
Top Tools
| Tool | Best For | Price |
|---|---|---|
| Ollama | Easiest setup | Free |
| vLLM | High throughput | Free |
| Llama.cpp | CPU / edge | Free |
| Caddy | HTTPS proxy | Free |
| Hetzner GPU | Cheap GPU VPS | $70-500/mo |
Conclusion
Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.
