Skip to content
Misar.io

How to Deploy an LLM on a VPS with AI in 2026 (Step-by-Step Guide)

All articles
Guide

How to Deploy an LLM on a VPS with AI in 2026 (Step-by-Step Guide)

Self-host an open LLM (Llama, Mistral, Qwen) on your own VPS using vLLM or Ollama, with GPU-on-demand or CPU-only for smaller models.

Misar Team·Aug 23, 2025·3 min read
Table of Contents

Quick Answer

For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.

  • Setup time: 1-3 hours
  • Cost: $5/mo CPU VPS to $500/mo A100
  • Throughput: 10-200 tokens/sec depending on setup

What You'll Need

  • VPS with 16GB+ RAM (CPU) or GPU VPS (RunPod, Vast.ai, Hetzner GPU)
  • Docker installed
  • Domain + Caddy/Traefik for HTTPS
  • Ollama or vLLM

Steps

  • Choose model size by hardware.
  • CPU (16GB RAM): Llama 3.1 8B Q4_K_M, Qwen 2.5 7B
  • Single GPU 24GB (RTX 3090/4090): Llama 3.1 70B Q4, Qwen 2.5 32B
  • A100 80GB: Llama 3.1 70B full, Mixtral 8x22B Q4
  • Install Ollama (simplest). curl -fsSL https://ollama.com/install.sh | sh. Pull model: ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on :11434.
  • Or install vLLM (production). docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama.
  • Put HTTPS in front. Caddy one-liner: your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs.
  • Add auth. Ollama/vLLM don't ship auth. Use a simple Node.js or Caddy basic-auth proxy — reject requests without Authorization: Bearer <key>.
  • Test with OpenAI SDK. new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format.
  • Monitor. Track GPU utilization (nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS.
  • Scale. Multi-GPU with vLLM tensor parallelism. Or run multiple single-GPU nodes behind a load balancer.

Common Mistakes

  • Running 70B on 16GB: OOM crashes. Check quantization fits VRAM/RAM.
  • No rate limiting: One runaway client saturates the box. Add per-key limits.
  • Ignoring context window: Each 1K ctx = ~500MB KV cache. Don't over-allocate.
  • No HTTPS: Browsers & most OpenAI SDKs refuse HTTP. Always use Caddy/Traefik.
  • Skipping quantization: Full-precision 8B needs 16GB VRAM. Q4_K_M needs 5GB with minor quality loss.

Top Tools

Tool

Best For

Price

Ollama

Easiest setup

Free

vLLM

High throughput

Free

Llama.cpp

CPU / edge

Free

Caddy

HTTPS proxy

Free

Hetzner GPU

Cheap GPU VPS

$70-500/mo

FAQs

Q: Ollama vs vLLM?

Ollama: simple, slow. vLLM: complex, 10x faster at scale.

Q: Which GPU for production?

RTX 4090 (24GB) for indie. A100 (80GB) for scale. H100 for frontier.

Q: Can I run on CPU only?

Yes — 8B quantized model on 16GB RAM. ~5-10 tok/sec. Fine for batch jobs.

Q: Is this cheaper than OpenAI API?

Above ~5M tokens/mo, yes. Below that, APIs are cheaper.

Q: Data privacy vs hosted API?

100% on-prem. Data never leaves your VPS.

Q: Can I serve multiple models?

Yes — vLLM supports multi-model serving with routing.

Conclusion

Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.

self-hosted-llmvllmollamavpsopen-source-ai
Enjoyed this article? Share it with others.

More to Read

View all posts
Guide

How to Train an AI Chatbot on Website Content Safely

Website content is one of the richest sources of information your business has. Every help article, FAQ, service description, and policy page is a direct line to your customers’ most pressing questions—yet most of this d

9 min read
Guide

E-commerce AI Assistants: Use Cases That Actually Drive Revenue

E-commerce is no longer just about transactions—it’s about personalized experiences, instant support, and frictionless journeys. Today’s shoppers expect more than just a website; they want a concierge that understands th

11 min read
Guide

What a Healthcare AI Assistant Needs Before Launch

Healthcare AI isn’t just about algorithms—it’s about trust. Patients, clinicians, and regulators all need to believe that your AI assistant will do more than talk; it will listen, remember, and act responsibly when it ma

12 min read
Guide

Website AI Chat Widgets: What Converts Better Than Generic Bots

Website AI chat widgets have become a staple for SaaS companies looking to engage visitors, answer questions, and drive conversions. Yet, most chat widgets still rely on generic, rule-based bots that frustrate users with

11 min read

Explore Misar AI Products

From AI-powered blogging to privacy-first email and developer tools — see how Misar AI can power your next project.

Stay in the loop

Follow our latest insights on AI, development, and product updates.

Get Updates