Self-host Ollama on your own VPS
Ollama hit 52 million monthly downloads in early 2026. Running it on a VPS gives you a private, always-on LLM API endpoint that your other apps (Open WebUI, n8n, Dify, your own agents) can hit. No usage caps, no per-token bills, no third party seeing your prompts. Fair warning: small models only on standard VPS plans, large models need GPU which we don't currently offer.
Ollama needs at least 8 GB RAM for usable models. Starting at $31.19/mo.
Why self-host Ollama
Quick start: Ollama on Servury
Tested on Ubuntu 24. Pick an 8 GB+ plan, deploy, SSH in.
# 1. Install Ollama (the one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a model that fits your VPS RAM
ollama pull llama3.2:3b # ~3 GB RAM, fast
ollama pull qwen2.5:7b # ~6 GB RAM, better quality
ollama pull llama3.1:8b # ~8 GB RAM, recommended
# 3. Expose the API to your other apps (LAN or VPN only, never raw internet)
sudo systemctl edit ollama
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama
# 4. Hit the OpenAI-compatible endpoint:
curl http://YOUR_SERVER_IP:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"hello"}]}'
Security note: Ollama has no built-in auth. Always firewall port 11434 to your VPN/Tailscale range, or put a reverse proxy with auth in front. Never expose it open to the internet.
Plan picker by model size
llama3.2:3b, phi3:3.8b. Fine for embeddings, classification, simple chat.
VPS-200 plan
llama3.1:8b, qwen2.5:7b, mistral:7b. Sweet spot for most use cases.
VDS-300 or VPS-250
Larger reasoning models. Slow on CPU but workable for batch jobs.
VDS-450 plan
Real talk: Without a GPU, even 7B models run at 5-15 tokens/sec on CPU. Fine for background tasks, batch processing, and embedding generation. If you need real-time chat speeds for 13B+ models, you'll need a GPU host elsewhere and just point your apps at it.
Frequently asked questions
What is Ollama?
Ollama is the most popular tool for running open-source LLMs locally. It exposes an OpenAI-compatible API so your existing apps can swap from cloud providers to your local instance with one config change. Hit 52 million monthly downloads in Q1 2026.
Can I run Ollama without a GPU?
Yes. CPU inference works for 3B-8B models at 5-15 tokens/sec on a decent VPS. Fine for background jobs, embeddings, classification, and async chat. Real-time interactive chat at large model sizes is where GPUs matter.
How much RAM do I need?
Rule of thumb: model size in GB + 2 GB overhead. A Q4-quantized llama3.1:8b needs about 6 GB RAM, so 8 GB plan minimum. 7B models = 8 GB plan. 13B models = 16+ GB plan. 70B models are not realistic on CPU.
Why use a VPS instead of running Ollama on my laptop?
Three reasons: (1) it stays online when your laptop sleeps, so apps and agents that depend on it keep working, (2) other devices on your network or VPN can hit one shared instance, (3) your laptop's RAM and CPU stay free for actual work.
Can I expose Ollama publicly?
You should not. Ollama has no auth. Firewall port 11434 to a VPN like WireGuard or Tailscale, or put a reverse proxy with token auth in front. Treat it like an internal service.
Does this work with Open WebUI, n8n, Dify, etc?
Yes. All major AI tooling speaks the OpenAI API format and Ollama exposes it. Set the base URL to http://your-vps-ip:11434/v1 and a placeholder API key. Done.
Can I run multiple models?
Yes, ollama pull as many as you want. Ollama keeps them on disk and only loads them into RAM when used. The active model uses RAM until idle, then unloads automatically.
Where are servers located?
Montreal (owned hardware), New York, London, Paris, Frankfurt, Netherlands, and Singapore. For Ollama, pick the location closest to whatever apps will be calling its API.