Home / Servers / Ollama VPS

Self-host Ollama on your own VPS

Ollama hit 52 million monthly downloads in early 2026. Running it on a VPS gives you a private, always-on LLM API endpoint that your other apps (Open WebUI, n8n, Dify, your own agents) can hit. No usage caps, no per-token bills, no third party seeing your prompts. Fair warning: small models only on standard VPS plans, large models need GPU which we don't currently offer.

Ollama needs at least 8 GB RAM for usable models. Starting at $31.19/mo.

Why self-host Ollama

Prompts never leave your box. Even OpenRouter and OpenAI see your prompts. With self-hosted Ollama, inference happens on your VPS only.
OpenAI-compatible API. Most apps that talk to OpenAI just need a base URL change to point at your Ollama instance instead.
Always-on. Your laptop sleeps. A VPS gives Ollama the persistent uptime that agents and webhooks need.
Anonymous infra. Pay with XMR or BTC, no email required. Your inference layer stays unlinked from your real identity.

Quick start: Ollama on Servury

Tested on Ubuntu 24. Pick an 8 GB+ plan, deploy, SSH in.

# 1. Install Ollama (the one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model that fits your VPS RAM
ollama pull llama3.2:3b      # ~3 GB RAM, fast
ollama pull qwen2.5:7b       # ~6 GB RAM, better quality
ollama pull llama3.1:8b      # ~8 GB RAM, recommended

# 3. Expose the API to your other apps (LAN or VPN only, never raw internet)
sudo systemctl edit ollama
# Add under [Service]:
#   Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama

# 4. Hit the OpenAI-compatible endpoint:
curl http://YOUR_SERVER_IP:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"hello"}]}'

Security note: Ollama has no built-in auth. Always firewall port 11434 to your VPN/Tailscale range, or put a reverse proxy with auth in front. Never expose it open to the internet.

Plan picker by model size

3B - 4B models

llama3.2:3b, phi3:3.8b. Fine for embeddings, classification, simple chat.

8 GB RAM · 4 vCPU
VPS-200 plan
7B - 8B models

llama3.1:8b, qwen2.5:7b, mistral:7b. Sweet spot for most use cases.

10+ GB RAM · 6+ vCPU
VDS-300 or VPS-250
13B+ models

Larger reasoning models. Slow on CPU but workable for batch jobs.

20+ GB RAM · 10+ vCPU
VDS-450 plan

Real talk: Without a GPU, even 7B models run at 5-15 tokens/sec on CPU. Fine for background tasks, batch processing, and embedding generation. If you need real-time chat speeds for 13B+ models, you'll need a GPU host elsewhere and just point your apps at it.

Frequently asked questions

What is Ollama?

Ollama is the most popular tool for running open-source LLMs locally. It exposes an OpenAI-compatible API so your existing apps can swap from cloud providers to your local instance with one config change. Hit 52 million monthly downloads in Q1 2026.

Can I run Ollama without a GPU?

Yes. CPU inference works for 3B-8B models at 5-15 tokens/sec on a decent VPS. Fine for background jobs, embeddings, classification, and async chat. Real-time interactive chat at large model sizes is where GPUs matter.

How much RAM do I need?

Rule of thumb: model size in GB + 2 GB overhead. A Q4-quantized llama3.1:8b needs about 6 GB RAM, so 8 GB plan minimum. 7B models = 8 GB plan. 13B models = 16+ GB plan. 70B models are not realistic on CPU.

Why use a VPS instead of running Ollama on my laptop?

Three reasons: (1) it stays online when your laptop sleeps, so apps and agents that depend on it keep working, (2) other devices on your network or VPN can hit one shared instance, (3) your laptop's RAM and CPU stay free for actual work.

Can I expose Ollama publicly?

You should not. Ollama has no auth. Firewall port 11434 to a VPN like WireGuard or Tailscale, or put a reverse proxy with token auth in front. Treat it like an internal service.

Does this work with Open WebUI, n8n, Dify, etc?

Yes. All major AI tooling speaks the OpenAI API format and Ollama exposes it. Set the base URL to http://your-vps-ip:11434/v1 and a placeholder API key. Done.

Can I run multiple models?

Yes, ollama pull as many as you want. Ollama keeps them on disk and only loads them into RAM when used. The active model uses RAM until idle, then unloads automatically.

Where are servers located?

Montreal (owned hardware), New York, London, Paris, Frankfurt, Netherlands, and Singapore. For Ollama, pick the location closest to whatever apps will be calling its API.