Self-Host Ollama + Open WebUI: Run Local LLMs on Your Own Server

I’ve been messing with LLMs since early 2025, and honestly? The biggest pain point was always cost.

ChatGPT Plus is what, $20/month now? Claude Pro is another $20. If you want coding-specific models, throw in another subscription. Multiply that across a team or family, and you’re bleeding money just to ask a bot to rewrite emails.

I tried the API route too — pay-per-token seemed smart until I racked up a $80 bill in one weekend because I forgot to set a usage cap. My wallet still hasn’t forgiven me.

So I went looking for a self-hosted alternative. What I found was Ollama + Open WebUI, and honestly, it’s changed how I use AI entirely.

What Are We Building Here?

Two Docker containers, that’s it:

Ollama — the engine that runs the models. Supports everything from tiny 1B-parameter models (fast, dumb) up to massive 70B models (slow, smart).
Open WebUI — a beautiful, ChatGPT-like interface sitting on top of Ollama. Web search, file uploads, multi-model chats, the works.

No GPU required if you stick with smaller models. I run mine on a $40/month VPS with just CPU — works fine for coding assistance and writing.

Step 1: Get a Server

Before you do anything, you need a box to run this on. If you already have a homelab, great. If not, grab a VPS somewhere.

I use a VPS with 16GB RAM and 4 vCPUs. It’s overkill for small models but gives me room to run 7B-13B models comfortably. Minimum spec? 4GB RAM, 2 cores, 20GB disk. You’ll be surprised what runs on cheap hardware.

Step 2: Install Docker + Docker Compose

If you’ve hung around this blog, you probably already have Docker. If not:

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

Log out and back in, then verify with docker ps. Simple as that.

Step 3: Deploy Ollama and Open WebUI

Create a directory and a docker-compose.yml:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ./ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    depends_on:
      - ollama
    volumes:
      - ./open-webui_data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    restart: unless-stopped

Now run it:

docker compose up -d

That’s it. Two commands and you’re done. Open WebUI is now running on port 3000.

Step 4: Pull a Model

Head to http://your-server:3000, create an account, and you’ll see a clean chat interface. But it’s empty — you need to pull a model first.

I’d start with Llama 3.1 8B — it’s the sweet spot for most people. Fast, good at coding, decent reasoning:

docker exec -it ollama ollama pull llama3.1:8b

This downloads about 4.7GB. Grab a coffee. Or two.

Once it’s done, refresh Open WebUI and you’ll see the model in the dropdown. Start chatting.

Here’s what I run depending on the task:

Model	Size	RAM Needed	Best For
Llama 3.1 8B	4.7GB	8GB	General purpose, coding
Mistral 7B	4.1GB	8GB	Fast, good at reasoning
Qwen 2.5 7B	4.3GB	8GB	Excellent at writing
DeepSeek Coder V2	8.5GB	16GB	Code generation
Llama 3.1 70B	39GB	64GB	Best quality, needs serious hardware

Why This Beats OpenAI (for Most Things)

I still use ChatGPT sometimes. But for daily work, my self-hosted setup wins for three reasons:

No censorship. I can ask my local model anything — analyze sensitive documents, discuss controversial code approaches, whatever. There’s no content policy shutting me down mid-sentence.

No data leaks. This is the big one. Every prompt I send to ChatGPT goes to some server in Virginia. My local model never leaves my VPS. If you’re working with client data, proprietary code, or anything private, this alone justifies the setup.

No surprise bills. $0/month, no matter how many queries I send. I’ve left agents running overnight on autopilot and woke up to zero charges. Try that with an API.

What I Wish I Knew Before Starting

A few things I learned the hard way:

Big models are slow on CPU. Like, painfully slow. A 70B model on CPU generates about 1 token per second. Llama 3.1 8B does 15-20 tokens/second on a decent CPU. Stick to 7B-13B unless you’ve got a GPU.

Open WebUI has a built-in RAG system. You can upload PDFs, documents, even whole codebases, and the model will answer questions from your data. Game changer for documentation.

Set up a reverse proxy. Don’t expose port 3000 directly. Slap Nginx Proxy Manager or Traefik in front, add HTTPS with let’s encrypt, and consider setting up authentication. Or better yet, put it behind a VPN:

Storage adds up quick. Llama 3.1 8B is 4.7GB. If you install a few models, you’ll chew through 20-30GB fast. Plan your disk accordingly.

What’s Next?

Once you’ve got the basics running, try:

Enable web search in Open WebUI settings — your model can browse the internet for current info
Try Ollama’s multimodal models — Llama 3.2 Vision can analyze images
Set up model aliases so you can switch between cheap/fast and expensive/smart models on the fly
Hook it up to n8n — I’ve got an automation that routes support emails through my local LLM for drafting replies

Self-hosting an LLM isn’t some exotic thing reserved for people with racks of GPUs. It’s a Docker compose file and a couple of commands. Everyone should run their own AI.

Go give it a shot. You’ll wonder why you didn’t do it sooner.