Deploy a Private AI Chat Server on Ubuntu with Ollama and Open WebUI (GPU Support)

If you want a fast, private, and low-cost AI chat system that stays on your server, Ollama plus Open WebUI is a rock-solid choice. In this tutorial, you will deploy a local large language model stack on Ubuntu using Docker, enable NVIDIA GPU acceleration for speed, put it behind HTTPS with Caddy, and learn how to back it up, update it, and troubleshoot common issues.

What you will build: a three-container stack (Ollama + Open WebUI + Caddy) running on Ubuntu 22.04/24.04. Ollama hosts models (like Llama 3.1), Open WebUI provides a friendly chat interface with built-in auth, and Caddy terminates TLS with a free certificate. All traffic to Ollama is kept internal so only the web UI is exposed.

Prerequisites

- An Ubuntu 22.04 or 24.04 server with sudo access and outbound internet. - A domain name pointing to your server’s public IP (A/AAAA record). - Ports 80 and 443 reachable from the internet. - Optional but recommended: an NVIDIA GPU (T4, A10, RTX 30/40, etc.). CPU-only mode also works, just slower. - 16 GB RAM minimum recommended for 7–8B models; more for larger models.

Step 1 — Install Docker and Compose

Install Docker Engine and the Compose plugin.

sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin

sudo usermod -aG docker $USER
newgrp docker
docker compose version

Step 2 — Enable NVIDIA GPU (optional but recommended)

If your server has an NVIDIA GPU, install the NVIDIA drivers on the host and the NVIDIA Container Toolkit so Docker can use the GPU. If you plan to run CPU-only, skip this step.

# Install the NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Sanity check: should show GPU info
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

If the sanity check fails, verify drivers with nvidia-smi on the host and ensure Secure Boot isn’t blocking the kernel modules.

Step 3 — Create docker-compose.yml

This Compose file deploys Ollama, Open WebUI, and Caddy. The Ollama API is bound to localhost for safety; only Caddy listens publicly with HTTPS. Replace your.domain.com and the email in the Caddyfile later.

mkdir -p ~/private-ai && cd ~/private-ai
cat > docker-compose.yml <<'YAML'
version: "3.9"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    environment:
      - OLLAMA_KEEP_ALIVE=24h
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama:/root/.ollama
    gpus: all
    networks: [ai]

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    depends_on: [ollama]
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=True
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "127.0.0.1:3000:8080"
    restart: unless-stopped
    networks: [ai]

  caddy:
    image: caddy:2
    container_name: caddy
    depends_on: [open-webui]
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
      - caddy_config:/config
    ports:
      - "80:80"
      - "443:443"
    restart: unless-stopped
    networks: [ai]

networks:
  ai:
    driver: bridge

volumes:
  ollama:
  open-webui:
  caddy_data:
  caddy_config:
YAML

Note: If you do not have an NVIDIA GPU or Docker Compose errors on the gpus: all line, remove that line and run CPU-only. Performance will be slower.

Step 4 — Add a Caddyfile for HTTPS

Create a simple Caddyfile that reverse-proxies your domain to Open WebUI and provisions a free TLS certificate automatically.

cat > Caddyfile <<'CADDY'
your.domain.com {
  encode gzip
  reverse_proxy open-webui:8080
  tls you@example.com
  header {
    Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    X-Content-Type-Options "nosniff"
    X-Frame-Options "DENY"
    Referrer-Policy "no-referrer-when-downgrade"
  }
}
CADDY

Ensure your DNS A/AAAA record for your.domain.com points to this server’s public IP and that ports 80 and 443 are open in any firewall or cloud security group.

Step 5 — Launch the stack

Start everything with one command:

docker compose up -d
docker compose ps

Visit https://your.domain.com in a browser. The first user who signs up becomes admin in Open WebUI. Keep your credentials safe. By default, the Ollama API is not exposed publicly; it is only reachable by Open WebUI inside the Docker network.

Step 6 — Pull a model and test

Use Ollama to download a model. Smaller models start faster and fit more GPUs; larger models are smarter but need more VRAM.

# Example: Llama 3.1 8B
docker exec -it ollama ollama pull llama3.1:8b

# Quick API test (CPU/GPU both work)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"Say hello in one sentence."}' | jq .

Open WebUI will list the model automatically. Start chatting at https://your.domain.com and choose the model in the UI. If VRAM is limited, try 7B/8B variants or quantized builds (e.g., Q4_K_M).

Step 7 — Tuning for performance and memory

- GPU VRAM: 8B models typically need 6–10 GB VRAM depending on quantization; 13B often needs 10–16 GB. If you run out of memory, pick a smaller or more heavily quantized model. - Context length: use smaller context (e.g., 4096) for speed; larger contexts consume more RAM/VRAM. - Batching: in Open WebUI, keep concurrent chats lower on small GPUs. - Keep-alive: OLLAMA_KEEP_ALIVE=24h keeps models warm and reduces first-token latency at the cost of memory.

Step 8 — Back up and update safely

Your data lives in Docker volumes. Back them up regularly (especially Open WebUI data if you store conversations). These commands create tar archives in the current directory.

# Stop the stack before a consistent backup (optional but recommended)
docker compose down

# Back up Ollama models and cache
docker run --rm -v ollama:/data -v "$(pwd)":/backup busybox \
  tar czf /backup/ollama-$(date +%F).tgz -C /data .

# Back up Open WebUI data (users, settings, history)
docker run --rm -v open-webui:/data -v "$(pwd)":/backup busybox \
  tar czf /backup/openwebui-$(date +%F).tgz -C /data .

# Bring the stack back up
docker compose up -d

To update to the latest versions, pull new images and recreate containers without losing data:

cd ~/private-ai
docker compose pull
docker compose up -d

Step 9 — Troubleshooting

- GPU not detected in containers: run docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi. If it fails, reinstall the NVIDIA driver and the container toolkit, and restart Docker. - Permission denied with Docker: add your user to the docker group and re-login (usermod -aG docker $USER + newgrp docker). - HTTPS not provisioning: make sure the domain’s DNS points to the server, and ports 80/443 are open and not used by another service (stop Apache/NGINX if present). - 502/Bad Gateway from Caddy: check docker compose logs open-webui to ensure it started; it can take 10–30 seconds on the first run. - Out-of-memory or crashes when chatting: choose a smaller model (e.g., 7B/8B), use a more aggressive quantization, or reduce context length. - Compose complains about gpus: remove the gpus: all line and start CPU-only, or run Ollama via docker run --gpus all instead of Compose.

What’s next?

You now have a production-ready private AI chat server with HTTPS and optional GPU acceleration. Explore model variants (instruction-tuned, coding, reasoning), enable role-based access in Open WebUI, add nightly backups, and monitor GPU/CPU usage with tools like nvtop and cAdvisor. For advanced setups, place the stack behind a VPN or zero-trust proxy and add per-user rate limits.

LifeBytes Journal

Search This Blog