Overview
This step-by-step guide shows you how to deploy a private local AI stack using Ollama and Open WebUI with Docker. Ollama runs language models locally (LLMs), and Open WebUI provides a friendly browser interface. The setup supports both NVIDIA GPUs for acceleration and CPU-only machines. You will get persistent storage, a clean Docker Compose file, and optional reverse proxy hardening.
Why this stack?
Running models locally gives you fast, private inference with full control. Ollama supports popular models like Llama 3, Mistral, and Phi 3. Open WebUI offers chat history, model switching, prompt templates, and a polished UX. Docker keeps everything reproducible and easy to update. With GPU enabled, throughput improves dramatically; without a GPU, it still works on modern CPUs.
Prerequisites
You need a 64-bit Linux host (Ubuntu 22.04+ recommended), macOS, or Windows with WSL2. Install Docker Engine and Docker Compose Plugin. If you have an NVIDIA GPU on Linux, install the proprietary driver and NVIDIA Container Toolkit so containers can see the GPU. Confirm Docker is functional with a simple hello-world container before proceeding.
Check your GPU (optional but recommended)
On Linux with NVIDIA, verify the driver and CUDA stack are working. The command below should list your GPU. If it fails, resolve driver issues before continuing.
nvidia-smi
Create a project directory
Create a working folder to store the Docker Compose file and your persistent volumes. The same directory will hold your reverse proxy config if you enable it later. For example, use ~/ollama-stack on Linux or a similar path on other systems.
Write the Docker Compose file
Save the following as docker-compose.yml. It defines two services: the Ollama model server and Open WebUI. Volumes ensure model files and chat history survive container updates.
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=24h
# GPU: uncomment this block if you have NVIDIA drivers and Container Toolkit installed
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: ["gpu"]
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
- openwebui-data:/app/backend/data
volumes:
ollama-data:
openwebui-data:
Start the services
From the folder that contains docker-compose.yml, bring up the stack in the background. The first run will download images; future starts are much faster.
docker compose up -d
docker compose ps
Pull a model
Ollama does not ship with models. Pull one to get started. Llama 3 8B is a great baseline on consumer GPUs or modern CPUs. You can pull models through Open WebUI, but using the CLI is immediate and reliable.
docker exec -it ollama ollama pull llama3:8b
# Alternative models:
# docker exec -it ollama ollama pull mistral
# docker exec -it ollama ollama pull phi3
Open the interface
Visit http://localhost:3000 in your browser. On first load, Open WebUI initializes its database. Select your model (for example, llama3:8b) and start chatting. If you are accessing from another device on the LAN, replace localhost with the host’s IP address.
Enable GPU acceleration (Linux/NVIDIA)
If you have an NVIDIA GPU, install the NVIDIA Container Toolkit so Docker can pass the GPU into the container. After installation, uncomment the GPU block in the Compose file and redeploy. On CPU-only systems, keep the GPU configuration commented out; Ollama will fall back to CPU.
# On Ubuntu:
# 1) Install drivers via the "Additional Drivers" tool or:
# sudo apt-get update && sudo apt-get install -y nvidia-driver-535
# 2) Install NVIDIA Container Toolkit:
# sudo apt-get install -y nvidia-container-toolkit
# sudo nvidia-ctk runtime configure
# sudo systemctl restart docker
# 3) Recreate the stack:
docker compose down
docker compose up -d
# Validate GPU usage:
docker exec -it ollama nvidia-smi
Secure with a reverse proxy (optional)
If you plan to expose the interface on the internet, add a reverse proxy with TLS and HTTP Basic Auth. The example below uses Caddy for automatic HTTPS on a public domain. Replace example.com with your domain and set a strong username and password.
services:
caddy:
image: caddy:latest
container_name: caddy
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
depends_on:
- open-webui
# Caddyfile (place next to docker-compose.yml)
# example.com {
# basicauth {
# admin JDJhJDEwJHh5eXouLi4 # use `caddy hash-password --plaintext YOURPASS`
# }
# reverse_proxy open-webui:8080
# }
Persist and back up your data
Models and chat history live in Docker volumes named ollama-data and openwebui-data. Back them up by stopping the stack and using docker run --rm -v VOL:/data -v "$PWD":/backup alpine tar to archive each volume. Restoring is the reverse: create an empty volume and extract the tarball into it.
Update the stack
To update to the latest images, pull and recreate. Your data volumes remain intact. If a model gets corrupted or partially downloaded, remove it with ollama rm MODEL and pull again.
docker compose pull
docker compose up -d
docker exec -it ollama ollama list
Troubleshooting
If Open WebUI cannot reach Ollama, ensure the OLLAMA_BASE_URL uses the service name ollama (not localhost) inside Docker. If you see “could not select device driver with capabilities gpu,” the NVIDIA toolkit is missing or misconfigured; confirm nvidia-smi works on the host and restart Docker. If ports are already in use, change 11434 and 3000 in the Compose file to free ports. For slow responses on CPU, try smaller models (for example, llama3:8b-instruct) or quantized variants.
Performance tips
Use GPU when available for large models and higher throughput. Pin the model that matches your VRAM; 8B models typically fit in 8–12 GB of VRAM with quantization. Increase OLLAMA_KEEP_ALIVE to avoid cold starts. If running on SSD-backed storage, model loading is faster. On multi-user setups, place the stack behind a reverse proxy and consider segmenting access per user.
What you built
You now have a private, locally hosted AI chat interface backed by Ollama and Open WebUI, packaged with Docker for easy management. It runs fully offline, can leverage your GPU, and is simple to back up and update. From here, explore custom prompt templates, load additional models, or integrate the HTTP API for programmatic inference.
Comments
Post a Comment