Self-Host Local LLMs with Ollama and Open WebUI on Docker (GPU Optional)

Overview

Want fast, private AI chat and prompts without sending data to the cloud? In this tutorial, you will deploy Ollama (for running local LLMs) and Open WebUI (a friendly web interface) using Docker. The setup works on CPU or NVIDIA GPU, persists your models, and can be updated with a single command. By the end, you will have a local AI stack that supports models like Llama 3 and Mistral, all running on your own hardware.

Prerequisites

- A Linux server (Ubuntu 22.04/24.04 or similar) with 8 GB RAM or more. CPU works; GPU is optional for acceleration.

- Docker Engine and Docker Compose plugin installed.

- Optional GPU: NVIDIA driver and NVIDIA Container Toolkit.

1) Install Docker and Compose

If Docker is not installed, use the official convenience script:

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
docker --version
docker compose version

If you prefer APT, follow Docker’s repository instructions for your distribution, then verify Docker and the Compose plugin versions.

2) (Optional) Enable NVIDIA GPU for Containers

Install the NVIDIA driver (matching your GPU) and the NVIDIA Container Toolkit:

# On Ubuntu
sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-driver-535

# NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Validate GPU visibility in containers
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

If you see your GPU listed by nvidia-smi inside the container, you are ready for GPU acceleration.

3) Create the Docker Compose File

Create a working directory and a compose file:

mkdir -p ~/ai-stack && cd ~/ai-stack
nano docker-compose.yml

Paste the CPU-friendly baseline stack:

version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=5m

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

For NVIDIA GPU acceleration, add this line under the ollama service (indentation matters):

    gpus: all

The stack exposes Ollama’s API on port 11434 and the web interface on port 3000. Change host ports if needed.

4) Launch the Stack

Start the containers:

docker compose up -d
docker compose ps

Open your browser to http://SERVER_IP:3000 to access Open WebUI. The interface will detect your Ollama instance automatically via the configured URL.

5) Pull a Model and Test

Use Ollama to download a model. Llama 3 8B Instruct is a popular starting point:

docker exec -it ollama ollama pull llama3:8b-instruct

You can now chat inside Open WebUI by selecting the model. To test the API directly:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3:8b-instruct",
  "prompt": "Explain vector embeddings in one paragraph."
}'

6) Persist, Update, and Backup

Your models live in the ollama Docker volume, and Open WebUI settings (history, prompts, users) live in the open-webui volume. To update safely:

docker compose pull
docker compose up -d

List and remove models to manage disk space:

docker exec -it ollama ollama list
docker exec -it ollama ollama rm llama3:8b-instruct

Quick volume backups can be made with a temporary container:

# Backup Ollama models
docker run --rm -v ollama:/data -v "$PWD":/backup busybox tar czf /backup/ollama-backup.tgz -C /data .

# Backup Open WebUI data
docker run --rm -v open-webui:/data -v "$PWD":/backup busybox tar czf /backup/openwebui-backup.tgz -C /data .

7) Secure Access

By default, the web UI is reachable on your server’s IP and port 3000. For a safer setup, bind Open WebUI to localhost and place it behind a TLS reverse proxy. Edit the port mapping to 127.0.0.1:3000:8080 and use a reverse proxy like Caddy or Nginx with a domain and HTTPS.

Example Caddyfile (auto TLS):

ai.example.com {
  reverse_proxy 127.0.0.1:3000
}

Run Caddy with Docker:

docker run -d --name caddy \
  -p 80:80 -p 443:443 \
  -v $PWD/Caddyfile:/etc/caddy/Caddyfile \
  -v caddy_data:/data \
  caddy:latest

Open WebUI includes auth and user management; create an admin on first login and restrict sign-ups in settings if you do not want public registration.

Troubleshooting

- Port conflicts: If 3000 or 11434 are in use, change the host ports in docker-compose.yml and restart.

- GPU not detected: Ensure docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi works. Verify drivers and container toolkit, then add gpus: all to the Ollama service.

- Slow downloads: Models are large. Use a reliable connection and enough disk space; consider a local cache or mirror if multiple hosts will pull the same model.

- Memory errors: Choose smaller models (e.g., 7B/8B) or reduce context length in the UI. On CPU-only systems, expect slower generation; try quantized variants when available.

What You Achieved

You now have a modern local AI stack running Ollama and Open WebUI on Docker, with optional GPU acceleration, persistent storage, and HTTPS-ready reverse proxy integration. This setup is easy to update, secure, and extend—perfect for private experimentation, helpdesk assistants, coding copilots, or offline R&D.

LifeBytes Journal

Search This Blog