Deploy Ollama and Open WebUI on Ubuntu with NVIDIA GPU Using Docker (Step-by-Step)

Overview

Running local large language models (LLMs) is now practical with modern GPUs. In this tutorial, you will learn how to deploy Ollama and Open WebUI on Ubuntu using Docker and NVIDIA GPU acceleration. Ollama handles model downloads and inference, while Open WebUI provides a clean, browser-based chat interface. By the end, you will have a secure, upgradable stack for self-hosted AI on your own server.

Prerequisites

You will need Ubuntu 22.04 or 24.04 (server or desktop), an NVIDIA GPU with recent drivers, and sudo access. Ensure your GPU is visible by the OS with nvidia-smi. If you are starting from a clean install, use sudo ubuntu-drivers autoinstall, reboot, and verify nvidia-smi shows your card and driver version.

Step 1: Install Docker Engine

Install Docker from the official repository to get current features and security patches. First, add Docker’s GPG key and repository, then install Docker Engine and the Compose plugin.

sudo apt update && sudo apt install -y ca-certificates curl gnupg lsb-release
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER && newgrp docker

Step 2: Enable NVIDIA GPU in Containers

Install the NVIDIA Container Toolkit so Docker can access your GPU. This allows GPU passthrough to Ollama in a container.

sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi

If the last command prints your GPU details, Docker sees your GPU. If not, verify your NVIDIA driver installation and rerun the steps above.

Step 3: Create a Docker Compose file

Use Docker Compose to run Ollama and Open WebUI together. Create a folder such as ~/ai-stack and add a compose.yml file with the following contents.

version: "3.8"
services:
ollama:
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
     - NVIDIA_VISIBLE_DEVICES=all
     - OLLAMA_KEEP_ALIVE=48h
    volumes:
     - ollama:/root/.ollama
    ports:
     - "11434:11434"
openwebui:
    image: ghcr.io/open-webui/open-webui:latest
    depends_on:
     - ollama
    environment:
     - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
     - openwebui:/app/backend/data
    ports:
     - "3000:8080"
volumes:
ollama:
openwebui:

If you prefer AMD GPUs, replace ollama/ollama:latest with ollama/ollama:rocm and ensure your host has ROCm drivers configured. You will also need to pass /dev/kfd and /dev/dri devices; consult Ollama’s ROCm documentation for the exact mappings.

Step 4: Launch the Stack and Pull a Model

Start the services and tail the logs to confirm that everything is healthy.

docker compose up -d
docker compose logs -f ollama

Pull a model with good balance between quality and VRAM needs, such as Llama 3.1 8B in a quantized format. You can do this via the terminal or from Open WebUI’s Model Manager.

docker exec -it $(docker compose ps -q ollama) ollama pull llama3.1:8b

Open your browser to http://SERVER_IP:3000. On first launch, create your admin account. In Settings, select the default model and start chatting. The first generation may be slower while the model warms the cache.

Step 5: Security, Updates, and Backups

If you plan to expose the service over the internet, place Open WebUI behind a reverse proxy with TLS (Caddy, Traefik, or Nginx) and restrict access by IP or set up SSO. For private networks, at minimum change the default ports and disable self-registration from the Admin panel to prevent unauthorized accounts.

To update to the latest versions safely, pull new images and recreate the containers. Volumes preserve your models and settings.

docker compose pull && docker compose up -d

Back up your volumes regularly, especially before upgrades. A simple snapshot approach with tar works for low-downtime maintenance.

docker run --rm -v ollama:/data -v $(pwd):/backup busybox sh -c "cd /data && tar czf /backup/ollama-vol-$(date +%F).tgz ."
docker run --rm -v openwebui:/data -v $(pwd):/backup busybox sh -c "cd /data && tar czf /backup/openwebui-vol-$(date +%F).tgz ."

Performance Tips

Choose quantized models that fit your VRAM; for 8–12 GB GPUs, Q4_K_M variants are often the best starting point. Monitor VRAM usage with watch -n1 nvidia-smi. In Open WebUI, tune the context length and batch size to match your GPU memory. Keep the driver, CUDA runtime, and container toolkit reasonably current for stability and speed. If you run multiple simultaneous chats, consider increasing concurrency carefully and watch for swapping or GPU OOM errors.

Troubleshooting

Container cannot see the GPU: Ensure nvidia-smi works on the host, then verify nvidia-container-toolkit is installed and that you ran nvidia-ctk runtime configure followed by systemctl restart docker. Test with docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi.

Model fails to load or out-of-memory: Pick a smaller quantized model or reduce context length. Close other GPU applications. Ensure the Ollama service has GPU access and that the model variant matches your hardware constraints.

Ports already in use: Change the published ports in your Compose file (e.g., map 3001:8080) and rerun docker compose up -d.

With this setup, you can run private, high-performance LLMs on your own hardware with a friendly web interface, straightforward updates, and simple backups. Scale up by adding larger models, enabling GPU persistence mode, or deploying behind a production reverse proxy with TLS and access controls.

LifeBytes Journal

Search This Blog