Overview
Running local large language models (LLMs) is now practical with modern GPUs. In this tutorial, you will learn how to deploy Ollama and Open WebUI on Ubuntu using Docker and NVIDIA GPU acceleration. Ollama handles model downloads and inference, while Open WebUI provides a clean, browser-based chat interface. By the end, you will have a secure, upgradable stack for self-hosted AI on your own server.
Prerequisites
You will need Ubuntu 22.04 or 24.04 (server or desktop), an NVIDIA GPU with recent drivers, and sudo access. Ensure your GPU is visible by the OS with nvidia-smi
. If you are starting from a clean install, use sudo ubuntu-drivers autoinstall
, reboot, and verify nvidia-smi
shows your card and driver version.
Step 1: Install Docker Engine
Install Docker from the official repository to get current features and security patches. First, add Docker’s GPG key and repository, then install Docker Engine and the Compose plugin.
sudo apt update && sudo apt install -y ca-certificates curl gnupg lsb-release
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER && newgrp docker
Step 2: Enable NVIDIA GPU in Containers
Install the NVIDIA Container Toolkit so Docker can access your GPU. This allows GPU passthrough to Ollama in a container.
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
If the last command prints your GPU details, Docker sees your GPU. If not, verify your NVIDIA driver installation and rerun the steps above.
Step 3: Create a Docker Compose file
Use Docker Compose to run Ollama and Open WebUI together. Create a folder such as ~/ai-stack
and add a compose.yml
file with the following contents.
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- OLLAMA_KEEP_ALIVE=48h
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
openwebui:
image: ghcr.io/open-webui/open-webui:latest
depends_on:
- ollama
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- openwebui:/app/backend/data
ports:
- "3000:8080"
volumes:
ollama:
openwebui:
If you prefer AMD GPUs, replace ollama/ollama:latest
with ollama/ollama:rocm
and ensure your host has ROCm drivers configured. You will also need to pass /dev/kfd
and /dev/dri
devices; consult Ollama’s ROCm documentation for the exact mappings.
Step 4: Launch the Stack and Pull a Model
Start the services and tail the logs to confirm that everything is healthy.
docker compose up -d
docker compose logs -f ollama
Pull a model with good balance between quality and VRAM needs, such as Llama 3.1 8B in a quantized format. You can do this via the terminal or from Open WebUI’s Model Manager.
docker exec -it $(docker compose ps -q ollama) ollama pull llama3.1:8b
Open your browser to http://SERVER_IP:3000
. On first launch, create your admin account. In Settings, select the default model and start chatting. The first generation may be slower while the model warms the cache.
Step 5: Security, Updates, and Backups
If you plan to expose the service over the internet, place Open WebUI behind a reverse proxy with TLS (Caddy, Traefik, or Nginx) and restrict access by IP or set up SSO. For private networks, at minimum change the default ports and disable self-registration from the Admin panel to prevent unauthorized accounts.
To update to the latest versions safely, pull new images and recreate the containers. Volumes preserve your models and settings.
docker compose pull && docker compose up -d
Back up your volumes regularly, especially before upgrades. A simple snapshot approach with tar
works for low-downtime maintenance.
docker run --rm -v ollama:/data -v $(pwd):/backup busybox sh -c "cd /data && tar czf /backup/ollama-vol-$(date +%F).tgz ."
docker run --rm -v openwebui:/data -v $(pwd):/backup busybox sh -c "cd /data && tar czf /backup/openwebui-vol-$(date +%F).tgz ."
Performance Tips
Choose quantized models that fit your VRAM; for 8–12 GB GPUs, Q4_K_M
variants are often the best starting point. Monitor VRAM usage with watch -n1 nvidia-smi
. In Open WebUI, tune the context length and batch size to match your GPU memory. Keep the driver, CUDA runtime, and container toolkit reasonably current for stability and speed. If you run multiple simultaneous chats, consider increasing concurrency carefully and watch for swapping or GPU OOM errors.
Troubleshooting
Container cannot see the GPU: Ensure nvidia-smi
works on the host, then verify nvidia-container-toolkit
is installed and that you ran nvidia-ctk runtime configure
followed by systemctl restart docker
. Test with docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
.
Model fails to load or out-of-memory: Pick a smaller quantized model or reduce context length. Close other GPU applications. Ensure the Ollama service has GPU access and that the model variant matches your hardware constraints.
Ports already in use: Change the published ports in your Compose file (e.g., map 3001:8080
) and rerun docker compose up -d
.
With this setup, you can run private, high-performance LLMs on your own hardware with a friendly web interface, straightforward updates, and simple backups. Scale up by adding larger models, enabling GPU persistence mode, or deploying behind a production reverse proxy with TLS and access controls.
Comments
Post a Comment