Overview
This step-by-step guide shows you how to run modern local large language models (LLMs) on Ubuntu using Ollama and Open WebUI with Docker and NVIDIA GPU acceleration. You will deploy a private, browser-based chat interface backed by fast local inference, ideal for secure prototyping, offline work, and cost control. The tutorial covers prerequisites, installation, configuration, persistence, and troubleshooting.
Prerequisites
Before you begin, make sure you have the following on your Ubuntu 22.04/24.04 host:
- 64-bit Ubuntu with at least 16 GB RAM (more is better for larger models).
- NVIDIA GPU (Turing or newer recommended) with recent drivers installed.
- Admin (sudo) access and a stable internet connection.
1) Install NVIDIA Drivers and Container Toolkit
Install or verify the NVIDIA driver, then set up the NVIDIA Container Toolkit so Docker can use the GPU inside containers.
sudo apt update
ubuntu-drivers devices
# Choose the recommended driver (e.g., nvidia-driver-550) and install:
sudo apt install -y nvidia-driver-550
sudo reboot
After reboot, verify your GPU:
nvidia-smi
Install the NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit.gpg
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
2) Install Docker and the Compose Plugin
If Docker is not installed, add the official repository and install Docker Engine and the Compose plugin:
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
3) Create a Docker Network and Volumes
A dedicated Docker network allows containers to talk to each other by name. Volumes ensure models and app data persist across container restarts.
docker network create ollama-net
docker volume create ollama
docker volume create open-webui
4) Run Ollama with GPU Acceleration
Start the Ollama container, expose the API port (11434), and attach the GPU. The volume keeps downloaded models persistent.
docker run -d \
--name ollama \
--gpus all \
--network ollama-net \
-p 11434:11434 \
-v ollama:/root/.ollama \
--restart unless-stopped \
ollama/ollama:latest
Verify the API is reachable:
curl http://localhost:11434/api/tags
5) Deploy Open WebUI and Link It to Ollama
Open WebUI provides a clean browser-based chat interface for Ollama. Connect it to the Ollama container via the internal Docker network.
docker run -d \
--name open-webui \
--network ollama-net \
-p 3000:8080 \
-e OLLAMA_API_BASE=http://ollama:11434 \
-v open-webui:/app/backend/data \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:latest
Open your browser to http://localhost:3000 and complete the initial setup. By default, Open WebUI will detect the Ollama API base you provided and list available models once you pull them.
6) Pull a Model and Run Your First Chat
Use the Ollama CLI inside the container to download a model. For a good balance of performance and quality on consumer GPUs, start with an 8B or 7B quantized build.
docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M
Test generation via API:
curl -s http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "In one sentence, explain what Ollama does."
}'
In Open WebUI (http://localhost:3000), select the pulled model from the model dropdown and start chatting. You can adjust context size and GPU usage per chat in the advanced parameters.
7) Enable Persistence, Updates, and Autostart
Because you used named volumes, your models and Open WebUI data (users, chats, settings) persist through upgrades. To update, pull the latest images and recreate containers:
docker pull ollama/ollama:latest
docker pull ghcr.io/open-webui/open-webui:latest
docker stop open-webui ollama
docker rm open-webui ollama
# Re-run the docker run commands from steps 4 and 5
The --restart unless-stopped policy ensures both services start automatically after a system reboot.
8) Performance Tuning Tips
- Use quantized model variants (e.g., q4_K_M or q5_K_M) for faster inference with lower VRAM usage.
- For larger GPUs, try higher quality quantization (q6_K or q8_0) or larger models (e.g., 12B/13B) if VRAM allows.
- In Open WebUI’s advanced options, set num_ctx (e.g., 4096 or 8192) and increase num_gpu to offload more layers to the GPU. Start conservative and scale up if stable.
- Monitor GPU and memory with nvidia-smi while generating to right-size model and context length.
9) Securing Access
If you plan to access Open WebUI over the network, enable authentication in Settings and front it with a reverse proxy such as Caddy or Nginx for TLS. Avoid exposing the Ollama API directly to the internet. Restrict firewall rules to trusted IPs or a VPN.
Troubleshooting
GPU not used: If inference is slow and nvidia-smi shows 0% usage, confirm the container sees GPUs (docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi). Re-run step 1 to fix NVIDIA toolkit integration, then restart Docker.
Permission denied to Docker: If docker commands require sudo, add your user to the docker group (sudo usermod -aG docker $USER; newgrp docker).
Port already in use: Change -p mappings (e.g., -p 11435:11434 or -p 3001:8080) and update OLLAMA_API_BASE in the Open WebUI container accordingly.
Out of memory or crashes: Use a smaller or more heavily quantized model, reduce num_ctx, or close other GPU workloads. Check container logs (docker logs ollama and docker logs open-webui).
Model not listed in WebUI: Ensure Open WebUI can reach Ollama via the internal name (ollama). Both containers must be on the same Docker network. Verify with curl http://ollama:11434/api/tags inside the open-webui container (docker exec -it open-webui sh).
What You Achieved
You now have a private, GPU-accelerated LLM stack on Ubuntu using Docker, Ollama, and Open WebUI. This setup lets you iterate quickly, control data residency, reduce costs, and stay productive even without an internet connection. You can add more models with docker exec -it ollama ollama pull <model> and switch between them in the WebUI as your use cases evolve.
Comments
Post a Comment