Deploy Ollama with Open WebUI on Ubuntu 24.04 (GPU-Accelerated)

Running a private, local large language model (LLM) stack has become straightforward thanks to Ollama and Open WebUI. In this tutorial, you will set up Ollama on Ubuntu 24.04 for local inference and connect it to Open WebUI for a clean, feature-rich chat interface. Optional steps cover NVIDIA GPU acceleration and a one-container alternative. The end result is a fast, private, and flexible AI workstation or lab setup.

Prerequisites

You need an Ubuntu 24.04 machine with at least 16 GB RAM (more is better for larger models), 20+ GB free disk space, and a stable internet connection. For GPU acceleration, an NVIDIA GPU with recent drivers (CUDA 12+ capable) is recommended. Administrative shell access (sudo) is required.

Step 1 — Update the system

Start by updating the OS and rebooting to ensure a clean baseline:

sudo apt update && sudo apt -y upgrade
sudo reboot

Step 2 — (Optional) Enable NVIDIA GPU acceleration

If you have an NVIDIA GPU, install the recommended proprietary driver. This enables Ollama to offload model layers to the GPU for significant speedups.

sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

After reboot, verify the driver is active:

nvidia-smi

If you see your GPU listed with a driver version, you are ready for GPU acceleration. If not, check Secure Boot status, ensure the driver matches your GPU, and review dmesg for driver signing or module load errors.

Step 3 — Install Ollama

Ollama is a lightweight runtime for local LLMs. Install it with the official script and enable the systemd service:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
systemctl status ollama

By default, Ollama listens on localhost:11434 and will auto-detect NVIDIA GPUs if drivers are present. To test the API, run:

curl -s http://127.0.0.1:11434/api/tags

Pull a model (e.g., Llama 3 8B) and do a quick prompt test:

ollama pull llama3
ollama run llama3

Type a sample question to confirm token generation. Exit with Ctrl+C.

Step 4 — Install Docker (for Open WebUI)

Open WebUI is easiest to run in a container. Install the Docker Engine and Compose plugin:

sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker

Step 5 — Run Open WebUI connected to host Ollama

This keeps Ollama on the host and runs Open WebUI in Docker. The container will reach Ollama on 127.0.0.1:11434 via the special host-gateway address.

docker run -d --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e OLLAMA_API_BASE=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:latest

Open your browser to http://SERVER_IP:3000, create the initial admin account, and confirm that the Ollama connection shows as healthy. From Settings, you can select the default model, import additional models, and configure conversation settings such as system prompts and context length.

Step 6 — Verify GPU usage and performance

Start a chat in Open WebUI and watch GPU utilization:

watch -n 1 nvidia-smi

If GPU utilization stays at 0%, verify your NVIDIA driver is active. Ollama will automatically offload supported layers to the GPU when available. For very small models, GPU may not be heavily utilized.

Optional — All-in-one container (Open WebUI + Ollama)

If you prefer to containerize everything, use the combined image. This requires the NVIDIA Container Toolkit for GPU access from Docker.

sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

docker run -d --name openwebui-ollama \
  --gpus all \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -v ollama:/root/.ollama \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:ollama

This image embeds Ollama and automatically wires it to Open WebUI. It will download models into the ollama volume. If you do not have a GPU, remove the --gpus all flag.

Security and hardening tips

By default, Ollama listens only on localhost. Keep it that way unless you are placing a reverse proxy in front (Nginx, Caddy, or Traefik) for TLS. Restrict access to port 3000 using a firewall (UFW or security group) and enable Open WebUI authentication during initial setup. If you must expose the interface, ensure HTTPS termination and strong credentials, and consider IP allowlists.

Troubleshooting

If Open WebUI cannot connect to Ollama, confirm the API is reachable: curl -s http://127.0.0.1:11434/api/tags. If it works, recheck the container run command and the host-gateway mapping. For slow inference, try a smaller model (e.g., mistral, phi3, or qwen2:0.5b/1.5b) and ensure swap is available. If downloads fail, verify DNS and outbound firewall rules.

You now have a modern, local AI stack on Ubuntu 24.04 that is private, fast, and extensible. Explore models like llama3:instruct, qwen2, mistral, or codellama for coding, and tailor prompts, templates, and context settings in Open WebUI for your specific workflows.

LifeBytes Journal

Search This Blog