Running a private, local large language model (LLM) stack has become straightforward thanks to Ollama and Open WebUI. In this tutorial, you will set up Ollama on Ubuntu 24.04 for local inference and connect it to Open WebUI for a clean, feature-rich chat interface. Optional steps cover NVIDIA GPU acceleration and a one-container alternative. The end result is a fast, private, and flexible AI workstation or lab setup.
Prerequisites
You need an Ubuntu 24.04 machine with at least 16 GB RAM (more is better for larger models), 20+ GB free disk space, and a stable internet connection. For GPU acceleration, an NVIDIA GPU with recent drivers (CUDA 12+ capable) is recommended. Administrative shell access (sudo) is required.
Step 1 — Update the system
Start by updating the OS and rebooting to ensure a clean baseline:
sudo apt update && sudo apt -y upgrade
sudo reboot
Step 2 — (Optional) Enable NVIDIA GPU acceleration
If you have an NVIDIA GPU, install the recommended proprietary driver. This enables Ollama to offload model layers to the GPU for significant speedups.
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
After reboot, verify the driver is active:
nvidia-smi
If you see your GPU listed with a driver version, you are ready for GPU acceleration. If not, check Secure Boot status, ensure the driver matches your GPU, and review dmesg for driver signing or module load errors.
Step 3 — Install Ollama
Ollama is a lightweight runtime for local LLMs. Install it with the official script and enable the systemd service:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
systemctl status ollama
By default, Ollama listens on localhost:11434 and will auto-detect NVIDIA GPUs if drivers are present. To test the API, run:
curl -s http://127.0.0.1:11434/api/tags
Pull a model (e.g., Llama 3 8B) and do a quick prompt test:
ollama pull llama3
ollama run llama3
Type a sample question to confirm token generation. Exit with Ctrl+C.
Step 4 — Install Docker (for Open WebUI)
Open WebUI is easiest to run in a container. Install the Docker Engine and Compose plugin:
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
Step 5 — Run Open WebUI connected to host Ollama
This keeps Ollama on the host and runs Open WebUI in Docker. The container will reach Ollama on 127.0.0.1:11434 via the special host-gateway address.
docker run -d --name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e OLLAMA_API_BASE=http://host.docker.internal:11434 \
-v open-webui:/app/backend/data \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:latest
Open your browser to http://SERVER_IP:3000, create the initial admin account, and confirm that the Ollama connection shows as healthy. From Settings, you can select the default model, import additional models, and configure conversation settings such as system prompts and context length.
Step 6 — Verify GPU usage and performance
Start a chat in Open WebUI and watch GPU utilization:
watch -n 1 nvidia-smi
If GPU utilization stays at 0%, verify your NVIDIA driver is active. Ollama will automatically offload supported layers to the GPU when available. For very small models, GPU may not be heavily utilized.
Optional — All-in-one container (Open WebUI + Ollama)
If you prefer to containerize everything, use the combined image. This requires the NVIDIA Container Toolkit for GPU access from Docker.
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run -d --name openwebui-ollama \
--gpus all \
-p 3000:8080 \
-v open-webui:/app/backend/data \
-v ollama:/root/.ollama \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:ollama
This image embeds Ollama and automatically wires it to Open WebUI. It will download models into the ollama volume. If you do not have a GPU, remove the --gpus all flag.
Security and hardening tips
By default, Ollama listens only on localhost. Keep it that way unless you are placing a reverse proxy in front (Nginx, Caddy, or Traefik) for TLS. Restrict access to port 3000 using a firewall (UFW or security group) and enable Open WebUI authentication during initial setup. If you must expose the interface, ensure HTTPS termination and strong credentials, and consider IP allowlists.
Troubleshooting
If Open WebUI cannot connect to Ollama, confirm the API is reachable: curl -s http://127.0.0.1:11434/api/tags. If it works, recheck the container run command and the host-gateway mapping. For slow inference, try a smaller model (e.g., mistral, phi3, or qwen2:0.5b/1.5b) and ensure swap is available. If downloads fail, verify DNS and outbound firewall rules.
You now have a modern, local AI stack on Ubuntu 24.04 that is private, fast, and extensible. Explore models like llama3:instruct, qwen2, mistral, or codellama for coding, and tailor prompts, templates, and context settings in Open WebUI for your specific workflows.
Comments
Post a Comment