Run Local LLMs on Ubuntu: Install Ollama and Open WebUI with NVIDIA GPU Support

Local large language models have matured fast. With Ollama and Open WebUI, you can run modern models like Llama 3.1 or Phi-3 locally, enjoy a clean chat interface, and use your NVIDIA GPU for real speed. This guide walks through a clean setup on Ubuntu 22.04 or 24.04 using Docker and the NVIDIA Container Toolkit.

What you will build: Ollama as your model runtime, Open WebUI as the web interface, both running on Docker, and NVIDIA GPU acceleration for high throughput.

Prerequisites: An Ubuntu 22.04/24.04 machine, an NVIDIA GPU with at least 8 GB VRAM (more is better), 16+ GB system RAM recommended, sudo access, and an internet connection.

1) Install NVIDIA Driver and Validate GPU

First, ensure your system sees the GPU and the driver is installed. If you already have a working driver and nvidia-smi reports correctly, you can skip to Docker installation.

Detect the GPU:
lspci | grep -i nvidia

Install the recommended driver:
sudo apt update
sudo ubuntu-drivers autoinstall
sudo reboot

Verify after reboot:
nvidia-smi
You should see driver, CUDA version, and GPU utilization stats.

2) Install Docker and NVIDIA Container Toolkit

We will run both Ollama and Open WebUI via containers. Install Docker first, then enable GPU support inside containers.

Install Docker Engine:
sudo apt update && sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER && newgrp docker

Install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Test GPU in containers:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
If you see the same GPU readout, the toolkit works.

3) Install Ollama

Ollama brings one-line model setup and fast local inference. Use the official installer:

curl -fsSL https://ollama.com/install.sh | sh

The install registers a systemd service and puts the binary at /usr/local/bin/ollama. Start or check status:

sudo systemctl enable --now ollama
systemctl status ollama

Quick test with CPU or GPU:
ollama run llama3.1:8b
Type a prompt; use Ctrl+C to exit. If you have a supported NVIDIA setup, Ollama will prefer GPU automatically.

4) Deploy Open WebUI with Docker

Open WebUI gives you a polished chat interface, prompt templates, and conversation history. We will connect it to the local Ollama API at http://host.docker.internal:11434 (or your host IP) from inside the container.

Run Open WebUI:
docker run -d --name openwebui --restart unless-stopped -p 3000:8080 \
-e OLLAMA_API_BASE=http://host.docker.internal:11434 \
-e WEBUI_AUTH=True \
-v openwebui-data:/app/backend/data \
ghcr.io/open-webui/open-webui:latest

Open a browser to http://SERVER_IP:3000. Create the first admin user and configure the default model (for example, llama3.1:8b).

5) Using the GPU with Open WebUI + Ollama

If your driver and toolkit are correct, Ollama will use the GPU. To confirm, watch GPU usage while making a request:

watch -n 1 nvidia-smi

To control GPU usage and offloading, set environment variables for the Ollama service. For example:

sudo systemctl edit ollama
Add lines under [Service]:
Environment="OLLAMA_NUM_GPU=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Save and reload:
sudo systemctl daemon-reload && sudo systemctl restart ollama

You can also pick quantized models to fit your VRAM. Examples: llama3.1:8b-q4_K_M (balanced), llama3.1:8b-q5_K_M (higher quality), or phi3:mini-4k-instruct-q4_K_M for small GPUs.

6) Optional: Docker Compose Setup

To run everything together and persist data, use Docker Compose:

mkdir -p ~/llm-stack && cd ~/llm-stack
nano compose.yml

Paste:

services:
  ollama:
   image: ollama/ollama:latest
   container_name: ollama
   restart: unless-stopped
   ports:
    - "11434:11434"
   volumes:
    - ollama:/root/.ollama
   deploy:
    resources:
     reservations:
      devices:
       - capabilities: ["gpu"]

  openwebui:
   image: ghcr.io/open-webui/open-webui:latest
   container_name: openwebui
   restart: unless-stopped
   ports:
    - "3000:8080"
   environment:
    - OLLAMA_API_BASE=http://ollama:11434
    - WEBUI_AUTH=True
   depends_on:
    - ollama
   volumes:
    - openwebui-data:/app/backend/data

volumes:
  ollama:
  openwebui-data:

docker compose up -d

7) Security, Backups, and Troubleshooting

Secure access: Keep Open WebUI behind a reverse proxy (Caddy, Nginx) with HTTPS. At minimum, enable built-in auth (already set with WEBUI_AUTH=True). For remote access, consider Tailscale or WireGuard rather than exposing port 3000 to the internet.

Persist and back up data: Ollama stores models in ~/.ollama/models (or the volume you mapped). Open WebUI keeps data in the openwebui-data volume. Back up with:

docker run --rm -v ollama:/src -v $PWD:/dst alpine tar czf /dst/ollama-backup.tgz -C /src .
docker run --rm -v openwebui-data:/src -v $PWD:/dst alpine tar czf /dst/openwebui-backup.tgz -C /src .

Common issues:

- Ollama says it cannot find a GPU: check driver and run nvidia-smi; ensure nvidia-container-toolkit is installed and Docker restarted.
- Model fails to load due to VRAM limits: pick a smaller or more aggressively quantized variant (q4, q3), or set num_ctx lower in the model settings.
- High CPU usage: disable background indexing features in the UI and avoid running multiple heavy models simultaneously.

8) Next Steps

Try function-calling and RAG with Open WebUI extensions, schedule model updates with ollama pull, and benchmark different quantizations for your GPU. With this stack, you own your data, enjoy low latency, and can iterate quickly without cloud costs.

Comments