How to Run Local LLMs on Ubuntu with Ollama and Open WebUI (GPU-Accelerated)

Running a private, fast large language model (LLM) on your own hardware is now practical thanks to Ollama and Open WebUI. This guide shows how to install Ollama on Ubuntu (with NVIDIA GPU acceleration) and connect it to Open WebUI for a clean, chat-style interface. You will get a secure, local setup that can run modern models such as Llama 3.1 and Mistral without sending data to the cloud.

What You’ll Build

You will install Ollama as a system service on Ubuntu 22.04/24.04, download a model, and run Open WebUI in Docker. The GPU will be used by Ollama to accelerate inference while Open WebUI provides a browser-based interface. The result is a local chat environment accessible at http://localhost:3000.

Prerequisites

• Ubuntu 22.04 LTS or 24.04 LTS on a machine with at least 16 GB RAM (more is better).
• An NVIDIA GPU (6 GB+ VRAM recommended) and the proprietary NVIDIA driver.
• Sudo privileges and internet access.

Step 1: Prepare Ubuntu and NVIDIA Drivers

Update the system first: sudo apt update && sudo apt upgrade -y. Install the latest recommended NVIDIA driver: ubuntu-drivers devices to inspect, then sudo ubuntu-drivers autoinstall. Reboot: sudo reboot. After reboot, confirm the GPU is visible with nvidia-smi. If you see your GPU and driver version, you’re ready.

Step 2: Install Ollama

Ollama is a lightweight runtime for local LLMs. Install it with: curl -fsSL https://ollama.com/install.sh | sh. The installer sets up the service and binary. Start or restart the service if needed: sudo systemctl restart ollama. By default, the Ollama API listens on http://127.0.0.1:11434.

Step 3: Pull a Model and Test

Pull a modern, efficient model. Examples: ollama pull llama3.1:8b or ollama pull mistral:7b. Test generation on the CLI: ollama run llama3.1:8b then type a prompt. Or test the API: curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"Hello"}'. If a model loads and responds, your base setup is working and it will use the GPU when possible.

Step 4: Install Docker (for Open WebUI)

If Docker is missing, install it quickly:
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io
Optional but recommended: sudo usermod -aG docker $USER then log out/in.

Step 5: Launch Open WebUI

Open WebUI connects to the local Ollama API and gives a feature-rich chat interface with history and prompt templates. On Linux, map the host gateway inside the container so it can reach Ollama on the host:

docker run -d --name openwebui --restart unless-stopped -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 --add-host=host.docker.internal:host-gateway -v openwebui-data:/app/backend/data ghcr.io/open-webui/open-webui:latest

Open a browser and visit http://localhost:3000. In Settings → Connections, ensure the Ollama base URL is http://host.docker.internal:11434. You can now select the model (e.g., llama3.1:8b) and chat.

Step 6: Optimize Memory and Context

If VRAM is limited, choose a smaller quantization (e.g., llama3.1:8b often defaults to Q4_K_M). List quantizations with ollama show llama3.1:8b. To increase context, create a custom model:
printf "FROM llama3.1:8b\nPARAMETER num_ctx 8192\n" > Modelfile
ollama create llama3.1-8k -f Modelfile
Then select llama3.1-8k in Open WebUI.

Step 7: Secure and Expose (Optional)

Keep Ollama bound to localhost unless you need remote access. If you must expose it, put Open WebUI behind a reverse proxy with TLS (e.g., Nginx or Caddy) and enable authentication in Open WebUI (Settings → Auth). For UFW: sudo ufw allow 22/tcp, sudo ufw allow 3000/tcp (if local), then sudo ufw enable. Prefer a proper domain and HTTPS if accessing remotely.

Updating and Backups

Update Ollama with the installer again or your package manager, then restart: sudo systemctl restart ollama. Update Open WebUI by pulling a new image: docker pull ghcr.io/open-webui/open-webui:latest then docker stop openwebui && docker rm openwebui and re-run the docker run command. Back up your chat data from the Docker volume openwebui-data using docker run --rm -v openwebui-data:/data -v $PWD:/backup alpine tar czf /backup/openwebui-data.tgz -C / data.

Troubleshooting

• GPU not used: verify nvidia-smi shows a running process during inference. Ensure the proprietary driver is loaded. If you previously installed CUDA separately, avoid driver mismatches.
• Open WebUI can’t reach Ollama: confirm curl http://localhost:11434 on the host works and that the Docker container uses --add-host=host.docker.internal:host-gateway or switch to --network host if acceptable.
• Slow first response: the first prompt loads weights into memory; subsequent prompts are faster. Consider smaller models or different quantization if latency is too high.
• Out-of-memory: reduce context (num_ctx), switch to a lower-precision quantization, or choose a smaller model (e.g., llama3.1:8b-instruct with Q4).

You’re Done

You now have a private, GPU-accelerated LLM stack on Ubuntu. Ollama handles fast local inference, and Open WebUI gives you a slick, familiar chat interface. With careful model selection, quantization, and context tuning, this setup can power assistants, coding copilots, and knowledge bots entirely on your hardware.

Comments