Run Local LLMs on Ubuntu: Install Ollama and Open WebUI (GPU Optional)

Overview

Running a local large language model (LLM) is easier than ever thanks to Ollama and Open WebUI. Ollama handles model downloads and inference on your machine (CPU or GPU), while Open WebUI provides a clean, chat-style interface accessible from your browser. This guide shows how to install both on Ubuntu 22.04 or 24.04, enable GPU acceleration if available, and securely expose the interface on your network.

Prerequisites

You will need an Ubuntu server or desktop (22.04 or 24.04), at least 16 GB RAM recommended for 7–8B models, 15–30 GB of free disk for models and caches, and a stable internet connection. For NVIDIA acceleration, install the proprietary NVIDIA driver (e.g., 535+). AMD GPUs can work via ROCm on supported cards, though CPU-only works fine for testing.

Step 1 — Update Ubuntu and install Docker

First, refresh packages and install Docker so we can run Open WebUI easily. Run:

sudo apt update && sudo apt upgrade -y sudo apt install -y ca-certificates curl gnupg docker.io docker-compose-plugin sudo systemctl enable --now docker

Step 2 — Install Ollama

Ollama provides a one-line installer for Linux. It sets up binaries and a systemd service so the API runs on port 11434. Run:

curl -fsSL https://ollama.com/install.sh | sh sudo systemctl enable --now ollama ollama --version

If the service starts correctly, Ollama listens on http://127.0.0.1:11434. You can verify with ss -ltn | grep 11434 or journalctl -u ollama -b.

Step 3 — Pull your first model

Ollama supports many open models. For a balanced starting point, pull Llama 3 8B and test an interactive prompt:

ollama pull llama3:8b ollama run llama3:8b

When prompted, try a simple question to ensure responses work. You can list installed models with ollama list and remove any you no longer need with ollama rm <model>. Other popular choices include mistral:7b and phi3:3.8b.

Step 4 — Optional: Enable GPU acceleration

If your machine has a supported GPU, Ollama will try to use it automatically. For NVIDIA cards, ensure the proprietary driver is loaded (check with nvidia-smi). If you updated drivers, restart the Ollama service with sudo systemctl restart ollama. For AMD, ensure your GPU and Ubuntu version support ROCm; consult vendor documentation. If Ollama falls back to CPU, you can still run smaller or quantized models.

Step 5 — Install Open WebUI (Docker)

Open WebUI provides a modern chat interface for Ollama. We will run it in Docker and bind it to port 3000. The container will talk to the local Ollama API at 127.0.0.1:11434 by default:

docker run -d --name open-webui \ -p 3000:8080 \ -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \ -v open-webui:/app/backend/data \ ghcr.io/open-webui/open-webui:latest

After a few seconds, open your browser to http://<server-ip>:3000. On first launch, create an admin account. In Settings, verify the Ollama endpoint is set to http://127.0.0.1:11434 and select your default model, such as llama3:8b.

Step 6 — Secure access

If this server is on a private LAN, you can restrict access to trusted devices with UFW. For example, to allow port 3000 only from your desktop IP:

sudo ufw allow from <your-desktop-ip> to any port 3000 proto tcp

If you plan to expose Open WebUI over the internet, place it behind a reverse proxy with TLS. For Nginx, forward / to http://127.0.0.1:3000 and use Let’s Encrypt for HTTPS. Alternatives like Caddy can automate certificates with a simple config. Always use strong admin passwords and consider IP allowlists or SSO when possible.

Step 7 — Optimize models and performance

Model size and quantization matter. Smaller or quantized models consume less RAM and run faster on CPU; larger models yield better quality but need more resources. Look for model tags that include quantization (e.g., q4) if you have limited memory. You can run parallel requests by tuning environment variables, but start with defaults to validate stability. Keep an eye on system metrics with htop and nvidia-smi to avoid OOM issues.

Step 8 — Update and maintenance

Keep components current for security and performance. Update Ollama binaries with your package manager or re-run the installer, then restart the service:

sudo systemctl restart ollama

Update models with:

ollama pull llama3:8b

Update Open WebUI by pulling the latest image and recreating the container (your data volume persists):

docker pull ghcr.io/open-webui/open-webui:latest docker stop open-webui && docker rm open-webui docker run -d --name open-webui -p 3000:8080 -e OLLAMA_BASE_URL=http://127.0.0.1:11434 -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:latest

Troubleshooting

If Open WebUI cannot connect to Ollama, confirm the service is listening on 11434 and that your OLLAMA_BASE_URL is correct. Check logs with journalctl -u ollama -f and docker logs -f open-webui. If responses are very slow or the process is killed, try a smaller or more heavily quantized model. For GPU issues, confirm drivers are installed and that the process sees the GPU (NVIDIA: nvidia-smi). If Docker cannot bind port 3000, pick another free port, e.g., -p 8081:8080.

What you can do next

With the base setup running, experiment with tools like RAG to bring your documents into the chat, add prompt templates for repeatable workflows, or run multiple models for different tasks. You can also script batch inference using the Ollama REST API at /api/generate and integrate the service into dashboards or automations. This stack gives you full control over privacy, cost, and performance without relying on external providers.

LifeBytes Journal

Search This Blog