Overview
This guide shows you how to run Llama 3.1 locally on Ubuntu using Ollama (model runtime) and Open WebUI (a friendly browser UI). We will deploy everything with Docker and Docker Compose. If you have an NVIDIA GPU, you can enable GPU acceleration using the NVIDIA Container Toolkit. The steps work on Ubuntu 22.04 and 24.04.
Prerequisites
Before you begin, you need a 64-bit Ubuntu server or desktop with internet access. For CPU-only inference, 16 GB RAM is recommended, and more is better for larger models. For GPU acceleration, install an NVIDIA GPU with recent drivers (535+ recommended) and ensure the system sees it. You will also need Docker and Docker Compose (the plugin packaged with Docker CE is fine).
Step 1 — Install Docker Engine and Compose
If Docker is not installed, run the following commands to install Docker CE and the Compose plugin. This method uses the official repository.
sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
docker compose version
If the last command prints a version, you are good to go.
Step 2 — Optional: Enable NVIDIA GPU for Containers
Skip this step if you want CPU-only inference. To use your NVIDIA GPU with Docker, install the NVIDIA Container Toolkit and configure Docker to use it.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit.gpg \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
nvidia-smi
The nvidia-smi command should print GPU details. If it fails, verify your NVIDIA driver installation before continuing.
Step 3 — Create a Docker Compose file
We will run two services: ollama on port 11434 and open-webui on port 3000. The volumes persist downloaded models and WebUI settings. Create a working folder, then add the compose file.
mkdir -p ~/ollama-openwebui
cd ~/ollama-openwebui
nano docker-compose.yml
Paste the following CPU-friendly configuration:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=24h
open-webui:
image: ghcr.io/open-webui/open-webui:latest
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- openwebui-data:/app/backend/data
volumes:
ollama-data:
openwebui-data:
If you have an NVIDIA GPU and completed Step 2, enable GPU acceleration for Ollama by adding the lines below under the ollama service (indentation matters):
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- OLLAMA_KEEP_ALIVE=24h
Note: On recent Docker Compose versions you can also use docker run --gpus all for quick tests, but the above runtime stanza is the most broadly compatible in compose files.
Step 4 — Start the Stack
Start both containers in the background:
docker compose up -d
docker compose ps
Open WebUI will be available at http://SERVER_IP:3000 and Ollama API at http://SERVER_IP:11434. The first time you visit Open WebUI, it will create a local admin user.
Step 5 — Download a Model (Llama 3.1 8B)
You can pull models through the WebUI, but the quickest way is the Ollama CLI inside the container. This example pulls Llama 3.1 8B:
docker exec -it ollama ollama pull llama3.1:8b
After the download completes, go to Open WebUI, click Models, choose llama3.1:8b, and start chatting. If you are using a GPU, the first generation will warm up the model; later prompts will be faster.
Step 6 — Useful Tweaks
Model storage location: Models live in the ollama-data volume. To move them to a larger disk, bind-mount a host path instead of a named volume, for example - /data/ollama:/root/.ollama.
Memory usage: Large models can exhaust RAM or VRAM. Try a smaller model (e.g., llama3.1:8b-instruct) or a quantized variant (e.g., Q4_K_M) if offered. Close other apps on low-memory systems.
API access: Apps can call the Ollama server at http://SERVER_IP:11434. Open WebUI simply forwards to this endpoint.
Security: Do not expose ports 11434 or 3000 to the internet without protection. If you need remote access, place the services behind a reverse proxy such as Nginx or Traefik with TLS and authentication, or use a private VPN.
Troubleshooting
Error: could not select device driver "" with capabilities: [[gpu]] — The NVIDIA runtime is not configured. Re-run sudo nvidia-ctk runtime configure --runtime=docker and sudo systemctl restart docker. Confirm nvidia-smi works on the host and add the runtime: nvidia snippet in your compose file.
Port already in use — Change the host ports in the compose file (for example, use 11435:11434 or 3001:8080) and run docker compose up -d again.
Slow or out-of-memory on CPU — Use a smaller or more heavily quantized model, or lower context length in Open WebUI settings. Consider upgrading RAM or enabling a GPU.
Permission issues on volumes — If you switch to bind mounts, ensure the host directory is writable by Docker. Run sudo chown -R 1000:1000 /data/ollama or adjust to match the container user.
Stop and Update
To stop the stack, run:
docker compose down
To update images, pull the newest builds and recreate containers:
docker compose pull
docker compose up -d
Conclusion
You now have a local AI stack running Llama 3.1 with Ollama and Open WebUI on Ubuntu. This setup is easy to maintain, private by default, and can leverage GPU acceleration when available. From here, experiment with different models, tune generation settings, and integrate the Ollama API into your own applications.
Comments
Post a Comment