How to Run a Local AI Chat Server with Ollama and Open WebUI (GPU-Ready)

Overview

Running a private, fast, and flexible AI chat server on your own computer is easier than ever. In this guide, you will set up Ollama (the local LLM runtime) and Open WebUI (a polished chat interface) on Linux or Windows. You will be able to pull popular models like Llama 3 or Mistral, enable GPU acceleration when available, and access a modern web UI for chatting, prompt management, file uploads, and more. This tutorial focuses on simplicity, security, and reliability, using Docker for the web UI and native installation for Ollama.

What You’ll Need

- A 64-bit machine running Ubuntu 22.04/24.04, Debian 12, or Windows 11/10 (latest updates).
- 16 GB RAM recommended; more helps with larger models.
- Optional GPU for acceleration: NVIDIA/AMD/Intel with recent drivers (Ollama uses CUDA/ROCm/DirectML depending on platform).
- Internet access to download models (they can be several gigabytes).

Step 1 — Install Ollama

On Linux (Ubuntu/Debian):
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
Check that the API is listening on port 11434:
curl http://localhost:11434/api/tags

On Windows:
Install with Winget:
winget install Ollama.Ollama
After install, Ollama runs in the background. Verify the API:
curl http://localhost:11434/api/tags

Tip: If you have a GPU, make sure the latest driver is installed. On Linux (NVIDIA), confirm with nvidia-smi. Ollama will automatically use the GPU when possible and print a GPU line in its logs during the first model load.

Step 2 — Pull a Model and Test Locally

Pull a high-quality, general-purpose model. Two popular choices are Llama 3 and Mistral:

ollama pull llama3
ollama pull mistral

Run a quick prompt to validate generation:

ollama run llama3
When prompted, type: Explain what containers are in simple terms.
Press Ctrl+C to exit.

Step 3 — Install Docker (for Open WebUI)

On Linux (quick installer):
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
Confirm:
docker version

On Windows:
Install Docker Desktop from the official site, enable WSL 2 integration, and confirm it runs correctly.

Step 4 — Deploy Open WebUI

Open WebUI provides a rich chat interface on top of the Ollama API. You will run it in Docker and point it to your local Ollama instance.

Linux (Docker host networking):
docker run -d --name open-webui --restart unless-stopped --network host -e OLLAMA_BASE_URL=http://127.0.0.1:11434 -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:latest

Windows/macOS (port mapping):
docker run -d --name open-webui --restart unless-stopped -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -v open-webui:/app/backend/data ghcr.io/open-webui/open-webui:latest

Open your browser to http://localhost:3000. On first launch, create an admin account. In Settings, select your default model (e.g., llama3) and adjust UI preferences. The data volume open-webui persists chats and configurations across container restarts.

Step 5 — Basic Security and Remote Access

By default, both Ollama (11434) and Open WebUI (3000) listen on localhost. This is the safest configuration for a single-user machine. If you must access Open WebUI remotely, place it behind a reverse proxy with TLS and a password. For example, using Caddy on a server with a public DNS record:

example.com { reverse_proxy 127.0.0.1:3000 }

Caddy will fetch a free Let’s Encrypt certificate automatically. Keep the Ollama API on localhost and expose only the web UI via the proxy.

Step 6 — Verify the Stack

- Test the Ollama API directly:
curl http://localhost:11434/api/tags
- Generate a short reply via API:
curl -s http://localhost:11434/api/generate -d '{"model":"llama3","prompt":"Say hi in one sentence."}'
- Open WebUI at http://localhost:3000 and send a message. Check the model switcher if you pulled multiple models.

Performance Tips

- Prefer smaller or quantized models for limited RAM/VRAM (e.g., 7B Q4_K_M).
- Close other heavy apps to reduce RAM pressure and swapping.
- If you have a strong GPU, pull a larger model (e.g., 13B) for better quality.
- Store models on a fast SSD; Ollama caches models under your user profile (e.g., ~/.ollama on Linux).

Backups and Maintenance

- Models: back up ~/.ollama (Linux) or the Ollama data directory on Windows to avoid re-downloading large files.
- Web UI data: the Docker volume open-webui stores conversations and settings. You can back it up with:
docker run --rm -v open-webui:/data -v "$PWD":/backup alpine sh -c 'cd /data && tar czf /backup/open-webui-data.tgz .'
- Update Ollama occasionally:
curl -fsSL https://ollama.com/install.sh | sh
- Update Open WebUI:
docker pull ghcr.io/open-webui/open-webui:latest && docker restart open-webui

Troubleshooting

- Port in use: If 3000 or 11434 is busy, change the mapping (e.g., -p 8080:8080) or stop the conflicting service.
- Docker cannot reach Ollama on Linux: ensure you used --network host, or alternatively add --add-host=host.docker.internal:host-gateway and set OLLAMA_BASE_URL=http://host.docker.internal:11434.
- GPU not used: update drivers and reboot; on Linux check nvidia-smi; on Windows confirm GPU activity in Task Manager during inference.
- Out-of-memory: switch to a smaller model or a stronger quantization; close other applications.

You’re Done

You now have a private, GPU-ready AI chat server powered by Ollama and Open WebUI. This setup is fast, secure (localhost by default), and extensible. You can add specialized models for coding or RAG, script API calls for automation, and back everything up with a couple of commands. Enjoy building with local AI—no cloud required.

LifeBytes Journal

Search This Blog