If you want a fast, private, and internet-free AI assistant on your computer, running a local large language model (LLM) with Ollama and Open WebUI is one of the easiest modern approaches. Ollama handles model downloads and inference, while Open WebUI provides a clean, ChatGPT-style interface in your browser. This guide walks you through a simple, cross-platform setup on Windows, macOS, and Linux, plus performance tips and troubleshooting.
What You Will Build
You will install Ollama to run models like Llama 3 or Qwen locally, then add Open WebUI with Docker to get a polished chat interface. Everything runs on your machine; your prompts and data never leave your device unless you choose to expose the service.
Prerequisites
You need a 64-bit computer with at least 8 GB RAM (16 GB is better) and 8–20 GB of free disk space per model, depending on model size and quantization. A modern CPU works fine; a supported GPU (Apple Silicon, NVIDIA CUDA, or AMD ROCm on Linux) can significantly boost speed.
Step 1: Install Ollama
Windows: Download and run the installer from https://ollama.com/download. After installation, open PowerShell and verify with ollama --version
. Ollama runs a local service on http://localhost:11434
.
macOS: If you use Homebrew, run: brew install ollama
. Alternatively, grab the macOS installer from the Ollama site. Verify with ollama --version
. On Apple Silicon (M1/M2/M3), Ollama uses Metal acceleration automatically.
Linux: Run the official script: curl -fsSL https://ollama.com/install.sh | sh
. Then start the service if needed: ollama serve
. Verify with ollama --version
and test the API at http://localhost:11434
.
Step 2: Download and Test a Model
Ollama makes model management simple. You can pull and run a model in one step. For a good balance of speed and quality on most machines, try an 8B or 7B model.
Examples:
• Llama 3.1 (8B): ollama run llama3.1
• Qwen2.5 (7B Instruct): ollama run qwen2.5:7b-instruct
• Mistral (7B Instruct): ollama run mistral:instruct
When prompted, type a question to confirm it responds. The first run downloads the model; subsequent runs are instant. If you prefer a quantized variant to save RAM, look for tags like :q4_K_M
in the model name (for example, llama3.1:8b-instruct-q4_K_M
).
Step 3: Install Open WebUI with Docker
Open WebUI provides a modern chat interface in your browser and connects to Ollama’s API. You will run it in a Docker container for easy updates and isolation. Install Docker Desktop (Windows/macOS) or Docker Engine (Linux) if you do not already have it.
Run the container (Windows/macOS):docker run -d --name open-webui --restart unless-stopped -p 3000:8080 -v open-webui:/app/backend/data -e OLLAMA_API_BASE_URL=http://host.docker.internal:11434 ghcr.io/open-webui/open-webui:main
Run the container (Linux):docker run -d --name open-webui --restart unless-stopped -p 3000:8080 -v open-webui:/app/backend/data --add-host=host.docker.internal:host-gateway -e OLLAMA_API_BASE_URL=http://host.docker.internal:11434 ghcr.io/open-webui/open-webui:main
After the container starts, browse to http://localhost:3000. Create your admin account when prompted. You should see available models from Ollama, and you can start chatting immediately.
Step 4: Connect and Customize
If Open WebUI cannot see your models, open Settings in the interface and confirm the API endpoint is http://host.docker.internal:11434
(Windows/macOS) or http://127.0.0.1:11434
(Linux if you prefer not to use the host alias). You can add multiple backends later, including remote Ollama servers on your LAN.
Customize the default model, temperature, and context length in Open WebUI settings. For general-purpose tasks, a temperature between 0.2 and 0.7 works well. Increase context for longer documents if your model supports it; keep in mind that higher context increases RAM usage.
Performance Tips
Use your GPU when available: Ollama uses Metal on Apple Silicon, CUDA on NVIDIA, and ROCm on supported AMD GPUs (Linux). Ensure your drivers/toolkits are current. On Linux with NVIDIA, verify with nvidia-smi
. If GPU is not detected, Ollama falls back to CPU.
Pick the right size and quantization: Smaller models like 7B are fast and light. Quantized builds (for example, q4_K_M
) reduce memory usage with minimal quality loss. For higher quality and still-good speed on capable Macs/GPUs, try 8B or 14B quantized variants.
Limit loaded models: If you experiment with multiple models, keep one active at a time. You can stop idle chats or restart the Ollama service to free memory quickly.
Keep data on fast storage: Place your model directory on SSD/NVMe for noticeably faster loading. Avoid external spinning disks for best results.
Troubleshooting
Open WebUI cannot reach Ollama: Ensure your container has the correct API URL. On Linux, include --add-host=host.docker.internal:host-gateway
or switch to http://127.0.0.1:11434
and publish the port as shown above.
Port already in use: If 11434
(Ollama) or 3000
(Open WebUI) is taken, change the binding. Example for Ollama: OLLAMA_HOST=127.0.0.1:11435 ollama serve
. For Docker, change the left side of the mapping: -p 4000:8080
.
CUDA/ROCm issues: Update to the latest NVIDIA driver (CUDA 12+) or a supported ROCm version on AMD. Restart after driver updates. If GPU still is not used, confirm that smaller models run fine on CPU, then revisit driver/toolkit installation.
Docker permission errors (Linux): If sudo
is required, either use it or add your user to the docker
group and re-login: sudo usermod -aG docker $USER
.
Security and Privacy
By default, both Ollama and Open WebUI bind to localhost. That is ideal for privacy. If you decide to access your chatbot from other devices, place it behind a reverse proxy with authentication (e.g., Traefik, Nginx Proxy Manager) and TLS. Never expose 11434
or 3000
directly to the internet without protection.
Update and Uninstall
To update Ollama, use your package manager (macOS Homebrew: brew upgrade ollama
) or reinstall via the official installer/script. Check the version with ollama --version
. You can update models at any time by pulling newer tags.
To update Open WebUI, pull the latest image and recreate the container:docker pull ghcr.io/open-webui/open-webui:main
docker stop open-webui && docker rm open-webui
docker run ... (same command you used above)
To remove everything, stop and remove the container and volume: docker rm -f open-webui
and docker volume rm open-webui
. You can remove models by deleting them via ollama rm <model>
.
What You Achieved
You now have a fully private AI chatbot running locally with a modern web interface. This stack is flexible: swap models in seconds, run specialized assistants for coding or writing, and scale performance with better GPUs. Most importantly, your prompts and outputs stay on your machine, giving you both speed and peace of mind.
Comments
Post a Comment