TL;DR I've been been running LLMs at home, and I will show you what I did.
I've historically been pretty negative on LLMs, especially the frothy attempts at AGI and the overall business trend of jamming AI into any available process regardless of user consent. But when I found out that there were LLMs I could run on a laptop? LLMs that didn't have to "know" about the entire internet, but could just understand plain text and follow simple instructions, maybe operate some tools? I'm sold. It turns out what I was against wasn't LLMs, but LLMs as a service. I still think the business model is bullshit, but I'll save you that discussion.
So, running LLMs at home. There's several steps to it. The first and easiest way to get started is to download Ollama and run Llama 3.2 by typing ollama run llama3.2
at a command prompt. This will download a 3B (3 billion parameters) model that can hold a conversation and do function passing. You can use others, but be aware that they can be slow depending on the size and parameters involved.
Once you have an Ollama model running, you can make sure it's running on your GPU by running ollama ps
, enabling Ollama on the tailscale interface, and then exposing it on Tailscale using tailscale serve --bg --https 443 localhost:11434
. After that, I can point any machine in the house to the Ollama endpoint.
The downside to using Ollama directly is that it's a command line experience. You can't really save your chats, and you can't share them with anyone. You also will have some awkwardness trying to compare the output of different models to your inputs, or trying to measure the amount of time that responses took to be generated. Ollama also will unload models that haven't been used in a while. You can set OLLAMA_KEEP_ALIVE
environment variable or specify the keep_alive
in the parameters.
There are a couple of options to manage chats, depending on your use case, but broadly speaking you will pick either a web application, or a native application. The web applications vary a lot by use case, but if you are running locally you will want either KobaldCPP (mostly for roleplay), text-generation-webui (mode based chat), or Open WebUI (technical / no fun). The native applications vary a lot: you have things like Enchanted that provide a nice GUI on top of Ollama, or you can run LM Studio (technical), AnythingLLM (sort of wiki-based), or GPT4All (fine, I guess). There's even local assistants like Goose or Tabby.
I ended up using Open WebUI on my basement server, for a couple of reasons.
- Open WebUI has a mobile web application mode, so I can get to it on my phone or iPad.
- I have Tailscale so I can access while mobile.
- Open WebUI can use speech-to-text (STT) and text-to-speech (TSS) so I can chat without looking.
- Open WebUI can connect to multiple remote services through OpenAI endpoints and Ollama endpoints, centralizing chat history. There's even support for Claude, which does not provide an OpenAI endpoint.
- Open WebUI allows me to download new models to different Ollama instances from the UI. This is very smooth.
- There's lot of little features like holding down shift allows one button deletes of chats.
Finally, the biggest reason for using Open WebUI is that any native application is going to tie you down to a particular platform, and typically has much less responsive support. Open WebUI is all python, all open source, and has all the bells and whistles you could want.
Initially, I ran Open WebUI as a docker container on Windows before moving it down to the basement server and exposing it with HTTPS (needed for microphone access) through tailscale serve. This turned out to be a bit of a pain because Docker Desktop provides a docker.hostname.internal
hosts entry automatically, but Docker Engine does not: you have to add an explicit --add-host=host.docker.internal:host-gateway
to the Docker command, and the container doesn't know about Tailscale so accessing external hosts is very manual.
What I ended up doing was leveraging my bootstrap solution and running Open WebUI inside a VirtualBox container with Tailscale enabled for better control. Here's how I did that:
- Install Vagrant, VirtualBox, and Tailscale.
- Set up some scripts for installing Open WebUI.
The Vagrantfile
is straightforward, although you do have to give it a fair amount of memory or it will crash.
Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/jammy64"
config.vm.hostname = "openwebui"
config.vm.provider "virtualbox" do |v|
v.name = "openwebui"
v.memory = 16384
v.cpus = 6
end
config.vm.provision :ansible do |ansible|
ansible.verbose = false
ansible.compatibility_mode = "2.0"
ansible.playbook = "playbook.yml"
ansible.galaxy_role_file = "requirements.yml"
ansible.galaxy_roles_path = "/etc/ansible/roles"
ansible.galaxy_command = "sudo ansible-galaxy install --role-file=%{role_file} --roles-path=%{roles_path} --force"
end
config.trigger.before :destroy do |trigger|
trigger.run_remote = {inline: "tailscale logout"}
trigger.on_error = :continue
end
end
The playbook.yml
file is more interesting, as uv is a complete solution in itself – if you have it installed, then /usr/local/bin/uvx --python 3.11 open-webui@latest serve --host 127.0.0.1 --port 3000
will download everything (and update it!) and then serve it as appopriate. It logs to stdout, so I stuck it in a systemctl service so I could refer back to logs and have some service management.
When the service starts up, it immediately tries to download 30 files of multiple gigabytes – these are the internal RAG and TTS/STT models. If you don't want them you can disable them. I found the RAG to be very slow, alter documents and occasionally miss documents altogether, so I recommend at least setting RAG_EMBEDDING_MODEL_AUTO_UPDATE=false
and RAG_RERANKING_MODEL_AUTO_UPDATE=false
.
---
- hosts: all
become: true
no_log: false
gather_facts: true
tasks:
- name: install-tailscale
import_role:
name: artis3n.tailscale
vars:
tailscale_authkey: ""
tailscale_args: "--ssh"
insecurely_log_authkey: false
- name: Download uv installer script
get_url:
url: https://astral.sh/uv/install.sh
dest: /tmp/uv_install.sh
mode: '0755'
- name: Execute uv installer script
shell: /tmp/uv_install.sh
environment:
XDG_BIN_HOME: /usr/local/bin
args:
creates: /usr/local/bin/uv # Prevents re-running if already installed
- name: Set correct permissions
file:
path: /usr/local/bin/uv
mode: '0755'
owner: root
group: root
- name: Clean up installer script
file:
path: /tmp/uv_install.sh
state: absent
- name: Create systemd service for Open WebUI
copy:
dest: /etc/systemd/system/open-webui.service
content: |
[Unit]
Description=Open WebUI
After=network.target
[Service]
Type=simple
RemainAfterExit=yes
TimeoutStartSec=0
WorkingDirectory=/vagrant/open-webui-data
Environment=RAG_EMBEDDING_MODEL_AUTO_UPDATE=false
Environment=WHISPER_MODEL_AUTO_UPDATE=true
Environment=RAG_RERANKING_MODEL_AUTO_UPDATE=false
Environment=ANTHROPIC_API_KEY=
Environment=DATA_DIR=/vagrant/open-webui-data
ExecStart=/usr/local/bin/uvx --python 3.11 open-webui@latest serve --host 127.0.0.1 --port 3000
Restart=always
User=vagrant
[Install]
WantedBy=multi-user.target
- name: Reload systemd
systemd:
daemon_reload: yes
- name: Enable and start Open WebUI service
systemd:
name: open-webui
enabled: yes
state: started
- name: Set tailscale serve of Open WebUI
ansible.builtin.command: tailscale serve --bg --https 443 localhost:3000
become: true
Now, I can have my Windows desktop run Ollama, make it available to the server in the basement, and access it from anywhere with Tailscale.
Comments