Running LLMs at Home

TL;DR I've been been running LLMs at home, and I will show you what I did.

I've historically been pretty negative on LLMs, especially the frothy attempts at AGI and the overall business trend of jamming AI into any available process regardless of user consent. But when I found out that there were LLMs I could run on a laptop? LLMs that didn't have to "know" about the entire internet, but could just understand plain text and follow simple instructions, maybe operate some tools? I'm sold. It turns out what I was against wasn't LLMs, but LLMs as a service. I still think the business model is bullshit, but I'll save you that discussion.

So, running LLMs at home. There's several steps to it. The first and easiest way to get started is to download Ollama and run Llama 3.2 by typing ollama run llama3.2 at a command prompt. This will download a 3B (3 billion parameters) model that can hold a conversation and do function passing. You can use others, but be aware that they can be slow depending on the size and parameters involved.

Once you have an Ollama model running, you can make sure it's running on your GPU by running ollama ps, enabling Ollama on the tailscale interface, and then exposing it on Tailscale using tailscale serve --bg --https 443 localhost:11434. After that, I can point any machine in the house to the Ollama endpoint.

The downside to using Ollama directly is that it's a command line experience. You can't really save your chats, and you can't share them with anyone. You also will have some awkwardness trying to compare the output of different models to your inputs, or trying to measure the amount of time that responses took to be generated. Ollama also will unload models that haven't been used in a while. You can set OLLAMA_KEEP_ALIVE environment variable or specify the keep_alive in the parameters.

There are a couple of options to manage chats, depending on your use case, but broadly speaking you will pick either a web application, or a native application. The web applications vary a lot by use case, but if you are running locally you will want either KobaldCPP (mostly for roleplay), text-generation-webui (mode based chat), or Open WebUI (technical / no fun). The native applications vary a lot: you have things like Enchanted that provide a nice GUI on top of Ollama, or you can run LM Studio (technical), AnythingLLM (sort of wiki-based), or GPT4All (fine, I guess). There's even local assistants like Goose or Tabby.

I ended up using Open WebUI on my basement server, for a couple of reasons.

  • Open WebUI has a mobile web application mode, so I can get to it on my phone or iPad.
  • I have Tailscale so I can access while mobile.
  • Open WebUI can use speech-to-text (STT) and text-to-speech (TSS) so I can chat without looking.
  • Open WebUI can connect to multiple remote services through OpenAI endpoints and Ollama endpoints, centralizing chat history. There's even support for Claude, which does not provide an OpenAI endpoint.
  • Open WebUI allows me to download new models to different Ollama instances from the UI. This is very smooth.
  • There's lot of little features like holding down shift allows one button deletes of chats.

Finally, the biggest reason for using Open WebUI is that any native application is going to tie you down to a particular platform, and typically has much less responsive support. Open WebUI is all python, all open source, and has all the bells and whistles you could want.

Initially, I ran Open WebUI as a docker container on Windows before moving it down to the basement server and exposing it with HTTPS (needed for microphone access) through tailscale serve. This turned out to be a bit of a pain because Docker Desktop provides a docker.hostname.internal hosts entry automatically, but Docker Engine does not: you have to add an explicit --add-host=host.docker.internal:host-gateway to the Docker command, and the container doesn't know about Tailscale so accessing external hosts is very manual.

What I ended up doing was leveraging my bootstrap solution and running Open WebUI inside a VirtualBox container with Tailscale enabled for better control. Here's how I did that:

  1. Install Vagrant, VirtualBox, and Tailscale.
  2. Set up some scripts for installing Open WebUI.

The Vagrantfile is straightforward, although you do have to give it a fair amount of memory or it will crash.

Vagrant.configure("2") do |config|   
    config.vm.box = "ubuntu/jammy64"
    config.vm.hostname = "openwebui"

    config.vm.provider "virtualbox" do |v|
        v.name = "openwebui"
        v.memory = 16384
        v.cpus = 6
    end

    config.vm.provision :ansible do |ansible|
        ansible.verbose = false
        ansible.compatibility_mode = "2.0"
        ansible.playbook = "playbook.yml"
        ansible.galaxy_role_file = "requirements.yml"
        ansible.galaxy_roles_path = "/etc/ansible/roles"
        ansible.galaxy_command = "sudo ansible-galaxy install --role-file=%{role_file} --roles-path=%{roles_path} --force"        
    end

    config.trigger.before :destroy do |trigger|
        trigger.run_remote = {inline: "tailscale logout"}
        trigger.on_error = :continue
    end
end

The playbook.yml file is more interesting, as uv is a complete solution in itself – if you have it installed, then /usr/local/bin/uvx --python 3.11 open-webui@latest serve --host 127.0.0.1 --port 3000 will download everything (and update it!) and then serve it as appopriate. It logs to stdout, so I stuck it in a systemctl service so I could refer back to logs and have some service management.

When the service starts up, it immediately tries to download 30 files of multiple gigabytes – these are the internal RAG and TTS/STT models. If you don't want them you can disable them. I found the RAG to be very slow, alter documents and occasionally miss documents altogether, so I recommend at least setting RAG_EMBEDDING_MODEL_AUTO_UPDATE=false and RAG_RERANKING_MODEL_AUTO_UPDATE=false.

---
- hosts: all
  become: true
  no_log: false
  gather_facts: true

  tasks:
    - name: install-tailscale
      import_role:
        name: artis3n.tailscale
      vars:
        tailscale_authkey: ""
        tailscale_args: "--ssh"
        insecurely_log_authkey: false

    - name: Download uv installer script
      get_url:
        url: https://astral.sh/uv/install.sh
        dest: /tmp/uv_install.sh
        mode: '0755'

    - name: Execute uv installer script
      shell: /tmp/uv_install.sh
      environment:
        XDG_BIN_HOME: /usr/local/bin
      args:
        creates: /usr/local/bin/uv  # Prevents re-running if already installed

    - name: Set correct permissions
      file:
        path: /usr/local/bin/uv
        mode: '0755'
        owner: root
        group: root

    - name: Clean up installer script
      file:
        path: /tmp/uv_install.sh
        state: absent

    - name: Create systemd service for Open WebUI
      copy:
        dest: /etc/systemd/system/open-webui.service
        content: |
          [Unit]
          Description=Open WebUI
          After=network.target

          [Service]
          Type=simple
          RemainAfterExit=yes
          TimeoutStartSec=0
          WorkingDirectory=/vagrant/open-webui-data
          Environment=RAG_EMBEDDING_MODEL_AUTO_UPDATE=false
          Environment=WHISPER_MODEL_AUTO_UPDATE=true
          Environment=RAG_RERANKING_MODEL_AUTO_UPDATE=false
          Environment=ANTHROPIC_API_KEY=
          Environment=DATA_DIR=/vagrant/open-webui-data
          ExecStart=/usr/local/bin/uvx --python 3.11 open-webui@latest serve --host 127.0.0.1 --port 3000
          Restart=always
          User=vagrant        

          [Install]
          WantedBy=multi-user.target      

    - name: Reload systemd
      systemd:
        daemon_reload: yes

    - name: Enable and start Open WebUI service
      systemd:
        name: open-webui
        enabled: yes
        state: started

    - name: Set tailscale serve of Open WebUI
      ansible.builtin.command: tailscale serve --bg --https 443 localhost:3000
      become: true

Now, I can have my Windows desktop run Ollama, make it available to the server in the basement, and access it from anywhere with Tailscale.

Comments