Making An LLM That Just Works For My Brother
This is a technical blog post with a heartwarming personal story attached at the end.
I've streamlined and generalized my work with LLMs into a reproducible project. GroundedLLM is a turnkey implementation consisting of a Docker Compose stack. You plug in a Tavily API Key and a Google Gemini API Key, hit docker compose up -d
and it'll bring up a fully configured search agent accessible through Open WebUI.
Why it's there: I have a weekly Facetime call with my brother Felix, now that he's moved to London and our schedules don't align the way that they used to. We spend the time catching up and geeking out about technical stuff. I told him about some of the hilarious failures that LLMs have produced.
And my brother asked why LLMs didn't Just Work.
Why Don't LLMs Just Work?
In a way, it's completely explainable. We experience reality as a controlled hallucination that needs to be grounded by our senses. It's not that LLMs hallucinate – it's that they have no control over their hallucinations because they don't have enough input to ground them. Typically LLMs cover up for their lack of awareness through exhaustive amount of training data, but for anything after their training cutoff, they are helpless. They have no way to get at fresh data – and even if they do, they have no way to retain that data once it leaves their context window. They cannot look, they cannot learn, and they cannot remember.
(In fairness, Google has been rolling out grounding with Google Search and does have some personalization based on user preferences – but it's intentionally limited due to privacy concerns, and has very little agency.)
I think LLMs need at least four things to be useful:
- Brain: the LLM needs strong cognitive abilities to understand and use its tools.
- Senses: the LLM needs some way to ground itself in its environment and avoid hallucinations.
- Agency: the LLM needs to be able to navigate itself out of uncertainty.
- Memory: the LLM needs to be able to keep track of its environmental state past its context window.
When I was learning to cook, the reason that it was so successful was because I had all four pieces in place: Claude Sonnet 3.7 (Brains), Tavily Search and Mealie (Senses), and Letta (Agency and Memory). With those four things, I had a system that was well grounded – if it needed recipes, it would use the search tool rather than hallucinate websites and recipes that didn't exist. If I gave it a goal, it could run through several different tools and options to meet that goal, and it knew to keep my preferences and limitations into account.
So. Make an LLM that Just Worked for my brother, using these components. A turnkey solution.
Making Search and Extract Tools
The first step was to set a tool server and create some intelligent extraction and search tools.
Setting up a tool server was easy: I had been using Hayhooks to act as a tool server for Letta, and in the intervening weeks they added MCP support to Hayhooks, which made it even easier to leverage.
I wanted a decent page extraction tool that could convert multiple formats to Markdown, clean the Markdown and pass it to an LLM. In addition, I knew from past experience that it's incredibly easy to blow past Anthropic's tier 1 rate limit when you put a search tool together with an agent, so I wanted to put the search results somewhere other than Claude Sonnet – I picked Gemini 2.0 Flash. This took a bit more work to understand how conditional routing and joining worked in Haystack, but plugging it into another LLM was also straightforward. A few more tweaks, and I had HTTP/2. For the extract prompt, I added some tweaks to cover the cases where an llms.txt was passed in and a verbatim
option in case the agent really needed to get at the full text.
I needed to set up and have Tavily running in a standalone tool. For the cooking app, Tavily Search was configured through Composio, but that wouldn't work here because Composio is not a turn key solution. Fortunately, Tavily has a python library that was easy to integrate; I put together a custom component that wrapped Tavily and turned search results into documents. I set up the search prompt for the search LLM to do query expansion and suggest follow up queries. And I implemented the "search then extract" approach the best practices recommended.
Provisioning The Agent
The next step was getting an agent created and configured automatically when Docker Compose started up, and then also hooking up that agent to Open WebUI.
At this point I had enough of an agent running that I simply asked it what I should do, and how other people had solved this problem: it told me I should run an initialization container that would run some commands and then exit.
Setting up the initializer was actually fun once I realized how little work there was involved. Just like before, all I needed to do was call Hayhooks to do the actual provisioning, and then I could pass it off to a Letta custom component to create an agent, and an Open WebUI custom component that would integrate that agent with a tweaked version of Letta Pipe.
Prompting The Agent
The next step was setting up the search agent to do query decomposition so that it could effectively break down big questions into little questions that it could search for. I used DeepRAG: Thinking to Retrieval Step by Step for Large Language Models as the primary source, although the general technique of query decomposition is well known.
All of this went into persona_memory.md. I passed it through Anthropic's prompt improvement tool a couple of times until it was happy.
The next pass through, I realized that the first thing Letta should do is introduce itself and its capabilities:
On the first interaction with the user, explain to them that you are capable of remembering information between chats – especially if they use a phrase like "store this in core memory" – and you can demonstrate this: ask the user their name and where they live and infer their timezone and locale from their response. Store the user's information in your core memory for future reference, and refer to them by their name in future interactions.
And it should proactively pay attention to the user's habits:
- If the user mentions their interests, background, or preferences, record them in human core memory.
- If the user mentions a search preference, i.e. version of documentation, preferred websites to use as sources, or preferred questions to ask, take the preferences into account when using tools.
And then I told it how llms.txt files work, in situations where it was dealing with a site that wasn't accessible through Tavily. Using llms.txt and some cues, the agent was actually capable of doing its own impromptu scrape of a site until it got what it needed.
Setting up MCP Tools
It was about this time that aws-documentation-mcp-server came out, immediately making the prior work I'd done on scraping and importing AWS documentation pointless. I plugged it into the system.
At this point, I realized I could plug any MCP server in, as long as I had it running over SSE. I added Wikipedia.
I needed to tell the agent how Open WebUI managed images, video, and audio, so I could do this:
And finally, I added Letta MCP Server, so I could have an agent that futzed with agents.
I'd like to add more, but one of the annoying things about tools is you can't just add everything to the agent. It is the paradox of choice: more tools mean more opportunities for the LLM to get confused. Even powerful LLMs like Claude Sonnet 3.7 can only handle 20 tools max before breaking down completely. At some point, putting everything into a single agent breaks down.
If I were to add more tools, I would need more agents, each with their own focused persona and toolkit, e.g. the personal assistant agent would manage Google Calendar and TickTick.
Also, MCP worries me. I've stuck my MCP servers in docker containers to isolate them, but I have a gut feeling throwing a bunch of MCP servers at an LLM is like taping a knife to a Roomba: something is going to happen, and it's not going to be pretty.
What's Not Included
My last blog post looked at classic "vector datastore backend" retrieval assisted generation (RAG) as an answer, possibly the answer. My particular focus was on downloading and indexing the AWS documentation so that it would be available for an LLM.
I can't think of a way I would want a turnkey solution to implement a full-on datastore. It's only a solution when you know exactly what the problem domain will be, and then you also have the scraping, cleaning and indexing on top of it. It's not practical. I've thought about adding a caching layer with nginx as a forward proxy, but even then the gains are minimal.
In almost every situation I will face as an end user, I am searching an already indexed repository and it is Somebody Else's Problem – it's a Typesense or an Algolia or Sourcegraph situation. Generally if it's a company resource, someone has put the time and effort into making it available, and all I have to do is plug into their APIs. They want me to search them. If I'm really unlucky then I might have to futz with a headless browser to make search work, but in almost every case a search and extract is the right answer.
The only reason I can see to have a local datastore is if the remote datasource is unavailable, or you're working with local documents that don't have an index. Even in that case, I would start with ripgrep or sonic and only scale up to using DuckDB or Sqlite with a vector store if absolutely necessary. There is also the option of integrating local documents with Letta's archival memory via the Data Sources API, i.e. you could load up all the files in a git repository into a datasource and attach it to a "repository agent" set up with nomic-embed-code that could answer questions about that codebase.
The End Result
When next our Facetime call started, both Felix and Jen (Felix's wife) were there, and we caught up on everything. Jen mentioned she had a work problem that involved digging and correlating data out of several different websites, and I said "Yes. I have a Thing That Will Help With That. That Is What It Does."
Felix installed the docker containers and set up the API keys, and in five minutes the search agent had the answers. Felix asked for different presentation, and the agent formatted the data in a table and linked the citations.
I told Felix he should double check that the information was correct, but the fact that it was able to gather that information at all and present it in one place was already a huge step forward.
Someone who didn't even know they wanted to use the search agent found that it solved a real world problem for them.
That is a solid win.