Adding memory to LLMs with Letta

This is going to be a more technical blog post: I'm documenting the work of putting together a locally based large language model (LLM) to help me with recipes and cooking. You can read my last post for details.

Using Open WebUI has been a good choice, but there are some glaring limitations in using LLMs once you start using them in earnest. The biggest one is that LLMs have no memory. This is a problem for me, because I want my LLM to be able to understand my context: I have a rice cooker, I prefer steaming things in the microwave to boiling or blanching, I have this food in the fridge. I don't want to have to keep telling it this.

This is a good place to start talking about how LLMs work and give a brief rundown of the internals. The underlying architecture of LLMs and generative AI is really interesting once you get into it, and it should explain why I'm doing it this way.

How LLMs Work

A large language model is basically a stateless function that consumes text as an input and produces text as an output. It does this by converting words and symbols into tokens, each token represented by a multi-dimensional set of numbers. An LLM consumes text by placing it into the context window, but that is limited to a set number of tokens, and larger context windows take far more time and processing power to execute.

When you hold a conversation with an LLM, every single message in the chat is sent to the context window of the LLM for processing. It does not hold the previous messages in memory – it must read through the entire thing over again, every time. This has consequences: there's an upper limit to how long your conversations can be, and the more you talk, the more text there is for the LLM to process and the slower it gets.

graph TD
    A[New Input] --> B
    B --> C{LLM Processing}
    
    subgraph Context Window
        B[New Input Message]
        D[System Message<br>'You are a helpful assistant...'] 
        E[Previous User Message 1]
        F[Previous LLM Response 1]
        G[Current User Message]
    end
    
    D --> C
    E --> C
    F --> C
    G --> C
    
    C --> H[Token Generation]
    H --> I[LLM Response Output]
    
    style B fill:#f0f0f0,stroke:#333
    style C fill:#e1f5fe,stroke:#0288d1
    style D fill:#e8f5e9,stroke:#4caf50
    style H fill:#fff3e0,stroke:#ff9800

There is no memory outside of this. If it's not in the context window, it doesn't exist.

So, how do LLMs do things on their own if they can only produce text? As far as I know (and I am still learning this), there are two strategies: they can talk to the program running the LLMs, and they can talk to themselves. Between these two options, LLMs can essentially fake memory.

Function Calling

When the LLM produces a message intended for to be processed by the program, that is called a tool message. Usually the tool message is set up in the form of structured output like JSON, and contains the method and arguments that the LLM would like the program to execute on its behalf. Once the tool is executed, the result is passed back to the LLM (again, in structured text output), and the LLM then produces a response back to the user, or makes other tool calls as appropriate.

sequenceDiagram
    participant User
    participant LLM
    participant Environment

    User->>LLM: Task Input
    Note over LLM: Analyze task and<br/>determine tool needed
    LLM->>Environment: Tool Call {function, args}
    Note over Environment: Execute function<br/>with arguments
    Environment->>LLM: Tool Result
    Note over LLM: Process result and<br/>formulate response
    LLM->>User: Final Response

This approach is called function calling and usually requires the LLM to be sufficiently advanced (at least 3 billion parameters) enough to understand what tools it has, and how and when to call them appropriately. There are some LLMs, such as Hermes 3, that explicitly tout their ability to do function calling reliably, and there is even a function calling leaderboard.

The question of how the LLM knows that it has tools available is that it is told explicitly, in English, what the tool is and how it works.

Here's a example of function calling to ask what's in the kitchen:

You are a function calling AI model. You are provided with function signatures within XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.

Here are the available tools:

<tools> 
  {"type": "function", "function": 
    "name": "get_kitchen_inventory" 
    "description":"get_kitchen_inventory(category: str) -> dict 
    
    - Get current inventory data for kitchen items by category using local database.

    Args:
        category (str): The category to query ('ingredients', 'appliances', or 'all').
  
    Returns:
        dict: A dictionary containing inventory data.
            Keys:
                - 'ingredients': List of available ingredients with:
                    - 'name': Name of ingredient
                    - 'amount': Quantity available
                    - 'unit': Unit of measurement
                    - 'expiry_date': Expiration date if applicable
                - 'appliances': List of available cooking equipment with:
                    - 'name': Name of appliance
                    - 'type': Type of appliance (e.g., 'cooker', 'processor')
                    - 'max_capacity': Maximum capacity if applicable
                    - 'features': List of special features
                - 'last_updated': Timestamp of last inventory update", 
    "parameters": {"type": "object", "properties": {"category": {"type": "string", "enum": ["ingredients", "appliances", "all"]}}, "required": ["category"]}}  
</tools> 

Use the following pydantic model json schema for each tool call you will make:

{"properties": {"arguments": {"title": "Arguments", "type": "object"}, "name": {"title": "Name", "type": "string"}}, "required": ["arguments", "name"], "title": "FunctionCall", "type": "object"} 

For each function call return a json object with function name and arguments within XML tags as follows:

<tool_call>
{"arguments": {"category": "all"}, "name": "get_kitchen_inventory"}
</tool_call>

It is very strange that we are using English to tell a program how to talk to another program, but again, the LLM has no way to understand anything that isn't convertable into tokens. If you are relying on the LLM to make contextualy appropriate calls for you, this is basically it.

Function calls are not free. They add to the amount of work that the LLM has to do, and they set a floor on how simple and stupid the LLM can be. Function calls are important and necessary because they allow the LLM to manage its context window. By calling a function and adding the result to its context window, the LLM can know things that you haven't explicitly told it. This is key for retrieval augmented generation (RAG) which applies function calling to databases and search engines, typically using similarity searches based on embedding models and vector databases for leverage.

Self Calling

Because the LLM has the ability to call tools, we can create tools which enable the LLM to talk to themselves and leave notes.

For example, we can create a tool that says "store some relevant information about this chat" and when its called, the program puts it in a datastore (database or memory) and returns success. Later, the LLM can query for that information in another chat and recall it.

This is basically the Memento strategy applied to LLMs: you wake up every day as an amnesiac, and you work out who you are and what you're supposed to do from the notes to yourself.

Now that we have function calling and self messaging, we essentially have a basic form of memory.

sequenceDiagram
    participant User
    participant LLM
    participant Memory

    Note over Memory: Stores previous<br/>interactions
    
    User->>LLM: Input Message
    LLM->>Memory: Retrieve relevant context
    Memory->>LLM: Return context
    Note over LLM: Process input with<br/>retrieved context
    LLM->>User: Response
    LLM->>Memory: Store interaction

However, memory is more complex than that, and there's different forms of memory that are needed to be effective.

MemGPT

The full suite of memory is detailed in the MemGPT paper that describes four different kinds of memory: core, message, archival, and recall, at different levels of priority.

sequenceDiagram
    participant User
    participant MemGPT
    participant Core Memory
    participant Message Memory
    participant Archival Memory
    participant Recall Memory

    Note over Core Memory: System Prompt<br/>Persona<br/>Core Context
    Note over Message Memory: Recent Messages<br/>(Fixed Size Buffer)
    Note over Archival Memory: Long-term Storage<br/>(Vector Database)
    Note over Recall Memory: Dynamic Memory<br/>(Search Results)

    User->>MemGPT: Input Message
    
    MemGPT->>Message Memory: Check context window space
    alt Context window full
        MemGPT->>Archival Memory: Store oldest messages
        Message Memory->>Message Memory: Free up space
    end
    
    MemGPT->>Core Memory: Load core context
    MemGPT->>Message Memory: Load recent messages
    MemGPT->>Archival Memory: Search relevant memories
    Archival Memory->>Recall Memory: Load relevant memories
    
    Note over MemGPT: Process all memory<br/>streams together
    
    MemGPT->>User: Generate Response
    MemGPT->>Message Memory: Store interaction
    
    alt Memory management needed
        MemGPT->>Archival Memory: Archive important information
        MemGPT->>Message Memory: Clean up old messages
    end

Although there's a lot here, there are some interesting features in MemGPT in how long term memories are consolidated. Although archival memory is unbounded, an LLM can only fit so much in its context window and needs to consolidate relevant information, so at periodic intervals it will summarize and prioritize information as relevant. One nice detail is that because everything is text, you don't have to guess what MemGPT remembers – you can just go read old memories in the database, and it will have the general structure without the details.

Adding Memory

The implementation of MemGPT is called Letta. It's open source and can run locally. It's easy enough to install using the Docker quickstart, but I didn't want to have the PostgresSQL database inside the same VM as Letta because I have a tendency to type vagrant destroy when upgrading and futzing with things.

Here's the Letta playbook and the pgvector playbook. Note that in order to use their Cloud API Development platform, you must expose your Letta endpoint for public access, so you must use the SECURE flag.

Once i had Letta installed, I installed the Letta pipe and created an agent – and I had an LLM with memory!

Letta Agent

There's very little I can do with this agent right now, but the next step is to give it recipe information and have it work through cooking with me step by step.

How LLMs Work

Function Calling

Self Calling

MemGPT

Adding Memory

Comments

Related Posts

Published Recipe Agent 27 Jul 2025

Useful LLM Agent Tools 21 Jun 2025

Making An LLM That Just Works For My Brother 13 Apr 2025