Step by Step RAG

This is a technical post about implementing retrieval augmented generation (RAG) for LLMs. It's intentionally high level and doesn't get into all the details and possibilities.

The Problem

Using Letta for answering questions and walking me through cooking has been great. However, my experience using Claude Sonnet and other LLMs for general information retrieval has been not-so-great.

I am consistently getting technical answers that are out of date at best, and include hallucinations at worst. This can range from getting various parameters and environment variables wrong, to creating entirely new frameworks (complete with code samples!) on the fly. To avoid hallucinations, the LLM needs external data to ground it.

For most people, using a solution like NotebookLM, RagFlow, R2R, or kotaemon is perfectly sufficient, and I have not gone through all the possible options on RagHUB. But I have a specific problem: I want all the AWS documentation to be available and citable by Letta. All of it.

Starting with RAG

Retrieval augmented generation (RAG) happens when an LLM retrieves data as part of its response. Although RAG is commonly associated with databases and document stores, it could be any tool that returns additional data to be incorporated in the response. For example, Letta uses Tavily to search when it doesn't have the answer, which it has used to look up knife skills and expiration dates on food.

Got a tool that runs ripgrep against your github repo? It's RAG. Got a tool that pings n8n to query Google Sheets? Also RAG.

So in this case, we want to make a tool available to Letta that does a semantic search on AWS documentation. We don't tell Letta how this happens: we want Letta focused on the domain and we also want to keep the tool calling as simple as possible, as too much detail can confuse the LLM.

Here's what the tool looks like:

import requests

def query_aws_documents(question: str):
    """
    Parameters
    ----------
    question: str
        The question to ask the AWS documentation.

    Returns
    -------
    answer: str
        The answer to the question.
    """

    response = requests.post(
        "http://devserver:1416/query_aws_docs/run",
        json={"question": question},
    )
    return response.json()["result"]

That's the whole thing. Keep it simple, keep it focused, keep it isolated. This approach means that if I want to set up an MCP server with Claude Desktop or Cline, I can do so without any problems.

Querying the AWS Documentation

Now we have the question of what is on the other end of that HTTP request. You do have the option of using a remote RAG system like Vectorize or Vectara – I want the full end-to-end RAG experience, so I'm using HayHooks, a REST API server for Haystack. Haystack has been good quality to work with: documentation is solid, the design is logical, and it has excellent logging options.

HayHooks has a query pipeline that puts together the components needed for retrieval. The example they give uses ElasticSearch, but you can start off with a pipeline template.

The implementation of the pipeline is hidden from the tool. I could be directly pasting a document into Gemini 2.0 Flash Lite and leveraging the 1M token context window, and it would still be RAG as long as it was the pipeline doing it. Unfortunately, the AWS documentation is too large to fit in a single context window, so we need to start looking at more complex solutions. This is where we start getting into what people traditionally think of as RAG.

In situations where the document store is too large to fit into the context window, RAG has to return the "best" answer, typically using a hybrid approach with keyword search combined with a semantic search implemented through embedding. There is a huge amount of change going on, and just reviewing 2024 is enough to realize that no matter what solution you use, it's going to be out of date in a year.

I'm starting simple. Here's the pipeline wrapper I'm using:

class PipelineWrapper(BasePipelineWrapper):
    def create_pipeline(self) -> Pipeline:  
        text_embedder = get_text_embedder()
        retriever = get_retriever()
        prompt_builder = get_chat_prompt_builder()
        chat_generator = create_chat_generator()
        
        query_pipeline = Pipeline()
        query_pipeline.add_component("embedder", text_embedder)
        query_pipeline.add_component("retriever", retriever)
        query_pipeline.add_component("prompt_builder", prompt_builder)
        query_pipeline.add_component("llm", chat_generator)
        query_pipeline.connect("embedder.embedding", "retriever.query_embedding")
        query_pipeline.connect("retriever", "prompt_builder")
        query_pipeline.connect("prompt_builder.prompt", "llm.messages")
        return query_pipeline

    def setup(self) -> None:    
        self.pipeline = self.create_pipeline()

    def run_api(self, question: str) -> str:
        log.trace(f"Running pipeline with question: {question}")
        result = self.pipeline.run(
            {
                "embedder": {"text": question},
                "prompt_builder": {"question": question},
            }
        )

        return result["llm"]["replies"][0].text

There's just four things in the pipeline: the embedder, the retriever, the prompt builder, and the LLM. The embedder takes the question and turns it into an embedding. The retriever takes the embedding and returns a list of documents. The prompt builder takes the question and the documents and turns it into a prompt. And finally, the LLM takes the prompt and returns a response.

Fortunately, you don't need to imagine what the pipeline looks like. HayHooks has a Swagger UI that has a draw route.

Query Pipeline

In my case, I'm using the following:

There's lots of options for choosing the right embedder, but honestly if I have to do more with this I would probably go with the hybrid approach and pick OpenSearchEmbeddingRetriever and OpenSearchBM25Retriever before I started messing with embedding models.

Indexing the AWS Documentation

There still remains the problem of indexing the AWS documentation for search. This is where HayHooks comes in with an indexing pipeline.

We assume for the purposes of the indexing pipeline that the documents are already downloaded and converted to Markdown. Using HayHooks, we can upload files from the command line or call the REST API directly.

Again, the pipeline is conceptually very simple:

class PipelineWrapper(BasePipelineWrapper):
    def setup(self) -> None:
        document_store = create_document_store()
        markdown_converter = create_markdown_converter()
        document_cleaner = create_document_cleaner()
        document_splitter = create_document_splitter()
        document_embedder = create_document_embedder()
        document_writer = create_document_writer(document_store)

        pipe = Pipeline()
        pipe.add_component(instance=markdown_converter, name="markdown_converter")
        pipe.add_component(instance=document_cleaner, name="document_cleaner")
        pipe.add_component(instance=document_splitter, name="document_splitter")
        pipe.add_component(instance=document_embedder, name="document_embedder")
        pipe.add_component(instance=document_writer, name="document_writer")

        pipe.connect("markdown_converter", "document_cleaner")
        pipe.connect("document_cleaner", "document_splitter")
        pipe.connect("document_splitter", "document_embedder")
        pipe.connect("document_embedder", "document_writer")

        self.pipeline = pipe

    def run_api(self, files: Optional[List[UploadFile]] = None) -> dict:
        # elided for brevity

As far the pipeline implementation goes:

And then because FastAPI can only handle uploading 1000 files at once, I have a script to upload them in batches:

INDEX_DIR="$(dirname "$0")/files_to_index/markdown_docs"
MAX_FILES=1000

for PRODUCT_DIR in "$INDEX_DIR"/*; do
    if [ -d "$PRODUCT_DIR" ]; then
        # Count number of files in directory
        FILE_COUNT=$(find "$PRODUCT_DIR" -type f | wc -l)
        
        if [ "$FILE_COUNT" -gt "$MAX_FILES" ]; then
            echo "Warning: $PRODUCT_DIR contains $FILE_COUNT files. HayHooks cannot process more than $MAX_FILES files."
            echo "Skipping directory..."
            continue
        fi
        
        echo "Indexing $PRODUCT_DIR ($FILE_COUNT files)"
        uv run hayhooks pipeline run indexing --dir "$PRODUCT_DIR"
    fi
done

Download and Conversion

In order to index markdown documents, there must be markdown documents. This is actually the easy part.

  • Download the AWS documentation using awsdocs, although I could have gone with Trafilatura instead.
  • Convert the HTML to Markdown using pandoc, although ReaderLM-v2 would have been cooler.
  • Store the HTML and generated Markdown in Git.

Storing the raw documents in Git allows for diffing and versioning between snapshots, and gives me something to fall back on if I've munged a conversion completely.

I was not aware that there are LLM specific web crawlers like Firecrawl. I feel a bit odd about it because I know that Anubis is only necessary because of the massive amount of badly behaved crawlers that have "stealth mode" activated specifically to get around bots. It's relatively easy to run a docker compose or self host so I went ahead and did that.

At some point, I suspect that there's just going to be a packaged markdown dump of websites so that crawlers can just download everything at once.

RAG Evaluation

After putting together a RAG, you're supposed to evaluate it. I have the good fortune of a pre-existing dataset of AWS questions and answers. I'm not sure I really want to go that far, but I may pick a few answers out just to spot check.

Open WebUI Integration

One nice thing about HayHooks is that it's OpenAI compatible. If I don't want to use Letta and just want a streaming response ASAP, I can implement run_chat_completion and use the OpenAI endpoint to chat directly with Open WebUI.

Just add this to your pipeline:

 def run_chat_completion(self, model: str, messages: List[dict], body: dict) -> Union[str, Generator]:
        log.trace(f"Running pipeline with model: {model}, messages: {messages}, body: {body}")

        question = get_last_user_message(messages)
        log.trace(f"Question: {question}")

        # Streaming pipeline run, will return a generator
        return streaming_generator(
            pipeline=self.pipeline,
            pipeline_run_args={"fetcher": {"urls": URLS}, "prompt": {"query": question}},
        )

There's a docker compose that you can run that comes pre-configured with Open WebUI if you just want to try it out.

Dev Docs

I found out about it too late, but I could have also used DevDocs as part of a RAG solution. The MCP server only searches by keyword, but that might not actually matter.

Given the existence of Gemini 2.0 Flash, I could just throw all the pages that matched into the context window and let the LLM sort it out. It's cheap, and will soon be even cheaper with the context caching that is coming soon.

It would be far faster overall than running through local embeddings, cheaper than paying for remote embeddings, and the LLM would have a far more coherent understanding of the documentation than would be available from the chunks returned from embeddings. I would lose semantic search, but honestly I don't think that's a problem in technical documentation, because I usually know exactly what I'm looking for.

Future Directions

Now that I know how to implement RAG, there's really nothing stopping me doing more of it. A bunch of what I want to do is learn more effectively.

There's a lot of other things I could do:

  • Add PDFs of manuals and academic papers with pyzotero and Docling.
  • Transcribe podcasts using the audio API.
  • Chuck in all the HOWTOs, tips, and random notes that get scattered around the place.
  • Throw in a bunch of Youtube transcripts of conferences.
  • Add citations to the RAG with confidence scores and retrieval timestamps.
  • Integrate Letta's memory system into searches for context-aware retrieval.
  • Accumulate knowledge by scraping smart people's blogs and keep track of new projects and technologies without having to crawl social media or Reddit.

But more importantly, I know that RAG isn't hard. It's really just another ETL job that you can implement in two shellscripts with Ollama and SQLite – the only bits that are strange are the embedding models, and they are all good enough not to matter too much.

Comments