RAGs - Part III

Wrapping up with challenges and debugging strategies

Aug 09, 2023

This is Part III in the RAG series. Find the previous posts here 👇

Part I - Retrieval Augmented Generation (RAG) - I
Part II - RAG - II

I ran a poll recently asking what content would be interesting to post and a 100% of responses from the albeit-small reader-base voted for sharing more freelancer spotlights and their projects. So those are coming soon :)

Meanwhile, I can share one of my own projects, that also nicely ties into what I ended RAG Part-II with, promising to share about my experience with building RAG-chatbots for clients.

The overarching theme of my post is going to be how everyone is building chat-with-data chatbots, but not many are building debugging and evaluation systems around monitoring these systems. This makes it hard to deploy these systems to production.

It is extremely easy to build a basic RAG pipeline using the Open AI Embeddings and Chat Completion apis, a vector store like Pinecone and open source frameworks like LangChain and LLamaIndex. who provide starter modules to set this up within an hour.

For example, you can set up RAG system in 4 lines of code (!) using Langchain abstractions like WebBaseLoader and VectorstoreIndexCreator.

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")

index = VectorstoreIndexCreator().from_loaders([loader])

That’s it. The vector index can be now queried.

index.query("What is Task Decomposition?")

Now comes the hard part- is this thing ready for production?

And common practice here, as Simon Willison puts it, most people’s evaluation strategy is based on “vibes” :) Vibes is great, and an essential part of testing but it doesn’t scale.

Anyone building RAG systems needs to figure out a robust evaluation strategy. I get the feeling that most people's evaluation strategies (including my own) are based mainly on "vibes" - Simon Willison

I am currently working on building exactly this.

What should this tool look like?

Working backwards from the common themes in client feedback is a good starting point to dive into this.

Some useful context- I am part of a project building RAG chatbots for a research center who offers several video-based courses to paying users. Since they have several course modules, and each module is an hour long video, making this content searchable would be very useful. So, RAG-powered chatbots where users can ask questions on specific topics to find the relevant module makes sense.

In testing these chatbots, most client feedback fall under the following themes

“Most responses are great, but these few questions give incorrect/partial answers”.
“The cited sources for this question does not include this module“
”Tone in this response is too chatGPT like and does not seem like our content”
”The bot is mis-understanding this phrase to be regular english words and not our internal course terminology”

This is exactly the last-mile problem with baseline RAG systems. They mostly work great, but certain edge-cases make it extremely difficult to debug and prevent the chatbot from being production-ready.

So where do you start investigating? My current work is building out a framework to systematize and potentially automate parts of this investigation. Here is my current process.

Check that source chunks were retrieved and the reply isn’t hallucination
1. Clearly, retrieving relevant chunks of text is of critical importance that affects quality of the responses. So the first thing I check is that at least 1 source chunks was found relevant to the user query so that the LLM had some context to work with.
2. If the “tone seems too chatGPT-like”, most likely none of the chunks met a similarity threshold to qualify for the response generation and ChatGPT took over completely
If the retrieved context does seem relevant to the user query, but the response is still missing information, then dig into why that information
1. this might entail tweaking the retrieval params like number of chunks to return from the vector store or the size of each chunk being too big/small causing the desired chunk to rank lower on similarity.
If the retrieved context is not directly pertinent to the user query, check if the semantic matching over-indexed on the wrong tokens
1. for example, if the user query was asking about a specific piece of content in a document named DocX, but the semantic search was returning text chunks containing the word DocX in the document - the search has worked as per design but did not meet the user’s search intent.
2. here, a hybrid search system that prepends an entity recognition step to identify DocX as the document title, then filtering the document pool to only DocX, following by semantic search within DocX would work better than vanilla semantic search.
if the quality still doesn’t improve, you might need to re-ingest your documents with some creative enhancements. some examples I’ve encountered include:
1. generating a list of questions pertinent to each chunk and joint-embedding could add more context to each chunk (doc2query approach)
2. keeping a pool of edge-cases to use directly bypassing the retrieval step
3. adding metadata to the text in each chunk, eg. section/chapter headings, document-header information.

And finally, the big challenge that looms over all the above, and frankly what makes working with LLMs both fun and frustrating- is after you have tweaked a few things and your response is finally pulling more context - How do you do this at scale? 😩

Anyway, let’s not dRAG this any further ;)

🧑‍🎓LLMentary School

Discussion about this post