• AI, But Simple
  • Posts
  • Retrieval-Augmented Generation (RAG), Simply Explained

Retrieval-Augmented Generation (RAG), Simply Explained

AI, But Simple Issue #62

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Retrieval-Augmented Generation (RAG), Simply Explained

AI, But Simple Issue #62

Retrieval-Augmented Generation (RAG) is a technique utilized to decrease Large Language Model (LLM) hallucinations by allowing LLMs to access external or private data (by choice of the user) that was not included in training.

Say you ask ChatGPT (or the LLM of choice): “Should I drink milk as part of my breakfast?

It should provide a detailed response as to milk’s nutritional benefits, and it’s likely that part of this response are scientific articles related to this exact subject. These huge models are able to access this information since they have “digested” millions of texts and other materials—ChatGPT 4 is said to have been trained on 13 trillion tokens, which is roughly 10 trillion words.

Their outputs are generic responses based on the model’s pattern recognition of what it sees in the vast sea of data that it has been trained upon.

But what if you ask ChatGPT or Claude: “What did the health newsletter in my email say about drinking milk in the morning?

ChatGPT-4: ”I don’t have access to your personal email or its contents, so I can’t see what your health newsletter said about drinking milk in the morning.”

Want to get the most out of ChatGPT?

ChatGPT is a superpower if you know how to use it correctly.

Discover how HubSpot's guide to AI can elevate both your productivity and creativity to get more things done.

Learn to automate tasks, enhance decision-making, and foster innovation with the power of AI.

The Foundation of Contextual AI

Where the general knowledge of ChatGPT couldn’t generate a response regarding a specific detail (health newsletter) in your emails, LLMs with the help of RAG can combat this issue.

Put simply, RAG allows LLMs to access knowledge outside of their training data without having to be retrained on the information you desire. This is established with a link between the LLM and the specific context.

The RAG mechanism can be broken down into four main pieces:

  1. External Data Sources: These sources come from API connections, databases, alternate repositories, and libraries. A variety of formats are acceptable, not just .txt or pdfs. Using a framework like LangChain or Haystack can help chunk unstructured data to ease the rest of the process.

  2. Embedding Data into Vectors: An embedding language model (e.g., HuggingFace) will take in the sources of data that were given and create a new database of vector embeddings (e.g., Pinecone), which are numerical representations of data.

  3. Relevancy search: By converting natural language into vectors, the model performs a faster, more efficient search based on semantic definitions of words rather than literal word filtration. If the user input’s vector representation is similar to a piece of information in the embedded vector database (measured using the dot product of vectors), the information will be pulled, along with any metadata stored in the original database.

  4. Augmenting Context into New Prompt: After locating the necessary data, RAG will augment the context into the prompt, similar to how we can provide a PDF to ChatGPT along with our text prompt. This new prompt will be interpreted by the LLM using both the new context and the initial user input. The goal is that the additional context will be able to inform the LLM on specific data (for example, within a company’s records) that the user may be asking about.

# new prompt that the LLM receives
“Use the following context to answer the question.

Context:
<retrieved_docs> # Augmented data

Question: <user query> # initial user prompt
Answer:  # LLM response generates here

A model without RAG capabilities is like a new employee that is very enthusiastic about the job. This employee will possess broad knowledge about any topic, except for specific internal knowledge of the company.

This employee can talk confidently about almost any topic but may get specific company information wrong or make it up—think model hallucinations.

With RAG, the LLM “employee” is able to continuously learn about the field and company, enhancing its productivity and its work’s reliability.

By feeding specific context to the LLM, RAG is able to reduce inaccuracies in its output, crucial for any field that is built on real-time, readily available data analytics.

Advantages of RAG

RAG has another immediate advantage: additional measures of controlling LLM data access and protection for sensitive information.

For example, if an LLM was trained on important company secrets, such as employees’ personal details and/or salaries, a user could try to query for this information.

Simply guiding the LLM to ignore such requests may work, but only until a malicious actor finds a way around the safeguards installed. Since the data is kept separate from the LLM’s initial knowledge base, a permissions system could be set up to only allow authorized users certain information.

Another advantage is that RAG keeps models current. Oftentimes, the nature of real-time data is that it is always changing, like weather data or financial indices.

As long as the APIs and external databases update, the embedded vector database will update, so the model will always be up-to-date!

Introducing Agentic RAG

The next step in AI for LLMs seems to be incorporating agentic capabilities with RAG to form Agentic RAG.

Remember that Agentic AI infrastructure creates AI that is proactive in completing tasks and carrying out queries—read more about agentic AI in our past issue here. In implementations of Agentic RAG, the LLM’s productivity shoots upward.

An agent reasons and plans appropriate steps to carry out a user’s input, all on its own, resulting in better output quality than just a one-step prompt.

It can plan which tools to use, refine its responses using probabilistic token distributions, and leverage its “memory” (context window) and reasoning to improve strategies, iterate on itself, and recognize patterns.

*Note: previous figure, zoomed in on a specific portion (one use case)

With these improved characteristics, these systems will be more expensive than traditional RAG and much more experimental.

Also, the more agents that are in the background, accessing the different APIs and data sources, the higher the chance for a misstep to occur. The increased complexity introduces a factor of uncertainty that stops these systems from widescale use currently.

Supporting us by purchasing a PRO membership will get you:

  • Exclusive Hands-on Tutorials With Code

  • Full Math Explanations and Beautiful Visualizations

  • Ad-Free Reading for an Uninterrupted Learning Experience

Standardizing AI's Connection to the World

While Agentic RAG represents evolution in capability, a similar challenge has emerged: standardizing how AI systems connect to data sources in the first place.

RAG gives LLMs a way to interact with external APIs, databases, and information, but Model Context Protocol (MCP) allows for a standardized way for developers to integrate data pipelines into their applications.

MCP was developed by Anthropic in mid-2024 to serve as a "one-size-fits-all" solution, enabling different data sources to work together seamlessly with AI systems.

When connecting these data sources, developers face a major challenge: each API and database has its own unique integration requirements. Different APIs use different authentication methods, data formats, and error-handling approaches. As you add more sources, the complexity grows exponentially.

Managing all these different integration patterns becomes like trying to organize a tangled mess of cables behind your desk—except each cable has a unique set of rules to follow when plugging in.

Model Context Protocol (MCP)

Model Context Protocol eliminates these complications by creating a hub where LLM and data relationships are cleanly maintained.

MCP is reliant on three parts: the host, the client, and the server.

The MCP Host receives and interprets the user’s input. The MCP Host will communicate the user’s request to the MCP Server, but the MCP Client will be the one that actually carries the message.

*Note: RAG is working behind the scenes to augment the new data into a prompt for the LLM.

The MCP Client will convert the user’s input into machine-readable form and provide it to the MCP Server, which will activate the appropriate APIs and data-pulling techniques.

Each MCP Client is 1:1 to each MCP Server, but the MCP Host can contain many different MCP Clients. An example of an MCP Server is Microsoft Copilot, which can access your Microsoft Office data.

The transport layer between client and server is also important, since it's what makes MCP truly standardized.

When an MCP client needs to communicate with a server, it converts MCP protocols into JSON-RPC format, conveying the MCP client’s message to the MCP Server.

  • This JSON-RPC format carries the data itself and also includes the rules for how that data should be handled, processed, and validated.

It’s like if you sent a package in the mail that includes both the shipping box and the delivery instructions rolled into one. When the server receives this package, it knows exactly how to unpack it and process the request. On the return trip, the server's response gets packaged back into JSON-RPC format and converted into the MCP commands that the client can understand.

This standardization is crucial because it means any MCP client can talk to any MCP Server, regardless of what programming language they're built with or what underlying systems they connect to.

  • Without this common “language,” each integration would need custom translation code, which is exactly the kind of complexity MCP was designed to eliminate.

MCP can significantly enhance RAG systems by changing how database searches are performed. In traditional RAG setups, the system automatically searches the vector database every time a user asks a question, even if it isn’t needed.

This can prove troublesome and inefficient for complex queries, expending unnecessary API calls and lengthening wait times for responses.

MCP is smarter, connecting to the vector database through a server action, treating the database search as a strategic tool rather than an automatic reflex. Because of this, the LLM can decide when a database search would actually be helpful.

If you ask, “What's 2+2?” The system won't waste resources searching through company documents. But if you ask, “What was our Q3 revenue strategy?” It will strategically choose to search the relevant databases.

Despite how promising it looks, MCP faces several limitations that organizations must consider.

Security remains a primary concern, as MCP lacks native end-to-end encryption, meaning potential exposure of sensitive data traveling between LLMs, servers, and APIs, which will require additional security infrastructure.

The protocol also suffers from compatibility issues. Different LLMs require developers to tailor their approaches to suit each one: Claude prefers XML encodings, while GPT works better with Markdown.

Perhaps most problematically, MCP creates dependency risks where significant changes to underlying APIs require complete reconfiguration. API developers could also alter their systems without user knowledge, breaking existing integrations and potentially trust.

The Bottom Line

To be clear, the LLM is still the initial decision-maker, which is then enabled by MCP and RAG. RAG provides the method for introducing relevant context, while MCP serves as the standardized protocol that makes these connections possible and manageable at scale.

RAG and MCP are complementary technologies that together solve the fundamental challenge of connecting AI to real-world environments. These techniques improve the LLMs ability to interact with real-world environments with heightened control and efficiency.

As these technologies mature and their limitations are addressed, we're moving toward a future where AI systems can seamlessly interact with any data source or tool. The question isn't whether to choose RAG or MCP, but how to leverage both to build AI workflows that are not just intelligent—but intelligently connected.

Here’s a special thanks to our biggest supporters:

  • Sushant Waidande

  • Sai Krishna Pashapu

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Want to reach 50000+ ML engineers? Let’s work together:

Feedback, inquiries? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team