Semantic Chunks for RAG
In order to abide by the context window of the LLM , we usually break text into smaller parts / pieces which is called chunking.
LLMs, although capable of generating text that is both meaningful and grammatically correct, these LLMs suffer from a problem called hallucination. Hallucination in LLMs is the concept where the LLMs confidently generate wrong answers, that is they make up wrong answers in a way that makes us believe that it is true. This has been a major problem since the introduction of the LLMs. These hallucinations lead to incorrect and factually wrong answers. Hence Retrieval Augmented Generation was introduced.
In RAG, we take a list of documents/chunks of documents and encode these textual documents into a numerical representation called vector embeddings, where a single vector embedding represents a single chunk of document and stores them in a database called vector store. The models required for encoding these chunks into embeddings are called encoding models or bi-encoders. These encoders are trained on a large corpus of data, thus making them powerful enough to encode the chunks of documents in a single vector embedding representation.
The retrieval greatly depends on how the chunks are manifested and stored in the vectorstore. Finding the right chunk size for any given text is a very hard question in general.
Improving Retrieval can be done by various retrieval method. But it can also be done by better chunking strategy.
Different chunking methods:
- Fixed size chunking
- Recursive Chunking
- Document Specific Chunking
- Semantic Chunking
- Agentic Chunking
Fixed Size Chunking: This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking will be the best path in most common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any NLP libraries.
Recursive Chunking : Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size. Leverages what is good about fixed size chunk and overlap.
Document Specific Chunking: It takes into consideration the structure of the document . Instead of using a set number of characters or recursive process it creates chunks that align with the logical sections of the document like paragraphs or sub sections. By doing this it maintains the author’s organization of the content thereby keeping the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections. It can handle formats such as Markdown, Html, etc.
Sematic Chunking: Semantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information’s integrity during retrieval, leading to a more accurate and contextually appropriate outcome. It is slower compared to the previous chunking strategy
Agentic Chunk: The hypothesis here is to process documents in a fashion that humans would do.
- We start at the top of the document, treating the first part as a chunk.
- We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one
- We keep this up until we reach the end of the document.
This approach is still being tested and isn’t quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There’s no implementation available in public libraries just yet.
Here we will experiment with Semantic chunking and Recursive Retriever .
Comparison of methods steps:
- Load the Document
- Chunk the Document using the following two methods: Semantic chunking and Recursive Retriever .
- Assess qualitative and quantitative improvements with RAGAS
Semantic Chunks
Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.
By focusing on the text’s meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It’s a top-notch choice when maintaining the semantic integrity of the text is vital.
The hypothesis here is we can use embeddings of individual sentences to make more meaningful chunks. Basic idea is as follows :-
- Split the documents into sentences based on separators(.,?,!)
- Index each sentence based on position.
- Group: Choose how many sentences to be on either side. Add a buffer of sentences on either side of our selected sentence.
- Calculate distance between group of sentences.
- Merge groups based on similarity i.e. keep similar sentences together.
- Split the sentences that are not similar.
Technology Stack Used
- Langchain :LangChain is an open-source framework designed to simplify the creation of applications using large language models (LLMs). It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.
- LLM: Groq’s Language Processing Unit (LPU) is a cutting-edge technology designed to significantly enhance AI computing performance, especially for Large Language Models (LLMs). The primary goal of the Groq LPU system is to provide real-time, low-latency experiences with exceptional inference performance.
- Embedding Model: FastEmbed is a lightweight, fast, Python library built for embedding generation.
- Evaluation: Ragas offers metrics tailored for evaluating each component of your RAG pipeline in isolation.
Semantic chunking is a crucial technique in natural language processing that enhances the efficiency of information retrieval and understanding. By breaking down text into manageable pieces, or chunks, systems can better analyze and respond to user queries. This method is particularly effective when integrated with various retrieval strategies, allowing for both granular and broad searches.
System Integration
Efficient chunking aligns with system capabilities. For example:
- Full-Text Search: Use larger chunks to allow algorithms to explore broader contexts effectively. This is useful for searching books based on extensive excerpts or chapters.
- Granular Search Systems: Employ smaller chunks to precisely retrieve information relevant to user queries. For instance, if a user asks, “How do I reset my password?”, the system can retrieve a specific sentence or paragraph addressing that action directly.
Semantic Memory
Semantic memory functions similarly to how the human brain stores and retrieves knowledge. It utilizes embeddings to create a semantic memory by representing concepts or entities as vectors in a high-dimensional space. This approach allows models to learn relationships between concepts and make inferences based on the similarity or distance between vector representations. For example, the semantic memory can be trained to understand that “Word” and “Excel” are related concepts because they are both document types and Microsoft products, despite differing file formats and features.
Embeddings in Practice
Software developers can leverage pre-trained embedding models or train their own with custom datasets. Pre-trained models are beneficial as they have been trained on extensive data and can be utilized immediately for various applications. However, custom embedding models may be necessary when dealing with specialized vocabularies or domain-specific language.
Considerations for Retrieval Methods
There are various retrieval strategies to consider:
- Similarity Search: A simple method that uses embeddings to find relevant text chunks.
- Metadata Filtering: When metadata is available, filtering data based on it before performing a similarity search can yield better results.
- Statistical Retrieval Methods: Techniques like TF-IDF and BM25 utilize term frequency and distribution to identify relevant text chunks.
Contextual Retrieval
Not all retrieved text chunks are taken as they are. Sometimes, it is beneficial to include more context around the actual retrieved text chunk. The actual retrieved text chunk is referred to as a “child chunk”, while the larger context it belongs to is called a “parent chunk”. Additionally, providing weights to retrieved documents can enhance relevance; for example, a time-weighted approach can help prioritize the most recent documents.

Dr. MD Harun Ar Rashid, FCPS, MD, PhD, is a highly respected medical specialist celebrated for his exceptional clinical expertise and unwavering commitment to patient care. With advanced qualifications including FCPS, MD, and PhD, he integrates cutting-edge research with a compassionate approach to medicine, ensuring that every patient receives personalized and effective treatment. His extensive training and hands-on experience enable him to diagnose complex conditions accurately and develop innovative treatment strategies tailored to individual needs. In addition to his clinical practice, Dr. Harun Ar Rashid is dedicated to medical education and community outreach, often participating in initiatives that promote health awareness and advance medical knowledge. His career is a testament to the high standards represented by his credentials, and he continues to contribute significantly to his field, driving improvements in both patient outcomes and healthcare practices.