Relevant Segment Extraction (RSE)

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

Article Summary

Relevant segment extraction (RSE) is a method of reconstructing multi-chunk segments of contiguous text out of retrieved chunks. This step occurs after vector search (and optionally reranking), but before presenting the retrieved context to the LLM. This method ensures that nearby chunks are presented to the LLM in the order they appear in the original document. It also adds in chunks that are not marked...

Key Takeaways

  • This article explains Motivation in simple medical language.
  • This article explains Key Components in simple medical language.
  • This article explains Method Details in simple medical language.
Educational health guideWritten for patient understanding and clinical awareness.
Reviewed content workflowUse writer and reviewer profiles for stronger trust.
Emergency safety firstUrgent warning signs are highlighted below.

Seek urgent medical care if you notice

These warning signs are general safety guidance. Local emergency numbers and clinical judgment should always come first.

  • Severe symptoms, breathing difficulty, fainting, confusion, or rapidly worsening illness.
  • New weakness, severe pain, high fever, or symptoms after a serious injury.
  • Any symptom that feels urgent, unusual, or unsafe for the patient.
1

Emergency now

Use emergency care for severe, sudden, rapidly worsening, or life-threatening symptoms.

2

See a doctor

Book a professional medical evaluation if symptoms persist, worsen, recur often, affect daily activities, or occur in a high-risk patient.

3

Learn safely

Use this article to understand possible causes, tests, treatment options, prevention, and questions to ask your clinician.

Patient safety assistant

Check your symptom safely

Hi, I am RX Symptom Navigator. I can help you understand what to read next and what warning signs need care.
Warning: Do not use this in emergencies, pregnancy, severe illness, or as a substitute for a doctor. For children or teens, use with a parent/guardian and clinician.
A rural-friendly guide: warning signs, when to see a doctor, related articles, tests to discuss, and OTC safety education.
1 Symptom 2 Severity 3 Safe guidance
First safety question

Is there chest pain, breathing trouble, fainting, confusion, severe bleeding, stroke-like weakness, severe injury, or pregnancy danger sign?

Choose quickly

Browse by body area
Start here: Write or select a symptom. The guide will show warning signs, doctor guidance, diagnostic tests to discuss, OTC safety education, and related RX articles.

Important: This tool is educational only. It cannot diagnose, treat, or replace a doctor. OTC information is not a prescription. In an emergency, contact local emergency services or go to the nearest hospital.

Frequently Asked Questions

Motivation When chunking documents for RAG, choosing the right chunk size is an exercise in managing tradeoffs. Large chunks provide better context to the LLM than small chunks, but they also make it harder to precisely retrieve specific pieces of information. Some queries (like simple factoid questions) are best handled by small chunks, while other queries (like higher-level questions) require very large chunks. There are some queries that can be answered with a single sentence from the document, while there are other queries that require entire sections or chapters to properly answer. Most real-world RAG use cases face a combination of these types of queries.What we really need is a more dynamic system that can retrieve short chunks when that's all that's needed, but can also retrieve very large chunks when required. How do we do that?Our solution is motivated by one simple insight: relevant chunks tend to be clustered within their original documents. Key Components Chunk text key-value store RSE requires being able to retrieve chunk text from a database quickly, using a doc_id and chunk_index as keys. This is because not all chunks that need to be included in a given segment will have been returned in the initial search results. Therefore some sort of key-value store may need to be used in addition to the vector database. Method Details Document chunking Standard document chunking methods can be used. The only special requirement here is that documents are chunked with no overlap. This allows us to reconstruct sections of the document (i.e. segments) by concatenating chunks. RSE optimization After the standard chunk retrieval process is completed, which ideally includes a reranking step, the RSE process can begin. The first step is to combine the absolute relevance value (i.e the similarity score) and the relevance rank. This provides a more robust starting point than just using the similarity score on its own or just using the rank on its own. Then we subtract a constant threshold value (let's say 0.2) from each chunk's value, such that irrelevant chunks have a negative value (as low as -0.2), and relevant chunks have a positive value (as high as 0.8). By calculating chunk values this way we can define segment value as just the sum of the chunk values.For example suppose chunks 0-4 in a document have the following chunk values: [-0.2, -0.2, 0.4, 0.8, -0.1]. The segment that includes only chunks 2-3 would have value 0.4+0.8=1.2.Finding the best segments then becomes a constrained version of the maximum sum subarray problem. We use a brute force search with a few heuristics to make it efficient. This generally takes ~5-10ms.Setup First, some setup. You'll need a Cohere API key to run some of these cells, as we use their excellent reranker to calculate relevance scores.In [4]:import os import numpy as np from typing import List from scipy.stats import beta import matplotlib.pyplot as plt import cohere from dotenv import load_dotenv # Load environment variables from a .env file load_dotenv() os.environ["CO_API_KEY"] = os.getenv('CO_API_KEY') # Cohere API keyWe define a few helper functions. We'll use the Cohere Rerank API to calculate relevance values for our chunks. Normally, we'd start with a vector and/or keyword search to narrow down the list of candidates, but since we're just dealing with a single document here we can just send all chunks directly to the reranker, keeping things a bit simpler.In [11]:from langchain_text_splitters import RecursiveCharacterTextSplitter def split_into_chunks(text: str, chunk_size: int): """ Split a given text into chunks of specified size using RecursiveCharacterTextSplitter. Args: text (str): The input text to be split into chunks. chunk_size (int, optional): The maximum size of each chunk. Defaults to 800. Returns: list[str]: A list of text chunks. Example: >>> text = "This is a sample text to be split into chunks." >>> chunks = split_into_chunks(text, chunk_size=10) >>> print(chunks) ['This is a', 'sample', 'text to', 'be split', 'into', 'chunks.'] """ text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0, length_function=len) texts = text_splitter.create_documents([text]) chunks = [text.page_content for text in texts] return chunks def transform(x: float): """ Transformation function to map the absolute relevance value to a value that is more uniformly distributed between 0 and 1. The relevance values given by the Cohere reranker tend to be very close to 0 or 1. This beta function used here helps to spread out the values more uniformly. Args: x (float): The absolute relevance value returned by the Cohere reranker Returns: float: The transformed relevance value """ a, b = 0.4, 0.4 # These can be adjusted to change the distribution shape return beta.cdf(x, a, b) def rerank_chunks(query: str, chunks: List[str]): """ Use Cohere Rerank API to rerank the search results Args: query (str): The search query chunks (list): List of chunks to be reranked Returns: similarity_scores (list): List of similarity scores for each chunk chunk_values (list): List of relevance values (fusion of rank and similarity) for each chunk """ model = "rerank-english-v3.0" client = cohere.Client(api_key=os.environ["CO_API_KEY"]) decay_rate = 30 reranked_results = client.rerank(model=model, query=query, documents=chunks) results = reranked_results.results reranked_indices = [result.index for result in results] reranked_similarity_scores = [result.relevance_score for result in results] # in order of reranked_indices # convert back to order of original documents and calculate the chunk values similarity_scores = [0] * len(chunks) chunk_values = [0] * len(chunks) for i, index in enumerate(reranked_indices): absolute_relevance_value = transform(reranked_similarity_scores[i]) similarity_scores[index] = absolute_relevance_value chunk_values[index] = np.exp(-i/decay_rate)*absolute_relevance_value # decay the relevance value based on the rank return similarity_scores, chunk_values def plot_relevance_scores(chunk_values: List[float], start_index: int = None, end_index: int = None) -> None: """ Visualize the relevance scores of each chunk in the document to the search query Args: chunk_values (list): List of relevance values for each chunk start_index (int): Start index of the chunks to be plotted end_index (int): End index of the chunks to be plotted Returns: None Plots: Scatter plot of the relevance scores of each chunk in the document to the search query """ plt.figure(figsize=(12, 5)) plt.title(f"Similarity of each chunk in the document to the search query") plt.ylim(0, 1) plt.xlabel("Chunk index") plt.ylabel("Query-chunk similarity") if start_index is None: start_index = 0 if end_index is None: end_index = len(chunk_values) plt.scatter(range(start_index, end_index), chunk_values[start_index:end_index])In [12]:# File path for the input document FILE_PATH = "../data/nike_2023_annual_report.txt"with open(FILE_PATH, 'r') as file: text = file.read() chunks = split_into_chunks(text, chunk_size=800) print (f"Split the document into {len(chunks)} chunks")Split the document into 500 chunksVisualize chunk relevance distribution across single documentIn [31]:# Example query that requires a longer result than a single chunk query = "Nike consolidated financial statements"similarity_scores, chunk_values = rerank_chunks(query, chunks)In [39]:plot_relevance_scores(chunk_values)How to interpret the chunk relevance plot above In the plot above, the x-axis represents the chunk index. The first chunk in the document has index 0, the next chunk has index 1, etc. The y-axis represents the relevance of each chunk to the query. Viewing it this way lets us see how relevant chunks tend to be clustered in one or more sections of a document.Note: the relevance values in this plot are actually a combination of the raw relevance value and the relevance ranks. An exponential decay function is applied to the ranks, and that is then multiplied by the raw relevance value. Using this combination provides a more robust measure of relevance than using just one or the other. Zooming in Now let's zoom in on that cluster of relevant chunks for a closer look.In [34]:plot_relevance_scores(chunk_values, 320, 340)What's interesting to note here is that only 7 of these 20 chunks have been marked as relevant by our reranker. And many of the non-relevant chunks are sandwiched between relevant chunks. Looking at the span of 323-336, exactly half of those chunks are marked as relevant and the other half are marked as not relevant. Let's see what this part of the document containsIn [ ]:def print_document_segment(chunks: List[str], start_index: int, end_index: int): """ Print the text content of a segment of the documentArgs: chunks (list): List of text chunks start_index (int): Start index of the segment end_index (int): End index of the segment (not inclusive) Returns: None Prints: The text content of the specified segment of the document """ for i in range(start_index, end_index): print(f"\nChunk {i}") print(chunks[i])print_document_segment(chunks, 320, 340)We can see that the Consolidated Statement of Income starts in chunk 323, and everything up to chunk 333 contains consolidated financial statements, which is what we're looking for. So every chunk in that range is indeed relevant and necessary for our query, yet only about half of those chunks were marked as relevant by the reranker. So in addition to providing more complete context to the LLM, by combining these clusters of relevant chunks we actually find important chunks that otherwise would have been ignored.What can we do with these clusters of relevant chunks?

The core idea is that clusters of relevant chunks, in their original contiguous form, provide much better context to the LLM than individual chunks can. Now for the hard part: how do we actually identify these clusters? If we can calculate chunk values in such a way that the value of a segment is just the sum of the values of its constituent chunks, then finding the optimal segment is a version of the maximum subarray problem, for which a…

What if the answer is contained in a single chunk?

In the case where only a single chunk, or a few isolated chunks, are relevant to the query, we don't want to create large segments out of them. We just want to return those specific chunks. RSE can handle that scenario well too. Since there are no clusters of relevant chunks, it basically reduces to standard top-k retrieval in that case. We'll leave it as an exercise to the reader to see what happens to the chunk relevance plot and…

References

Add references, clinical guidelines, textbooks, journal articles, or trusted medical sources here. You can edit this area from the RX Article Professional Blocks panel.