Semantic Chunking for Document Processing
This code implements a semantic chunking approach for processing and retrieving information from PDF documents, first proposed by Greg Kamradt and subsequently implemented in LangChain. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.
Motivation
Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.
Key Components
- PDF processing and text extraction
- Semantic chunking using LangChain’s SemanticChunker
- Vector store creation using FAISS and OpenAI embeddings
- Retriever setup for querying the processed documents
Method Details
Document Preprocessing
- The PDF is read and converted to a string using a custom
read_pdf_to_string
function.
Semantic Chunking
- Utilizes LangChain’s
SemanticChunker
with OpenAI embeddings. - Three breakpoint types are available:
- ‘percentile’: Splits at differences greater than the X percentile.
- ‘standard_deviation’: Splits at differences greater than X standard deviations.
- ‘interquartile’: Uses the interquartile distance to determine split points.
- In this implementation, the ‘percentile’ method is used with a threshold of 90.
Vector Store Creation
- OpenAI embeddings are used to create vector representations of the semantic chunks.
- A FAISS vector store is created from these embeddings for efficient similarity search.
Retriever Setup
- A retriever is configured to fetch the top 2 most relevant chunks for a given query.
Key Features
- Context-Aware Splitting: Attempts to maintain semantic coherence within chunks.
- Flexible Configuration: Allows for different breakpoint types and thresholds.
- Integration with Advanced NLP Tools: Uses OpenAI embeddings for both chunking and retrieval.
Benefits of this Approach
- Improved Coherence: Chunks are more likely to contain complete thoughts or ideas.
- Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced.
- Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs.
- Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments.
Implementation Details
- Uses OpenAI’s embeddings for both the semantic chunking process and the final vector representations.
- Employs FAISS for creating an efficient searchable index of the chunks.
- The retriever is set up to return the top 2 most relevant chunks, which can be adjusted as needed.
Example Usage
The code includes a test query: “What is the main cause of climate change?”. This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.
Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports.
Import libraries
import os
import sys
from dotenv import load_dotenv
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
Define file path
path = "../data/Understanding_Climate_Change.pdf"
Read PDF to string
content = read_pdf_to_string(path)
Breakpoint types:
- ‘percentile’: all differences between sentences are calculated, and then any difference greater than the X percentile is split.
- ‘standard_deviation’: any difference greater than X standard deviations is split.
- ‘interquartile’: the interquartile distance is used to split chunks.
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type='percentile', breakpoint_threshold_amount=90) # chose which embeddings and breakpoint type and threshold to use
Split original text to semantic chunks
docs = text_splitter.create_documents([content])
Create vector store and retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
Test the retriever
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Dr. Md. Harun Ar Rashid, MPH, MD, PhD, is a highly respected medical specialist celebrated for his exceptional clinical expertise and unwavering commitment to patient care. With advanced qualifications including MPH, MD, and PhD, he integrates cutting-edge research with a compassionate approach to medicine, ensuring that every patient receives personalized and effective treatment. His extensive training and hands-on experience enable him to diagnose complex conditions accurately and develop innovative treatment strategies tailored to individual needs. In addition to his clinical practice, Dr. Harun Ar Rashid is dedicated to medical education and research, writing and inventory creative thinking, innovative idea, critical care managementing make in his community to outreach, often participating in initiatives that promote health awareness and advance medical knowledge. His career is a testament to the high standards represented by his credentials, and he continues to contribute significantly to his field, driving improvements in both patient outcomes and healthcare practices.