This code implements a semantic chunking approach for processing and retrieving information from PDF documents, first proposed by Greg Kamradt and subsequently implemented in LangChain. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.
Motivation
Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.
Key Components
- PDF processing and text extraction
- Semantic chunking using LangChain’s SemanticChunker
- Vector store creation using FAISS and OpenAI embeddings
- Retriever setup for querying the processed documents
Method Details
Document Preprocessing
- The PDF is read and converted to a string using a custom
read_pdf_to_string
function.
Semantic Chunking
- Utilizes LangChain’s
SemanticChunker
with OpenAI embeddings. - Three breakpoint types are available:
- ‘percentile’: Splits at differences greater than the X percentile.
- ‘standard_deviation’: Splits at differences greater than X standard deviations.
- ‘interquartile’: Uses the interquartile distance to determine split points.
- In this implementation, the ‘percentile’ method is used with a threshold of 90.
Vector Store Creation
- OpenAI embeddings are used to create vector representations of the semantic chunks.
- A FAISS vector store is created from these embeddings for efficient similarity search.
Retriever Setup
- A retriever is configured to fetch the top 2 most relevant chunks for a given query.
Key Features
- Context-Aware Splitting: Attempts to maintain semantic coherence within chunks.
- Flexible Configuration: Allows for different breakpoint types and thresholds.
- Integration with Advanced NLP Tools: Uses OpenAI embeddings for both chunking and retrieval.
Benefits of this Approach
- Improved Coherence: Chunks are more likely to contain complete thoughts or ideas.
- Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced.
- Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs.
- Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments.
Implementation Details
- Uses OpenAI’s embeddings for both the semantic chunking process and the final vector representations.
- Employs FAISS for creating an efficient searchable index of the chunks.
- The retriever is set up to return the top 2 most relevant chunks, which can be adjusted as needed.
Example Usage
The code includes a test query: “What is the main cause of climate change?”. This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.
Semantic chunking represents an advanced approach to document processing for retrieval systems. By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particularly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports.
Import libraries
import os
import sys
from dotenv import load_dotenv
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
Define file path
path = "../data/Understanding_Climate_Change.pdf"
Read PDF to string
content = read_pdf_to_string(path)
Breakpoint types:
- ‘percentile’: all differences between sentences are calculated, and then any difference greater than the X percentile is split.
- ‘standard_deviation’: any difference greater than X standard deviations is split.
- ‘interquartile’: the interquartile distance is used to split chunks.
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type='percentile', breakpoint_threshold_amount=90) # chose which embeddings and breakpoint type and threshold to use
Split original text to semantic chunks
docs = text_splitter.create_documents([content])
Create vector store and retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
Test the retriever
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)