This code implements a context enrichment window technique for document retrieval in a vector database. It enhances the standard retrieval process by adding surrounding context to each retrieved chunk, improving the coherence and completeness of the returned information.

Motivation

Traditional vector search often returns isolated chunks of text, which may lack necessary context for full understanding. This approach aims to provide a more comprehensive view of the retrieved information by including neighboring text chunks.

Key Components

PDF processing and text chunking
Vector store creation using FAISS and OpenAI embeddings
Custom retrieval function with context window
Comparison between standard and context-enriched retrieval

Method Details

Document Preprocessing

The PDF is read and converted to a string.
The text is split into chunks with surrounding sentences

Vector Store Creation

OpenAI embeddings are used to create vector representations of the chunks.
A FAISS vector store is created from these embeddings.

Context-Enriched Retrieval

LlamaIndex has a special parser for such task. SentenceWindowNodeParser this parser splits documents into sentences. But the resulting nodes inculde the surronding senteces with a relation structure. Then, on the query MetadataReplacementPostProcessor helps connecting back these related sentences.

Retrieval Comparison

The notebook includes a section to compare standard retrieval with the context-enriched approach.

Benefits of this Approach

Provides more coherent and contextually rich results
Maintains the advantages of vector search while mitigating its tendency to return isolated text fragments
Allows for flexible adjustment of the context window size

This context enrichment window technique offers a promising way to improve the quality of retrieved information in vector-based document search systems. By providing surrounding context, it helps maintain the coherence and completeness of the retrieved information, potentially leading to better understanding and more accurate responses in downstream tasks such as question answering.

Import libraries and environment variables

In [ ]:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceWindowNodeParser, SentenceSplitter
from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
import faiss
import os
import sys
from dotenv import load_dotenv
from pprint import pprint

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# Llamaindex global settings for llm and embeddings
EMBED_DIMENSION=512
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=EMBED_DIMENSION)

Read docs

In [ ]:

path = "../data/"
reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])
documents = reader.load_data()
print(documents[0])

Create vector store and retriever

In [ ]:

# Create FaisVectorStore to store embeddings
fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)
vector_store = FaissVectorStore(faiss_index=fais_index)

Ingestion Pipelines

Ingestion Pipeline with Sentence Splitter

In [ ]:

base_pipeline = IngestionPipeline(
    transformations=[SentenceSplitter()],
    vector_store=vector_store
)

base_nodes = base_pipeline.run(documents=documents)

Ingestion Pipeline with Sentence Window

In [ ]:

node_parser = SentenceWindowNodeParser(
    # How many sentences on both sides to capture. 
    # Setting this to 3 results in 7 sentences.
    window_size=3,
    # the metadata key for to be used in MetadataReplacementPostProcessor
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence"
)

# Create a pipeline with defined document transformations and vectorstore
pipeline = IngestionPipeline(
    transformations=[node_parser],
    vector_store=vector_store,
)
windowed_nodes = pipeline.run(documents=documents)

Querying

In [ ]:

query = "Explain the role of deforestation and fossil fuels in climate change"

Querying without Metadata Replacement

In [ ]:

# Create vector index from base nodes
base_index = VectorStoreIndex(base_nodes)

# Instantiate query engine from vector index
base_query_engine = base_index.as_query_engine(
    similarity_top_k=1,
)

# Send query to the engine to get related node(s)
base_response = base_query_engine.query(query)

print(base_response)

Print Metadata of the Retrieved Node

In [ ]:

pprint(base_response.source_nodes[0].node.metadata)

Querying with Metadata Replacement

“Metadata replacement” intutively might sound a little off topic since we’re working on the base sentences. But LlamaIndex stores these “before/after sentences” in the metadata data of the nodes. Therefore to build back up these windows of sentences we need Metadata replacement post processor.

In [ ]:

# Create window index from nodes created from SentenceWindowNodeParser
windowed_index = VectorStoreIndex(windowed_nodes)

# Instantiate query enine with MetadataReplacementPostProcessor
windowed_query_engine = windowed_index.as_query_engine(
    similarity_top_k=1,
    node_postprocessors=[
        MetadataReplacementPostProcessor(
            target_metadata_key="window" # `window_metadata_key` key defined in SentenceWindowNodeParser
            )
        ],
)

# Send query to the engine to get related node(s)
windowed_response = windowed_query_engine.query(query)

print(windowed_response)

Print Metadata of the Retrieved Node

In [ ]:

# Window and original sentence are added to the metadata
pprint(windowed_response.source_nodes[0].node.metadata)

SaveSavedRemoved 0

Context Enrichment Window for Document Retrieval

Motivation

Key Components

Method Details

Document Preprocessing

Vector Store Creation

Context-Enriched Retrieval

Retrieval Comparison

Benefits of this Approach

Import libraries and environment variables

Read docs

Create vector store and retriever

Ingestion Pipelines

Ingestion Pipeline with Sentence Splitter

Ingestion Pipeline with Sentence Window

Querying

Querying without Metadata Replacement

Print Metadata of the Retrieved Node

Querying with Metadata Replacement

Print Metadata of the Retrieved Node

Context Enrichment Window for Document Retrieval

Contextual Chunk Headers (CCH)

Semantic Chunks for RAG

RAG Evaluation and Meta-Evaluation with GroUSE

Deep Evaluation of RAG Systems using deepeval

Simple RAG with Llamaindex

To Get Daily Health Newsletter