This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

Key Components

PDF processing and text extraction
Text chunking for manageable processing
Vector store creation using FAISS and OpenAI embeddings
Retriever setup for querying the processed documents
Evaluation of the RAG system

Method Details

Document Preprocessing

The PDF is loaded using PyPDFLoader.
The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

Text Cleaning

A custom function replace_t_with_space is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

Vector Store Creation

OpenAI embeddings are used to create vector representations of the text chunks.
A FAISS vector store is created from these embeddings for efficient similarity search.

Retriever Setup

A retriever is configured to fetch the top 2 most relevant chunks for a given query.

Encoding Function

The encode_pdf function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

Key Features

Modular Design: The encoding process is encapsulated in a single function for easy reuse.
Configurable Chunking: Allows adjustment of chunk size and overlap.
Efficient Retrieval: Uses FAISS for fast similarity search.
Evaluation: Includes a function to evaluate the RAG system’s performance.

Usage Example

The code includes a test query: “What is the main cause of climate change?”. This demonstrates how to use the retriever to fetch relevant context from the processed document.

Evaluation

The system includes an evaluate_rag function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

Benefits of this Approach

Scalability: Can handle large documents by processing them in chunks.
Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

Import libraries and environment variables

In [4]:

import os
import sys
from dotenv import load_dotenv
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not os.getenv('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks
from helper_functions import *
from evaluation.evalute_rag import *

Read Docs

In [2]:

path = "../data/Understanding_Climate_Change.pdf"

Encode document

In [ ]:

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.
    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.
    Returns:
        A FAISS vector store containing the encoded book content.
    """
    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)
    # Create embeddings (Tested with OpenAI and Amazon Bedrock)
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)
    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)
    # Create vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
    return vectorstore

In [4]:

chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

Create retriever

In [5]:

chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

Test retriever

In [6]:

test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

c:\Users\N7\PycharmProjects\llm_tasks\RAG_TECHNIQUES\.venv\Lib\site-packages\langchain_core\_api\deprecation.py:139: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.
  warn_deprecated(

Context 1:
driven by human activities, particularly the emission of greenhou se gases.
Chapter 2: Causes of Climate Change
Greenhouse Gases
The primary cause of recent climate change is the increase in greenhouse gases in the
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential
for life on Earth, as it keeps the planet warm enough to support life. However, human
activities have intensified this natural process, leading to a warmer climate.
Fossil Fuels
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and
natural gas used for electricity, heating, and transportation. The industrial revolution marked
the beginning of a significant increase in fossil fuel consumption, which continues to rise
today.
Coal
Context 2:
Most of these climate changes are attributed to very small variations in Earth's orbit that
change the amount of solar energy our planet receives. During the Holocene epoch, which
began at the end of the last ice age, human societies f lourished, but the industrial era has seen
unprecedented changes.
Modern Observations
Modern scientific observations indicate a rapid increase in global temperatures, sea levels,
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has
documented these changes extensively. Ice core samples, tree rings, and ocean sediments
provide a historical record that scientists use to understand past climate conditions and
predict future trends. The evidence overwhelmingly shows that recent changes are primarily
driven by human activities, particularly the emission of greenhou se gases.
Chapter 2: Causes of Climate Change
Greenhouse Gases

Evaluate results

In [ ]:

#Note - this currently works with OPENAI only
evaluate_rag(chunks_query_retriever)

SaveSavedRemoved 1

Simple RAG (Retrieval-Augmented Generation) System

Key Components

Method Details

Document Preprocessing

Text Cleaning

Vector Store Creation

Retriever Setup

Encoding Function

Key Features

Usage Example

Evaluation

Benefits of this Approach

Import libraries and environment variables

Read Docs

Encode document

Create retriever

Test retriever

Evaluate results

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

Simple RAG with Llamaindex

To Get Daily Health Newsletter

Simple RAG (Retrieval-Augmented Generation) System

Key Components

Method Details

Document Preprocessing

Text Cleaning

Vector Store Creation

Retriever Setup

Encoding Function

Key Features

Usage Example

Evaluation

Benefits of this Approach

Import libraries and environment variables

Read Docs

Encode document

Create retriever

Test retriever

Evaluate results

You Might Also Like This Posts:

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

Simple RAG with Llamaindex

To Get Daily Health Newsletter