This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

CSV File Structure and Use Case

The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

Key Components

Loading and spliting csv files.
Vector store creation using FAISS and OpenAI embeddings
Retriever setup for querying the processed documents
Creating a question and answer over the csv data.

Method Details

Document Preprocessing

The csv is loaded using langchain Csvloader
The data is split into chunks.

Vector Store Creation

OpenAI embeddings are used to create vector representations of the text chunks.
A FAISS vector store is created from these embeddings for efficient similarity search.

Retriever Setup

A retriever is configured to fetch the most relevant chunks for a given query.

Benefits of this Approach

Scalability: Can handle large documents by processing them in chunks.
Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

import libries

In [8]:

from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
import os
from dotenv import load_dotenv
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

CSV File Structure and Use Case

In [18]:

import pandas as pd
file_path = ('../data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)
#preview the csv file
data.head()

Out[18]:

	Index	Customer Id	First Name	Last Name	Company	City	Country	Phone 1	Phone 2	Email	Subscription Date	Website
0	1	DD37Cf93aecA6Dc	Sheryl	Baxter	Rasmussen Group	East Leonard	Chile	229.077.5154	397.884.0519×718	zunigavanessa@smith.info	2020-08-24	http://www.stephenson.com/
1	2	1Ef7b82A4CAAD10	Preston	Lozano	Vega-Gentry	East Jimmychester	Djibouti	5153435776	686-620-1820×944	vmata@colon.com	2021-04-23	http://www.hobbs.com/
2	3	6F94879bDAfE5a6	Roy	Berry	Murillo-Perry	Isabelborough	Antigua and Barbuda	+1-539-402-0259	(496)978-3969×58947	beckycarr@hogan.com	2020-03-25	http://www.lawrence.com/
3	4	5Cef8BFA16c5e3c	Linda	Olsen	Dominguez, Mcmillan and Donovan	Bensonview	Dominican Republic	001-808-617-6467×12895	+1-813-324-8756	stanleyblackwell@benson.org	2020-06-02	http://www.good-lyons.com/
4	5	053d585Ab6b3159	Joanna	Bender	Martin, Lang and Andrade	West Priscilla	Slovakia (Slovak Republic)	001-234-203-0635×76146	001-199-446-3860×3486	colinalvarado@miles.net	2021-04-17	https://goodwin-ingram.com/

load and process csv data

In [19]:

loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

Initiate faiss vector store and openai embedding

In [9]:

import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(" ")))
vector_store = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

Add the splitted csv data to the vector store

In [ ]:

vector_store.add_documents(documents=docs)

Create the retrieval chain

In [11]:

from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
retriever = vector_store.as_retriever()
# Set up system prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),    
])
# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [15]:

answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})
answer['answer']

Out[15]:

'Sheryl Baxter works for Rasmussen Group.'

SaveSavedRemoved 0

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

CSV File Structure and Use Case

Key Components

Method Details

Document Preprocessing

Vector Store Creation

Retriever Setup

Benefits of this Approach

CSV File Structure and Use Case

Semantic Chunking for Document Processing

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

To Get Daily Health Newsletter

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

CSV File Structure and Use Case

Key Components

Method Details

Document Preprocessing

Vector Store Creation

Retriever Setup

Benefits of this Approach

CSV File Structure and Use Case

You Might Also Like This :

Semantic Chunking for Document Processing

Simple RAG (Retrieval-Augmented Generation) System for CSV Files

To Get Daily Health Newsletter