Simple RAG (Retrieval-Augmented Generation) System for CSV Files
This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.
CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.
Key Components
- Loading and spliting csv files.
- Vector store creation using FAISS and OpenAI embeddings
- Query engine setup for querying the processed documents
- Creating a question and answer over the csv data.
Method Details
Document Preprocessing
- The csv is loaded using LlamaIndex’s PagedCSVReader
- This reader converts each row into a LlamaIndex Document along with the respective column names of the table. No further splitting applied.
Vector Store Creation
- OpenAI embeddings are used to create vector representations of the text chunks.
- A FAISS vector store is created from these embeddings for efficient similarity search.
Query Engine Setup
- A query engine is configured to fetch the most relevant chunks for a given query then answer the question.
Benefits of this Approach
- Scalability: Can handle large documents by processing them in chunks.
- Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
- Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
- Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.
This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a CSV file.
Imports & Environment Variables
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.readers.file import PagedCSVReader
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core import VectorStoreIndex
import faiss
import os
import pandas as pd
from dotenv import load_dotenv
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
# Llamaindex global settings for llm and embeddings
EMBED_DIMENSION=512
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=EMBED_DIMENSION)
CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.
file_path = ('../data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)
# Preview the csv file
data.head()
Vector Store
# Create FaisVectorStore to store embeddings
fais_index = faiss.IndexFlatL2(EMBED_DIMENSION)
vector_store = FaissVectorStore(faiss_index=fais_index)
Load and Process CSV Data as Document
csv_reader = PagedCSVReader()
reader = SimpleDirectoryReader(
input_files=[file_path],
file_extractor= {".csv": csv_reader}
)
docs = reader.load_data()
# Check a sample chunk
print(docs[0].text)
Index: 1 Customer Id: DD37Cf93aecA6Dc First Name: Sheryl Last Name: Baxter Company: Rasmussen Group City: East Leonard Country: Chile Phone 1: 229.077.5154 Phone 2: 397.884.0519x718 Email: zunigavanessa@smith.info Subscription Date: 2020-08-24 Website: http://www.stephenson.com/
Ingestion Pipeline
pipeline = IngestionPipeline(
vector_store=vector_store,
documents=docs
)
nodes = pipeline.run()
Create Query Engine
vector_store_index = VectorStoreIndex(nodes)
query_engine = vector_store_index.as_query_engine(similarity_top_k=2)
Query the rag bot with a question based on the CSV data
response = query_engine.query("which company does sheryl Baxter work for?")
response.response
'Rasmussen Group'

Dr. Md. Harun Ar Rashid, MPH, MD, PhD, is a highly respected medical specialist celebrated for his exceptional clinical expertise and unwavering commitment to patient care. With advanced qualifications including MPH, MD, and PhD, he integrates cutting-edge research with a compassionate approach to medicine, ensuring that every patient receives personalized and effective treatment. His extensive training and hands-on experience enable him to diagnose complex conditions accurately and develop innovative treatment strategies tailored to individual needs. In addition to his clinical practice, Dr. Harun Ar Rashid is dedicated to medical education and research, writing and inventory creative thinking, innovative idea, critical care managementing make in his community to outreach, often participating in initiatives that promote health awareness and advance medical knowledge. His career is a testament to the high standards represented by his credentials, and he continues to contribute significantly to his field, driving improvements in both patient outcomes and healthcare practices.