RAG Evaluation and Meta-Evaluation with GroUSE

Last updated: February 8, 2026Reviewed date: February 8, 2026Reading time: 17 min read

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

On this page28 sections

Article Summary

This tutorial introduces GroUSE, a framework for evaluating Retrieval-Augmented Generation (RAG) pipelines, focusing on the final stage: Grounded Question Answering (GQA). It demonstrates how to use Large Language Models (LLMs) to assess GQA answers across four distinct metrics and guides you through customizing your own Judge LLM using GroUSE unit tests. Motivation Manually evaluating RAG pipeline outputs can be challenging. The GroUSE framework leverages LLMs...

Key Takeaways

This article explains Motivation in simple medical language.
This article explains Key Components in simple medical language.
This article explains Method Details in simple medical language.
This article explains Benefits of the approach in simple medical language.

Before reading

RX Patient Tools

Use these quick guides before reading the article, or return to them when you need help preparing questions for a doctor.

Start here Choose the right pathway for symptoms, reports, medicines, or urgent warning signs. Disease article roadmap Read this topic step by step: meaning, symptoms, warning signs, diagnosis, treatment, prevention, and follow-up. Treatment planner Prepare questions about treatment choices, benefits, risks, side effects, and follow-up. Family & caregiver guide Organize symptoms, reports, medicines, questions, and follow-up safely. Nutrition & diet guide Prepare food, hydration, supplement, and medicine-timing questions safely. Prevention guide Organize risk factors, protective habits, screening, and warning signs. Recovery guide Prepare a safe plan for activity, rehabilitation, warning signs, and follow-up.

Educational health guideWritten for patient understanding and clinical awareness.

Reviewed content workflowUse writer and reviewer profiles for stronger trust.

Emergency safety firstUrgent warning signs are highlighted below.

Definition

Motivation

Manually evaluating RAG pipeline outputs can be challenging. The GroUSE framework leverages LLMs with finely tuned prompts to address all potential failure modes in Grounded Question Answering. GroUSE unit tests are used to identify the most effective prompts to optimize the performance of these evaluators.

Key Components

Answer Relevancy evaluation
Completeness evaluation
Faithfulness evaluation
Usefulness evaluation
Judge LLM Customization

Method Details

The task we want to assess: Grounded Question Answering

Grounded Question Answering (QA) is usually the last step of a RAG pipeline: given a question and a set of documents retrieved from the corpus, an LLM must generate an answer. We expect the LLM to cite which document each piece of information is coming from, as depicted below. When no precise answer is in the documents, the LLM should indicate it in its answer. In that case, if some related information is available in the documents, the LLM can add it to the answer to show the corpus is not completely off-topic with respect to the question.

Evaluation Metrics

Each answer is evaluated according to six metrics. The fisrt four metrics are evaluated with an evaluator LLM call. Positive acceptance and negative rejection are deducted from the first four.

1. Answer Relevancy

Answer relevancy assesses the relevance of the information provided in the answer regarding the question, using a Likert scale (1 to 5).

2. Completeness

Completeness uses a Likert scale (1 to 5) to evaluate whether all relevant information from the documents is present in the answer.

3. Faithfulness

Faithfulness is a binary score that checks if all facts in the answer are accurate and correctly attributed to the corresponding document.

4. Usefulness

When the answer states that no references can answer the question but additional information is provided, usefulness is a binary score that determines if the provided additional information is still useful.

5. Positive Acceptance

Percentage of samples that responded when they were supposed to.

6. Negative Rejection

Percentage of samples that refrained from responding when there is no context in the documents that allow to answer the question.

Benefits of the approach

The GroUSE framework comprehensively addresses the seven failure modes of Grounded Question Answering, providing a thorough evaluation of your RAG pipeline’s final stage.

Implementation details

Answer Relevancy, Completeness, Faithfulness and Usefulness are evaluated using GPT-4 as the default model, as it was the best model we tested. Positive acceptance and negative rejection can be deducted from the answer relevancy and completeness results as these can have None values when no references contain answers to the question.

The GroUSE framework provides a comprehensive set of evaluation metrics to assess the performance of Grounded Question Answering models. By addressing seven key failure modes, it enables developers to thoroughly evaluate and improve their RAG pipelines. The use of LLM-based judges, such as GPT-4, automate this evaluation process. To tailor the framework to your specific needs, you can develop a custom LLM evaluator and validate its performance using GroUSE unit tests.

RAG Evaluation and Meta-Evaluation with GroUSE

Tutorial

Import libraries

In [ ]:

import os
import nest_asyncio
from grouse import (
    EvaluationSample,
    GroundedQAEvaluator,
    meta_evaluate_pipeline,
)

Avoid nested asyncio loops inside notebooks (this line is not needed if you run the code in a Python script)

In [ ]:

nest_asyncio.apply()

Setup your API key

For this tutorial, you will need access to the OpenAI API and get an OpenAI API key. You can get one here.

In [ ]:

os.environ["OPENAI_API_KEY"] = input("Add your OpenAI API key:")

Initialize the evaluator

The default model used is GPT-4. Prompts are adapted to this model, so if you want to have the best results, keep using the default model.

In [ ]:

evaluator = GroundedQAEvaluator()

Evaluate a good answer

An LLM has given a good answer to a question related to the Eiffel Tower, given some contexts from the Eiffel Tower Wikipedia page. Let’s evaluate the answer and check that everything is okay.

In [ ]:

good_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower stands in the Champs de Mars in Paris.[1]",
    expected_output="In the Champs de Mars in Paris. [1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France"
    ]
)
result = evaluator.evaluate(eval_samples=[good_sample]).evaluations[0]
print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy)
print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy_justification)
print("Completeness (1 to 5):", result.completeness.completeness)
print("Completeness (1 to 5):", result.completeness.completeness_justification)
print("Faithfulness (0 or 1):", result.faithfulness.faithfulness)
print("Faithfulness (0 or 1):", result.faithfulness.faithfulness_justification)

How does it behave with an irrelevant answer?

In [ ]:

irrelevant_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is mainly made of puddle iron.[2]",
    expected_output="In the Champs de Mars in Paris.[1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "The puddle iron (wrought iron) of the Eiffel Tower weighs 7,300 tonnes,[70] and the addition of lifts, shops and antennae have brought the total weight to approximately 10,100 tonnes."
    ]
)
result = evaluator.evaluate(eval_samples=[irrelevant_sample]).evaluations[0]
print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy)
print("Justification:", result.answer_relevancy.answer_relevancy_justification)

Evaluation of an incomplete sample

In [ ]:

incomplete_sample = EvaluationSample(
    input="Who critized the Eiffel Tower project in 1889?",
    actual_output=(
        "The tower was critized by those who did not believe it was feasible and some artists.[1]"
    ),
    expected_output=(
        "The tower was critized by those who did not believe it was feasible and those who objected on artistic grounds.[1]"
        "An artist committee was created to protest againt the construction of the tower, led by the prominent architect "
        "Charles Garnier and including some of the most important figures of the arts, "
        "such as William-Adolphe Bouguereau, Guy de Maupassant, Charles Gounod and Jules Massenet. [2]"
    ),
    references=[
        "The proposed tower had been a subject of controversy, drawing criticism from those who did not believe it was feasible and those who objected on artistic grounds.",
        (
            "It came to a head as work began at the Champ de Mars: a \"Committee of Three Hundred\" "
            "(one member for each metre of the tower's height) was formed, led by the prominent architect "
            "Charles Garnier and including some of the most important figures of the arts, "
            "such as William-Adolphe Bouguereau, Guy de Maupassant, Charles Gounod and Jules Massenet."
        ),
        "A petition called \"Artists against the Eiffel Tower\" was sent to the Minister of Works and Commissioner for the Exposition, Adolphe Alphand, and it was published by Le Temps on 14 February 1887"
    ]
)
result = evaluator.evaluate(eval_samples=[incomplete_sample]).evaluations[0]
print("Completeness (1 to 5):", result.completeness.completeness)
print("Justification:", result.completeness.completeness_justification)

Evaluation of an unfaithful sample

In [ ]:

unfaithful_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.[1][2]",
    expected_output="In the Champs de Mars in Paris.[1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "Gustave Eiffel died in his appartment at Rue Rabelais in Paris."
    ]
)
result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]
print("Faithfulness (0 or 1):", result.faithfulness.faithfulness)
print("Justification:", result.faithfulness.faithfulness_justification)

Evaluation of information utility in case there is no answer to the question in the references

In [ ]:

useful_sample = EvaluationSample(
    input="Who critized the Eiffel Tower project in 1889?",
    actual_output=(
        "No document seems to precisely answer your question."
        "However, it is mentioned that a petition against tht Eiffel Tower construciton was sent "
        "to the Minister of Works and Commissioner for the Exposition [1]"
    ),
    expected_output=(
        "No document seems to precisely answer your question."
        "However, it is worth noting that a petition against tht Eiffel Tower construciton was sent "
        "to the Minister of Works and Commissioner for the Exposition [1]"
    ),
    references=[
        "A petition against the tower was sent to the Minister of Works and Commissioner for the Exposition, Adolphe Alphand, and it was published by Le Temps on 14 February 1887"
    ]
)

result = evaluator.evaluate(eval_samples=[useful_sample]).evaluations[0]
print("Usefulness (0 or 1):", result.usefulness.usefulness)
print("Justification:", result.usefulness.usefulness_justification)
print("Positive Acceptance:", result.positive_acceptance)
print("Negative Rejection:", result.negative_rejection)

Note that all the results are cached and we can compute the global statistics on all the samples

In [ ]:

evaluation_report = evaluator.evaluate(eval_samples=[
    good_sample,
    irrelevant_sample,
    incomplete_sample,
    unfaithful_sample,
    useful_sample,
]).report
print("Average answer relevancy: ", evaluation_report.answer_relevancy)
print("Average completeness: ", evaluation_report.completeness)
print("Average faithfulness: ", evaluation_report.faithfulness)
print("Average usefulness: ", evaluation_report.usefulness)

Create your own Judge LLM

Since GPT-4 is expensive, let’s create a new evaluator using gpt-4o-mini. For that, we need to adapt the evaluation prompt to the model by using the train set of the GroUSE unit tests. Make sure that the output follows the same format as described in the prompts below.

Answer Relevancy prompt

In [ ]:

relevancy_evaluation_prompt = """# Task
Task: Grounded Question Answering
Based solely on the content of the references, the objective is to generate a response to the user's query. Each statement must be followed by the reference of the source passage, in the format [i] where i is the number of the reference. If no passage seems relevant, the answer should begin with "No document seems to precisely answer your question" and may be supplemented with related sourced information.
# Instructions
I will provide you with two answers, numbered 1 and 2, each containing a response to the user request.
I want you to assign to each answer a relevancy grade between 1 and 5:
- Answer relevancy evaluates if the content of the answer accurately responds to the user's question.
- The truthfulness of the information in the answer does not impact relevancy: even if information that appears false is contained in the answer, as long as this information is related to the request, then relevancy should not decrease. Remember that this information could come from references mentioning imaginary content that you are unaware of: the only thing to evaluate to assign the relevancy grade is therefore the adequacy between the information in the answer and the request, NOT their truthfulness.
- The absence of information in the answer does not impact relevancy, only the information contained in the answer is evaluated.
- Answer relevancy cannot be evaluated if the answer mentions that no document responds to the user request, it is then `null`, regardless of whether it contains other information or not.
Rating scale:
null - The answer asserts that no document precisely responds to the user request. Even if it provides additional information, whether appropriate or not, the relevancy remains `null`.
5 - The answer has excellent relevancy. All information provided in the answer is in line with the question and precisely answers the user request.
4 - The answer achieves good relevancy by providing relevant information to answer the user question. Some information indicated does not exactly answer the question, but remains in line with the request.
3 - The answer has average relevancy, it contains information that allows responding to the user request, but it also contains superfluous information, which was not necessary to answer the request.
2 - The answer shows low relevancy, with some elements related to the request, but the majority of the content is not in line with the question asked.
1 - The answer has very low relevancy, not answering the user's question at all. The content is largely inappropriate or off-topic, delivering no useful information for the request.
Before assigning each grade, you will check that the answer does not contain "No document responds...", if this is the case you must put a grade of `null`. If this is not the case, you will then analyze the adequacy between the request and the information contained in the answer.
Your response should be in JSON format, respecting the following format:
{
    "answer_1": {
        "answer_affirms_no_document_answers": X,
        "answer_relevancy_justification": "...",
        "answer_relevancy": Y
    },
    "answer_2": {
        "answer_affirms_no_document_answers": X,
        "answer_relevancy_justification": "...",
        "answer_relevancy": Y
    }
}
Where "..." is a string, X is a boolean, and Y is an integer between 1 and 5 or `null`.
# Sample
User request: {{ input }}
# To evaluate
Answer 1: {{ expected_output }}
Answer 2: {{ actual_output }}"""

Completeness prompt

In [ ]:

completeness_evaluation_prompt = """# Task
Task: Grounded Question Answering
Based solely on the content of the references, the objective is to generate a response to the user's query. Each statement must be followed by the reference of the source passage, in the format [i] where i is the number of the reference. If no passage seems relevant, the answer should begin with "No document seems to precisely answer your question" and may be supplemented with related sourced information.
# Instructions
I will provide you with two answers, numbered 1 and 2, each containing a response to the user request.
I want you to assign to each answer a completeness grade between 1 and 5:
- The only condition for an answer to be complete is the presence in it of at least all the information from the references that are relevant to the question asked.
- The presence of unrelated information in the answer does not impact completeness.
- The presence of information in the answer not from the references does not impact completeness.
- Possible errors in the sources citing the references do not impact completeness.
- Completeness cannot be evaluated if the references contain no information that can precisely answer the user request, in which case the grade takes the value `null`.
Rating scale:
null - The references contained no relevant information to precisely answer the user's question. In this case, there is no need to read the content of the answer to know that the grade is `null`.
5 - The answer is very complete, it contains all the relevant information from the references. No essential information is omitted, ensuring complete coverage of the question asked.
4 - The answer covers most of the relevant information in depth. It integrates the references satisfactorily, covering the majority of key points. Some details may be missing, but overall, the answer is substantial.
3 - The answer reasonably addresses a number of relevant aspects. It integrates part of the necessary information from the references. However, gaps remain, impacting the overall completeness.
2 - The answer only covers a minimal part of the relevant information. It misses several important information from the references.
1 - The answer covers none of the relevant information, all relevant information from the references has been omitted in the answer.
Before assigning each grade, you will always start by analyzing the information found in the references that are relevant to the user request. If there is no relevant information in the references, completeness must be `null`. If there are relevant information in the references, you will analyze which portion of this information is present or absent in the answers to evaluate the completeness grade. Your response should be in JSON format, respecting the following format:
{
    "answer_1": {
        "completeness_justification": "...",
        "completeness": X
    },
    "answer_2": {
        "completeness_justification": "...",
        "completeness": X
    }
}
Where "..." is a string, and X is an integer between 1 and 5 or `null`.

# SAMPLE
List of references :
{%- for context in contexts %}
Reference {{ loop.index }}: {{ context }}
{%- endfor %}
User request: {{ input }}
# To evaluate
Answer 1: {{ expected_output }}
Answer 2: {{ actual_output }}"""

Faithfulness prompt

In [ ]:

faithfulness_evaluation_prompt = """# Task
Task: Grounded Question Answering
Based solely on the content of the references, the objective is to generate a response to the user's query. Each statement must be followed by the reference of the source passage, in the format [i] where i is the number of the reference. If no passage seems relevant, the answer should begin with "No document seems to precisely answer your question" and may be supplemented with related sourced information.
# Instructions
I will provide you with two answers, numbered 1 and 2, each containing a response to the user request.
I want you to assign to each answer a boolean faithfulness grade. An answer is faithful if:
- Each statement made by the answer is followed by a source indicating the reference from which it is drawn.
- The information preceding the source is indeed from the corresponding reference.
- The information preceding the source is in agreement with the corresponding reference, and does not assert facts different from those indicated in the reference.
In all other cases, the response is considered non-faithful.
Faithfulness is also considered non-measurable if the answer asserts that no document responds to the question, and it does not provide any related information, it is then `null`.
Rating scale:
null - The answer asserts that no document responds to the question, and does not provide any related information.
1 - All sentences in the answer cite their sources, and are in agreement with the cited sources.
0 - At least one sentence in the response does not cite its sources, or cites a wrong source, or modifies the content from the references, or asserts something that is not supported by the cited references.
Before assigning each grade, you will start by verifying that the answer does not only assert "No document responds...", without any other information. If this is the case, then faithfulness must be `null`. Otherwise, I want you to analyze by explaining for each sentence, one after the other, if 1) a reference follows the sentence, 2) the reference following the sentence is correct, and 3) if the sentence does not distort or modify the content of the references. Your response should be in JSON format, respecting the following format:
{
    "answer_1": {
        "answer_only_asserts_no_document_answers": X,
        "content_analysis_sentence_by_sentence": [
            {
                "sentence": "...",
                "criterion_1": "...",
                "criterion_2": "...",
                "criterion_3": "..."
            },
            ...
        ],
        "faithfulness_justification": "...",
        "faithfulness": Y
    },
    "answer_2": {
        "answer_only_asserts_no_document_answers": X,
        "content_analysis_sentence_by_sentence": [
            {
            "sentence": "...",
            "criterion_1": "...",
            "criterion_2": "...",
            "criterion_3": "..."
            },
            ...
        ],
        "faithfulness_justification": "...",
        "faithfulness": Y
    }
}
Where "..." is a string, X is a boolean, and Y is either a boolean or `null`.
# Sample
List of references :
{%- for context in contexts %}
Reference {{ loop.index }}: {{ context }}
{%- endfor %}
# To evaluate
Answer 1: {{ expected_output }}
Answer 2: {{ actual_output }}"""

Usefulness Prompt

In [ ]:

usefulness_evaluation_prompt = """# Task
Task: Grounded Question Answering
Based solely on the content of the references, the objective is to generate a response to the user's query. Each statement must be followed by the reference of the source passage, in the format [i] where i is the number of the reference. If no passage seems relevant, the answer should begin with "No document seems to precisely answer your question" and may be supplemented with related sourced information.
# Instructions
I will provide you with two answers, numbered 1 and 2, each containing a response to the user request.
I want you to assign to each answer a usefulness grade of 0 or 1:
- Usefulness is only evaluated when the answer says that no document precisely answers the user's question, but it still provides information related to the question.
- Usefulness measures how interesting the related information is to know for the user, given that there is no answer in the references.
- If the answer responds to the user request, usefulness must be `null`.
- If the answer indicates that no document responds to the user request, without adding other information, usefulness must be `null`.
Rating scale:
null - (The answer responds to the user request) OR (the answer does not answer the user's question AND does not provide any related information).
1 - The related information is generally related to the question and adds value to the general understanding of the topic.
0 - The related information is completely off-topic with respect to the question asked.
Before assigning each grade, you will start by verifying that the answer indeed asserts "No document responds...", then you will check that the answer contains related information in addition to this assertion. If one of these two conditions is `false` then usefulness must be `null`.
If both conditions are indeed true, then you will analyze the usefulness of having added this related information to evaluate the usefulness grade. Your response should be in JSON format, respecting the following format:
{
    "answer_1": {
        "answer_affirms_no_document_answers": X,
        "answer_contains_related_information": X,
        "usefulness_justification": "...",
        "usefulness": Y
    },
    "answer_2": {
        "answer_affirms_no_document_answers": X,
        "answer_contains_related_information": X,
        "usefulness_justification": "...",
        "usefulness": Y
    }
}
Where "..." is a string, X is a boolean, and Y is an integer that is 0 or 1 or `null`.
# Sample
User request: {{ input }}
# To evaluate
Answer 1: {{ expected_output }}
Answer 2: {{ actual_output }}"""

Save prompts to use them later

In [ ]:

prompts_path = "gpt4o_mini_prompts"
os.makedirs(prompts_path, exist_ok=True)
with open(os.path.join(prompts_path, "answer_relevancy.txt.jinja"), "w") as file:
    file.write(relevancy_evaluation_prompt)
with open(os.path.join(prompts_path, "completeness.txt.jinja"), "w") as file:
    file.write(completeness_evaluation_prompt)
with open(os.path.join(prompts_path, "faithfulness.txt.jinja"), "w") as file:
    file.write(faithfulness_evaluation_prompt)
with open(os.path.join(prompts_path, "usefulness.txt.jinja"), "w") as file:
    file.write(usefulness_evaluation_prompt)

Check results on train set

In [ ]:

meta_evaluations = meta_evaluate_pipeline("gpt-4o-mini", prompts_path, train_set=True)
print("Aggregated metrics")
print(meta_evaluations.report)

It is an encouraging beginning. But, we can iterate on the prompts above to try to have better scores with GPT-4o-mini. Still, it will be difficult to have a performance as good as GPT-4.

Once you are happy with your prompts, you can evaluate the Judge model on the test set

In [ ]:

meta_evaluations = meta_evaluate_pipeline("gpt-4o-mini", prompts_path, train_set=False)
meta_evaluations.report

Limitations

Unit tests can help you assess the limits of your judge LLM on edge cases but don’t guarantee that your judge LLM will be perfect. Be cautious when analysing the results.

References

@misc{muller2024grousebenchmarkevaluateevaluators,
      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, 
      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
      year={2024},
      eprint={2409.06595},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06595}, 
}

Safety note: This is not a prescription or diagnosis. For severe symptoms, pregnancy danger signs, children with serious illness, chest pain, breathing difficulty, stroke-like weakness, or major injury, seek urgent care.

Which doctor may help?

Start with a registered doctor or the nearest qualified health center.

What to tell the doctor

Write when the problem started and how it changed.
Bring old prescriptions, investigation reports, and current medicines.
Write allergies, pregnancy status, diabetes, kidney/liver disease, and major past illnesses.
Bring one family member if the patient is weak, elderly, confused, or a child.

Questions to ask

What is the most likely cause of my symptoms?
Which danger signs mean I should go to hospital quickly?
Which tests are necessary now, and which can wait?
How should I take medicines safely and what side effects should I watch for?
When should I come for follow-up?

Tests to discuss

Vital signs: temperature, pulse, blood pressure, oxygen saturation
Basic physical examination by a clinician
CBC, urine test, blood sugar, or imaging only when clinically needed

Avoid these mistakes

Do not use antibiotics, steroid tablets/injections, or strong painkillers without proper medical advice.
Do not hide pregnancy, kidney disease, ulcer, allergy, or blood thinner use.
Do not delay emergency care when danger signs are present.

Medicine safety and first-aid guide

This section is for patient education only. It does not replace a doctor, pharmacist, or emergency care.

Safe first steps

Rest, drink safe water, and observe symptoms carefully.
Keep a written note of symptoms, duration, temperature, medicines already taken, and allergy history.
Seek medical care quickly if symptoms are severe, worsening, or unusual for the patient.

OTC medicine safety

For mild pain or fever, ask a registered pharmacist or doctor before using common over-the-counter pain/fever medicines.
Do not combine multiple pain medicines without advice, especially if you have kidney disease, liver disease, stomach ulcer, asthma, pregnancy, or take blood thinners.
Do not give adult medicines to children unless a qualified clinician advises it.

Avoid these mistakes

Do not start antibiotics without a proper medical decision.
Do not use steroid tablets or injections casually for quick relief.
Do not delay emergency care because of home remedies.

Get urgent help if

Severe symptoms, confusion, fainting, breathing difficulty, chest pain, severe dehydration, or sudden weakness need urgent medical care.

Medicine names, dose, and timing must be decided by a qualified clinician or pharmacist after checking age, pregnancy, allergy, other diseases, and current medicines.

For rural patients and family caregivers

Patient health record and symptom diary

Write your symptoms, medicines already taken, test results, and questions before visiting a doctor. This note stays on your device unless you print or copy it.

Doctor to discuss: Doctor / qualified healthcare provider

Tests to discuss with doctor

Basic vital signs: temperature, pulse, blood pressure, oxygen level if needed
Relevant blood, urine, imaging, or specialist tests only after clinical assessment

Questions to ask

What is the most likely cause of my symptoms?
Which warning signs mean I should go to emergency care?
Which tests are really needed now?
Which medicines are safe for my age, pregnancy status, allergy, kidney/liver/stomach condition, and current medicines?

Emergency warning signs such as chest pain, severe breathing difficulty, sudden weakness, confusion, severe dehydration, major injury, or loss of bladder/bowel control need urgent medical care. Do not wait for online information.

Go to emergency care if you notice:

Severe or rapidly worsening symptoms
Breathing difficulty, chest pain, fainting, confusion, severe weakness, major injury, or severe dehydration

Doctor / service to discuss: Qualified healthcare provider; specialist depends on symptoms and examination.

Step 1
Check danger signs first

If danger signs are present, seek emergency care and do not wait for online information.
Step 2
Record the symptom story

Write when symptoms started, severity, medicines already taken, allergies, pregnancy status, and test results.
Step 3
Visit a qualified clinician

A doctor, nurse, or qualified healthcare provider can examine you and decide which tests or treatment are needed.
Step 4
Do only useful tests

Do tests after clinical assessment. Avoid unnecessary tests, random antibiotics, or repeated medicines without diagnosis.
Step 5
Follow up and return early if worse

If symptoms worsen, new warning signs appear, or treatment is not helping, return for review quickly.

Rural patient practical tips

Take a written symptom diary and all previous prescriptions/test reports.
Do not hide medicines already taken, even herbal or over-the-counter medicines.
Ask which warning signs mean urgent referral to hospital.

This roadmap is for education. A real diagnosis and treatment plan requires history, examination, and clinical judgment.

Internal learning pathway

Explore related RX articles

Related guides from RX Harun are grouped to help readers move from overview to symptoms, tests, treatment, and safe next steps.

Rx Data Science and Artificial Intelligence

Semantic Chunks for RAG DefinitionIn order to abide by the context window of the LLM , we usually break text…
Deep Evaluation of RAG Systems using deepeval DefinitionThis code demonstrates the use of the deepeval library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems.…
Simple RAG with Llamaindex DefinitionThis code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF document(s). The…
Simple RAG (Retrieval-Augmented Generation) System DefinitionThis code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The…
Simple RAG (Retrieval-Augmented Generation) System for CSV Files DefinitionThis code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The…
Simple RAG (Retrieval-Augmented Generation) System for CSV Files DefinitionThis code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The…

Read, save, and share this guide

Article Summary

Key Takeaways

RX Patient Tools

Motivation

Key Components

Method Details

The task we want to assess: Grounded Question Answering

Evaluation Metrics

1. Answer Relevancy

2. Completeness

3. Faithfulness

4. Usefulness

5. Positive Acceptance

6. Negative Rejection

Benefits of the approach

Implementation details

Tutorial

Import libraries

Avoid nested asyncio loops inside notebooks (this line is not needed if you run the code in a Python script)

Setup your API key

Initialize the evaluator

Evaluate a good answer

How does it behave with an irrelevant answer?

Evaluation of an incomplete sample

Evaluation of an unfaithful sample

Evaluation of information utility in case there is no answer to the question in the references

Note that all the results are cached and we can compute the global statistics on all the samples

Create your own Judge LLM

Answer Relevancy prompt

Completeness prompt

Faithfulness prompt

Usefulness Prompt

Save prompts to use them later

Check results on train set

Once you are happy with your prompts, you can evaluate the Judge model on the test set

Limitations

References

Related Articles

Prepare before seeing a doctor

Which doctor may help?

What to tell the doctor

Questions to ask

Tests to discuss

Avoid these mistakes

Medicine safety and first-aid guide

Safe first steps

OTC medicine safety

Avoid these mistakes

Get urgent help if

Patient health record and symptom diary

Care roadmap for: RAG Evaluation and Meta-Evaluation with GroUSE

Check danger signs first

Record the symptom story

Visit a qualified clinician

Do only useful tests

Follow up and return early if worse

Explore related RX articles

To Get Daily Health Newsletter