Improve BERT inference

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

Patient Mode

Understand this article easily

Switch between simple English and easy Bangla patient notes. This is for education and does not replace a doctor consultation.

Today’s best-performing language processing models use huge neural architectures with hundreds of millions of parameters. State-of-the-art transformer-based architectures like BERT are available as pretrained models for anyone to use for any language task. The big models have outstanding accuracy, but they are difficult to use...

For severe symptoms, danger signs, pregnancy, child illness, or sudden worsening, seek urgent medical care.

বাংলা রোগী নোট এখনো যোগ করা হয়নি। পোস্ট এডিটরে “RX Bangla Patient Mode” বক্স থেকে সহজ বাংলা সারাংশ যোগ করুন।

এই তথ্য শিক্ষা ও সচেতনতার জন্য। এটি ডাক্তারি পরীক্ষা, রোগ নির্ণয় বা প্রেসক্রিপশনের বিকল্প নয়।

Article Summary

Today’s best-performing language processing models use huge neural architectures with hundreds of millions of parameters. State-of-the-art transformer-based architectures like BERT are available as pretrained models for anyone to use for any language task. The big models have outstanding accuracy, but they are difficult to use in practice. These models are resource hungry due to a large number of parameters. These issues become worse when serving...

Key Takeaways

  • This article explains Overview of Optimum Intel and quantization aware training in simple medical language.
  • This article explains Overview of ONNXRuntime, and OpenVINO™ Execution Provider in simple medical language.
  • This article explains Putting the tools together to achieve better performance in simple medical language.
  • This article explains Converting PyTorch FP32 model to INT8 ONNX model with QAT in simple medical language.
Educational health guideWritten for patient understanding and clinical awareness.
Reviewed content workflowUse writer and reviewer profiles for stronger trust.
Emergency safety firstUrgent warning signs are highlighted below.

Seek urgent medical care if you notice

These warning signs are general safety guidance. Local emergency numbers and clinical judgment should always come first.

  • Severe symptoms, breathing difficulty, fainting, confusion, or rapidly worsening illness.
  • New weakness, severe pain, high fever, or symptoms after a serious injury.
  • Any symptom that feels urgent, unusual, or unsafe for the patient.
1

Emergency now

Use emergency care for severe, sudden, rapidly worsening, or life-threatening symptoms.

2

See a doctor

Book a professional medical evaluation if symptoms persist, worsen, recur often, affect daily activities, or occur in a high-risk patient.

3

Learn safely

Use this article to understand possible causes, tests, treatment options, prevention, and questions to ask your clinician.

Before reading

RX Patient Tools

Use these quick guides before reading the article, or return to them when you need help preparing questions for a doctor.

Start here Choose the right pathway for symptoms, reports, medicines, or urgent warning signs. Disease article roadmap Read this topic step by step: meaning, symptoms, warning signs, diagnosis, treatment, prevention, and follow-up. Treatment planner Prepare questions about treatment choices, benefits, risks, side effects, and follow-up. Family & caregiver guide Organize symptoms, reports, medicines, questions, and follow-up safely. Nutrition & diet guide Prepare food, hydration, supplement, and medicine-timing questions safely. Prevention guide Organize risk factors, protective habits, screening, and warning signs. Recovery guide Prepare a safe plan for activity, rehabilitation, warning signs, and follow-up.

Today’s best-performing language processing models use huge neural architectures with hundreds of millions of parameters. State-of-the-art transformer-based architectures like BERT are available as pretrained models for anyone to use for any language task.

The big models have outstanding accuracy, but they are difficult to use in practice. These models are resource hungry due to a large number of parameters. These issues become worse when serving the fine-tuned model and it requires a lot of memory and time to process a single message. A state-of-the-art model is not good if it can handle only one message per second. To improve the throughput, we need to accelerate the well-performing BERT model, by reducing the computation or the number of operations with the help of quantization.

Overview of Optimum Intel and quantization aware training

Optimum Intel is an extension for the Hugging Face Optimum library with OpenVINO™ runtime as a backend for the Transformers architectures. It also provides an interface to Intel® NNCF (Neural Network Compression Framework) package. It helps implement Intel’s optimizations through NNCF with changes to just a few lines of code in the training pipeline.

Quantization aware training (QAT) is a widely used technique for optimizing models during training. It inserts nodes into the neural network during training that simulates the effect of lower precision. This allows the training algorithm to consider quantization errors as part of the overall training loss that gets minimized during training. QAT has better accuracy and reliability than carrying out quantization after the model has been trained. The output after training with our tool is a quantized PyTorch model, ONNX model, and IR.xml.

Overview of ONNXRuntime, and OpenVINO™ Execution Provider

ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, languages, and hardware platforms. It enables the acceleration of machine learning inferencing across all of your deployment targets using a single set of APIs.

Intel and Microsoft joined hands to create the OpenVINO™ Execution Provider (OVEP) for ONNX Runtime, which enables ONNX models for running inference using ONNX Runtime APIs while using the OpenVINO™ Runtime as a backend. With the OpenVINO™ Execution Provider, ONNX Runtime delivers better inferencing performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, and VPU. Now you’ve got a basic understanding of quantization, ONNX Runtime, and OVEP, let’s take the best of both worlds and stitch the story together.

Putting the tools together to achieve better performance

In our next steps, we will be doing quantization aware training using Optimum-Intel and Inference using Optimum-ORT with OpenVINO™ Execution Provider through Azure Machine Learning. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inferences.

Converting PyTorch FP32 model to INT8 ONNX model with QAT

When utilizing the Hugging Face training pipelines all you need is to update a few lines of code and you can invoke the NNCF optimizations for quantizing the model. The output of this would be an optimized INT8 PyTorch model, ONNX model, and OpenVINO™ IR. See the flow diagram below:

Improve BERT inference

For this case study, we have chosen the bert-squad pretrained model from Hugging Face. This has been pretrained on the SQuAD dataset for the question-answering use case. QAT can be applied by replacing the Transformers Trainer with the Optimum (OVTrainer). See below:

from trainer_qa import QuestionAnsweringOVTrainer
Run the training pipeline
1. Import OVConfig:
from optimum.intel.openvino import OVConfig
from trainer_qa import QuestionAnsweringOVTrainer
2. Initialize a config from the
    ov_config = OVConfig()
    
3. Initialize our Trainer
    trainer = QuestionAnsweringOVTrainer()

Comparison of FP32 model and INT8 ONNX model with Netron model visualization tool

When compared with FP32, the INT8 model has QuantizeLinear and DequantizeLinear operations added to mimic the lower precision after the QAT.

Improve BERT inference
Fig1: FP32 model
Improve BERT inference
Fig2: INT8 model

 

To replicate this example check out the reference code with detailed instructions on QAT and Inference using OpenVINO and Azure Machine Learning Jupyter Notebooks on GitHub.

Performance improvement results

Accuracy Original FP32 QAT INT8 Explanation
F1 93.1 92.83 In this case, it’s computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.
Eval_exact 86.91 86.94 This metric is as simple as it sounds. For each question + answer pair, if the characters of the model’s prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.

Comparison of ONNXRUNTIME_PERF_TEST application for ONNX-FP32 and ONNX-INT8 models

We’ve used ONNXRuntime APIs for running inference for the BERT model. As you can see the performance for the INT8 optimized model improved almost to 2.95x when compared to FP32 without much compromise in the accuracy.

Improve BERT inference

Quantized PyTorch, ONNX, and INT8 models can also be served using OpenVINO™ Model Server for high-scalability and optimization for Intel® solutions so that you can take advantage of all the power of the Intel® Xeon® processor or Intel’s AI accelerators and expose it over a network interface.

Optimize speed and performance

As neural networks move from servers to the edge, optimizing speed and size becomes even more important. In this blog, we gave an overview of how to use open source tooling to make it easy to improve performance.

Doctor visit helper

Prepare before seeing a doctor

A simple rural-patient checklist to help you explain symptoms clearly, ask better questions, and avoid unsafe self-treatment.

Safety note: This is not a prescription or diagnosis. For severe symptoms, pregnancy danger signs, children with serious illness, chest pain, breathing difficulty, stroke-like weakness, or major injury, seek urgent care.

Which doctor may help?

Start with a registered doctor or the nearest qualified health center.

What to tell the doctor

  • Write when the problem started and how it changed.
  • Bring old prescriptions, investigation reports, and current medicines.
  • Write allergies, pregnancy status, diabetes, kidney/liver disease, and major past illnesses.
  • Bring one family member if the patient is weak, elderly, confused, or a child.

Questions to ask

  • What is the most likely cause of my symptoms?
  • Which danger signs mean I should go to hospital quickly?
  • Which tests are necessary now, and which can wait?
  • How should I take medicines safely and what side effects should I watch for?
  • When should I come for follow-up?

Tests to discuss

  • Vital signs: temperature, pulse, blood pressure, oxygen saturation
  • Basic physical examination by a clinician
  • CBC, urine test, blood sugar, or imaging only when clinically needed

Avoid these mistakes

  • Do not use antibiotics, steroid tablets/injections, or strong painkillers without proper medical advice.
  • Do not hide pregnancy, kidney disease, ulcer, allergy, or blood thinner use.
  • Do not delay emergency care when danger signs are present.

Medicine safety and first-aid guide

This section is for patient education only. It does not replace a doctor, pharmacist, or emergency care.

Safe first steps

  • Rest, drink safe water, and observe symptoms carefully.
  • Keep a written note of symptoms, duration, temperature, medicines already taken, and allergy history.
  • Seek medical care quickly if symptoms are severe, worsening, or unusual for the patient.

OTC medicine safety

  • For mild pain or fever, ask a registered pharmacist or doctor before using common over-the-counter pain/fever medicines.
  • Do not combine multiple pain medicines without advice, especially if you have kidney disease, liver disease, stomach ulcer, asthma, pregnancy, or take blood thinners.
  • Do not give adult medicines to children unless a qualified clinician advises it.

Avoid these mistakes

  • Do not start antibiotics without a proper medical decision.
  • Do not use steroid tablets or injections casually for quick relief.
  • Do not delay emergency care because of home remedies.

Get urgent help if

  • Severe symptoms, confusion, fainting, breathing difficulty, chest pain, severe dehydration, or sudden weakness need urgent medical care.
Medicine names, dose, and timing must be decided by a qualified clinician or pharmacist after checking age, pregnancy, allergy, other diseases, and current medicines.

For rural patients and family caregivers

Patient health record and symptom diary

Write your symptoms, medicines already taken, test results, and questions before visiting a doctor. This note stays on your device unless you print or copy it.

Doctor to discuss: Doctor / qualified healthcare provider
Tests to discuss with doctor
  • Basic vital signs: temperature, pulse, blood pressure, oxygen level if needed
  • Relevant blood, urine, imaging, or specialist tests only after clinical assessment
Questions to ask
  • What is the most likely cause of my symptoms?
  • Which warning signs mean I should go to emergency care?
  • Which tests are really needed now?
  • Which medicines are safe for my age, pregnancy status, allergy, kidney/liver/stomach condition, and current medicines?

Emergency warning signs such as chest pain, severe breathing difficulty, sudden weakness, confusion, severe dehydration, major injury, or loss of bladder/bowel control need urgent medical care. Do not wait for online information.

Safe pathway to proper treatment

Care roadmap for: Improve BERT inference

Use this simple roadmap to understand the next safe steps. It is educational and does not replace examination by a doctor.

Go to emergency care if you notice:
  • Severe or rapidly worsening symptoms
  • Breathing difficulty, chest pain, fainting, confusion, severe weakness, major injury, or severe dehydration
Doctor / service to discuss: Qualified healthcare provider; specialist depends on symptoms and examination.
  1. Step 1

    Check danger signs first

    If danger signs are present, seek emergency care and do not wait for online information.

  2. Step 2

    Record the symptom story

    Write when symptoms started, severity, medicines already taken, allergies, pregnancy status, and test results.

  3. Step 3

    Visit a qualified clinician

    A doctor, nurse, or qualified healthcare provider can examine you and decide which tests or treatment are needed.

  4. Step 4

    Do only useful tests

    Do tests after clinical assessment. Avoid unnecessary tests, random antibiotics, or repeated medicines without diagnosis.

  5. Step 5

    Follow up and return early if worse

    If symptoms worsen, new warning signs appear, or treatment is not helping, return for review quickly.

Rural patient practical tips
  • Take a written symptom diary and all previous prescriptions/test reports.
  • Do not hide medicines already taken, even herbal or over-the-counter medicines.
  • Ask which warning signs mean urgent referral to hospital.

This roadmap is for education. A real diagnosis and treatment plan requires history, examination, and clinical judgment.

RX Patient Help

Ask a health question safely

Write your symptom story. A health professional or site editor can review it before any answer is prepared. This box is not for emergency care.

Emergency first: Severe chest pain, breathing trouble, unconsciousness, stroke signs, severe injury, heavy bleeding, or rapidly worsening symptoms need urgent local medical care now.

Frequently Asked Questions

Is this article a replacement for a doctor?

No. It is educational content only. Patients should consult a qualified clinician for diagnosis and treatment.

When should I seek urgent care?

Seek urgent care for severe symptoms, rapidly worsening condition, breathing difficulty, severe pain, neurological changes, or any emergency warning sign.

References

Add references, clinical guidelines, textbooks, journal articles, or trusted medical sources here. You can edit this area from the RX Article Professional Blocks panel.