Optimizing and deploying transformer INT8 inference

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

Patient Mode

Understand this article easily

Switch between simple English and easy Bangla patient notes. This is for education and does not replace a doctor consultation.

Transformer-based models have revolutionized the natural language processing (NLP) domain. Ever since its inception, transformer architecture has been integrated into models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) for performing tasks such as text generation or summarization and question and...

For severe symptoms, danger signs, pregnancy, child illness, or sudden worsening, seek urgent medical care.

বাংলা রোগী নোট এখনো যোগ করা হয়নি। পোস্ট এডিটরে “RX Bangla Patient Mode” বক্স থেকে সহজ বাংলা সারাংশ যোগ করুন।

এই তথ্য শিক্ষা ও সচেতনতার জন্য। এটি ডাক্তারি পরীক্ষা, রোগ নির্ণয় বা প্রেসক্রিপশনের বিকল্প নয়।

Article Summary

Transformer-based models have revolutionized the natural language processing (NLP) domain. Ever since its inception, transformer architecture has been integrated into models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) for performing tasks such as text generation or summarization and question and answering to name a few. The newer models are getting bigger in size by stacking more transformer layers and larger...

Key Takeaways

  • This article explains Setup in simple medical language.
  • This article explains INT8 optimization in simple medical language.
  • This article explains Results in simple medical language.
  • This article explains At the end in simple medical language.
Educational health guideWritten for patient understanding and clinical awareness.
Reviewed content workflowUse writer and reviewer profiles for stronger trust.
Emergency safety firstUrgent warning signs are highlighted below.

Seek urgent medical care if you notice

These warning signs are general safety guidance. Local emergency numbers and clinical judgment should always come first.

  • Severe symptoms, breathing difficulty, fainting, confusion, or rapidly worsening illness.
  • New weakness, severe pain, high fever, or symptoms after a serious injury.
  • Any symptom that feels urgent, unusual, or unsafe for the patient.
1

Emergency now

Use emergency care for severe, sudden, rapidly worsening, or life-threatening symptoms.

2

See a doctor

Book a professional medical evaluation if symptoms persist, worsen, recur often, affect daily activities, or occur in a high-risk patient.

3

Learn safely

Use this article to understand possible causes, tests, treatment options, prevention, and questions to ask your clinician.

Before reading

RX Patient Tools

Use these quick guides before reading the article, or return to them when you need help preparing questions for a doctor.

Start here Choose the right pathway for symptoms, reports, medicines, or urgent warning signs. Disease article roadmap Read this topic step by step: meaning, symptoms, warning signs, diagnosis, treatment, prevention, and follow-up. Treatment planner Prepare questions about treatment choices, benefits, risks, side effects, and follow-up. Family & caregiver guide Organize symptoms, reports, medicines, questions, and follow-up safely. Nutrition & diet guide Prepare food, hydration, supplement, and medicine-timing questions safely. Prevention guide Organize risk factors, protective habits, screening, and warning signs. Recovery guide Prepare a safe plan for activity, rehabilitation, warning signs, and follow-up.

Transformer-based models have revolutionized the natural language processing (NLP) domain. Ever since its inception, transformer architecture has been integrated into models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) for performing tasks such as text generation or summarization and question and answering to name a few. The newer models are getting bigger in size by stacking more transformer layers and larger input sequence lengths, which in turn, has led to improvements in model accuracy but comes at a cost of higher inference times.

NVIDIA TensorRT is an SDK for high-performance deep learning inference on NVIDIA GPUs. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference. One of the key features of TensorRT is that it allows the models to be deployed in reduced precisions like FP16 and INT8 without compromising on accuracy. Recently, Bing announced the support of running their transformer models on Azure T4 GPUs leveraging TensorRT INT8 optimization. Starting with TensorRT 8.0, users can now see down to 1.2ms inference latency using INT8 optimization on BERT Large.

Many of these transformer models from different frameworks (such as PyTorch and TensorFlow) can be converted to the Open Neural Network Exchange (ONNX) format, which is the open standard format representing AI and deep learning models for further optimizations. ONNX Runtime is a high-performance inference engine to run machine learning models, with multi-platform support and a flexible execution provider interface to integrate hardware-specific libraries. As shown in Figure 1, ONNX Runtime integrates TensorRT as one execution provider for model inference acceleration on NVIDIA GPUs by harnessing the TensorRT optimizations. Based on the TensorRT capability, ONNX Runtime partitions the model graph and offloads the parts that TensorRT supports to TensorRT execution provider for efficient model execution on NVIDIA hardware.

Optimizing and deploying transformer INT8 inference
Figure 1: Different execution providers supported by ONNX Runtime.

In this blog, we will be using the HuggingFace BERT model, apply TensorRT INT8 optimizations, and accelerate the inference with ONNX Runtime with TensorRT execution provider.

Setup

To get started, you can clone the transformer repository from the HuggingFace Github page.

$ git clone https://github.com/huggingface/transformers.git
$ cd transformers

Then, you can build and launch the docker container using the following steps which uses the NGC PyTorch container image.

$ docker build . -f examples/research_projects/quantization-qdqbert/Dockerfile -t bert_quantization:latest

$ docker run --gpus all --privileged --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 bert_quantization:latest

Once inside the container, navigate to the quantization directory.

$ cd transformers/examples/research_projects/quantization-qdqbert/

INT8 optimization

Model quantization is becoming popular in the deep learning optimization methods to use the 8-bit integers calculations for using the faster and cheaper 8-bit Tensor Cores. This, in turn, can be used to compute convolution and matrix-multiplication operations yielding more throughput, which is particularly effective on compute-limited layers.

Quantization Toolkit

TensorRT Quantization Toolkit for PyTorch provides a convenient tool to train and evaluate PyTorch models with simulated quantization. This library can automatically or manually add quantization to PyTorch models and the quantized model can be exported to ONNX and imported by TensorRT 8.0 and later.

If you already have an ONNX model, you can directly apply ONNX Runtime quantization tool with Post Training Quantization (PTQ)  for running with ONNX Runtime-TensorRT quantization. Please refer to this example for more details. This blog focuses on starting with a PyTorch model.

HuggingFace QDQBERT model

The HuggingFace QDQBERT model starts from the HuggingFace BERT model, and uses TensorRT Quantization Toolkit for PyTorch to insert Q/DQ nodes into the network. Fake quantization operations (pairs of QuantizeLinear/DequantizeLinear ops) are added to (1) linear layer inputs and weights, (2) matmul inputs, (3) residual add inputs, in the BERT model. After that, the QDQBERT model is exported to ONNX format, which can be imported into TensorRT. The QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example bert-large-uncased), and perform Quantization Aware Training (QAT) or Post Training Quantization (PTQ) afterwards.

Launch the following command to first perform calibration:

python3 run_quant_qa.py \
--model_name_or_path bert-large-uncased \
--dataset_name squad \
--max_seq_length 128 \
--doc_stride 32 \
--output_dir calib/bert-large-uncased \
--do_calib \
--calibrator percentile \
--percentile 99.99

And then the QAT can be launched by executing the script:

python3 run_quant_qa.py \
--model_name_or_path calib/bert-large-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 4e-5 \
--num_train_epochs 2 \
--max_seq_length 128 \
--doc_stride 32 \
--output_dir finetuned_int8/bert-large-uncased \
--tokenizer_name bert-large-uncased \
--save_steps 0

At a high level, TensorRT processes ONNX models with Q/DQ operators similarly to how TensorRT processes any other ONNX model: TensorRT imports an ONNX model containing Q/DQ operations. It performs a set of optimizations that are dedicated to Q/DQ processing. It continues to perform the general optimization passes. It builds a platform-specific, execution-plan file for inference execution. This plan file contains quantized operations and weights.

Thus, you can now export the fine-tuned model with Q/DQ operations to the ONNX format using the following:

python3 run_quant_qa.py \
--model_name_or_path finetuned_int8/bert-large-uncased \
--output_dir ./ \
--save_onnx \
--per_device_eval_batch_size 1 \
--max_seq_length 128 \
--doc_stride 32 \
--dataset_name squad \
--tokenizer_name bert-large-uncased

Starting from TensorRT 8.0, TensorRT processes Q/DQ networks with new optimizations, which increases Q/DQ model performance and provides predictable and user-controlled arithmetic precision transitions.

Results

Experiments of inferencing performance are performed on NVIDIA A100, using ONNX Runtime 1.11 and TensorRT 8.2 with HuggingFace BERT-large model. The inference task is SQuAD, with INT8 quantization by the HuggingFace QDQBERT-large model.

The benchmarking can be done using either trtexec:

trtexec --onnx=model.onnx --explicitBatch --workspace=16384 --int8 --shapes=input_ids:64x128,attention_mask:64x128,token_type_ids:64x128 --verbose

We also have the python script which uses the ONNX Runtime with TensorRT execution provider and can also be used instead:

python3 ort-infer-benchmark.py

With the optimizations of ONNX Runtime with TensorRT EP, we are seeing up to seven times speedup over PyTorch inference for BERT Large and BERT Base, with latency under 2 ms and 1 ms respectively for BS=1. The figures below show the inference latency comparison when running the BERT Large with sequence length 128 on NVIDIA A100.

Optimizing and deploying transformer INT8 inference
Figure 2: Compute latency comparison between ONNX Runtime-TensorRT and PyTorch for running BERT-Large on NVIDIA A100 GPU for sequence length 128.

You can also check the accuracy of the INT8 model using the following script:

python3 evaluate-hf-trt-qa.py \
--onnx_model_path=./model.onnx \
--output_dir ./ \
--per_device_eval_batch_size 64 \
--max_seq_length 128 \
--doc_stride 32 \
--dataset_name squad \
--tokenizer_name bert-large-uncased \
--int8 \
--seed 42

Accuracy metrics with ONNX Runtime-TensorRT 8.2 EP for the SQuAD task are:

INT8 FP16 FP32
F1 score 87.52263875 87.69072304 87.96610141

At the end

ONNX Runtime-TensorRT INT8 quantization shows very promising results on NVIDIA GPUs. We’d love to hear any feedback or suggestions as you try it in your production scenarios. You can submit feedback by participating in our GitHub repos (TensorRT and ONNX Runtime).

Doctor visit helper

Prepare before seeing a doctor

A simple rural-patient checklist to help you explain symptoms clearly, ask better questions, and avoid unsafe self-treatment.

Safety note: This is not a prescription or diagnosis. For severe symptoms, pregnancy danger signs, children with serious illness, chest pain, breathing difficulty, stroke-like weakness, or major injury, seek urgent care.

Which doctor may help?

Start with a registered doctor or the nearest qualified health center.

What to tell the doctor

  • Write when the problem started and how it changed.
  • Bring old prescriptions, investigation reports, and current medicines.
  • Write allergies, pregnancy status, diabetes, kidney/liver disease, and major past illnesses.
  • Bring one family member if the patient is weak, elderly, confused, or a child.

Questions to ask

  • What is the most likely cause of my symptoms?
  • Which danger signs mean I should go to hospital quickly?
  • Which tests are necessary now, and which can wait?
  • How should I take medicines safely and what side effects should I watch for?
  • When should I come for follow-up?

Tests to discuss

  • Vital signs: temperature, pulse, blood pressure, oxygen saturation
  • Basic physical examination by a clinician
  • CBC, urine test, blood sugar, or imaging only when clinically needed

Avoid these mistakes

  • Do not use antibiotics, steroid tablets/injections, or strong painkillers without proper medical advice.
  • Do not hide pregnancy, kidney disease, ulcer, allergy, or blood thinner use.
  • Do not delay emergency care when danger signs are present.

Medicine safety and first-aid guide

This section is for patient education only. It does not replace a doctor, pharmacist, or emergency care.

Safe first steps

  • Rest, drink safe water, and observe symptoms carefully.
  • Keep a written note of symptoms, duration, temperature, medicines already taken, and allergy history.
  • Seek medical care quickly if symptoms are severe, worsening, or unusual for the patient.

OTC medicine safety

  • For mild pain or fever, ask a registered pharmacist or doctor before using common over-the-counter pain/fever medicines.
  • Do not combine multiple pain medicines without advice, especially if you have kidney disease, liver disease, stomach ulcer, asthma, pregnancy, or take blood thinners.
  • Do not give adult medicines to children unless a qualified clinician advises it.

Avoid these mistakes

  • Do not start antibiotics without a proper medical decision.
  • Do not use steroid tablets or injections casually for quick relief.
  • Do not delay emergency care because of home remedies.

Get urgent help if

  • Severe symptoms, confusion, fainting, breathing difficulty, chest pain, severe dehydration, or sudden weakness need urgent medical care.
Medicine names, dose, and timing must be decided by a qualified clinician or pharmacist after checking age, pregnancy, allergy, other diseases, and current medicines.

For rural patients and family caregivers

Patient health record and symptom diary

Write your symptoms, medicines already taken, test results, and questions before visiting a doctor. This note stays on your device unless you print or copy it.

Doctor to discuss: Doctor / qualified healthcare provider
Tests to discuss with doctor
  • Basic vital signs: temperature, pulse, blood pressure, oxygen level if needed
  • Relevant blood, urine, imaging, or specialist tests only after clinical assessment
Questions to ask
  • What is the most likely cause of my symptoms?
  • Which warning signs mean I should go to emergency care?
  • Which tests are really needed now?
  • Which medicines are safe for my age, pregnancy status, allergy, kidney/liver/stomach condition, and current medicines?

Emergency warning signs such as chest pain, severe breathing difficulty, sudden weakness, confusion, severe dehydration, major injury, or loss of bladder/bowel control need urgent medical care. Do not wait for online information.

Safe pathway to proper treatment

Care roadmap for: Optimizing and deploying transformer INT8 inference

Use this simple roadmap to understand the next safe steps. It is educational and does not replace examination by a doctor.

Go to emergency care if you notice:
  • Severe or rapidly worsening symptoms
  • Breathing difficulty, chest pain, fainting, confusion, severe weakness, major injury, or severe dehydration
Doctor / service to discuss: Qualified healthcare provider; specialist depends on symptoms and examination.
  1. Step 1

    Check danger signs first

    If danger signs are present, seek emergency care and do not wait for online information.

  2. Step 2

    Record the symptom story

    Write when symptoms started, severity, medicines already taken, allergies, pregnancy status, and test results.

  3. Step 3

    Visit a qualified clinician

    A doctor, nurse, or qualified healthcare provider can examine you and decide which tests or treatment are needed.

  4. Step 4

    Do only useful tests

    Do tests after clinical assessment. Avoid unnecessary tests, random antibiotics, or repeated medicines without diagnosis.

  5. Step 5

    Follow up and return early if worse

    If symptoms worsen, new warning signs appear, or treatment is not helping, return for review quickly.

Rural patient practical tips
  • Take a written symptom diary and all previous prescriptions/test reports.
  • Do not hide medicines already taken, even herbal or over-the-counter medicines.
  • Ask which warning signs mean urgent referral to hospital.

This roadmap is for education. A real diagnosis and treatment plan requires history, examination, and clinical judgment.

RX Patient Help

Ask a health question safely

Write your symptom story. A health professional or site editor can review it before any answer is prepared. This box is not for emergency care.

Emergency first: Severe chest pain, breathing trouble, unconsciousness, stroke signs, severe injury, heavy bleeding, or rapidly worsening symptoms need urgent local medical care now.

Frequently Asked Questions

Is this article a replacement for a doctor?

No. It is educational content only. Patients should consult a qualified clinician for diagnosis and treatment.

When should I seek urgent care?

Seek urgent care for severe symptoms, rapidly worsening condition, breathing difficulty, severe pain, neurological changes, or any emergency warning sign.

References

Add references, clinical guidelines, textbooks, journal articles, or trusted medical sources here. You can edit this area from the RX Article Professional Blocks panel.