optimize large scale transformer model inference

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

Patient Mode

Understand this article easily

Switch between simple English and easy Bangla patient notes. This is for education and does not replace a doctor consultation.

Large-scale transformer models, such as GPT-2 and GPT-3, are among the most useful self-supervised transformer language models for natural language processing tasks such as language translation, question answering, passage summarization, text generation, and so on. After successfully shipping the first deep learning model for IntelliCode completion,...

For severe symptoms, danger signs, pregnancy, child illness, or sudden worsening, seek urgent medical care.

বাংলা রোগী নোট এখনো যোগ করা হয়নি। পোস্ট এডিটরে “RX Bangla Patient Mode” বক্স থেকে সহজ বাংলা সারাংশ যোগ করুন।

এই তথ্য শিক্ষা ও সচেতনতার জন্য। এটি ডাক্তারি পরীক্ষা, রোগ নির্ণয় বা প্রেসক্রিপশনের বিকল্প নয়।

Article Summary

Large-scale transformer models, such as GPT-2 and GPT-3, are among the most useful self-supervised transformer language models for natural language processing tasks such as language translation, question answering, passage summarization, text generation, and so on. After successfully shipping the first deep learning model for IntelliCode completion, our recent research effort brings GPT-C, a multi-layer generative decoder transformer architecture part of our DeepDev transformer platform for code and text...

Key Takeaways

  • This article explains Large scale transformer model with ONNX Runtime in simple medical language.
  • This article explains One-step beam search optimization through ONNX Runtime for large scale transformer model in simple medical language.
  • This article explains Technical insights about one-step beam search ONNX optimization in simple medical language.
  • This article explains Try it now in simple medical language.
Educational health guideWritten for patient understanding and clinical awareness.
Reviewed content workflowUse writer and reviewer profiles for stronger trust.
Emergency safety firstUrgent warning signs are highlighted below.

Seek urgent medical care if you notice

These warning signs are general safety guidance. Local emergency numbers and clinical judgment should always come first.

  • Severe symptoms, breathing difficulty, fainting, confusion, or rapidly worsening illness.
  • New weakness, severe pain, high fever, or symptoms after a serious injury.
  • Any symptom that feels urgent, unusual, or unsafe for the patient.
1

Emergency now

Use emergency care for severe, sudden, rapidly worsening, or life-threatening symptoms.

2

See a doctor

Book a professional medical evaluation if symptoms persist, worsen, recur often, affect daily activities, or occur in a high-risk patient.

3

Learn safely

Use this article to understand possible causes, tests, treatment options, prevention, and questions to ask your clinician.

Before reading

RX Patient Tools

Use these quick guides before reading the article, or return to them when you need help preparing questions for a doctor.

Start here Choose the right pathway for symptoms, reports, medicines, or urgent warning signs. Disease article roadmap Read this topic step by step: meaning, symptoms, warning signs, diagnosis, treatment, prevention, and follow-up. Treatment planner Prepare questions about treatment choices, benefits, risks, side effects, and follow-up. Family & caregiver guide Organize symptoms, reports, medicines, questions, and follow-up safely. Nutrition & diet guide Prepare food, hydration, supplement, and medicine-timing questions safely. Prevention guide Organize risk factors, protective habits, screening, and warning signs. Recovery guide Prepare a safe plan for activity, rehabilitation, warning signs, and follow-up.

Large-scale transformer models, such as GPT-2 and GPT-3, are among the most useful self-supervised transformer language models for natural language processing tasks such as language translation, question answering, passage summarization, text generation, and so on. After successfully shipping the first deep learning model for IntelliCode completion, our recent research effort brings GPT-C, a multi-layer generative decoder transformer architecture part of our DeepDev transformer platform for code and text from Developer Division (DevDiv) Data&AI Applied Science team, to empower IntelliCode with the whole line of code completion suggestions in Visual Studio and Visual Studio Code.

To meet the need of computing power required by large-scale transformers, our initial aim was to deploy the GPT-C model in production by leveraging Azure Machine Learning service with a cluster of virtual machines powered by NVIDIA Tesla V100 GPUs. However, there were some limitations:

  • Cloud-based deployment requires transmitting user code over the network for inference, which increases the risks of exposing sensitive data.
  • The service is not accessible in disconnected or offline mode. This limitation requires developers to stay connected to the internet during their work, which may not be a choice for people who work in areas with poor internet connections.
  • Typical language models aim to generate full token sequences left-to-right using a beam search decoding algorithm to search for the best solutions in a batch-oriented manner. GPT-C is no exception. This scenario imposes a large memory overhead, resulting in high latency and serving costs. A 12-layer generative transformer model requires 374 MB in memory usage, takes around 80 ms GPU time per inference call. This cost of scaling it to our large user base would make it impractical.

With its resource-efficient and high-performance nature, ONNX Runtime can help address these limitations in GPT-C model production.

Large scale transformer model with ONNX Runtime

ONNX (Open Neural Network Exchange) and ONNX Runtime play an important role in accelerating and simplifying transformer model inference in production. ONNX is an open standard format representing machine learning models. Models trained with various frameworks, e.g. PyTorch, TensorFlow, can be converted to ONNX. Built based on the ONNX standard, ONNX Runtime is an optimized inference engine for efficiently running any model converted to the ONNX format across different hardware and operating systems with minimum effort. Due to this framework interoperability nature of ONNX, ONNX Runtime improves the development efficiency from model training to inference. Through various optimization techniques, ONNX Runtime can run all kinds of models with optimal performance across hardware platforms.

To deliver the IntelliCode line completion experience at a low cost, we decided to deploy GPT-C on the client-side. This means that the GPT-C model needs to be run on CPU efficiently with a wide range of client devices. Thanks to ONNX Runtime, our first attempt significantly reduces the memory usage from about 370MB to 80MB. ONNX Runtime enables transformer optimizations that achieve more than 2x performance speedup over PyTorch with a large sequence length on CPUs. PyTorch offers a built-in ONNX exporter for exporting PyTorch model to ONNX. On top of that, ONNX Runtime builds the GPT2 conversion tool for simplifying the conversion experience for GPT2 models with the past states. Our GPT-C transformer model is easily converted from PyTorch to ONNX by leveraging this tool, then runs with ONNX Runtime with good performance. In addition to the model itself, beam search is another important component in our deployment. In the initial version, beam search modules were implemented in managed code (C# and Typescript). It scores and re-ranks the output tensors received from the previous ONNX Runtime model inference step. When the scoring and re-ranking are done, the model retrieves the output tensors from the beam search module and conducts another round of inference. Due to the inefficiency of managed code implementation, the E2E client-side GPT-C inference suffers from a relatively poor response time of around 1 second CPU time for each line completion inference.

To improve the E2E performance of client-side GPT-C further, we extended the GPT2 conversion tool to support GPT-2 models with native one-step beam search. This was collaborative work between the DevDiv Data&AI Applied Science team, Microsoft Turing team, and the ONNX Runtime team. Consequently, we improved both aspects of training and of deploying GPT-2 models, which makes it simpler and more efficient for GPT-2 models with native one-step beam search to fully access hardware acceleration through ONNX Runtime.

One-step beam search optimization through ONNX Runtime for large scale transformer model

As shown in Figure 1, GPT-C is leveraging the native one-step beam search in its compute graph. Specifically, one-step beam search is compiled as TorchScript code that serves as a bridge between the GPT-C beam search module and ONNX Runtime. Then GPT2 conversion tool calls to the ONNX conversion APIs to convert one-step beam search into ONNX operators and appends to the end of the converted GPT-C transformer model ONNX compute graph. After GPT-2 models with native one-step beam search are converted to the whole ONNX graph, ONNX Runtime quantization is applied to further reduce the size of the model. When deploying the GPT-C ONNX model, the IntelliCode client-side model service retrieves the output tensors from ONNX Runtime and sends them back for the next inference step until all beams reach the end of the line.

optimize large scale transformer model inference

Figure 1. How GPT-C Model deployed in Visual Studio and Visual Studio Code

We measured the latency of the GPT-C ONNX model on both CPU and GPU configurations. CPU performance measurement was done on a laptop machine with an Intel® Core® i7-8650U CPU. Compared with the initial attempt client-side GPT-C, performance gains up to 4.0x with around 300 ms per inference.

For GPU, we used one NVIDIA V100-PCIE-16GB GPU on an Azure Standard_NC12s_v3 VM and tested it in FP16 configuration. Compared with PyTorch, ONNX Runtime showed both significant memory efficiency and performance speedup with up to 5x and 4x, respectively.

Technical insights about one-step beam search ONNX optimization

Considering beam search requires multiple steps with certain stop conditions while the ONNX graph is static, we standardize the interface by exporting only one step of the beam search to ONNX. To enable multi-step beam search, all we need is a simple loop with a proper stop condition. Unfortunately, we ran into problems as the beam search algorithm requires loop operations in selecting beams and a set to store finished beams, which aren’t natively supported in ONNX spec yet. To overcome this, at each step, we use two matrices to store the beam indices and scores at each step. In addition, we use a vector of indicators to track if the input beams are finished.

optimize large scale transformer model inference

Figure 2. Left: Input beams. Right: candidate output beams. Meaning of tuples: (index, scores and or probabilities, finish and or unfinished). Here beam size (k) is 3 and index 2 is an <end-of-text> token. For the 3rd row, the input is finished (i.e., reach <end-of-text>), so we construct and insert k “fake” candidates with the 1st beam carrying the same score as the input with a pad index (or arbitrary index). The 2nd and after beams are given a score of -Inf, which will be dropped when finding the top-k (shadowed) from all candidates.

As the example shown in Figure 2, input beams feed into the model to get a probability distribution of the next tokens. Since model inference is expensive, we only run the model on the unfinished beams, denoted as k0 (k0 = 2 in the example), and select top-k (the beam size) candidates for each unfinished input beam, which results in a k0 x k table. Then the finished beams are constructed and inserted back into the candidate pool or table by using the ONNX scatter operator. Therefore, we end up with a table of k x k candidates. For the next round of beam search, the next k-beams are selected by the top-k operator, which automatically discards finished beams with a -inf score, avoiding the use of branching yet to be supported in ONNX to handle finished and unfished inputs separately.

By putting beam search into the ONNX graph, we benefit from ONNX Runtime’s optimization and reduce the overhead of transforming data between ONNX Runtime and the scripting language, which helps reduce model inference latency. Another benefit is to help bridge the gap between model training and deployment. As we know, it is common that the programming language for the large-scale transformer model in production may not be the same as the one used to train the model. For example, we found that most models are trained and tested using Python packages such as PyTorch but deployed in C#, C++, or Javascript packages. This means that the exact same beam search algorithm has to be implemented in different languages, which would cause inconsistency and maintenance issues. Given beam search algorithm is mostly standardized, exporting beam search directly to the ONNX graph avoids substantial code changes during deployment.

Try it now

We are delighted to offer this innovation to the public developer and data science community. You can now leverage high-performance inference with ONNX Runtime for a given GPT-2 model with one step beam search with the following steps:

  1. Train a model with or load a pre-trained model from GPT-2.
  2. Convert the GPT-2 model with one-step beam search to ONNX format.

Run the converted model with ONNX Runtime on the target platform of your choice. Check out this end-to-end tutorial.

Ongoing work

We will continue optimizing the performance of the large-scale transformer model in ONNX Runtime. There are still opportunities for further improvements, such as integrating the multi-step beam search into the ONNX model.

Doctor visit helper

Prepare before seeing a doctor

A simple rural-patient checklist to help you explain symptoms clearly, ask better questions, and avoid unsafe self-treatment.

Safety note: This is not a prescription or diagnosis. For severe symptoms, pregnancy danger signs, children with serious illness, chest pain, breathing difficulty, stroke-like weakness, or major injury, seek urgent care.

Which doctor may help?

Start with a registered doctor or the nearest qualified health center.

What to tell the doctor

  • Write when the problem started and how it changed.
  • Bring old prescriptions, investigation reports, and current medicines.
  • Write allergies, pregnancy status, diabetes, kidney/liver disease, and major past illnesses.
  • Bring one family member if the patient is weak, elderly, confused, or a child.

Questions to ask

  • What is the most likely cause of my symptoms?
  • Which danger signs mean I should go to hospital quickly?
  • Which tests are necessary now, and which can wait?
  • How should I take medicines safely and what side effects should I watch for?
  • When should I come for follow-up?

Tests to discuss

  • Vital signs: temperature, pulse, blood pressure, oxygen saturation
  • Basic physical examination by a clinician
  • CBC, urine test, blood sugar, or imaging only when clinically needed

Avoid these mistakes

  • Do not use antibiotics, steroid tablets/injections, or strong painkillers without proper medical advice.
  • Do not hide pregnancy, kidney disease, ulcer, allergy, or blood thinner use.
  • Do not delay emergency care when danger signs are present.

Medicine safety and first-aid guide

This section is for patient education only. It does not replace a doctor, pharmacist, or emergency care.

Safe first steps

  • Avoid heavy lifting, sudden bending, and prolonged bed rest.
  • Use comfortable posture and gentle movement as tolerated.
  • Discuss physiotherapy, X-ray, or MRI only when clinically needed.

OTC medicine safety

  • For mild back pain, pain-relief medicine may be discussed with a doctor or pharmacist.
  • Avoid repeated painkiller use if you have kidney disease, stomach ulcer, uncontrolled blood pressure, or are taking blood thinners.

Avoid these mistakes

  • Do not start antibiotics without a proper medical decision.
  • Do not use steroid tablets or injections casually for quick relief.
  • Do not delay emergency care because of home remedies.

Get urgent help if

  • Back pain with leg weakness, numbness around private area, loss of urine/stool control, fever, cancer history, or major injury needs urgent care.
Medicine names, dose, and timing must be decided by a qualified clinician or pharmacist after checking age, pregnancy, allergy, other diseases, and current medicines.

For rural patients and family caregivers

Patient health record and symptom diary

Write your symptoms, medicines already taken, test results, and questions before visiting a doctor. This note stays on your device unless you print or copy it.

Doctor to discuss: Doctor / qualified healthcare provider
Tests to discuss with doctor
  • Basic vital signs: temperature, pulse, blood pressure, oxygen level if needed
  • Relevant blood, urine, imaging, or specialist tests only after clinical assessment
Questions to ask
  • What is the most likely cause of my symptoms?
  • Which warning signs mean I should go to emergency care?
  • Which tests are really needed now?
  • Which medicines are safe for my age, pregnancy status, allergy, kidney/liver/stomach condition, and current medicines?

Emergency warning signs such as chest pain, severe breathing difficulty, sudden weakness, confusion, severe dehydration, major injury, or loss of bladder/bowel control need urgent medical care. Do not wait for online information.

Safe pathway to proper treatment

Care roadmap for: optimize large scale transformer model inference

Use this simple roadmap to understand the next safe steps. It is educational and does not replace examination by a doctor.

Go to emergency care if you notice:
  • Severe or rapidly worsening symptoms
  • Breathing difficulty, chest pain, fainting, confusion, severe weakness, major injury, or severe dehydration
Doctor / service to discuss: Qualified healthcare provider; specialist depends on symptoms and examination.
  1. Step 1

    Check danger signs first

    If danger signs are present, seek emergency care and do not wait for online information.

  2. Step 2

    Record the symptom story

    Write when symptoms started, severity, medicines already taken, allergies, pregnancy status, and test results.

  3. Step 3

    Visit a qualified clinician

    A doctor, nurse, or qualified healthcare provider can examine you and decide which tests or treatment are needed.

  4. Step 4

    Do only useful tests

    Do tests after clinical assessment. Avoid unnecessary tests, random antibiotics, or repeated medicines without diagnosis.

  5. Step 5

    Follow up and return early if worse

    If symptoms worsen, new warning signs appear, or treatment is not helping, return for review quickly.

Rural patient practical tips
  • Take a written symptom diary and all previous prescriptions/test reports.
  • Do not hide medicines already taken, even herbal or over-the-counter medicines.
  • Ask which warning signs mean urgent referral to hospital.

This roadmap is for education. A real diagnosis and treatment plan requires history, examination, and clinical judgment.

RX Patient Help

Ask a health question safely

Write your symptom story. A health professional or site editor can review it before any answer is prepared. This box is not for emergency care.

Emergency first: Severe chest pain, breathing trouble, unconsciousness, stroke signs, severe injury, heavy bleeding, or rapidly worsening symptoms need urgent local medical care now.

Frequently Asked Questions

Is this article a replacement for a doctor?

No. It is educational content only. Patients should consult a qualified clinician for diagnosis and treatment.

When should I seek urgent care?

Seek urgent care for severe symptoms, rapidly worsening condition, breathing difficulty, severe pain, neurological changes, or any emergency warning sign.

References

Add references, clinical guidelines, textbooks, journal articles, or trusted medical sources here. You can edit this area from the RX Article Professional Blocks panel.