Distributed Data Parallel

Patient Tools

Read, save, and share this guide

Use these quick tools to make this medical article easier to read, print, save, or share with a family member.

Patient Mode

Understand this article easily

Switch between simple English and easy Bangla patient notes. This is for education and does not replace a doctor consultation.

Model training has been and will be in the foreseeable future one of the most frustrating things machine learning developers face. It takes quite a long time and people can’t really do anything about it. If you have the luxury (especially at this moment of...

For severe symptoms, danger signs, pregnancy, child illness, or sudden worsening, seek urgent medical care.

বাংলা রোগী নোট এখনো যোগ করা হয়নি। পোস্ট এডিটরে “RX Bangla Patient Mode” বক্স থেকে সহজ বাংলা সারাংশ যোগ করুন।

এই তথ্য শিক্ষা ও সচেতনতার জন্য। এটি ডাক্তারি পরীক্ষা, রোগ নির্ণয় বা প্রেসক্রিপশনের বিকল্প নয়।

Article Summary

Model training has been and will be in the foreseeable future one of the most frustrating things machine learning developers face. It takes quite a long time and people can’t really do anything about it. If you have the luxury (especially at this moment of time) of having multiple GPUs, you are likely to find Distributed Data Parallel (DDP) helpful in terms of model training....

Key Takeaways

  • This article explains Walkthrough in simple medical language.
  • This article explains Troubleshooting in simple medical language.
  • This article explains Benchmark in simple medical language.
Educational health guideWritten for patient understanding and clinical awareness.
Reviewed content workflowUse writer and reviewer profiles for stronger trust.
Emergency safety firstUrgent warning signs are highlighted below.

Seek urgent medical care if you notice

These warning signs are general safety guidance. Local emergency numbers and clinical judgment should always come first.

  • Severe symptoms, breathing difficulty, fainting, confusion, or rapidly worsening illness.
  • New weakness, severe pain, high fever, or symptoms after a serious injury.
  • Any symptom that feels urgent, unusual, or unsafe for the patient.
1

Emergency now

Use emergency care for severe, sudden, rapidly worsening, or life-threatening symptoms.

2

See a doctor

Book a professional medical evaluation if symptoms persist, worsen, recur often, affect daily activities, or occur in a high-risk patient.

3

Learn safely

Use this article to understand possible causes, tests, treatment options, prevention, and questions to ask your clinician.

Before reading

RX Patient Tools

Use these quick guides before reading the article, or return to them when you need help preparing questions for a doctor.

Start here Choose the right pathway for symptoms, reports, medicines, or urgent warning signs. Disease article roadmap Read this topic step by step: meaning, symptoms, warning signs, diagnosis, treatment, prevention, and follow-up. Treatment planner Prepare questions about treatment choices, benefits, risks, side effects, and follow-up. Family & caregiver guide Organize symptoms, reports, medicines, questions, and follow-up safely. Nutrition & diet guide Prepare food, hydration, supplement, and medicine-timing questions safely. Prevention guide Organize risk factors, protective habits, screening, and warning signs. Recovery guide Prepare a safe plan for activity, rehabilitation, warning signs, and follow-up.

Model training has been and will be in the foreseeable future one of the most frustrating things machine learning developers face. It takes quite a long time and people can’t really do anything about it. If you have the luxury (especially at this moment of time) of having multiple GPUs, you are likely to find Distributed Data Parallel (DDP) helpful in terms of model training. DDP performs model training across multiple GPUs, in a transparent fashion. You can have multiple GPUs on a single machine, or multiple machines separately. DDP can utilize all the GPUs you have to maximize the computing power, thus significantly shorten the time needed for training.

For a reasonably long time, DDP was only available on Linux. This was changed in PyTorch 1.7. In PyTorch 1.7 the support for DDP on Windows was introduced by Microsoft and has since then been continuously improved. In this article, we’d like to show you how it can help with the training experience on Windows.

Walkthrough

For reference, we’ll set up two machines with the same spec on Azure, with one being Windows and the other being Linux, then perform model training with the same code and dataset.

We use this very nice resource in Azure called a Data Science Virtual Machine (DSVM). This is a handy VM image with a lot of machine learning tools preinstalled. At the time of writing, PyTorch 1.8.1(Anaconda) is included in the DSVM image, which will be what we use for demonstration.

You can search directly for this resource:

Distributed Data Parallel

You can also follow the normal VM creation process and choose the desired DSVM image:

Distributed Data Parallel

In this article, we use the size “Standard NC24s_v3”, which puts four NVIDIA Tesla V100 GPUs at our disposal.

To better understand how DDP works, here are some basic concepts we need to learn first.

One important concept we need to understand is “process group”, which is the fundamental tool that powers DDP. A process group is, as the name suggests, a group of processes. Each of the processes is responsible for the training workload of one dedicated GPU. Additionally, we need some method to coordinate the group of processes (more importantly, the GPUs behind them), so that they can communicate with each other. This is called “backend” in PyTorch (–dist-backend in the script parameter). In PyTorch 1.8 we will be using Gloo as the backend because NCCL and MPI backends are currently not available on Windows. See the PyTorch documentation to find more information about “backend”. And finally, we need a place for the backend to exchange information. This is called “store” in PyTorch (–dist-url in the script parameter). See the PyTorch documentation to find out more about “store”.

Other concepts that might be a bit confusing are “world size” and “rank”. World size is essentially the number of processes participating in the training job. As we mentioned before, each process is responsible for one dedicated GPU. Thus, world size also equals to the total number of GPUs used. Pretty straightforward, right? Now let’s talk about “rank”. Rank can be seen as an index number of each process, which can be used to identify one specific process. Note that a process with rank 0 is always needed because it will act like the “controller” which coordinates all the processes. If the process with rank 0 doesn’t exist, the entire training is a no-go.

With the necessary knowledge in our backpack, let’s get started with the actual training. We use a small subset of ImageNet 2012 as the dataset. Let’s assume we have downloaded and placed our dataset at some location in the filesystem, we’ll use “D:\imagenet-small” for this demonstration.

Obviously, we also need a training script. We use the imagenet training script from PyTorch Examples repo and ResNet50 as the target model. The training script here can be seen as a normal training script, plus the DDP power provided packages like “torch.distributed” and “torch.multiprocessing”. The script doesn’t contain too much logic and you can easily set up your own script based on it. You can also refer to this Getting Started tutorial for more inspiration.

On a single machine, we can simply use FileStore which is easier to set up. The complete command looks like this:

> python main.py D:\imagenet-small --arch resnet50 --dist-url file:///D:\pg --dist-backend gloo --world-size 1 --multiprocessing-distributed  --rank 0

You probably noticed that we are using “world-size 1” and “rank 0”. This is because the script will calculate the desired world size and rank based on the available GPUs. Here the actual world size used is the same as the number of GPUs available, which is four. The rank of each process will also be automatically assigned with the correct number, starting from zero.

If you’re not a fan of command-line arguments, you can also use environment variables to initialize the DDP arguments. This might be helpful if you need to automate the deployment. More details can be found in the “Environment variable initialization” section of the PyTorch documentation.

If everything goes well, the training job will start shortly after.

Troubleshooting

If something doesn’t go well, here are some troubleshooting tips that might be helpful:

  • If you’re using FileStore on Windows, make sure the file used is not locked by other processes, which can happen if you forcefully kill the training processes. This can lead to freezing of the DDP training process, because the script fails to initialize the FileStore. A workaround is to manually kill previous training processes and delete the file before you conduct the next training.
  • If you’re using TcpStore, make sure the network is accessible and the port is in fact available. Otherwise, the training may freeze because the script fails to initialize the TcpStore. The process with rank zero will bind and listen on the port you provided. Other processes will try to connect to that port. You can use network monitoring tools like “netstat” to help debugging the TCP connection issue.
  • You can use tools like nvidia-smi to monitor the GPU load while performing the training. Ideally, we want all the GPUs fully utilized and running at 100 percent usage. If you find that the GPU load is low, you may want to increase the batch size and/or the number of DataLoader workers.
  • Be aware that the number of GPUs used in DDP also affects the effective batch size. For example, if we use 128 as batch size on a single GPU, and then we switch to DDP with two GPUs. We have two options: a) split the batch and use 64 as batch size on each GPU; b) use 128 as batch size on each GPU and thus resulting in 256 as the effective batch size. Besides the limitation of the GPU memory, the choice is mostly up to you. You can tweak the script to choose either way. Remember to also adjust the initial learning rate if you choose option b) and expect a similar training result.

Benchmark

Back to our benchmarking mission. First, we tried to perform the training without using DDP to establish a baseline. Then we tried the DDP setup with two GPUs, then finally with four GPUs. These are the results:

Duration 1 GPU (No DDP) 2 GPUs 4 GPUs
Linux 56m 58s 31m 7s 17m 20s
Windows 58m 55s 31m 55s 19m 3s

 

To better visualize it, we plot it as the chart below:

Distributed Data Parallel

As we can see from the data, the acceleration from additional GPUs meets our overall expectations. Using two GPUs cuts training duration to almost half. And using four GPUs makes it nearly one-quarter.

In terms of accuracy, here’s the loss curve we see on both Windows and Linux:

Distributed Data Parallel

We can tell from the loss curve that the shortening of training time does not end up with a bad training result. We can still expect the model to be gradually trained over time.

This is of course only a small demonstration of how DDP on Windows can bring users a performance boost that is comparable to the one on Linux, without compromising accuracy. We at Microsoft are working closely with PyTorch team to keep improving the PyTorch experience on Windows. The support of DDP on Windows is a huge leap ahead in terms of training performance. We’d like to encourage people to try it and we’d love to hear your feedback.

Doctor visit helper

Prepare before seeing a doctor

A simple rural-patient checklist to help you explain symptoms clearly, ask better questions, and avoid unsafe self-treatment.

Safety note: This is not a prescription or diagnosis. For severe symptoms, pregnancy danger signs, children with serious illness, chest pain, breathing difficulty, stroke-like weakness, or major injury, seek urgent care.

Which doctor may help?

Start with a registered doctor or the nearest qualified health center.

What to tell the doctor

  • Write when the problem started and how it changed.
  • Bring old prescriptions, investigation reports, and current medicines.
  • Write allergies, pregnancy status, diabetes, kidney/liver disease, and major past illnesses.
  • Bring one family member if the patient is weak, elderly, confused, or a child.

Questions to ask

  • What is the most likely cause of my symptoms?
  • Which danger signs mean I should go to hospital quickly?
  • Which tests are necessary now, and which can wait?
  • How should I take medicines safely and what side effects should I watch for?
  • When should I come for follow-up?

Tests to discuss

  • Vital signs: temperature, pulse, blood pressure, oxygen saturation
  • Basic physical examination by a clinician
  • CBC, urine test, blood sugar, or imaging only when clinically needed

Avoid these mistakes

  • Do not use antibiotics, steroid tablets/injections, or strong painkillers without proper medical advice.
  • Do not hide pregnancy, kidney disease, ulcer, allergy, or blood thinner use.
  • Do not delay emergency care when danger signs are present.

Medicine safety and first-aid guide

This section is for patient education only. It does not replace a doctor, pharmacist, or emergency care.

Safe first steps

  • Rest, drink safe water, and observe symptoms carefully.
  • Keep a written note of symptoms, duration, temperature, medicines already taken, and allergy history.
  • Seek medical care quickly if symptoms are severe, worsening, or unusual for the patient.

OTC medicine safety

  • For mild pain or fever, ask a registered pharmacist or doctor before using common over-the-counter pain/fever medicines.
  • Do not combine multiple pain medicines without advice, especially if you have kidney disease, liver disease, stomach ulcer, asthma, pregnancy, or take blood thinners.
  • Do not give adult medicines to children unless a qualified clinician advises it.

Avoid these mistakes

  • Do not start antibiotics without a proper medical decision.
  • Do not use steroid tablets or injections casually for quick relief.
  • Do not delay emergency care because of home remedies.

Get urgent help if

  • Severe symptoms, confusion, fainting, breathing difficulty, chest pain, severe dehydration, or sudden weakness need urgent medical care.
Medicine names, dose, and timing must be decided by a qualified clinician or pharmacist after checking age, pregnancy, allergy, other diseases, and current medicines.

For rural patients and family caregivers

Patient health record and symptom diary

Write your symptoms, medicines already taken, test results, and questions before visiting a doctor. This note stays on your device unless you print or copy it.

Doctor to discuss: Doctor / qualified healthcare provider
Tests to discuss with doctor
  • Basic vital signs: temperature, pulse, blood pressure, oxygen level if needed
  • Relevant blood, urine, imaging, or specialist tests only after clinical assessment
Questions to ask
  • What is the most likely cause of my symptoms?
  • Which warning signs mean I should go to emergency care?
  • Which tests are really needed now?
  • Which medicines are safe for my age, pregnancy status, allergy, kidney/liver/stomach condition, and current medicines?

Emergency warning signs such as chest pain, severe breathing difficulty, sudden weakness, confusion, severe dehydration, major injury, or loss of bladder/bowel control need urgent medical care. Do not wait for online information.

Safe pathway to proper treatment

Care roadmap for: Distributed Data Parallel

Use this simple roadmap to understand the next safe steps. It is educational and does not replace examination by a doctor.

Go to emergency care if you notice:
  • Severe or rapidly worsening symptoms
  • Breathing difficulty, chest pain, fainting, confusion, severe weakness, major injury, or severe dehydration
Doctor / service to discuss: Qualified healthcare provider; specialist depends on symptoms and examination.
  1. Step 1

    Check danger signs first

    If danger signs are present, seek emergency care and do not wait for online information.

  2. Step 2

    Record the symptom story

    Write when symptoms started, severity, medicines already taken, allergies, pregnancy status, and test results.

  3. Step 3

    Visit a qualified clinician

    A doctor, nurse, or qualified healthcare provider can examine you and decide which tests or treatment are needed.

  4. Step 4

    Do only useful tests

    Do tests after clinical assessment. Avoid unnecessary tests, random antibiotics, or repeated medicines without diagnosis.

  5. Step 5

    Follow up and return early if worse

    If symptoms worsen, new warning signs appear, or treatment is not helping, return for review quickly.

Rural patient practical tips
  • Take a written symptom diary and all previous prescriptions/test reports.
  • Do not hide medicines already taken, even herbal or over-the-counter medicines.
  • Ask which warning signs mean urgent referral to hospital.

This roadmap is for education. A real diagnosis and treatment plan requires history, examination, and clinical judgment.

RX Patient Help

Ask a health question safely

Write your symptom story. A health professional or site editor can review it before any answer is prepared. This box is not for emergency care.

Emergency first: Severe chest pain, breathing trouble, unconsciousness, stroke signs, severe injury, heavy bleeding, or rapidly worsening symptoms need urgent local medical care now.

Frequently Asked Questions

Is this article a replacement for a doctor?

No. It is educational content only. Patients should consult a qualified clinician for diagnosis and treatment.

When should I seek urgent care?

Seek urgent care for severe symptoms, rapidly worsening condition, breathing difficulty, severe pain, neurological changes, or any emergency warning sign.

References

Add references, clinical guidelines, textbooks, journal articles, or trusted medical sources here. You can edit this area from the RX Article Professional Blocks panel.