Use Multiple GPUs

Single Machine Data Parallel

When data is too large to feed in a single GPU

  • data is scatterred across GPUs
  • Model is replicated on each GPUs
  • Pytorch Gather the ouput from each GPUs,
    • and compute loss, gradient, updates weights

Code is simple, just one line:


model = Net().to("cuda:0")
model = torch.nn.DataParallel(model) # add this line

# normal Training loops ...

Single Machine Model Parallel

When Model itself is too large to fit in the memory of a single GPU

  • data is sent to GPU 0
    • process 1 is processed in GPU 0
  • data is sent to GPU 1
    • process 2 is processed in GPU 1
  • Output is computed
    • data comes in on GPU 0, but output on GPU 1

Code:

Require manually transfer the data inside the nn module self-defined.


class Net(torch.nn.Module):
  def __init__(self, *gpus):
    super().__init__(self)
    self.gpu0 = torch.device(gpus[0]) # pass gpu to a nn.module constructer
    self.gpu1 = torch.device(gpus[1])
    self.sub_net1 = torch.nn.Linear(10, 10).to(self.gpu0) self.sub_net2 = torch.nn.Linear(10, 5).to(self.gpu1)
  def forward(self, x):
    y = self.sub_net1(x.to(self.gpu0))
    z = self.sub_net2(y.to(self.gpu1)) # blocking return z


model = Net("cuda:0", "cuda:1") # training loop...


  • pass gpu to a nn.module constructer
  • save each operation(layer) to corresponding GPUs In forward:
  • manually transfer data across GPUs

Distributed Data Parallel

  • GPUs across all machines receives different data, process in parallel at the same time.
  • backward update the gradient across al GPUs

Code:

 def one_machine(machine_rank, world_size, backend):
    torch.distributed.init_process_group(
         backend, rank=machine_rank, world_size=world_size
    )
    gpus = {
        0: [0, 1],
        1: [2, 3],
    }[machine_rank]  # or one gpu per process to avoid Global Interpreter Lock
    model = Net().to(gpus[0])  # default to first gpu on machine
    model = torch.nn.parallel.DDP(model, device_ids=gpus)
    # training loop...
for machine_rank in range(world_size):
    torch.multiprocessing.spawn(
        one_machine, args=(world_size, backend),
         nprocs=world_size, join=True # blocking
)


Distrobuted Model and Data Parellel

Multiple Machines, each has Multiple GPUs



def one_machine(machine_rank, world_size, backend): 
  torch.distributed.init_process_group(backend, rank=machine_rank, world_size=world_size )
  gpus = {
      0: [0, 1],
      1: [2, 3],
  }[machine_rank]
  model = Net(gpus)
  model = torch.nn.parallel.DDP(model) # training loop...

for machine_rank in range(world_size): 
  torch.multiprocessing.spawn(
    one_machine, args=(world_size, backend), nprocs=world_size, join=True
    )

Summary

If data and model can fit in a single GPU:

  • Parellel is not a concern since traininig in a single GPU is the most efficient way

If there’re multiple GPUs on your server, and wanna speed up training with minimum code change:xs

  • Single-machine Multi-GPU DataParrellel

If there’re multiple GPUs on your server, and wanna speed up training. And are willing to write a little more code:

  • Single-machine multi-GPU Distributed Data Parallel

Break the machine boundary…

Related