Pytorch Data Loader

Dylan Yang

Last updated on Jun 6, 2022 2 min read

Training Streams

Recall the Typical Training Process:


from torch.optim import SGD
loader = ...
model = MyNet()
criterion = torch.nn.CrossEntropyLoss()
optimizer = SGD(model.parameters)
  for epoch in range(10):
    for batch, labels in loader:
    outputs = model(batch)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Loader

Two styles of dataset:

Streams
- providing 1 sample each iter
Map
- allow access samples in any order
  - e.g. randomly
- user may augment, manipulate data through dataloader

class IterableStyleDataset(torch.utils. data. IterableDataset):
  def _iter_(self):
  # Support for streams...


class MapStyleDataset(torch.utils. data.Dataset):
  def _getitem_(self, key):
  # Map from (non-int) keys $\ldots$
  def _len_(self):
  # Support sampling ...

`DataLoader` Object

Pytorch DataLoader allows us to load batches from a dataset

dataloader = DataLoader (
  dataset, # only for map-style dataset
  batch_size=8, # balance speed and convergence
  num_workers =2, # non-blocking when $>0$
  sampler=RandomSampler, # random read may saturate drive
  pin_memory=True, # page-lock memory for data?

random read may saturate the storage
pin: keep the data in RAM.
- you may transfer data to GPU faster

Performance

2 main constrains:

CPU IPS (instructions per second)
storage IOPS (I/O per second)

CPU

You want the CPUs to be performing:

preprocessing
decompression
copying – to get the data to the GPU.

The rule: You don’t want them to be idling or busy-waiting for thread/process synchronization primitives, IO, etc

The easiest way to improve CPU utilization with the PyTorch is to use the worker process support built into Dataloader.

Pytorch

Pytorch Data Loader

Training Streams

Loader

DataLoader Object

Performance

CPU

Related

`DataLoader` Object