mindspore.dataset.DistributedSampler
- class mindspore.dataset.DistributedSampler(num_shards, shard_id, shuffle=True, num_samples=None, offset=-1)[source]
A sampler that accesses a shard of the dataset, it helps divide dataset into multi-subset for distributed training.
Note
The shuffling modes supported for different datasets are as follows:
List of support for shuffling mode Shuffling Mode
MindDataset
TFRecordDataset
Others
Shuffle.ADAPTIVESupported
Not Supported
Not Supported
Shuffle.GLOBALSupported
Supported
Supported
Shuffle.PARTIALSupported
Not Supported
Not Supported
Shuffle.FILESSupported
Supported
Not Supported
Shuffle.INFILESupported
Not Supported
Not Supported
- Parameters:
num_shards (int) – Number of shards to divide the dataset into.
shard_id (int) – Shard ID of the current shard, which should be within the range of [0, num_shards - 1].
shuffle (Union[bool, Shuffle], optional) –
Specify the shuffle mode. Default:
True, performsmindspore.dataset.Shuffle.GLOBAL. If shuffle isFalse, no shuffling will be performed. There are several levels of shuffling, desired shuffle enum is defined bymindspore.dataset.Shuffle.Shuffle.ADAPTIVE: When the number of dataset samples is less than or equal to 100 million,Shuffle.GLOBALis used. When the number of dataset samples is greater than 100 million,Shuffle.PARTIALis used. The shuffle is performed once every 1 million samples.Shuffle.GLOBAL: Global shuffle of all rows of data in dataset. The memory usage is large.Shuffle.PARTIAL: Partial shuffle of data in dataset for every 1 million samples. The memory usage is less thanShuffle.GLOBAL.Shuffle.FILES: Shuffle the file sequence but keep the order of data within each file.Shuffle.INFILE: Keep the file sequence the same but shuffle the data within each file.
num_samples (int, optional) – The number of samples to draw. Default:
None, which means sample all elements.offset (int, optional) – The starting shard ID where the elements in the dataset are sent to, which should be no more than num_shards . This parameter is only valid when a ConcatDataset takes a
mindspore.dataset.DistributedSampleras its sampler. It will affect the number of samples per shard. Default:-1, which means each shard has the same number of samples.
- Raises:
TypeError – If num_shards is not of type int.
TypeError – If shard_id is not of type int.
TypeError – If shuffle is not of type bool or Shuffle.
TypeError – If num_samples is not of type int.
TypeError – If offset is not of type int.
ValueError – If num_samples is a negative value.
RuntimeError – If num_shards is not a positive value.
RuntimeError – If shard_id is smaller than 0 or equal to num_shards or larger than num_shards .
RuntimeError – If offset is greater than num_shards .
Examples
>>> import mindspore.dataset as ds >>> # creates a distributed sampler with 10 shards in total. This shard is shard 5. >>> sampler = ds.DistributedSampler(10, 5) >>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir, ... num_parallel_workers=8, ... sampler=sampler)