mindspore.dataset.DistributedSampler

class mindspore.dataset.DistributedSampler(num_shards, shard_id, shuffle=True, num_samples=None, offset=-1)[source]

A sampler that accesses a shard of the dataset, it helps divide dataset into multi-subset for distributed training.

Parameters
  • num_shards (int) – Number of shards to divide the dataset into.

  • shard_id (int) – Shard ID of the current shard, which should within the range of [0, num_shards-1].

  • shuffle (bool, optional) – If True, the indices are shuffled, otherwise it will not be shuffled(default=True).

  • num_samples (int, optional) – The number of samples to draw (default=None, which means sample all elements).

  • offset (int, optional) – The starting shard ID where the elements in the dataset are sent to, which should be no more than num_shards. This parameter is only valid when a ConcatDataset takes a DistributedSampler as its sampler. It will affect the number of samples of per shard (default=-1, which means each shard has same number of samples).

Examples

>>> # creates a distributed sampler with 10 shards in total. This shard is shard 5.
>>> sampler = ds.DistributedSampler(10, 5)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
Raises
  • TypeError – If num_shards is not an integer value.

  • TypeError – If shard_id is not an integer value.

  • TypeError – If shuffle is not a boolean value.

  • TypeError – If num_samples is not an integer value.

  • TypeError – If offset is not an integer value.

  • ValueError – If num_samples is a negative value.

  • RuntimeError – If num_shards is not a positive value.

  • RuntimeError – If shard_id is smaller than 0 or equal to num_shards or larger than num_shards.

  • RuntimeError – If offset is greater than num_shards.

add_child(sampler)

Add a sub-sampler for given sampler. The sub-sampler will receive all data from the output of parent sampler and apply its sample logic to return new samples.

Parameters

sampler (Sampler) – Object used to choose samples from the dataset. Only builtin samplers(DistributedSampler, PKSampler, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler) are supported.

Examples

>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> sampler.add_child(ds.RandomSampler(num_samples=2))
>>> dataset = ds.Cifar10Dataset(cifar10_dataset_dir, sampler=sampler)
get_child()

Get the child sampler of given sampler.

Returns

Sampler, The child sampler of given sampler.

Examples

>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> sampler.add_child(ds.RandomSampler(num_samples=2))
>>> child_sampler = sampler.get_child()
get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, the number of samples, or None.

Examples

>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> num_samplers = sampler.get_num_samples()