mindspore.dataset.DistributedSampler

View Source On Gitee
class mindspore.dataset.DistributedSampler(num_shards, shard_id, shuffle=True, num_samples=None, offset=- 1)[source]

A sampler that accesses a shard of the dataset, it helps divide dataset into multi-subset for distributed training.

Parameters
  • num_shards (int) – Number of shards to divide the dataset into.

  • shard_id (int) – Shard ID of the current shard, which should within the range of [0, num_shards - 1].

  • shuffle (bool, optional) – If True, the indices are shuffled, otherwise it will not be shuffled. Default: True.

  • num_samples (int, optional) – The number of samples to draw. Default: None, which means sample all elements.

  • offset (int, optional) – The starting shard ID where the elements in the dataset are sent to, which should be no more than num_shards . This parameter is only valid when a ConcatDataset takes a mindspore.dataset.DistributedSampler as its sampler. It will affect the number of samples of per shard. Default: -1, which means each shard has the same number of samples.

Raises
  • TypeError – If num_shards is not of type int.

  • TypeError – If shard_id is not of type int.

  • TypeError – If shuffle is not of type bool.

  • TypeError – If num_samples is not of type int.

  • TypeError – If offset is not of type int.

  • ValueError – If num_samples is a negative value.

  • RuntimeError – If num_shards is not a positive value.

  • RuntimeError – If shard_id is smaller than 0 or equal to num_shards or larger than num_shards .

  • RuntimeError – If offset is greater than num_shards .

Examples

>>> import mindspore.dataset as ds
>>> # creates a distributed sampler with 10 shards in total. This shard is shard 5.
>>> sampler = ds.DistributedSampler(10, 5)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
add_child(sampler)

Add a sub-sampler for given sampler. The parent will receive all data from the output of sub-sampler sampler and apply its sample logic to return new samples.

Parameters

sampler (Sampler) – Object used to choose samples from the dataset. Only builtin samplers(mindspore.dataset.DistributedSampler , mindspore.dataset.PKSampler, mindspore.dataset.RandomSampler, mindspore.dataset.SequentialSampler, mindspore.dataset.SubsetRandomSampler, mindspore.dataset.WeightedRandomSampler ) are supported.

Examples

>>> import mindspore.dataset as ds
>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> sampler.add_child(ds.RandomSampler(num_samples=4))
>>> dataset = ds.Cifar10Dataset(cifar10_dataset_dir, sampler=sampler)
get_child()

Get the child sampler of given sampler.

Returns

Sampler, The child sampler of given sampler.

Examples

>>> import mindspore.dataset as ds
>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> sampler.add_child(ds.RandomSampler(num_samples=2))
>>> child_sampler = sampler.get_child()
get_num_samples()

Get num_samples value of the current sampler instance. This parameter can be optionally passed in when defining the Sampler. Default: None. This method will return the num_samples value. If the current sampler has child samplers, it will continue to access the child samplers and process the obtained value according to certain rules.

The following table shows the various possible combinations, and the final results returned.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, the number of samples, or None.

Examples

>>> import mindspore.dataset as ds
>>> sampler = ds.SequentialSampler(start_index=0, num_samples=3)
>>> num_samplers = sampler.get_num_samples()