mindspore.dataset.RandomDataset

class mindspore.dataset.RandomDataset(total_rows=None, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, cache=None, shuffle=None, num_shards=None, shard_id=None)[source]

A source dataset that generates random data.

Parameters

total_rows (int, optional) – Number of samples for the dataset to generate. Default: None , number of samples is random.
schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by mindspore.dataset.Schema are acceptable. Default: None .
columns_list (list[str], optional) – List of column names of the dataset. Default: None , the columns will be named like this "c0", "c1", "c2" etc.
num_samples (int, optional) – The number of samples to be included in the dataset. Default: None , all samples.
num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None , will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers() .
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None , which means no cache is used.
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None , expected order behavior shown in the table below.
num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None . When this argument is specified, num_samples reflects the maximum sample number of per shard. Used in data parallel training .
shard_id (int, optional) – The shard ID within num_shards . Default: None . This argument can only be specified when num_shards is also specified.

Raises

RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If num_parallel_workers exceeds the max thread numbers.
ValueError – If shard_id is not in range of [0, num_shards ).
TypeError – If total_rows is not of type int.
TypeError – If num_shards is not of type int.
TypeError – If num_parallel_workers is not of type int.
TypeError – If shuffle is not of type bool.
TypeError – If columns_list is not of type list.

Tutorial Examples:

Load & Process Data With Dataset Pipeline

Examples

>>> from mindspore import dtype as mstype
>>> import mindspore.dataset as ds
>>>
>>> schema = ds.Schema()
>>> schema.add_column('image', de_type=mstype.uint8, shape=[2])
>>> schema.add_column('label', de_type=mstype.uint8, shape=[1])
>>> # apply dataset operations
>>> ds1 = ds.RandomDataset(schema=schema, total_rows=50, num_parallel_workers=4)

Pre-processing Operation

`mindspore.dataset.Dataset.apply`	Apply a function in this dataset.
`mindspore.dataset.Dataset.concat`	Concatenate the dataset objects in the input list.
`mindspore.dataset.Dataset.filter`	Filter dataset by predicate.
`mindspore.dataset.Dataset.flat_map`	Map func to each row in dataset and flatten the result.
`mindspore.dataset.Dataset.map`	Apply each operation in operations to this dataset.
`mindspore.dataset.Dataset.project`	The specified columns will be selected from the dataset and passed into the pipeline with the order specified.
`mindspore.dataset.Dataset.rename`	Rename the columns in input datasets.
`mindspore.dataset.Dataset.repeat`	Repeat this dataset count times.
`mindspore.dataset.Dataset.reset`	Reset the dataset for next epoch.
`mindspore.dataset.Dataset.save`	Save the dynamic data processed by the dataset pipeline in common dataset format.
`mindspore.dataset.Dataset.shuffle`	Shuffle the dataset by creating a cache with the size of buffer_size .
`mindspore.dataset.Dataset.skip`	Skip the first N elements of this dataset.
`mindspore.dataset.Dataset.split`	Split the dataset into smaller, non-overlapping datasets.
`mindspore.dataset.Dataset.take`	Take the first specified number of samples from the dataset.
`mindspore.dataset.Dataset.zip`	Zip the datasets in the sense of input tuple of datasets.

Batch

`mindspore.dataset.Dataset.batch`	Combine batch_size number of consecutive rows into batch which apply per_batch_map to the samples first.
`mindspore.dataset.Dataset.bucket_batch_by_length`	Bucket elements according to their lengths.
`mindspore.dataset.Dataset.padded_batch`	Combine batch_size number of consecutive rows into batches which apply pad_info to the samples first.

Iterator

`mindspore.dataset.Dataset.create_dict_iterator`	Create an iterator over the dataset that yields samples of type dict, while the key is the column name and the value is the data.
`mindspore.dataset.Dataset.create_tuple_iterator`	Create an iterator over the dataset that yields samples of type list, whose elements are the data for each column.

Attribute

`mindspore.dataset.Dataset.get_batch_size`	Return the size of batch.
`mindspore.dataset.Dataset.get_class_indexing`	Get the mapping dictionary from category names to category indexes.
`mindspore.dataset.Dataset.get_col_names`	Return the names of the columns in dataset.
`mindspore.dataset.Dataset.get_dataset_size`	Return the number of batches in an epoch.
`mindspore.dataset.Dataset.get_repeat_count`	Get the replication times in RepeatDataset.
`mindspore.dataset.Dataset.input_indexs`	Get the column index, which represents the corresponding relationship between the data column order and the network when using the sink mode.
`mindspore.dataset.Dataset.num_classes`	Get the number of classes in a dataset.
`mindspore.dataset.Dataset.output_shapes`	Get the shapes of output data.
`mindspore.dataset.Dataset.output_types`	Get the types of output data.

Apply Sampler

`mindspore.dataset.MappableDataset.add_sampler`	Add a child sampler for the current dataset.
`mindspore.dataset.MappableDataset.use_sampler`	Replace the last child sampler of the current dataset, leaving the parent sampler unchanged.

Others

`mindspore.dataset.Dataset.recv`	The dataset communication interface receives data sent by the source Dataset using `mindspore.dataset.Dataset.send` .
`mindspore.dataset.Dataset.send`	The dataset communication interface sends data to the target Dataset, which can be received through `mindspore.dataset.Dataset.recv`.
`mindspore.dataset.Dataset.sync_update`	Release a blocking condition and trigger callback with given data.
`mindspore.dataset.Dataset.sync_wait`	Add a blocking condition to the input Dataset and a synchronize action will be applied.
`mindspore.dataset.Dataset.to_json`	Serialize a pipeline into JSON string and dump into file if filename is provided.