mindspore.dataset

This module provides APIs to load and process various datasets: MNIST, CIFAR-10, CIFAR-100, VOC, ImageNet, CelebA dataset, etc. It also supports datasets in special format, including mindrecord, tfrecord, manifest. Users can also create samplers with this module to sample data.

class mindspore.dataset.CLUEDataset(dataset_files, task='AFQMC', usage='train', num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None)[source]

A source dataset that reads and parses CLUE datasets. CLUE, the Chinese Language Understanding Evaluation Benchmark, is a collection of datasets, baselines, pre-trained models, corpus and leaderboard. Supported CLUE classification tasks: ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’.

Citation of CLUE dataset.

@article{CLUEbenchmark,
title   = {CLUE: A Chinese Language Understanding Evaluation Benchmark},
author  = {Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li,
           Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Jun Zeng,
           Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou,
           Shaoweihua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Zhenzhong Lan},
journal = {arXiv preprint arXiv:2004.05986},
year    = {2020},
howpublished = {https://github.com/CLUEbenchmark/CLUE},
description  = {CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different
                tasks, including single-sentence classification, sentence pair classification, and machine
                reading comprehension.}
}
Parameters
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • task (str, optional) – The kind of task, one of ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’. (default=AFQMC).

  • usage (str, optional) – Need train, test or eval data (default=”train”).

  • num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.CLUEDataset(dataset_files=dataset_files, task='AFQMC', usage='train')
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

This is a general purpose split function which can be called from any operator in the pipeline. There is another, optimized split function, which will be called automatically if ds.split is called where ds is a MappableDataset.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have at least 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. Dataset cannot be sharded if split is going to be called.

  2. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = "/path/to/text_file/*"
>>>
>>> # TextFileDataset is not a mappable dataset, so this non-optimized split will be called.
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.TextFileDataset(dataset_files, shuffle=False)
>>> train, test = data.split([0.9, 0.1])
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.CSVDataset(dataset_files, field_delim=',', column_defaults=None, column_names=None, num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None)[source]

A source dataset that reads and parses comma-separated values (CSV) datasets.

Parameters
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • field_delim (str, optional) – A string that indicates the char delimiter to separate fields (default=’,’).

  • column_defaults (list, optional) – List of default values for the CSV field (default=None). Each item in the list is either a valid type (float, int, or string). If this is not provided, treats all columns as string type.

  • column_names (list[str], optional) – List of column names of the dataset (default=None). If this is not provided, infers the column_names from the first row of CSV file.

  • num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.CSVDataset(dataset_files=dataset_files, column_names=['col1', 'col2', 'col3', 'col4'])
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

This is a general purpose split function which can be called from any operator in the pipeline. There is another, optimized split function, which will be called automatically if ds.split is called where ds is a MappableDataset.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have at least 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. Dataset cannot be sharded if split is going to be called.

  2. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = "/path/to/text_file/*"
>>>
>>> # TextFileDataset is not a mappable dataset, so this non-optimized split will be called.
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.TextFileDataset(dataset_files, shuffle=False)
>>> train, test = data.split([0.9, 0.1])
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, usage='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing CelebA dataset. Currently supported: list_attr_celeba.txt only.

Note

The generated dataset has two columns [‘image’, ‘attr’]. The type of the image tensor is uint8. The attribute tensor is uint32 and one hot type.

Citation of CelebA dataset.

@article{DBLP:journals/corr/LiuLWT14,
author    = {Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang},
title     = {Deep Learning Face Attributes in the Wild},
journal   = {CoRR},
volume    = {abs/1411.7766},
year      = {2014},
url       = {http://arxiv.org/abs/1411.7766},
archivePrefix = {arXiv},
eprint    = {1411.7766},
timestamp = {Tue, 10 Dec 2019 15:37:26 +0100},
biburl    = {https://dblp.org/rec/journals/corr/LiuLWT14.bib},
bibsource = {dblp computer science bibliography, https://dblp.org},
howpublished = {http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html},
description  = {CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset
                with more than 200K celebrity images, each with 40 attribute annotations.
                The images in this dataset cover large pose variations and background clutter.
                CelebA has large diversities, large quantities, and rich annotations, including
                * 10,177 number of identities,
                * 202,599 number of face images, and
                * 5 landmark locations, 40 binary attributes annotations per image.
                The dataset can be employed as the training and test sets for the following computer
                vision tasks: face attribute recognition, face detection, landmark (or facial part)
                localization, and face editing & synthesis.}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=value set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None).

  • usage (str) – one of ‘all’, ‘train’, ‘valid’ or ‘test’.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None).

  • decode (bool, optional) – decode the images after reading (default=False).

  • extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/celeba_directory"
>>> dataset = ds.CelebADataset(dataset_dir=dataset_dir, usage='train')
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.Cifar100Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset that reads cifar100 data.

The generated dataset has three columns [‘image’, ‘coarse_label’, ‘fine_label’]. The type of the image tensor is uint8. The coarse and fine labels are each a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Citation of Cifar100 dataset.

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html},
description  = {This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images
                each. There are 500 training images and 100 testing images per class. The 100 classes in
                the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the
                class to which it belongs) and a "coarse" label (the superclass to which it belongs).}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 50,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 60,000 samples. (default=None, all samples)

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/cifar100_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR100 dataset in sequence
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR100 dataset
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # In CIFAR100 dataset, each dictionary has 3 keys: "image", "fine_label" and "coarse_label"
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.Cifar10Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset that reads cifar10 data.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Citation of Cifar10 dataset.

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html},
description  = {The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,
                with 6000 images per class. There are 50000 training images and 10000 test images.}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 50,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 60,000 samples. (default=None, all samples)

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/cifar10_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR10 dataset in sequence
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR10 dataset
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from CIFAR10 dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In CIFAR10 dataset, each dictionary has keys "image" and "label"
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.CocoDataset(dataset_dir, annotation_file, task='Detection', num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing COCO dataset.

CocoDataset support four kinds of task: 2017 Train/Val/Test Detection, Keypoints, Stuff, Panoptic.

The generated dataset has multi-columns :

  • task=’Detection’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘category_id’, dtype=uint32], [‘iscrowd’, dtype=uint32]].

  • task=’Stuff’, column: [[‘image’, dtype=uint8], [‘segmentation’,dtype=float32], [‘iscrowd’,dtype=uint32]].

  • task=’Keypoint’, column: [[‘image’, dtype=uint8], [‘keypoints’, dtype=float32], [‘num_keypoints’, dtype=uint32]].

  • task=’Panoptic’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘category_id’, dtype=uint32], [‘iscrowd’, dtype=uint32], [‘area’, dtype=uint32]].

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. CocoDataset doesn’t support PKSampler. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Citation of Coco dataset.

@article{DBLP:journals/corr/LinMBHPRDZ14,
author        = {Tsung{-}Yi Lin and Michael Maire and Serge J. Belongie and
                 Lubomir D. Bourdev and  Ross B. Girshick and James Hays and
                 Pietro Perona and Deva Ramanan and Piotr Doll{'{a}}r and C. Lawrence Zitnick},
title         = {Microsoft {COCO:} Common Objects in Context},
journal       = {CoRR},
volume        = {abs/1405.0312},
year          = {2014},
url           = {http://arxiv.org/abs/1405.0312},
archivePrefix = {arXiv},
eprint        = {1405.0312},
timestamp     = {Mon, 13 Aug 2018 16:48:13 +0200},
biburl        = {https://dblp.org/rec/journals/corr/LinMBHPRDZ14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org},
description   = {COCO is a large-scale object detection, segmentation, and captioning dataset.
                 It contains 91 common object categories with 82 of them having more than 5,000
                 labeled instances. In contrast to the popular ImageNet dataset, COCO has fewer
                 categories but more instances per category.}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • annotation_file (str) – Path to the annotation JSON.

  • task (str) – Set the task type for reading COCO data. Supported task types: ‘Detection’, ‘Stuff’, ‘Panoptic’ and ‘Keypoint’ (default=’Detection’).

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the configuration file).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • decode (bool, optional) – Decode the images after reading (default=False).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If parse JSON file failed.

  • ValueError – If task is not in [‘Detection’, ‘Stuff’, ‘Panoptic’, ‘Keypoint’].

  • ValueError – If annotation_file is not exist.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/coco_dataset_directory/image_folder"
>>> annotation_file = "/path/to/coco_dataset_directory/annotation_folder/annotation.json"
>>>
>>> # 1) Read COCO data for Detection task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Detection')
>>>
>>> # 2) Read COCO data for Stuff task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Stuff')
>>>
>>> # 3) Read COCO data for Panoptic task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Panoptic')
>>>
>>> # 4) Read COCO data for Keypoint task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Keypoint')
>>>
>>> # In COCO dataset, each dictionary has keys "image" and "annotation"
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()[source]

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.DistributedSampler(num_shards, shard_id, shuffle=True, num_samples=None, offset=- 1)[source]

A sampler that accesses a shard of the dataset.

Parameters
  • num_shards (int) – Number of shards to divide the dataset into.

  • shard_id (int) – Shard ID of the current shard within num_shards.

  • shuffle (bool, optional) – If True, the indices are shuffled (default=True).

  • num_samples (int, optional) – The number of samples to draw (default=None, all elements).

  • offset (int, optional) – The starting sample ID where access to elements in the dataset begins (default=-1).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a distributed sampler with 10 shards in total. This shard is shard 5.
>>> sampler = ds.DistributedSampler(10, 5)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
  • ValueError – If num_shards is not positive.

  • ValueError – If shard_id is smaller than 0 or equal to num_shards or larger than num_shards.

  • ValueError – If shuffle is not a boolean value.

get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, The number of samples, or None

class mindspore.dataset.GeneratorDataset(source, column_names=None, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None, python_multiprocessing=True)[source]

A source dataset that generates data from Python by invoking Python data source each epoch.

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • source (Union[Callable, Iterable, Random Accessible]) – A generator callable object, an iterable Python object or a random accessible Python object. Callable source is required to return a tuple of NumPy arrays as a row of the dataset on source().next(). Iterable source is required to return a tuple of NumPy arrays as a row of the dataset on iter(source).next(). Random accessible source is required to return a tuple of NumPy arrays as a row of the dataset on source[idx].

  • column_names (list[str], optional) – List of column names of the dataset (default=None). Users are required to provide either column_names or schema.

  • column_types (list[mindspore.dtype], optional) – List of column data types of the dataset (default=None). If provided, sanity check will be performed on generator output.

  • schema (Union[Schema, str], optional) – Path to the JSON schema file or schema object (default=None). Users are required to provide either column_names or schema. If both are provided, schema will be used.

  • num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).

  • sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ will not used. Random accessible input is required.

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy (default=True).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # 1) Multidimensional generator function as callable input
>>> def GeneratorMD():
>>>     for i in range(64):
>>>         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>> # Create multi_dimension_generator_dataset with GeneratorMD and column name "multi_dimensional_data"
>>> multi_dimension_generator_dataset = ds.GeneratorDataset(GeneratorMD, ["multi_dimensional_data"])
>>>
>>> # 2) Multi-column generator function as callable input
>>> def GeneratorMC(maxid = 64):
>>>     for i in range(maxid):
>>>         yield (np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]]))
>>> # Create multi_column_generator_dataset with GeneratorMC and column names "col1" and "col2"
>>> multi_column_generator_dataset = ds.GeneratorDataset(GeneratorMC, ["col1", "col2"])
>>>
>>> # 3) Iterable dataset as iterable input
>>> class MyIterable():
>>>     def __iter__(self):
>>>         return # User implementation
>>> # Create iterable_generator_dataset with MyIterable object
>>> iterable_generator_dataset = ds.GeneratorDataset(MyIterable(), ["col1"])
>>>
>>> # 4) Random accessible dataset as random accessible input
>>> class MyRA():
>>>     def __getitem__(self, index):
>>>         return # User implementation
>>> # Create ra_generator_dataset with MyRA object
>>> ra_generator_dataset = ds.GeneratorDataset(MyRA(), ["col1"])
>>> # List/Dict/Tuple is also random accessible
>>> list_generator = ds.GeneratorDataset([(np.array(0),), (np.array(1)), (np.array(2))], ["col1"])
>>>
>>> # 5) Built-in Sampler
>>> my_generator = ds.GeneratorDataset(my_ds, ["img", "label"], sampler=samplers.RandomSampler())
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.GraphData(dataset_file, num_parallel_workers=None, working_mode='local', hostname='127.0.0.1', port=50051, num_client=1, auto_shutdown=True)[source]

Reads the graph dataset used for GNN training from the shared file and database.

Parameters
  • dataset_file (str) – One of file names in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • working_mode (str, optional) –

    Set working mode, now supports ‘local’/’client’/’server’ (default=’local’).

    • ’local’, used in non-distributed training scenarios.

    • ’client’, used in distributed training scenarios. The client does not load data, but obtains data from the server.

    • ’server’, used in distributed training scenarios. The server loads the data and is available to the client.

  • hostname (str, optional) – Hostname of the graph data server. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=’127.0.0.1’).

  • port (int, optional) – Port of the graph data server. The range is 1024-65535. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=50051).

  • num_client (int, optional) – Maximum number of clients expected to connect to the server. The server will allocate resources according to this parameter. This parameter is only valid when working_mode is set to ‘server’ (default=1).

  • auto_shutdown (bool, optional) – Valid when working_mode is set to ‘server’, when the number of connected clients reaches num_client and no client is being connected, the server automatically exits (default=True).

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> features = data_graph.get_node_feature(nodes, [1])
get_all_edges(edge_type)[source]

Get all edges in the graph.

Parameters

edge_type (int) – Specify the type of edge.

Returns

array of edges.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_edges(0)
Raises

TypeError – If edge_type is not integer.

get_all_neighbors(node_list, neighbor_type)[source]

Get neighbor_type neighbors of the nodes in node_list.

Parameters
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neighbor_type (int) – Specify the type of neighbor.

Returns

Array of nodes.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neighbors = data_graph.get_all_neighbors(nodes, 0)
Raises
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_type is not integer.

get_all_nodes(node_type)[source]

Get all nodes in the graph.

Parameters

node_type (int) – Specify the type of node.

Returns

Array of nodes.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
Raises

TypeError – If node_type is not integer.

get_edge_feature(edge_list, feature_types)[source]

Get feature_types feature of the edges in edge_list.

Parameters
Returns

array of features.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> edges = data_graph.get_all_edges(0)
>>> features = data_graph.get_edge_feature(edges, [1])
Raises
  • TypeError – If edge_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_neg_sampled_neighbors(node_list, neg_neighbor_num, neg_neighbor_type)[source]

Get neg_neighbor_type negative sampled neighbors of the nodes in node_list.

Parameters
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neg_neighbor_num (int) – Number of neighbors sampled.

  • neg_neighbor_type (int) – Specify the type of negative neighbor.

Returns

Array of nodes.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neg_neighbors = data_graph.get_neg_sampled_neighbors(nodes, 5, 0)
Raises
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neg_neighbor_num is not integer.

  • TypeError – If neg_neighbor_type is not integer.

get_node_feature(node_list, feature_types)[source]

Get feature_types feature of the nodes in node_list.

Parameters
Returns

array of features.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> features = data_graph.get_node_feature(nodes, [1])
Raises
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_nodes_from_edges(edge_list)[source]

Get nodes from the edges.

Parameters

edge_list (Union[list, numpy.ndarray]) – The given list of edges.

Returns

Array of nodes.

Return type

numpy.ndarray

Raises

TypeError – If edge_list is not list or ndarray.

get_sampled_neighbors(node_list, neighbor_nums, neighbor_types)[source]

Get sampled neighbor information.

The api supports multi-hop neighbor sampling. That is, the previous sampling result is used as the input of next-hop sampling. A maximum of 6-hop are allowed.

The sampling result is tiled into a list in the format of [input node, 1-hop sampling result, 2-hop samling result …]

Parameters
Returns

Array of nodes.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neighbors = data_graph.get_sampled_neighbors(nodes, [2, 2], [0, 0])
Raises
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_nums is not list or ndarray.

  • TypeError – If neighbor_types is not list or ndarray.

graph_info()[source]

Get the meta information of the graph, including the number of nodes, the type of nodes, the feature information of nodes, the number of edges, the type of edges, and the feature information of edges.

Returns

Meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type and edge_feature_type.

Return type

dict

random_walk(target_nodes, meta_path, step_home_param=1.0, step_away_param=1.0, default_node=- 1)[source]

Random walk in nodes.

Parameters
  • target_nodes (list[int]) – Start node list in random walk

  • meta_path (list[int]) – node type for each walk step

  • step_home_param (float, optional) – return hyper parameter in node2vec algorithm (Default = 1.0).

  • step_away_param (float, optional) – inout hyper parameter in node2vec algorithm (Default = 1.0).

  • default_node (int, optional) – default node if no more neighbors found (Default = -1). A default value of -1 indicates that no node is given.

Returns

Array of nodes.

Return type

numpy.ndarray

Examples

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.random_walk([1,2], [1,2,1,2,1])
Raises
  • TypeError – If target_nodes is not list or ndarray.

  • TypeError – If meta_path is not list or ndarray.

class mindspore.dataset.ImageFolderDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, extensions=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[source]

A source dataset that reads images from a tree of directories.

All images within one folder have the same label. The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is a scalar int32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).

  • class_indexing (dict, optional) – A str-to-int mapping from folder name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).

  • decode (bool, optional) – Decode the images after reading (default=False).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # Set path to the imagefolder directory.
>>> # This directory needs to contain sub-directories which contain the images
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # 1) Read all samples (image files) in dataset_dir with 8 threads
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8)
>>>
>>> # 2) Read all samples (image files) from folder cat and folder dog with label 0 and 1
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, class_indexing={"cat":0, "dog":1})
>>>
>>> # 3) Read all samples (image files) in dataset_dir with extensions .JPEG and .png (case sensitive)
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, extensions=[".JPEG", ".png"])
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()[source]

Get the number of classes in dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.ManifestDataset(dataset_file, usage='train', num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, class_indexing=None, decode=False, num_shards=None, shard_id=None)[source]

A source dataset that reads images from a manifest file.

The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is a scalar uint64 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • dataset_file (str) – File to be read.

  • usage (str, optional) – acceptable usages include train, eval and inference (default=”train”).

  • num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • class_indexing (dict, optional) – A str-to-int mapping from label name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).

  • decode (bool, optional) – decode the images after reading (default=False).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_file = "/path/to/manifest_file.manifest"
>>>
>>> # 1) Read all samples specified in manifest_file dataset with 8 threads for training
>>> manifest_dataset = ds.ManifestDataset(dataset_file, usage="train", num_parallel_workers=8)
>>>
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0
>>> # in a 2-way distributed training setup
>>> manifest_dataset = ds.ManifestDataset(dataset_file, num_shards=2, shard_id=0)
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()[source]

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()[source]

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.MindDataset(dataset_file, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, sampler=None, padded_sample=None, num_padded=None, num_samples=None)[source]

A source dataset that reads MindRecord files.

Parameters
  • dataset_file (Union[str, list[str]]) – One of file names or file list in dataset.

  • columns_list (list[str], optional) – List of columns to be read (default=None).

  • num_parallel_workers (int, optional) – The number of readers (default=None).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, performs shuffle).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, sampler is exclusive with shuffle and block_reader). Support list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.

  • padded_sample (dict, optional) – Samples will be appended to dataset, which keys are the same as column_list.

  • num_padded (int, optional) – Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.

  • num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all samples).

Raises
  • ValueError – If num_shards is specified but shard_id is None.

  • ValueError – If shard_id is specified but num_shards is None.

apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.MnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing the MNIST dataset.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Citation of Mnist dataset.

@article{lecun2010mnist,
title        = {MNIST handwritten digit database},
author       = {LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal      = {ATT Labs [Online]},
volume       = {2},
year         = {2010},
howpublished = {http://yann.lecun.com/exdb/mnist},
description  = {The MNIST database of handwritten digits has a training set of 60,000 examples,
                and a test set of 10,000 examples. It is a subset of a larger set available from
                NIST. The digits have been size-normalized and centered in a fixed-size image.}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 60,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 70,000 samples. (default=None, all samples)

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/mnist_folder"
>>> # Read 3 samples from MNIST dataset
>>> mnist_dataset = ds.MnistDataset(dataset_dir=dataset_dir, num_samples=3)
>>> # Note: In mnist_dataset dataset, each dictionary has keys "image" and "label"
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.NumpySlicesDataset(data, column_names=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

Create a dataset with given data slices, mainly for loading Python data into dataset.

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Parameters
  • data (Union[list, tuple, dict]) – list, tuple, dict and other NumPy formats. Input data will be sliced along the first dimension and generate additional rows. Large data is not recommended to be loaded in this way as data is loading into memory.

  • column_names (list[str], optional) – List of column names of the dataset (default=None). If column_names is not provided, when data is dict, column_names will be its keys, otherwise it will be like column_1, column_2 …

  • num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).

  • sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ will not used. Random accessible input is required.

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # 1) Input data can be a list
>>> data = [1, 2, 3]
>>> dataset1 = ds.NumpySlicesDataset(data, column_names=["column_1"])
>>>
>>> # 2) Input data can be a dictionary, and column_names will be its keys
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> dataset2 = ds.NumpySlicesDataset(data)
>>>
>>> # 3) Input data can be a tuple of lists (or NumPy arrays), each tuple element refers to data in each column
>>> data = ([1, 2], [3, 4], [5, 6])
>>> dataset3 = ds.NumpySlicesDataset(data, column_names=["column_1", "column_2", "column_3"])
>>>
>>> # 4) Load data from CSV file
>>> import pandas as pd
>>> df = pd.read_csv("file.csv")
>>> dataset4 = ds.NumpySlicesDataset(dict(df), shuffle=False)
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.PKSampler(num_val, num_class=None, shuffle=False, class_column='label', num_samples=None)[source]

Samples K elements for each P class in the dataset.

Parameters
  • num_val (int) – Number of elements to sample for each class.

  • num_class (int, optional) – Number of classes to sample (default=None, all classes).

  • shuffle (bool, optional) – If True, the class IDs are shuffled (default=False).

  • class_column (str, optional) – Name of column with class labels for MindDataset (default=’label’).

  • num_samples (int, optional) – The number of samples to draw (default=None, all elements).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a PKSampler that will get 3 samples from every class.
>>> sampler = ds.PKSampler(3)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, The number of samples, or None

class mindspore.dataset.PaddedDataset(padded_samples)[source]

Create a dataset with fake data provided by user. Mainly used to add to the original data set and assign it to the corresponding shard.

Parameters

padded_samples (list(dict)) – Samples provided by user.

Raises
  • TypeError – If padded_samples is not an instance of list.

  • TypeError – If the element of padded_samples is not an instance of dict.

  • ValueError – If the padded_samples is empty.

Examples

>>> import mindspore.dataset as ds
>>> data1 = [{'image': np.zeros(1, np.uint8)}, {'image': np.zeros(2, np.uint8)}]
>>> ds1 = ds.PaddedDataset(data1)
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.RandomSampler(replacement=False, num_samples=None)[source]

Samples the elements randomly.

Parameters
  • replacement (bool, optional) – If True, put the sample ID back for the next draw (default=False).

  • num_samples (int, optional) – Number of elements to sample (default=None, all elements).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a RandomSampler
>>> sampler = ds.RandomSampler()
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, The number of samples, or None

class mindspore.dataset.Schema(schema_file=None)[source]

Class to represent a schema of a dataset.

Parameters

schema_file (str) – Path of schema file (default=None).

Returns

Schema object, schema info about dataset.

Raises

RuntimeError – If schema file failed to load.

Example

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>>
>>> # Create schema; specify column name, mindspore.dtype and shape of the column
>>> schema = ds.Schema()
>>> schema.add_column('col1', de_type=mindspore.int64, shape=[2])
add_column(name, de_type, shape=None)[source]

Add new column to the schema.

Parameters
  • name (str) – Name of the column.

  • de_type (str) – Data type of the column.

  • shape (list[int], optional) – Shape of the column (default=None, [-1] which is an unknown shape of rank 1).

Raises

ValueError – If column type is unknown.

from_json(json_obj)[source]

Get schema file from JSON file.

Parameters

json_obj (dictionary) – Object of JSON parsed.

Raises
parse_columns(columns)[source]

Parse the columns and add it to self.

Parameters

columns (Union[dict, list[dict]]) –

Dataset attribute information, decoded from schema file.

  • list[dict], ‘name’ and ‘type’ must be in keys, ‘shape’ optional.

  • dict, columns.keys() as name, columns.values() is dict, and ‘type’ inside, ‘shape’ optional.

Raises

Example

>>> schema = Schema()
>>> columns1 = [{'name': 'image', 'type': 'int8', 'shape': [3, 3]},
>>>             {'name': 'label', 'type': 'int8', 'shape': [1]}]
>>> schema.parse_columns(columns1)
>>> columns2 = {'image': {'shape': [3, 3], 'type': 'int8'}, 'label': {'shape': [1], 'type': 'int8'}}
>>> schema.parse_columns(columns2)
to_json()[source]

Get a JSON string of the schema.

Returns

Str, JSON string of the schema.

class mindspore.dataset.SequentialSampler(start_index=None, num_samples=None)[source]

Samples the dataset elements sequentially, same as not having a sampler.

Parameters
  • start_index (int, optional) – Index to start sampling at. (dafault=None, start at first ID)

  • num_samples (int, optional) – Number of elements to sample (default=None, all elements).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a SequentialSampler
>>> sampler = ds.SequentialSampler()
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, The number of samples, or None

class mindspore.dataset.SubsetRandomSampler(indices, num_samples=None)[source]

Samples the elements randomly from a sequence of indices.

Parameters
  • indices (list[int]) – A sequence of indices.

  • num_samples (int, optional) – Number of elements to sample (default=None, all elements).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> indices = [0, 1, 2, 3, 7, 88, 119]
>>>
>>> # creates a SubsetRandomSampler, will sample from the provided indices
>>> sampler = ds.SubsetRandomSampler()
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
class mindspore.dataset.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None, shard_equal_rows=False, cache=None)[source]

A source dataset that reads and parses datasets stored on disk in TFData format.

Parameters
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • schema (Union[str, Schema], optional) – Path to the JSON schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.

  • columns_list (list[str], optional) – List of columns to be read (default=None, read all columns)

  • num_samples (int, optional) – Number of samples (rows) to read (default=None). If num_samples is None and numRows(parsed from schema) does not exist, read the full dataset; If num_samples is None and numRows(parsed from schema) is greater than 0, read numRows rows; If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

  • shard_equal_rows (bool, optional) – Get equal rows for all shards(default=False). If shard_equal_rows is false, number of rows of each shard may be not equal.

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple tf data files
>>>
>>> # 1) Get all rows from dataset_files with no explicit schema
>>> # The meta-data in the first row will be used as a schema.
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files)
>>>
>>> # 2) Get all rows from dataset_files with user-defined schema
>>> schema = ds.Schema()
>>> schema.add_column('col_1d', de_type=mindspore.int64, shape=[2])
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema=schema)
>>>
>>> # 3) Get all rows from dataset_files with schema file "./schema.json"
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema="./schema.json")
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size(estimate=False)[source]

Get the number of batches in an epoch.

Note

Because the TFRecord format does not save metadata, all files need to be traversed to obtain the total amount of data. Therefore, this api is slow.

Parameters

estimate (bool, optional) – Fast estimation of the dataset size instead of a full scan.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

This is a general purpose split function which can be called from any operator in the pipeline. There is another, optimized split function, which will be called automatically if ds.split is called where ds is a MappableDataset.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have at least 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. Dataset cannot be sharded if split is going to be called.

  2. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = "/path/to/text_file/*"
>>>
>>> # TextFileDataset is not a mappable dataset, so this non-optimized split will be called.
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.TextFileDataset(dataset_files, shuffle=False)
>>> train, test = data.split([0.9, 0.1])
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=Shuffle.GLOBAL, num_shards=None, shard_id=None)[source]

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one column [‘text’].

Parameters
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=dataset_files)
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

This is a general purpose split function which can be called from any operator in the pipeline. There is another, optimized split function, which will be called automatically if ds.split is called where ds is a MappableDataset.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have at least 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. Dataset cannot be sharded if split is going to be called.

  2. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = "/path/to/text_file/*"
>>>
>>> # TextFileDataset is not a mappable dataset, so this non-optimized split will be called.
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.TextFileDataset(dataset_files, shuffle=False)
>>> train, test = data.split([0.9, 0.1])
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.VOCDataset(dataset_dir, task='Segmentation', usage='train', class_indexing=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None)[source]

A source dataset for reading and parsing VOC dataset.

The generated dataset has multiple columns :

  • task=’Detection’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘label’, dtype=uint32], [‘difficult’, dtype=uint32], [‘truncate’, dtype=uint32]].

  • task=’Segmentation’, column: [[‘image’, dtype=uint8], [‘target’,dtype=uint8]].

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Citation of VOC dataset.

@article{Everingham10,
author       = {Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.},
title        = {The Pascal Visual Object Classes (VOC) Challenge},
journal      = {International Journal of Computer Vision},
volume       = {88},
year         = {2010},
number       = {2},
month        = {jun},
pages        = {303--338},
biburl       = {http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.html#bibtex},
howpublished = {http://host.robots.ox.ac.uk/pascal/VOC/voc{year}/index.html},
description  = {The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual
                object category recognition and detection, providing the vision and machine
                learning communities with a standard dataset of images and annotation, and
                standard evaluation procedures.}
}
Parameters
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • task (str) – Set the task type of reading voc data, now only support “Segmentation” or “Detection” (default=”Segmentation”).

  • usage (str) – The type of data list text file to be read (default=”train”).

  • class_indexing (dict, optional) – A str-to-int mapping from label name to index, only valid in “Detection” task (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).

  • num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).

  • num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).

  • decode (bool, optional) – Decode the images after reading (default=False).

  • sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).

  • num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).

  • shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.

Raises
  • RuntimeError – If xml of Annotations is an invalid format.

  • RuntimeError – If xml of Annotations loss attribution of “object”.

  • RuntimeError – If xml of Annotations loss attribution of “bndbox”.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If task is not equal ‘Segmentation’ or ‘Detection’.

  • ValueError – If task equal ‘Segmentation’ but class_indexing is not None.

  • ValueError – If txt related to mode is not exist.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> # 1) Read VOC data for segmentatation training
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Segmentation", usage="train")
>>>
>>> # 2) Read VOC data for detection training
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train")
>>>
>>> # 3) Read all VOC dataset samples in dataset_dir with 8 threads in random order
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train", num_parallel_workers=8)
>>>
>>> # 4) Read then decode all VOC dataset samples in dataset_dir in sequence
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train", decode=True, shuffle=False)
>>>
>>> # In VOC dataset, if task='Segmentation', each dictionary has keys "image" and "target"
>>> # In VOC dataset, if task='Detection', each dictionary has keys "image" and "annotation"
apply(apply_func)

Apply a function in this dataset.

Parameters

apply_func (function) – A function that must take one ‘Dataset’ as an argument and return a preprogressing ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Declare an apply_func function which returns a Dataset object
>>> def apply_func(ds):
>>>     ds = ds.batch(2)
>>>     return ds
>>>
>>> # Use apply to call apply_func
>>> data = data.apply(apply_func)
Raises
  • TypeError – If apply_func is not a function.

  • TypeError – If apply_func doesn’t return a Dataset.

batch(batch_size, drop_remainder=False, num_parallel_workers=None, per_batch_map=None, input_columns=None, output_columns=None, column_order=None, pad_info=None)

Combine batch_size number of consecutive rows into batches.

For any child node, a batch is treated as a single row. For any column, all the elements within that column must have the same shape. If a per_batch_map callable is provided, it will be applied to the batches of tensors.

Note

The order of using repeat and batch reflects the number of batches and per_batch_map. It is recommended that the repeat operation be used after the batch operation.

Parameters
  • batch_size (int or function) – The number of rows each batch is created with. An int or callable which takes exactly 1 parameter, BatchInfo.

  • drop_remainder (bool, optional) – Determines whether or not to drop the last possibly incomplete batch (default=False). If True, and if there are less than batch_size rows available to make the last batch, then those rows will be dropped and not propagated to the child node.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

  • per_batch_map (callable, optional) – Per batch map callable. A callable which takes (list[Tensor], list[Tensor], …, BatchInfo) as input parameters. Each list[Tensor] represents a batch of Tensors on a given column. The number of lists should match with number of entries in input_columns. The last parameter of the callable should always be a BatchInfo object.

  • input_columns (list[str], optional) – List of names of the input columns. The size of the list should match with signature of the per_batch_map callable.

  • output_columns (list[str], optional) – [Not currently implemented] List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – [Not currently implemented] List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • pad_info (dict, optional) – Whether to perform padding on selected columns. pad_info={“col1”:([224,224],0)} would pad column with name “col1” to a tensor of size [224,224] and fill the missing with 0.

Returns

BatchDataset, dataset batched.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> data = data.batch(100, True)
bucket_batch_by_length(column_names, bucket_boundaries, bucket_batch_sizes, element_length_function=None, pad_info=None, pad_to_bucket_boundary=False, drop_remainder=False)

Bucket elements according to their lengths. Each bucket will be padded and batched when they are full.

A length function is called on each row in the dataset. The row is then bucketed based on its length and bucket_boundaries. When a bucket reaches its corresponding size specified in bucket_batch_sizes, the entire bucket will be padded according to batch_info, and then batched. Each batch will be full, except for maybe the last batch for each bucket.

Parameters
  • column_names (list[str]) – Columns passed to element_length_function.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).

  • bucket_batch_sizes (list[int]) – A list consisting of the batch sizes for each bucket. Must contain len(bucket_boundaries)+1 elements.

  • element_length_function (Callable, optional) – A function that takes in len(column_names) arguments and returns an int. If no value is provided, then len(column_names) must be 1, and the size of the first dimension of that column will be taken as the length (default=None).

  • pad_info (dict, optional) – Represents how to batch each column. The key corresponds to the column name, and the value must be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element corresponds to the value to pad with. If a column is not specified, then that column will be padded to the longest in the current batch, and 0 will be used as the padding value. Any None dimensions will be padded to the longest in the current batch, unless if pad_to_bucket_boundary is True. If no padding is wanted, set pad_info to None (default=None).

  • pad_to_bucket_boundary (bool, optional) – If True, will pad each None dimension in pad_info to the bucket_boundary minus 1. If there are any elements that fall into the last bucket, an error will occur (default=False).

  • drop_remainder (bool, optional) – If True, will drop the last batch for each bucket if it is not a full batch (default=False).

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where every 100 rows is combined into a batch
>>> # and drops the last incomplete batch if there is one.
>>> column_names = ["col1", "col2"]
>>> buket_boundaries = [5, 10]
>>> bucket_batch_sizes = [5, 1, 1]
>>> element_length_function = (lambda col1, col2: max(len(col1), len(col2)))
>>>
>>> # Will pad col1 to shape [2, bucket_boundaries[i]] where i is the
>>> # index of the bucket that is currently being batched.
>>> # Will pad col2 to a shape where each dimension is the longest in all
>>> # the elements currently being batched.
>>> pad_info = {"col1", ([2, None], -1)}
>>> pad_to_bucket_boundary = True
>>>
>>> data = data.bucket_batch_by_length(column_names, bucket_boundaries,
>>>                                    bucket_batch_sizes,
>>>                                    element_length_function, pad_info,
>>>                                    pad_to_bucket_boundary)
concat(datasets)

Concatenate the datasets in the input list of datasets. The “+” operator is also supported to concatenate.

Note

The column name, and rank and type of the column data must be the same in the input datasets.

Parameters

datasets (Union[list, class Dataset]) – A list of datasets or a single class Dataset to be concatenated together with this dataset.

Returns

ConcatDataset, dataset concatenated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>>
>>> # Create a dataset by concatenating ds1 and ds2 with "+" operator
>>> data1 = ds1 + ds2
>>> # Create a dataset by concatenating ds1 and ds2 with concat operation
>>> data1 = ds1.concat(ds2)
create_dict_iterator(num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a dictionary.

The order of the columns in the dictionary may not be the same as the original order.

Parameters
  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated (default=-1, iterator can be iterated infinite number of epochs).

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype, if output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, dictionary of column name-ndarray pair.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # create an iterator
>>> # The columns in the data obtained by the iterator might be changed.
>>> iterator = data.create_dict_iterator()
>>> for item in iterator:
>>>     # print the data in column1
>>>     print(item["column1"])
create_tuple_iterator(columns=None, num_epochs=- 1, output_numpy=False)

Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.

To specify which columns to list and the order needed, use columns_list. If columns_list is not provided, the order of the columns will not be changed.

Parameters
  • columns (list[str], optional) – List of columns to be used to specify the order of columns (default=None, means all columns).

  • num_epochs (int, optional) – Maximum number of epochs that iterator can be iterated. (default=-1, iterator can be iterated infinite number of epochs)

  • output_numpy (bool, optional) – Whether or not to output NumPy datatype. If output_numpy=False, iterator will output MSTensor (default=False).

Returns

Iterator, list of ndarrays.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>>
>>> # Create an iterator
>>> # The columns in the data obtained by the iterator will not be changed.
>>> iterator = data.create_tuple_iterator()
>>> for item in iterator:
>>>     # convert the returned tuple to a list and print
>>>     print(list(item))
device_que(prefetch_size=None, send_epoch_end=True)

Return a transferred Dataset that transfers data through a device.

Parameters
  • prefetch_size (int, optional) – Prefetch number of records ahead of the user’s request (default=None).

  • send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

filter(predicate, input_columns=None, num_parallel_workers=1)

Filter dataset by predicate.

Note

If input_columns not provided or empty, all columns will be used.

Parameters
  • predicate (callable) – Python callable which returns a boolean value. If False then filter the element.

  • input_columns (list[str], optional) – List of names of the input columns, when default=None, the predicate will be applied on all columns in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).

Returns

FilterDataset, dataset filter.

Examples

>>> import mindspore.dataset as ds
>>> # generator data(0 ~ 63)
>>> # filter the data that greater than or equal to 11
>>> dataset_f = dataset.filter(predicate=lambda data: data < 11, input_columns = ["data"])
flat_map(func)

Map func to each row in dataset and flatten the result.

The specified func is a function that must take one ‘Ndarray’ as input and return a ‘Dataset’.

Parameters

func (function) – A function that must take one ‘Ndarray’ as an argument and return a ‘Dataset’.

Returns

Dataset, applied by the function.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Declare a function which returns a Dataset object
>>> def flat_map_func(x):
>>>     data_dir = text.to_str(x[0])
>>>     d = ds.ImageFolderDataset(data_dir)
>>>     return d
>>> # data is an instance of a Dataset object.
>>> data = ds.TextFileDataset(DATA_FILE)
>>> data = data.flat_map(flat_map_func)
Raises
  • TypeError – If func is not a function.

  • TypeError – If func doesn’t return a Dataset.

get_batch_size()

Get the size of a batch.

Returns

Number, the number of data in a batch.

get_class_indexing()[source]

Get the class index.

Returns

Dict, A str-to-int mapping from label name to index.

get_col_names()

Get names of the columns in the dataset

get_dataset_size()[source]

Get the number of batches in an epoch.

Returns

Number, number of batches.

get_repeat_count()

Get the replication times in RepeatDataset else 1.

Returns

Number, the count of repeat.

map(operations=None, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None)

Apply each operation in operations to this dataset.

The order of operations is determined by the position of each operation in the operations parameter. operations[0] will be applied first, then operations[1], then operations[2], etc.

Each operation will be passed one or more columns from the dataset as input, and zero or more columns will be outputted. The first operation will be passed the columns specified in input_columns as input. If there is more than one operator in operations, the outputted columns of the previous operation are used as the input columns for the next operation. The columns outputted by the very last operation will be assigned names specified by output_columns.

Only the columns specified in column_order will be propagated to the child node. These columns will be in the same order as specified in column_order.

Parameters
  • operations (Union[list[TensorOp], list[functions]]) – List of operations to be applied on the dataset. Operations are applied in the order they appear in this list.

  • input_columns (list[str]) – List of the names of the columns that will be passed to the first operation as input. The size of this list must match the number of input columns expected by the first operator. (default=None, the first operation will be passed however many columns that is required, starting from the first column).

  • output_columns (list[str], optional) – List of names assigned to the columns outputted by the last operation. This parameter is mandatory if len(input_columns) != len(output_columns). The size of this list must match the number of output columns of the last operation. (default=None, output columns will have the same name as the input columns, i.e., the columns will be replaced).

  • column_order (list[str], optional) – List of all the desired columns to propagate to the child node. This list must be a subset of all the columns in the dataset after all operations are applied. The order of the columns in each row propagated to the child node follow the order they appear in this list. The parameter is mandatory if the len(input_columns) != len(output_columns). (default=None, all columns will be propagated to the child node, the order of the columns will remain the same).

  • num_parallel_workers (int, optional) – Number of threads used to process the dataset in parallel (default=None, the value from the configuration will be used).

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker processes. This option could be beneficial if the Python operation is computational heavy (default=False).

  • cache (DatasetCache, optional) – Tensor cache to use. (default=None which means no cache is used). The cache feature is under development and is not recommended.

  • callbacks – (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).

Returns

MapDataset, dataset after mapping operation.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.vision.c_transforms as c_transforms
>>>
>>> # data is an instance of Dataset which has 2 columns, "image" and "label".
>>> # ds_pyfunc is an instance of Dataset which has 3 columns, "col0", "col1", and "col2".
>>> # Each column is a 2D array of integers.
>>>
>>> # Set the global configuration value for num_parallel_workers to be 2.
>>> # Operations which use this configuration value will use 2 worker threads,
>>> # unless otherwise specified in the operator's constructor.
>>> # set_num_parallel_workers can be called again later if a different
>>> # global configuration value for the number of worker threads is desired.
>>> ds.config.set_num_parallel_workers(2)
>>>
>>> # Define two operations, where each operation accepts 1 input column and outputs 1 column.
>>> decode_op = c_transforms.Decode(rgb_format=True)
>>> random_jitter_op = c_transforms.RandomColorAdjust((0.8, 0.8), (1, 1), (1, 1), (0, 0))
>>>
>>> # 1) Simple map example
>>>
>>> operations = [decode_op]
>>> input_columns = ["image"]
>>>
>>> # Apply decode_op on column "image". This column will be replaced by the outputted
>>> # column of decode_op. Since column_order is not provided, both columns "image"
>>> # and "label" will be propagated to the child node in their original order.
>>> ds_decoded = data.map(operations, input_columns)
>>>
>>> # Rename column "image" to "decoded_image".
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns)
>>>
>>> # Specify the order of the columns.
>>> column_order ["label", "image"]
>>> ds_decoded = data.map(operations, input_columns, None, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and also specify the order of the columns.
>>> column_order ["label", "decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Rename column "image" to "decoded_image" and keep only this column.
>>> column_order ["decoded_image"]
>>> output_columns = ["decoded_image"]
>>> ds_decoded = data.map(operations, input_columns, output_columns, column_order)
>>>
>>> # A simple example using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as the previous examples.
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + 1)]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns)
>>>
>>> # 2) Map example with more than one operation
>>>
>>> # If this list of operations is used with map, decode_op will be applied
>>> # first, then random_jitter_op will be applied.
>>> operations = [decode_op, random_jitter_op]
>>>
>>> input_columns = ["image"]
>>>
>>> # Create a dataset where the images are decoded, then randomly color jittered.
>>> # decode_op takes column "image" as input and outputs one column. The column
>>> # outputted by decode_op is passed as input to random_jitter_op.
>>> # random_jitter_op will output one column. Column "image" will be replaced by
>>> # the column outputted by random_jitter_op (the very last operation). All other
>>> # columns are unchanged. Since column_order is not specified, the order of the
>>> # columns will remain the same.
>>> ds_mapped = data.map(operations, input_columns)
>>>
>>> # Create a dataset that is identical to ds_mapped, except the column "image"
>>> # that is outputted by random_jitter_op is renamed to "image_transformed".
>>> # Specifying column order works in the same way as examples in 1).
>>> output_columns = ["image_transformed"]
>>> ds_mapped_and_renamed = data.map(operation, input_columns, output_columns)
>>>
>>> # Multiple operations using pyfunc: Renaming columns and specifying column order
>>> # work in the same way as examples in 1).
>>> input_columns = ["col0"]
>>> operations = [(lambda x: x + x), (lambda x: x - 1)]
>>> output_columns = ["col0_mapped"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns)
>>>
>>> # 3) Example where number of input columns is not equal to number of output columns
>>>
>>> # operations[0] is a lambda that takes 2 columns as input and outputs 3 columns.
>>> # operations[1] is a lambda that takes 3 columns as input and outputs 1 column.
>>> # operations[1] is a lambda that takes 1 column as input and outputs 4 columns.
>>> #
>>> # Note: The number of output columns of operation[i] must equal the number of
>>> # input columns of operation[i+1]. Otherwise, this map call will also result
>>> # in an error.
>>> operations = [(lambda x y: (x, x + y, x + y + 1)),
>>>               (lambda x y z: x * y * z),
>>>               (lambda x: (x % 2, x % 3, x % 5, x % 7))]
>>>
>>> # Note: Since the number of input columns is not the same as the number of
>>> # output columns, the output_columns and column_order parameters must be
>>> # specified. Otherwise, this map call will also result in an error.
>>> input_columns = ["col2", "col0"]
>>> output_columns = ["mod2", "mod3", "mod5", "mod7"]
>>>
>>> # Propagate all columns to the child node in this order:
>>> column_order = ["col0", "col2", "mod2", "mod3", "mod5", "mod7", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
>>>
>>> # Propagate some columns to the child node in this order:
>>> column_order = ["mod7", "mod3", "col1"]
>>> ds_mapped = ds_pyfunc.map(operations, input_columns, output_columns, column_order)
num_classes()

Get the number of classes in a dataset.

Returns

Number, number of classes.

output_shapes()

Get the shapes of output data.

Returns

List, list of shapes of each column.

output_types()

Get the types of output data.

Returns

List of data types.

project(columns)

Project certain columns in input dataset.

The specified columns will be selected from the dataset and passed down the pipeline in the order specified. The other columns are discarded.

Parameters

columns (list[str]) – List of names of the columns to project.

Returns

ProjectDataset, dataset projected.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object
>>> columns_to_project = ["column3", "column1", "column2"]
>>>
>>> # Create a dataset that consists of column3, column1, column2
>>> # in that order, regardless of the original order of columns.
>>> data = data.project(columns=columns_to_project)
rename(input_columns, output_columns)

Rename the columns in input datasets.

Parameters
  • input_columns (list[str]) – List of names of the input columns.

  • output_columns (list[str]) – List of names of the output columns.

Returns

RenameDataset, dataset renamed.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> input_columns = ["input_col1", "input_col2", "input_col3"]
>>> output_columns = ["output_col1", "output_col2", "output_col3"]
>>>
>>> # Create a dataset where input_col1 is renamed to output_col1, and
>>> # input_col2 is renamed to output_col2, and input_col3 is renamed
>>> # to output_col3.
>>> data = data.rename(input_columns=input_columns, output_columns=output_columns)
repeat(count=None)

Repeat this dataset count times. Repeat indefinitely if the count is None or -1.

Note

The order of using repeat and batch reflects the number of batches. It is recommended that the repeat operation be used after the batch operation. If dataset_sink_mode is False, the repeat operation is invalid. If dataset_sink_mode is True, repeat count must be equal to the epoch of training. Otherwise, errors could occur since the amount of data is not the amount training requires.

Parameters

count (int) – Number of times the dataset is repeated (default=None).

Returns

RepeatDataset, dataset repeated.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>>
>>> # Create a dataset where the dataset is repeated for 50 epochs
>>> repeated = data.repeat(50)
>>>
>>> # Create a dataset where each epoch is shuffled individually
>>> shuffled_and_repeated = data.shuffle(10)
>>> shuffled_and_repeated = shuffled_and_repeated.repeat(50)
>>>
>>> # Create a dataset where the dataset is first repeated for
>>> # 50 epochs before shuffling. The shuffle operator will treat
>>> # the entire 50 epochs as one big dataset.
>>> repeat_and_shuffle = data.repeat(50)
>>> repeat_and_shuffle = repeat_and_shuffle.shuffle(10)
reset()

Reset the dataset for next epoch.

save(file_name, num_files=1, file_type='mindrecord')

Save the dynamic data processed by the dataset pipeline in common dataset format. Supported dataset formats: ‘mindrecord’ only

Implicit type casting exists when saving data as ‘mindrecord’. The table below shows how to do type casting.

Implicit Type Casting when Saving as ‘mindrecord’

Type in ‘dataset’

Type in ‘mindrecord’

Details

bool

None

Not supported

int8

int32

uint8

bytes(1D uint8)

Drop dimension

int16

int32

uint16

int32

int32

int32

uint32

int64

int64

int64

uint64

None

Not supported

float16

float32

float32

float32

float64

float64

string

string

Multi-dimensional string not supported

Note

  1. To save the samples in order, set dataset’s shuffle to False and num_files to 1.

  2. Before calling the function, do not use batch operator, repeat operator or data augmentation operators with random attribute in map operator.

  3. Mindrecord does not support DE_UINT64, multi-dimensional DE_UINT8(drop dimension) nor multi-dimensional DE_STRING.

Parameters
  • file_name (str) – Path to dataset file.

  • num_files (int, optional) – Number of dataset files (default=1).

  • file_type (str, optional) – Dataset format (default=’mindrecord’).

shuffle(buffer_size)

Randomly shuffles the rows of this dataset using the following algorithm:

  1. Make a shuffle buffer that contains the first buffer_size rows.

  2. Randomly select an element from the shuffle buffer to be the next row propogated to the child node.

  3. Get the next row (if any) from the parent node and put it in the shuffle buffer.

  4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.

A seed can be provided to be used on the first epoch. In every subsequent epoch, the seed is changed to a new one, randomly generated value.

Parameters

buffer_size (int) – The size of the buffer (must be larger than 1) for shuffling. Setting buffer_size equal to the number of rows in the entire dataset will result in a global shuffle.

Returns

ShuffleDataset, dataset shuffled.

Raises

RuntimeError – If exist sync operators before shuffle.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Optionally set the seed for the first epoch
>>> ds.config.set_seed(58)
>>>
>>> # Create a shuffled dataset using a shuffle buffer of size 4
>>> data = data.shuffle(4)
skip(count)

Skip the first N elements of this dataset.

Parameters

count (int) – Number of elements in the dataset to be skipped.

Returns

SkipDataset, dataset skipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset which skips first 3 elements from data
>>> data = data.skip(3)
split(sizes, randomize=True)

Split the dataset into smaller, non-overlapping datasets.

Parameters
  • sizes (Union[list[int], list[float]]) –

    If a list of integers [s1, s2, …, sn] is provided, the dataset will be split into n datasets of size s1, size s2, …, size sn respectively. If the sum of all sizes does not equal the original dataset size, an error will occur. If a list of floats [f1, f2, …, fn] is provided, all floats must be between 0 and 1 and must sum to 1, otherwise an error will occur. The dataset will be split into n Datasets of size round(f1*K), round(f2*K), …, round(fn*K) where K is the size of the original dataset. If after rounding:

    • Any size equals 0, an error will occur.

    • The sum of split sizes < K, the difference will be added to the first split.

    • The sum of split sizes > K, the difference will be removed from the first large enough split such that it will have atleast 1 row after removing the difference.

  • randomize (bool, optional) – Determines whether or not to split the data randomly (default=True). If True, the data will be randomly split. Otherwise, each split will be created with consecutive rows from the dataset.

Note

  1. There is an optimized split function, which will be called automatically when the dataset that calls this function is a MappableDataset.

  2. Dataset should not be sharded if split is going to be called. Instead, create a DistributedSampler and specify a split to shard after splitting. If dataset is sharded after a split, it is strongly recommended to set the same seed in each instance of execution, otherwise each shard may not be part of the same split (see Examples).

  3. It is strongly recommended to not shuffle the dataset, but use randomize=True instead. Shuffling the dataset may not be deterministic, which means the data in each split will be different in each epoch. Furthermore, if sharding occurs after split, each shard may not be part of the same split.

Raises
  • RuntimeError – If get_dataset_size returns None or is not supported for this dataset.

  • RuntimeError – If sizes is list of integers and sum of all elements in sizes does not equal the dataset size.

  • RuntimeError – If sizes is list of float and there is a split with size 0 after calculations.

  • RuntimeError – If the dataset is sharded prior to calling split.

  • ValueError – If sizes is list of float and not all floats are between 0 and 1, or if the floats don’t sum to 1.

Returns

tuple(Dataset), a tuple of datasets that have been split.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # Since many datasets have shuffle on by default, set shuffle to False if split will be called!
>>> data = ds.ImageFolderDataset(dataset_dir, shuffle=False)
>>>
>>> # Set the seed, and tell split to use this seed when randomizing.
>>> # This is needed because sharding will be done later
>>> ds.config.set_seed(58)
>>> train, test = data.split([0.9, 0.1])
>>>
>>> # To shard the train dataset, use a DistributedSampler
>>> train_sampler = ds.DistributedSampler(10, 2)
>>> train.use_sampler(train_sampler)
sync_update(condition_name, num_batch=None, data=None)

Release a blocking condition and trigger callback with given data.

Parameters
  • condition_name (str) – The condition name that is used to toggle sending next row.

  • num_batch (Union[int, None]) – The number of batches (rows) that are released. When num_batch is None, it will default to the number specified by the sync_wait operator (default=None).

  • data (Union[dict, None]) – The data passed to the callback (default=None).

sync_wait(condition_name, num_batch=1, callback=None)

Add a blocking condition to the input Dataset.

Parameters
  • num_batch (int) – the number of batches without blocking at the start of each epoch.

  • condition_name (str) – The condition name that is used to toggle sending next row.

  • callback (function) – The callback funciton that will be invoked when sync_update is called.

Raises

RuntimeError – If condition name already exists.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> data = data.sync_wait("callback1")
>>> data = data.batch(batch_size)
>>> for batch_data in data.create_dict_iterator():
>>>     data = data.sync_update("callback1")
take(count=- 1)

Takes at most given numbers of elements from the dataset.

Note

  1. If count is greater than the number of elements in the dataset or equal to -1, all the elements in dataset will be taken.

  2. The order of using take and batch matters. If take is before batch operation, then take given number of rows; otherwise take given number of batches.

Parameters

count (int, optional) – Number of elements to be taken from the dataset (default=-1).

Returns

TakeDataset, dataset taken.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # data is an instance of Dataset object.
>>> # Create a dataset where the dataset includes 50 elements.
>>> data = data.take(50)
to_device(send_epoch_end=True)

Transfer data through CPU, GPU or Ascend devices.

Parameters

send_epoch_end (bool, optional) – Whether to send end of sequence to device or not (default=True).

Note

If device is Ascend, features of data will be transferred one by one. The limitation of data transmission per time is 256M.

Returns

TransferDataset, dataset for transferring.

Raises
  • TypeError – If device_type is empty.

  • ValueError – If device_type is not ‘Ascend’, ‘GPU’ or ‘CPU’.

  • RuntimeError – If dataset is unknown.

  • RuntimeError – If distribution file path is given but failed to read.

use_sampler(new_sampler)

Will make the current dataset use the new_sampler provided.

Parameters

new_sampler (Sampler) – The sampler to use for the current dataset.

Returns

Dataset, that uses new_sampler.

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/imagefolder_directory"
>>> # Note: A SequentialSampler is created by default
>>> data = ds.ImageFolderDataset(dataset_dir)
>>>
>>> # Use a DistributedSampler instead of the SequentialSampler
>>> new_sampler = ds.DistributedSampler(10, 2)
>>> data.use_sampler(new_sampler)
zip(datasets)

Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.

Parameters

datasets (Union[tuple, class Dataset]) – A tuple of datasets or a single class Dataset to be zipped together with this dataset.

Returns

ZipDataset, dataset zipped.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # ds1 and ds2 are instances of Dataset object
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds1.zip(ds2)
class mindspore.dataset.WeightedRandomSampler(weights, num_samples=None, replacement=True)[source]

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

Parameters
  • weights (list[float]) – A sequence of weights, not necessarily summing up to 1.

  • num_samples (int, optional) – Number of elements to sample (default=None, all elements).

  • replacement (bool) – If True, put the sample ID back for the next draw (default=True).

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> weights = [0.9, 0.01, 0.4, 0.8, 0.1, 0.1, 0.3]
>>>
>>> # creates a WeightedRandomSampler that will sample 4 elements without replacement
>>> sampler = ds.WeightedRandomSampler(weights, 4)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)
Raises
get_num_samples()

All samplers can contain a numeric num_samples value (or it can be set to None). A child sampler can exist or be None. If a child sampler exists, then the child sampler count can be a numeric value or None. These conditions impact the resultant sampler count that is used. The following table shows the possible results from calling this function.

child sampler

num_samples

child_samples

result

T

x

y

min(x, y)

T

x

None

x

T

None

y

y

T

None

None

None

None

x

n/a

x

None

None

n/a

None

Returns

int, The number of samples, or None

mindspore.dataset.zip(datasets)[source]

Zip the datasets in the input tuple of datasets.

Parameters

datasets (tuple of class Dataset) – A tuple of datasets to be zipped together. The number of datasets must be more than 1.

Returns

DatasetOp, ZipDataset.

Raises

Examples

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir1 = "path/to/imagefolder_directory1"
>>> dataset_dir2 = "path/to/imagefolder_directory2"
>>> ds1 = ds.ImageFolderDataset(dataset_dir1, num_parallel_workers=8)
>>> ds2 = ds.ImageFolderDataset(dataset_dir2, num_parallel_workers=8)
>>>
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds.zip((ds1, ds2))