Data Iteration
Ascend
GPU
CPU
Data Preparation
Translator: Ming__blue
Overview
Original dataset is read into the memory through dataset loading interface, and then data is transformed through data enhancement operation. The obtained dataset object has two conventional data iteration methods:
Create an iterator for data iteration.
Pass in the model interface (such as
model.train
,model.eval
, etc.) for iterative training or inference.
Create an iterator for data iteration
Dataset objects can usually create two different iterators to traverse the data, namely tuple iterator and dictionary iterator.
The interface for creating tuple iterator is create_tuple_iterator
, and the interface for creating dictionary iterator is create_dict_iterator
. The specific usage is as follows.
First, arbitrarily create a dataset object as a demonstration.
[1]:
import mindspore.dataset as ds
np_data = [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]
dataset = ds.NumpySlicesDataset(np_data, column_names=["data"], shuffle=False)
Following methods can be used to create a data iterator.
[2]:
# Create tuple iterator
print("\n create tuple iterator")
for item in dataset.create_tuple_iterator():
print("item:\n", item[0])
# Create dictionary iterator
print("\n create dict iterator")
for item in dataset.create_dict_iterator():
print("item:\n", item["data"])
# Traverse the dataset object directly (equivalent to creating tuple iterator)
print("\n iterate dataset object directly")
for item in dataset:
print("item:\n", item[0])
# Traverse the dataset object using enumerate method(equivalent to creating tuple iterator)
print("\n iterate dataset using enumerate")
for index, item in enumerate(dataset):
print("index: {}, item:\n {}".format(index, item[0]))
create tuple iterator
item:
[[1 2]
[3 4]]
item:
[[5 6]
[7 8]]
create dict iterator
item:
[[1 2]
[3 4]]
item:
[[5 6]
[7 8]]
iterate dataset object directly
item:
[[1 2]
[3 4]]
item:
[[5 6]
[7 8]]
iterate dataset using enumerate
index: 0, item:
[[1 2]
[3 4]]
index: 1, item:
[[5 6]
[7 8]]
In addition, to generate data in multiple Epochs, adjust the value of the input parameter num_epochs
accordingly. Compared with calling the iterator interface multiple times, directly setting the Epoch number can improve the performance of data iteration.
[3]:
# Create tuple iterator to generate data in two Epochs
epoch = 2
iterator = dataset.create_tuple_iterator(num_epochs=epoch)
for i in range(epoch):
print("epoch: ", i)
for item in iterator:
print("item:\n", item[0])
epoch: 0
item:
[[1 2]
[3 4]]
item:
[[5 6]
[7 8]]
epoch: 1
item:
[[1 2]
[3 4]]
item:
[[5 6]
[7 8]]
The default output data type of the iterator is mindspore.Tensor
. To get data of the type numpy.ndarray
, set the parameter output_numpy=True
.
[4]:
# The default output type is mindspore.Tensor
for item in dataset.create_tuple_iterator():
print("dtype: ", type(item[0]), "\nitem:", item[0])
# Set the output type to numpy.ndarray
for item in dataset.create_tuple_iterator(output_numpy=True):
print("dtype: ", type(item[0]), "\nitem:", item[0])
dtype: <class 'mindspore.common.tensor.Tensor'>
item: [[1 2]
[3 4]]
dtype: <class 'mindspore.common.tensor.Tensor'>
item: [[5 6]
[7 8]]
dtype: <class 'numpy.ndarray'>
item: [[1 2]
[3 4]]
dtype: <class 'numpy.ndarray'>
item: [[5 6]
[7 8]]
For more detailed instructions, please refer to create_tuple_iterator and create_dict_iterator API documentation.
Pass in the Model interface for iterative training or inference
After the dataset object is created, it can be passed into the Model
interface, iterate data inside the interface, and send it to the network for training or inference.
[5]:
import numpy as np
from mindspore import ms_function
from mindspore import context, nn, Model
import mindspore.dataset as ds
import mindspore.ops as ops
def create_dataset():
np_data = [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]
np_data = np.array(np_data, dtype=np.float16)
dataset = ds.NumpySlicesDataset(np_data, column_names=["data"], shuffle=False)
return dataset
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.relu = ops.ReLU()
self.print = ops.Print()
@ms_function
def construct(self, x):
self.print(x)
return self.relu(x)
if __name__ == "__main__":
# it is supported to run in CPU, GPU or Ascend
context.set_context(mode=context.GRAPH_MODE)
dataset = create_dataset()
network = Net()
model = Model(network)
# do training, sink to device defaultly
model.train(epoch=1, train_dataset=dataset, dataset_sink_mode=True)
Tensor(shape=[2, 2], dtype=Float16, value=
[[ 1.0000e+00 2.0000e+00]
[ 3.0000e+00 4.0000e+00]])
Tensor(shape=[2, 2], dtype=Float16, value=
[[ 5.0000e+00 6.0000e+00]
[ 7.0000e+00 8.0000e+00]])
The dataset_sink_mode
parameter in the Model interface is used to set whether to sink data to the Device. If it is set to not sink, the above iterator will be created internally to traverse the data one by one and sent to the network; if set to sink, the data will be sent directly to the Device internally and sent to the network for iterative training or inference.
For more detailed usage, please refer to Model Basic Usage.