Constructing Dataset
This chapter focuses on considerations related to data processing in network migration. For basic data processing, please refer to:
Optimizing the Data Processing
Basic Process of Data Construction
The whole basic process of data construction consists of two main aspects: dataset loading and data augmentation.
Loading Dataset
MindSpore provides interfaces for loading many common datasets. The most used ones are as follows:
Data Interfaces |
Introduction |
---|---|
Cifar10 dataset read interface (you need to download the original bin file of Cifar10) |
|
Minist handwritten digit recognition dataset (you need to download the original file) |
|
Dataset reading method of using file directories as the data organization format for classification (common for ImageNet) |
|
MindRecord data read interface |
|
customized data interface |
|
Constructing a fake image dataset |
For common dataset interfaces in different fields, refer to Loading interfaces to common datasets.
GeneratorDataset, A Custom Dataset
The basic process for constructing a custom Dataset object is as follows:
First create an iterator class and define three methods __init__
, __getitem__
and __len__
in that class.
import numpy as np
class MyDataset():
"""Self Defined dataset."""
def __init__(self, n):
self.data = []
self.label = []
for _ in range(n):
self.data.append(np.zeros((3, 4, 5)))
self.label.append(np.ones((1)))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
data = self.data[idx]
label = self.label[idx]
return data, label
It is better not to use MindSpore operators in the iterator class, because it usually adds multiple threads in the data processing phase, which may cause problems.
The output of the iterator needs to be a numpy array.
The iterator must set the
__len__
method, and the returned result must be the real dataset size. Setting it larger will cause problems when gettingitem takes values.
Then use GeneratorDataset to encapsulate the iterator class:
import mindspore.dataset as ds
my_dataset = MyDataset(10)
# corresponding to torch.utils.data.DataLoader(my_dataset)
dataset = ds.GeneratorDataset(my_dataset, column_names=["data", "label"])
When customizing the dataset, you need to set a name for each output, such as column_names=["data", "label"]
above, indicating that the first output column of the iterator is data
and the second is label
. In the subsequent data augmentation and data iteration obtaining phases, the different columns can be processed separately by name.
All MindSpore data interfaces have some general attribute controls, and some of the common ones are described here:
Attributes |
Introduction |
---|---|
num_samples(int) |
Specify the total number of data samples |
shuffle(bool) |
Whether to do random disruptions to the data |
sampler(Sampler) |
Data sampler, customizing data disruption, allocation. |
num_shards(int) |
Used in distributed scenarios to divide data into several parts, used in conjunction with |
shard_id(int) |
For distributed scenarios, taking nth data (n ranges from 0 to n-1, and n is the set |
num_parallel_workers(int) |
Number of threads in parallel configuration |
For example:
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)
print(dataset.get_dataset_size())
# 1000
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0, num_samples=3)
print(dataset.get_dataset_size())
# 3
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0,
num_shards=8, shard_id=0)
print(dataset.get_dataset_size())
# 1000 / 8 = 125
1000
3
125
Data Processing and Augmentation
MindSpore dataset object uses the map interface for data augmentation. See map Interface
map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None, max_rowsize=16, offload=None)
Given a set of data augmentation lists, data augmentations are applied to the dataset objects in order.
Each data augmentation operation takes one or more data columns in the dataset object as input and outputs the result of the data augmentation as one or more data columns. The first data augmentation operation takes the specified columns in input_columns as input. If there are multiple data augmentation operations in the data augmentation list, the output columns of the previous data augmentation will be used as input columns for the next data augmentation.
The column name of output column in the last data augmentation is specified by output_columns
. If output_columns
is not specified, the output column name is the same as input_columns
.
The above introduction may be tedious, but in short, map
is to do the operations specified in operations
on some columns of the dataset. Here operations
can be the data augmentation provided by MindSpore.
audio, text, vision, and transforms. For more details, refer to Data Transforms, which is also a method of python. You can use opencv, PIL, pandas and some other third party methods, like loading dataset. Don’t use MindSpore operators.
MindSpore also provides some common random augmentation methods: Auto augmentation. When using data augmentation specifically, it is best to read Optimizing the Data Processing in the recommended order.
At the end of data augmentation, you can use the batch operator to merge batch_size
pieces of consecutive data in the dataset into a single batch data. For details, please refer to batch. Note that the parameter drop_remainder
needs to be set to True during training and False during inference.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)\
.batch(32, drop_remainder=True)
print(dataset.get_dataset_size())
# 1000 // 32 = 31
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)\
.batch(32, drop_remainder=False)
print(dataset.get_dataset_size())
# ceil(1000 / 32) = 32
31
32
The batch operator can also use some augmentation operations within batch. For details, see YOLOv3.
Data Iteration
MindSpore data objects are obtained iteratively in the following ways.
create_dict_iterator
Creates an iterator based on the dataset object, and the output data is of dictionary type.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
iterator = dataset.create_dict_iterator()
for data_dict in iterator:
for name in data_dict.keys():
print(name, data_dict[name].shape)
print("="*20)
image (10, 32, 32, 3)
label (10,)
====================
image (10, 32, 32, 3)
label (10,)
====================
create_tuple_iterator
Create an iterator based on the dataset object, and output data is a list of numpy.ndarray
data.
You can specify all column names and the order of the columns in the output by the parameter columns
. If columns is not specified, the order of the columns will remain the same.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
iterator = dataset.create_tuple_iterator()
for data_tuple in iterator:
for data in data_tuple:
print(data.shape)
print("="*20)
(10, 32, 32, 3)
(10,)
====================
(10, 32, 32, 3)
(10,)
====================
Traversing Directly over dataset Objects
Note that this writing method does not
shuffle
after traversing an epoch, so it may affect the precision when used in training. The above two methods are recommended when direct data iterations are needed during training.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
for data in dataset:
for data in data:
print(data.shape)
print("="*20)
(10, 32, 32, 3)
(10,)
====================
(10, 32, 32, 3)
(10,)
====================
The latter two of these can be used directly when the order of data read is the same as the order required by the network.
for data in dataset:
loss = net(*data)