Loading the Dataset
Overview
MindSpore helps you load common datasets, datasets of specific data formats, or custom datasets. Before loading a dataset, you need to import the required library mindspore.dataset
.
import mindspore.dataset as ds
Loading Common Datasets
MindSpore can load common standard datasets. The following table lists the supported datasets:
Dataset |
Description |
---|---|
ImageNet |
An image database organized based on the WordNet hierarchical structure. Each node in the hierarchical structure is represented by hundreds of images. |
MNIST |
A large database of handwritten digit images, which is usually used to train various image processing systems. |
CIFAR-10 |
A collection of images that are commonly used to train machine learning and computer vision algorithms. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. |
CIFAR-100 |
The dataset is similar to CIFAR-10. The difference is that this dataset has 100 classes, and each class contains 600 images, including 500 training images and 100 test images. |
PASCAL-VOC |
The data content is diversified and can be used to train computer vision models (such as classification, positioning, detection, segmentation, and action recognition). |
CelebA |
CelebA face dataset contains tens of thousands of face images of celebrities with 40 attribute annotations, which are usually used for face-related training tasks. |
The procedure for loading common datasets is as follows. The following describes how to create the CIFAR-10
object to load supported datasets.
Download and decompress the CIFAR-10 Dataset. The dataset in binary format (CIFAR-10 binary version) is used.
Configure the dataset directory and define the dataset instance to be loaded.
DATA_DIR = "cifar10_dataset_dir/" cifar10_dataset = ds.Cifar10Dataset(DATA_DIR)
Create an iterator and read data through the iterator.
for data in cifar10_dataset.create_dict_iterator(): # In CIFAR-10 dataset, each dictionary of data has keys "image" and "label". print(data["image"]) print(data["label"])
Loading Datasets of a Specific Data Format
MindSpore Data Format
MindSpore supports reading of datasets stored in MindSpore data format, that is, MindRecord
which has better performance and features.
For details about how to convert datasets to the MindSpore data format, see the Converting the Dataset to MindSpore Data Format.
To read a dataset using the MindDataset
object, perform the following steps:
Create
MindDataset
for reading data.CV_FILE_NAME = os.path.join(MODULE_PATH, "./imagenet.mindrecord") data_set = ds.MindDataset(dataset_file=CV_FILE_NAME)
In the preceding information:
dataset_file
: specifies the MindRecord file, including the path and file name.Create a dictionary iterator and read data records through the iterator.
num_iter = 0 for data in data_set.create_dict_iterator(): print(data["label"]) num_iter += 1
Manifest
Data Format
Manifest
is a data format file supported by Huawei ModelArts. For details, see https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0009.html.
MindSpore provides dataset classes for datasets in Manifest format. Run the following commands to configure the dataset directory and define the dataset instance to be loaded:
DATA_DIR = "manifest_dataset_path"
manifest_dataset = ds.ManifestDataset(DATA_DIR)
Currently, ManifestDataset supports only datasets of images and labels. The default column names are “image” and “label”.
TFRecord
Data Format
MindSpore can also read datasets in the TFRecord
data format through the TFRecordDataset
object.
Input the dataset path or the .tfrecord file list to create the
TFRecordDataset
.DATA_DIR = ["tfrecord_dataset_path/train-0000-of-0001.tfrecord"] dataset = ds.TFRecordDataset(DATA_DIR)
Create schema files or schema classes to set the dataset format and features.
The following is an example of the schema file:
{ "datasetType": "TF", "numRows": 3, "columns": { "image": { "type": "uint8", "rank": 1 }, "label" : { "type": "int64", "rank": 1 } } }
In the preceding information:
datasetType
: data format. TF indicates the TFRecord data format.
columns
: column information field, which is defined based on the actual column names of the dataset. In the preceding schema file example, the dataset columns are image and label.
numRows
: row information field, which controls the maximum number of rows for loading data. If the number of defined rows is greater than the actual number of rows, the actual number of rows prevails during loading.When creating the TFRecordDataset, input the schema file path. An example is as follows:
DATA_DIR = ["tfrecord_dataset_path/train-0000-of-0001.tfrecord"] SCHEMA_DIR = "dataset_schema_path/schema.json" dataset = ds.TFRecordDataset(DATA_DIR, schema=SCHEMA_DIR)
An example of creating a schema class is as follows:
import mindspore.common.dtype as mstype schema = ds.Schema() schema.add_column('image', de_type=mstype.uint8) # Binary data usually use uint8 here. schema.add_column('label', de_type=mstype.int32) dataset = ds.TFRecordDataset(DATA_DIR, schema=schema)
Create a dictionary iterator and read data through the iterator.
for data in dataset.create_dict_iterator(): # The dictionary of data has keys "image" and "label" which are consistent with columns names in its schema. print(data["image"]) print(data["label"])
Loading a Custom Dataset
You can load a custom dataset using the GeneratorDataset
object.
Define a function (for example,
Generator1D
) to generate a dataset.The custom generation function returns the objects that can be called. Each time, tuples of
numpy array
are returned as a row of data.An example of a custom function is as follows:
import numpy as np # Import numpy lib. def Generator1D(): for i in range(64): yield (np.array([i]),) # Notice, tuple of only one element needs following a comma at the end.
Transfer
Generator1D
toGeneratorDataset
to create a dataset and setcolumn
to data.dataset = ds.GeneratorDataset(Generator1D, ["data"])
After creating a dataset, create an iterator for the dataset to obtain the corresponding data. Iterator creation methods are as follows:
Create an iterator whose return value is of the sequence type.
for data in dataset.create_tuple_iterator(): # each data is a sequence print(data[0])
Create an iterator whose return value is of the dictionary type.
for data in dataset.create_dict_iterator(): # each data is a dictionary print(data["data"])