Loading Dataset

Overview

MindSpore can load common image datasets. You can directly use the classes in mindspore.dataset to load datasets. The following table lists the supported common datasets and corresponding classes.

Image Dataset	Dataset Class	Description
MNIST	MnistDataset	MNIST is a large handwritten digital image dataset. It has 60,000 training images and 10,000 test images and is often used to train various image processing systems.
CIFAR-10	Cifar10Dataset	CIFAR-10 is a small image dataset that contains 60,000 32 x 32 color images of 10 categories. On average, each category contains 6,000 images, of which 5,000 images are training images and 1,000 images are test images.
CIFAR-100	Cifar100Dataset	CIFAR-100 is similar to CIFAR-10, but it has 100 categories. On average, there are 600 images in each category, among which 500 images are training images and 100 images are test images.
CelebA	CelebADataset	CelebA is a large face image dataset that contains more than 200,000 face images of celebrities. Each image has 40 feature labels.
PASCAL-VOC	VOCDataset	PASCAL-VOC is a common image dataset, which is widely used in computer vision fields such as object detection and image segmentation.
COCO	CocoDataset	COCO is a large dataset for object detection, image segmentation, and pose estimation.
CLUE	CLUEDataset	CLUE is a large Chinese semantic comprehension dataset.

MindSpore can also load datasets in different data storage formats. You can directly use the corresponding classes in mindspore.dataset to load data files in the disk. The following table lists the supported data formats and corresponding classes.

Data Format	Dataset Class	Description
MindRecord	MindDataset	MindRecord is a self-developed data format of MindSpore. It features efficient read/write and easy distributed processing.
Manifest	ManifestDataset	Manifest is a data format supported by Huawei ModelArts. It describes the original files and labeling information and can be used for labeling, training, and inference.
TFRecord	TFRecordDataset	TFRecord is a binary data file format defined by TensorFlow.
NumPy	NumpySlicesDataset	NumPy data source refers to the NumPy array dataset that has been read into the memory.
Text File	TextFileDataset	Text File refers to common data in text format.
CSV File	CSVDataset	CSV refers to comma-separated values. Files in this format store tabular data in plain text.

MindSpore also supports user-defined dataset loading using GeneratorDataset. You can implement your own dataset classes as required.

For details about the API for dataset loading, see MindSpore API.

Loading Common Dataset

The following describes how to load common datasets.

CIFAR-10/100 Dataset

Download CIFAR-10 dataset and decompress it, the directory structure is as follows:

└─cifar-10-batches-bin
    ├── batches.meta.txt
    ├── data_batch_1.bin
    ├── data_batch_2.bin
    ├── data_batch_3.bin
    ├── data_batch_4.bin
    ├── data_batch_5.bin
    ├── readme.html
    └── test_batch.bin

The following example uses the Cifar10Dataset API to load the CIFAR-10 dataset, uses the sequential sampler to obtain five samples, and displays the shape and label of the corresponding image.

The methods for loading the CIFAR-100 and MNIST datasets are similar.

import mindspore.dataset as ds

DATA_DIR = "cifar-10-batches-bin/"

sampler = ds.SequentialSampler(num_samples=5)
dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)

for data in dataset.create_dict_iterator():
    print("Image shape:", data['image'].shape, ", Label:", data['label'])

The output is as follows:

Image shape: (32, 32, 3) , Label: 6
Image shape: (32, 32, 3) , Label: 9
Image shape: (32, 32, 3) , Label: 9
Image shape: (32, 32, 3) , Label: 4
Image shape: (32, 32, 3) , Label: 1

VOC Dataset

There are multiple versions of the VOC dataset, here uses VOC2012 as an example. Download VOC2012 dataset and decompress it. The directory structure is as follows:

└─ VOCtrainval_11-May-2012
    └── VOCdevkit
        └── VOC2012
            ├── Annotations
            ├── ImageSets
            ├── JPEGImages
            ├── SegmentationClass
            └── SegmentationObject

The following example uses the VOCDataset API to load the VOC2012 dataset, displays the original image shape and target image shape when segmentation and detection tasks are specified.

import mindspore.dataset as ds

DATA_DIR = "VOCtrainval_11-May-2012/VOCdevkit/VOC2012/"

dataset = ds.VOCDataset(DATA_DIR, task="Segmentation", usage="train", num_samples=2, decode=True, shuffle=False)

print("[Segmentation]:")
for data in dataset.create_dict_iterator():
    print("image shape:", data["image"].shape)
    print("target shape:", data["target"].shape)

dataset = ds.VOCDataset(DATA_DIR, task="Detection", usage="train", num_samples=1, decode=True, shuffle=False)

print("[Detection]:")
for data in dataset.create_dict_iterator():
    print("image shape:", data["image"].shape)
    print("bbox shape:", data["bbox"].shape)

The output is as follows:

[Segmentation]:
image shape: (281, 500, 3)
target shape: (281, 500, 3)
image shape: (375, 500, 3)
target shape: (375, 500, 3)
[Detection]:
image shape: (442, 500, 3)
bbox shape: (2, 4)

COCO Dataset

There are multiple versions of the COCO dataset. Here, the validation dataset of COCO2017 is taken as an example. Download COCO2017 validation dataset, detection task annotation and panoptic task annotation and decompress them, take only a part of the validation dataset and store it as the following directory structure:

└─ COCO
    ├── val2017
    └── annotations
        ├── instances_val2017.json
        ├── panoptic_val2017.json
        └── person_keypoints_val2017.json

The following example uses the CocoDataset API to load the COCO dataset, and displays the data when object detection, stuff segmentation, keypoint detection, and panoptic segmentation tasks are specified.

import mindspore.dataset as ds

DATA_DIR = "COCO/val2017/"
ANNOTATION_FILE = "COCO/annotations/instances_val2017.json"
KEYPOINT_FILE = "COCO/annotations/person_keypoints_val2017.json"
PANOPTIC_FILE = "COCO/annotations/panoptic_val2017.json"

dataset = ds.CocoDataset(DATA_DIR, annotation_file=ANNOTATION_FILE, task="Detection", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Detection:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=ANNOTATION_FILE, task="Stuff", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Stuff:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=KEYPOINT_FILE, task="Keypoint", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Keypoint:", data.keys())

dataset = ds.CocoDataset(DATA_DIR, annotation_file=PANOPTIC_FILE, task="Panoptic", num_samples=1)
for data in dataset.create_dict_iterator():
    print("Panoptic:", data.keys())

The output is as follows:

Detection: dict_keys(['image', 'bbox', 'category_id', 'iscrowd'])
Stuff: dict_keys(['image', 'segmentation', 'iscrowd'])
Keypoint: dict_keys(['image', 'keypoints', 'num_keypoints'])
Panoptic: dict_keys(['image', 'bbox', 'category_id', 'iscrowd', 'area'])

Loading Datasets in Specific Format

The following describes how to load dataset files in specific formats.

MindRecord

MindRecord is a data format defined by MindSpore. Using MindRecord can improve performance.

For details about how to convert a dataset into the MindRecord data format, see Data Format Conversion.

The following example uses the MindDataset API to load MindRecord files, and displays labels of the loaded data.

import mindspore.dataset as ds

DATA_FILE = ["mindrecord_file_0", "mindrecord_file_1", "mindrecord_file_2"]
mindrecord_dataset = ds.MindDataset(DATA_FILE)

for data in mindrecord_dataset.create_dict_iterator(output_numpy=True):
    print(data["label"])

Manifest

Manifest is a data format file supported by Huawei ModelArts. For details, see Specifications for Importing the Manifest File.

The following example uses the ManifestDataset API to load a Manifest file, and displays labels of the loaded data.

import mindspore.dataset as ds

DATA_FILE = "manifest_file"
manifest_dataset = ds.ManifestDataset(DATA_FILE)

for data in manifest_dataset.create_dict_iterator():
    print(data["label"])

TFRecord

TFRecord is a binary data file format defined by TensorFlow.

The following example uses the TFRecordDataset API to load TFRecord files and introduces two methods for setting the format of datasets.

Specify the dataset path or TFRecord file list to create a TFRecordDataset object.

import mindspore.dataset as ds

DATA_FILE = ["tfrecord_file_0", "tfrecord_file_1", "tfrecord_file_2"]
tfrecord_dataset = ds.TFRecordDataset(DATA_FILE)

Compile a schema file or create a schema object to set the dataset format and features.

Compile a schema file.

Write the dataset format and features to the schema file in JSON format. The following is an example:

{
 "columns": {
     "image": {
         "type": "uint8",
         "rank": 1
         },
     "label" : {
         "type": "string",
         "rank": 1
         }
     "id" : {
         "type": "int64",
         "rank": 0
         }
     }
 }

columns: column information field, which needs to be defined based on the actual column name of the dataset. In the preceding example, the dataset columns are image, label, and id.

When creating TFRecordDataset, transfer the path of the schema file.

SCHEMA_DIR = "dataset_schema_path/schema.json"
tfrecord_dataset = ds.TFRecordDataset(DATA_FILE, schema=SCHEMA_DIR)

Create a schema object.

Create a schema object, add user-defined fields to the schema object, and pass the schema object when creating a dataset object.

from mindspore import dtype as mstype
schema = ds.Schema()
schema.add_column('image', de_type=mstype.uint8)
schema.add_column('label', de_type=mstype.int32)
tfrecord_dataset = ds.TFRecordDataset(DATA_FILE, schema=schema)

NumPy

If all data has been read into the memory, you can directly use the NumpySlicesDataset class to load the data.

The following examples describe how to use NumpySlicesDataset to load array, list, and dict data.

Load NumPy array data.

import numpy as np
import mindspore.dataset as ds

np.random.seed(6)
features, labels = np.random.sample((4, 2)), np.random.sample((4, 1))

data = (features, labels)
dataset = ds.NumpySlicesDataset(data, column_names=["col1", "col2"], shuffle=False)

for data in dataset:
    print(data[0], data[1])

The output is as follows:

[0.89286015 0.33197981] [0.33540785]
[0.82122912 0.04169663] [0.62251943]
[0.10765668 0.59505206] [0.43814143]
[0.52981736 0.41880743] [0.73588211]

Load Python list data.

import mindspore.dataset as ds

data1 = [[1, 2], [3, 4]]

dataset = ds.NumpySlicesDataset(data1, column_names=["col1"], shuffle=False)

for data in dataset:
    print(data[0])

The output is as follows:

[1 2]
[3 4]

Load Python dict data.

import mindspore.dataset as ds

data1 = {"a": [1, 2], "b": [3, 4]}

dataset = ds.NumpySlicesDataset(data1, column_names=["col1", "col2"], shuffle=False)

for data in dataset.create_dict_iterator():
    print(data)

The output is as follows:

{'col1': Tensor(shape=[], dtype=Int64, value= 1), 'col2': Tensor(shape=[], dtype=Int64, value= 3)}
{'col1': Tensor(shape=[], dtype=Int64, value= 2), 'col2': Tensor(shape=[], dtype=Int64, value= 4)}

CSV

The following example uses CSVDataset to load CSV dataset files, and displays labels of the loaded data.

The method of loading a text dataset file is similar to that of loading a CSV file.

import mindspore.dataset as ds

DATA_FILE = ["csv_file_0", "csv_file_1", "csv_file_2"]
csv_dataset = ds.CSVDataset(DATA_FILE)

for data in csv_dataset.create_dict_iterator(output_numpy=True):
    print(data["1"])

Loading User-defined Dataset

For the datasets that cannot be directly loaded by MindSpore, you can construct the GeneratorDataset object to load them in a customized method or convert them into the MindRecord data format. The following demonstrates some different methods to load user-defined datasets. For comparison, keep the generated random data the same.

Constructing Dataset Generator Function

Construct a generator function that defines the data return method, and then use this function to construct the user-defined dataset object. This method is applicable for simple scenarios.

import numpy as np
import mindspore.dataset as ds

np.random.seed(58)
data = np.random.sample((5, 2))
label = np.random.sample((5, 1))

def GeneratorFunc():
    for i in range(5):
        yield (data[i], label[i])

dataset = ds.GeneratorDataset(GeneratorFunc, ["data", "label"])

for sample in dataset.create_dict_iterator():
    print(sample["data"], sample["label"])

The output is as follows:

[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]

Constructing Iterable Dataset Class

Construct a dataset class to implement the __iter__ and __next__ methods, and then use the object of this class to construct the user-defined dataset object. Compared with directly defining the generating function, using the dataset class can achieve more customized functions.

import numpy as np
import mindspore.dataset as ds

class IterDatasetGenerator:
    def __init__(self):
        np.random.seed(58)
        self.__index = 0
        self.__data = np.random.sample((5, 2))
        self.__label = np.random.sample((5, 1))

    def __next__(self):
        if self.__index >= len(self.__data):
            raise StopIteration
        else:
            item = (self.__data[self.__index], self.__label[self.__index])
            self.__index += 1
            return item

    def __iter__(self):
        self.__index = 0
        return self

    def __len__(self):
        return len(self.__data)

dataset_generator = IterDatasetGenerator()
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])

The output is as follows:

[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]

Constructing Random Accessible Dataset Class

Construct a dataset class to implement the __getitem__ method, and then use the object of this class to construct a user-defined dataset object. This method is applicable for achieving distributed training.

import numpy as np
import mindspore.dataset as ds

class GetDatasetGenerator:
    def __init__(self):
        np.random.seed(58)
        self.__data = np.random.sample((5, 2))
        self.__label = np.random.sample((5, 1))

    def __getitem__(self, index):
        return (self.__data[index], self.__label[index])

    def __len__(self):
        return len(self.__data)

dataset_generator = GetDatasetGenerator()
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])

The output is as follows:

[0.36510558 0.45120592] [0.78888122]
[0.49606035 0.07562207] [0.38068183]
[0.57176158 0.28963401] [0.16271622]
[0.30880446 0.37487617] [0.54738768]
[0.81585667 0.96883469] [0.77994068]

If you want to perform distributed training, you need to implement the __iter__ method in the sampler class additionally. The index of the sampled data is returned each time. The code that needs to be added is as follows:

import math

class MySampler():
    def __init__(self, dataset, local_rank, world_size):
        self.__num_data = len(dataset)
        self.__local_rank = local_rank
        self.__world_size = world_size
        self.samples_per_rank = int(math.ceil(self.__num_data / float(self.__world_size)))
        self.total_num_samples = self.samples_per_rank * self.__world_size

    def __iter__(self):
        indices = list(range(self.__num_data))
        indices.extend(indices[:self.total_num_samples-len(indices)])
        indices = indices[self.__local_rank:self.total_num_samples:self.__world_size]
        return iter(indices)

    def __len__(self):
        return self.samples_per_rank

dataset_generator = GetDatasetGenerator()
sampler = MySampler(dataset_generator, local_rank=0, world_size=2)
dataset = ds.GeneratorDataset(dataset_generator, ["data", "label"], shuffle=False, sampler=sampler)

for data in dataset.create_dict_iterator():
    print(data["data"], data["label"])

The output is as follows:

[0.36510558 0.45120592] [0.78888122]
[0.57176158 0.28963401] [0.16271622]
[0.81585667 0.96883469] [0.77994068]