Converting Dataset to MindRecord

Download NotebookView Source On Gitee

In MindSpore, the dataset used to train the network model can be converted into MindSpore-specific format data (MindSpore Record format), making it easier to save and load data. The goal is to normalize the user’s dataset and further enable the reading of the data through the MindDataset interface and use it during the training process.

conversion

In addition, the performance of MindSpore in some scenarios is optimized, and using the MindSpore Record data format can reduce disk IO and network IO overhead, which results in a better user experience.

The MindSpore data format has the following features:

  1. Unified storage and access of user data are implemented, simplifying training data loading.

  2. Data is aggregated for storage, which can be efficiently read, managed and moved.

  3. Data encoding and decoding are efficient and transparent to users.

  4. The partition size is flexibly controlled to implement distributed training.

Record File Structure

As shown in the following figure, a MindSpore Record file consists of a data file and an index file.

mindrecord

The data file contains file headers, scalar data pages, and block data pages, which are used to store the training data after user normalization, and a single MindSpore Record file is recommended to be less than 20G, and the user can store the large dataset as multiple MindSpore Record files.

The index file contains index information generated based on scalar data (such as image Label, image file name) for convenient retrieval and statistical dataset information.

The specific purposes of file headers, scalar data pages, and block data pages in data files are as follows:

  • File header: the meta information of MindSpore Record file, which is mainly used to store file header size, scalar data page size, block data page size, Schema information, index field, statistics, file segment information, the correspondence between scalar data and block data, etc.

  • Scalar data page: mainly used to store integer, string, floating-point data, such as the Label of the image, the file name of the image, the length and width of the image, that is, the information suitable for storing with scalars will be saved here.

  • Block data page: mainly used to store binary strings, NumPy arrays and other data, such as binary image files themselves, dictionaries converted into text, etc.

It should be noted that neither the data files nor the index files can support renaming operations at this time.

Converting Dataset to Record Format

The following mainly describes how to convert CV class data and NLP class data to MindSpore Record file format, and read MindSpore Record file through the MindDataset interface.

Converting CV class dataset

This example mainly uses a CV dataset containing 100 records and converts it to MindSpore Record format as an example, and describes how to convert a CV class dataset to the MindSpore Record file format and read it through the MindDataset interface.

First, you need to create a dataset of 100 pictures and save it, whose sample contains three fields: file_name (string), label (integer), and data (binary), and then use the MindDataset interface to read the MindSpore Record file.

  1. Generate 100 images and convert them to MindSpore Record file format.

[1]:
from PIL import Image
from io import BytesIO

import mindspore.mindrecord as record


# The full path to the output MindSpore Record file
MINDRECORD_FILE = "test.mindrecord"

# Define the contained fields
cv_schema = {"file_name": {"type": "string"},
             "label": {"type": "int32"},
             "data": {"type": "bytes"}}

# Declare the MindSpore Record file format
writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1, overwrite=True)
writer.add_schema(cv_schema, "it is a cv dataset")
writer.add_index(["file_name", "label"])

# Build a dataset
data = []
for i in range(100):
    sample = {}
    white_io = BytesIO()
    Image.new('RGB', ((i+1)*10, (i+1)*10), (255, 255, 255)).save(white_io, 'JPEG')
    image_bytes = white_io.getvalue()
    sample['file_name'] = str(i+1) + ".jpg"
    sample['label'] = i+1
    sample['data'] = white_io.getvalue()

    data.append(sample)
    if i % 10 == 0:
        writer.write_raw_data(data)
        data = []

if data:
    writer.write_raw_data(data)

writer.commit()
[1]:
MSRStatus.SUCCESS

As can be seen from the printed result MSRStatus.SUCCESS above, the dataset conversion was successful. In the examples that follow in this article, you can see this print result if the dataset is successfully converted.

  1. Read the MindSpore Record file format via the MindDataset interface.

[2]:
import mindspore.dataset as ds
import mindspore.dataset.vision as vision

# Read the MindSpore Record file format
data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)
decode_op = vision.Decode()
data_set = data_set.map(operations=decode_op, input_columns=["data"], num_parallel_workers=2)

# Count the number of samples
print("Got {} samples".format(data_set.get_dataset_size()))
Got 100 samples

Converting NLP class dataset

This example first creates a MindSpore Record file format with 100 records. Its sample contains eight fields, all of which are integer arrays, and then uses the MindDataset interface to read the MindSpore Record file.

For ease of presentation, the preprocessing process of converting text to lexicographic order is omitted here.

  1. Generate 100 images and convert them to MindSpore Record file format.

[3]:
import numpy as np
import mindspore.mindrecord as record

# The full path of the output MindSpore Record file
MINDRECORD_FILE = "test.mindrecord"

# Defines the fields that the sample data contains
nlp_schema = {"source_sos_ids": {"type": "int64", "shape": [-1]},
              "source_sos_mask": {"type": "int64", "shape": [-1]},
              "source_eos_ids": {"type": "int64", "shape": [-1]},
              "source_eos_mask": {"type": "int64", "shape": [-1]},
              "target_sos_ids": {"type": "int64", "shape": [-1]},
              "target_sos_mask": {"type": "int64", "shape": [-1]},
              "target_eos_ids": {"type": "int64", "shape": [-1]},
              "target_eos_mask": {"type": "int64", "shape": [-1]}}

# Declare the MindSpore Record file format
writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1, overwrite=True)
writer.add_schema(nlp_schema, "Preprocessed nlp dataset.")

# Build a virtual dataset
data = []
for i in range(100):
    sample = {"source_sos_ids": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),
              "source_sos_mask": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),
              "source_eos_ids": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),
              "source_eos_mask": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),
              "target_sos_ids": np.array([28, 29, 30, 31, 32], dtype=np.int64),
              "target_sos_mask": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),
              "target_eos_ids": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),
              "target_eos_mask": np.array([48, 49, 50, 51], dtype=np.int64)}
    data.append(sample)

    if i % 10 == 0:
        writer.write_raw_data(data)
        data = []

if data:
    writer.write_raw_data(data)

writer.commit()
[3]:
MSRStatus.SUCCESS
  1. Read the MindSpore Record format file through the MindDataset interface.

[4]:
import mindspore.dataset as ds

# Read MindSpore Record file format
data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE, shuffle=False)

# Count the number of samples
print("Got {} samples".format(data_set.get_dataset_size()))

# Print the part of data
count = 0
for item in data_set.create_dict_iterator():
    print("source_sos_ids:", item["source_sos_ids"])
    count += 1
    if count == 10:
        break
Got 100 samples
source_sos_ids: [1 2 3 4 5]
source_sos_ids: [2 3 4 5 6]
source_sos_ids: [3 4 5 6 7]
source_sos_ids: [4 5 6 7 8]
source_sos_ids: [5 6 7 8 9]
source_sos_ids: [ 6  7  8  9 10]
source_sos_ids: [ 7  8  9 10 11]
source_sos_ids: [ 8  9 10 11 12]
source_sos_ids: [ 9 10 11 12 13]
source_sos_ids: [10 11 12 13 14]

Dumping Dataset to MindRecord

MindSpore provides a tool class for converting commonly used datasets, capable of converting commonly used datasets to the MindSpore Record file format.

For more detailed descriptions of dataset transformations, refer to API Documentation.

Dumping the CIFAR-10 dataset

Users can convert CIFAR-10 raw data to MindSpore Record and read it using the MindDataset interface via the Dataset.save class.

  1. Download the CIFAR-10 Dataset and use Cifar10Dataset to load.

[5]:
from download import download
from mindspore.dataset import Cifar10Dataset

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz"

path = download(url, "./", kind="tar.gz", replace=True)
dataset = Cifar10Dataset("./cifar-10-batches-bin/")  # load data
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz (162.2 MB)

file_sizes: 100%|████████████████████████████| 170M/170M [00:26<00:00, 6.34MB/s]
Extracting tar.gz file...
Successfully downloaded / unzipped to ./
  1. Call the Dataset.save interface to dump the CIFAR-10 dataset into the MindSpore Record file format.

[6]:
dataset.save("cifar10.mindrecord")
  1. Read the MindSpore Record file format through the MindDataset interface.

[7]:
import os
from mindspore.dataset import MindDataset

# Read MindSpore Record file format
data_set = MindDataset(dataset_files="cifar10.mindrecord")

# Count the number of samples
print("Got {} samples".format(data_set.get_dataset_size()))

if os.path.exists("cifar10.mindrecord") and os.path.exists("cifar10.mindrecord.db"):
    os.remove("cifar10.mindrecord")
    os.remove("cifar10.mindrecord.db")

Got 60000 samples