mindspore.mindrecord

Introduction of MindRecord.

MindRecord is an efficient data storage and reading module provided by MindSpore. This module provides several methods to help users convert various public datasets into the MindRecord format, as well as methods to read, write, and retrieve data from MindRecord files.

../_images/data_conversion_concept_en.png

MindSpore format data allows for more convenient saving and loading of data, with the goal of normalizing user datasets and optimizing performance for different data scenarios. Using the MindRecord data format can reduce disk I/O and network I/O overhead, thereby providing a better data loading experience.

Users can generate MindRecord format data files using mindspore.mindrecord.FileWriter and load MindRecord format datasets using mindspore.dataset.MindDataset .

Users can also convert datasets from other formats to the MindRecord format. For more details, please refer to Converting Dataset to MindRecord . Additionally, MindRecord supports file encryption, decryption, and integrity checks to ensure the security of MindRecord format datasets.

class mindspore.mindrecord.FileWriter(file_name, shard_num=1, overwrite=False)[source]

Class to write user defined raw data into MindRecord files.

Note

After the MindRecord file is generated, if the file name is changed, the file may fail to be read.

Parameters

file_name (str) – File name of MindRecord file.
shard_num (int, optional) – The Number of MindRecord files. It should be between [1, 1000]. Default: 1 .
overwrite (bool, optional) – Whether to overwrite if the file already exists. Default: False .

Raises

ParamValueError – If file_name or shard_num or overwrite is invalid.

Examples

>>> from mindspore.mindrecord import FileWriter
>>>
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> writer.add_schema(schema_json, "test_schema")
>>> indexes = ["file_name", "label"]
>>> writer.add_index(indexes)
>>> for i in range(10):
...     data = [{"file_name": str(i) + ".jpg", "label": i,
...              "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
...     writer.write_raw_data(data)
>>> writer.commit()

add_index(index_fields)[source]

Select index fields from schema to accelerate reading. schema is added through add_schema .

Note

The index fields should be primitive type. e.g. int/float/str. If the function is not called, the fields of the primitive type in schema are set as indexes by default.

Please refer to the Examples of mindspore.mindrecord.FileWriter .

Parameters

index_fields (list[str]) – fields from schema.

Raises

ParamTypeError – If index field is invalid.
MRMDefineIndexError – If index field is not primitive type.
MRMAddIndexError – If failed to add index field.
MRMGetMetaError – If the schema is not set or failed to get meta.

add_schema(content, desc=None)[source]

The schema is added to describe the raw data to be written.

Note

Please refer to the Examples of mindspore.mindrecord.FileWriter .

The data types supported by MindRecord.
Data Type	Data Shape	Details
int32	/	integer number
int64	/	integer number
float32	/	real number
float64	/	real number
string	/	string data
bytes	/	binary data
int32	[-1] / [-1, 32, 32] / [3, 224, 224]	numpy ndarray
int64	[-1] / [-1, 32, 32] / [3, 224, 224]	numpy ndarray
float32	[-1] / [-1, 32, 32] / [3, 224, 224]	numpy ndarray
float64	[-1] / [-1, 32, 32] / [3, 224, 224]	numpy ndarray

Parameters

content (dict) – Dictionary of schema content.
desc (str, optional) – String of schema description, Default: None .

Raises

MRMInvalidSchemaError – If schema is invalid.
MRMBuildSchemaError – If failed to build schema.
MRMAddSchemaError – If failed to add schema.

Examples

>>> # Examples of available schemas
>>> schema1 = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> schema2 = {"input_ids": {"type": "int32", "shape": [-1]},
...            "input_masks": {"type": "int32", "shape": [-1]}}

commit()[source]

Flush data in memory to disk and generate the corresponding database files.

Note

Please refer to the Examples of mindspore.mindrecord.FileWriter .

Raises

MRMOpenError – If failed to open MindRecord file.
MRMSetHeaderError – If failed to set header.
MRMIndexGeneratorError – If failed to create index generator.
MRMGenerateIndexError – If failed to write to database.
MRMCommitError – If failed to flush data to disk.
RuntimeError – Parallel write failed.

classmethod open_for_append(file_name)[source]

Open MindRecord file and get ready to append data.

Parameters

file_name (str) – String of MindRecord file name.

Returns

FileWriter, file writer object for the opened MindRecord file.

Raises

ParamValueError – If file_name is invalid.
FileNameError – If path contains invalid characters.
MRMOpenError – If failed to open MindRecord file.
MRMOpenForAppendError – If failed to open file for appending data.

Examples

>>> from mindspore.mindrecord import FileWriter
>>>
>>> data = [{"file_name": "0.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> writer.add_schema(schema_json, "test_schema")
>>> writer.write_raw_data(data)
>>> writer.commit()
>>>
>>> write_append = FileWriter.open_for_append("test.mindrecord")
>>> for i in range(9):
...     data = [{"file_name": str(i+1) + ".jpg", "label": i,
...              "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
...     write_append.write_raw_data(data)
>>> write_append.commit()

set_header_size(header_size)[source]

Set the size of header which contains shard information, schema information, page meta information, etc. The larger a header, the more data the MindRecord file can store. If the size of header is larger than the default size (16MB), users need to call the API to set a proper size.

Parameters: header_size (int) – Size of header, in bytes, which between 16*1024(16KB) and 128*1024*1024(128MB).
Raises: MRMInvalidHeaderSizeError – If failed to set header size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.set_header_size(1 << 25) # 32MB

set_page_size(page_size)[source]

Set the size of page that represents the area where data is stored, and the areas are divided into two types: raw page and blob page. The larger a page, the more data the page can store. If the size of a sample is larger than the default size (32MB), users need to call the API to set a proper size.

Parameters: page_size (int) – Size of page, in bytes, which between 32*1024(32KB) and 256*1024*1024(256MB).
Raises: MRMInvalidPageSizeError – If failed to set page size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.set_page_size(1 << 26)  # 64MB

write_raw_data(raw_data, parallel_writer=False)[source]

Convert raw data into a series of consecutive MindRecord files after the raw data is verified against the schema.

Note

Please refer to the Examples of mindspore.mindrecord.FileWriter .

Parameters

raw_data (list[dict]) – List of raw data.
parallel_writer (bool, optional) – Write raw data in parallel if it equals to True. Default: False . Parallel writing is not supported on the Windows platform.

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.
TypeError – If parallel_writer is not bool.

class mindspore.mindrecord.FileReader(file_name, num_consumer=4, columns=None, operator=None)[source]

Class to read MindRecord files.

Note

If file_name is a file path, it tries to load all MindRecord files generated in a conversion, and throws an exception if a MindRecord file is missing. If file_name is file path list, only the MindRecord files in the list are loaded. The parameter operator has no effect and will be deprecated in a future version.

Parameters

file_name (str, list[str]) – One of MindRecord file path or file path list.
num_consumer (int, optional) – Number of reader workers which load data. Default: 4 . It should not be smaller than 1 or larger than the number of processor cores.
columns (list[str], optional) – A list of fields where corresponding data would be read. Default: None .
operator (int, optional) – Reserved parameter for operators. Default: None .

Raises

ParamValueError – If file_name , num_consumer or columns is invalid.

Examples

>>> from mindspore.mindrecord import FileReader
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> reader = FileReader(file_name=mindrecord_file)
>>>
>>> # create iterator for mindrecord and get saved data
>>> for _, item in enumerate(reader.get_next()):
...     ori_data = item
>>> reader.close()

close()[source]: Stop reader worker and close file.

Note

Please refer to the Examples of mindspore.mindrecord.FileReader .

get_next()[source]

Yield a batch of data according to columns at a time.

Note

Please refer to the Examples of mindspore.mindrecord.FileReader .

Returns: dict, a batch whose keys are the same as columns.
Raises: MRMUnsupportedSchemaError – If schema is invalid.

len()[source]

Get the number of the samples in MindRecord.

Returns: int, the number of the samples in MindRecord.

Examples

>>> from mindspore.mindrecord import FileReader
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> reader = FileReader(file_name=mindrecord_file)
>>> length = reader.len()
>>> reader.close()

schema()[source]

Get the schema of the MindRecord.

Returns: dict, the schema info.

Examples

>>> from mindspore.mindrecord import FileReader
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> reader = FileReader(file_name=mindrecord_file)
>>> schema = reader.schema()
>>> reader.close()

class mindspore.mindrecord.MindPage(file_name, num_consumer=4)[source]

Class to read MindRecord files in pagination.

Parameters

file_name (Union[str, list[str]]) – One of MindRecord files or a file list.
num_consumer (int, optional) – The number of reader workers which load data. Default: 4 . It should not be smaller than 1 or larger than the number of processor cores.

Raises

ParamValueError – If file_name is not type str or list[str].
ParamValueError – If num_consumer is not type int.

Examples

>>> from mindspore.mindrecord import MindPage
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> mind_page = MindPage(mindrecord_file)
>>>
>>> # get all the index fields
>>> fields = mind_page.candidate_fields
>>>
>>> # set the field to be retrieved
>>> mind_page.category_field = "file_name"
>>>
>>> # get all the group info
>>> info = mind_page.read_category_info()
>>>
>>> # get the row by id which is from category info
>>> row_by_id = mind_page.read_at_page_by_id(0, 0, 1)
>>>
>>> # get the row by name which is from category info
>>> row_by_name = mind_page.read_at_page_by_name("8.jpg", 0, 1)

property candidate_fields

Return candidate category fields.

Note

Please refer to the Examples of mindspore.mindrecord.MindPage .

Returns: list[str], by which data could be grouped.

property category_field

Setter / Getter function for category fields.

Note

Please refer to the Examples of mindspore.mindrecord.MindPage .

Returns: list[str], by which data could be grouped.

read_at_page_by_id(category_id, page, num_row)[source]

Query by category id in pagination.

Note

Please refer to the Examples of mindspore.mindrecord.MindPage .

Parameters

category_id (int) – Category id, referred to the return of read_category_info .
page (int) – Index of page.
num_row (int) – Number of rows in a page.

Returns

list[dict], data queried by category id.

Raises

ParamValueError – If any parameter is invalid.
MRMFetchDataError – If failed to fetch data by category.
MRMUnsupportedSchemaError – If schema is invalid.

read_at_page_by_name(category_name, page, num_row)[source]

Query by category name in pagination.

Note

Please refer to the Examples of mindspore.mindrecord.MindPage .

Parameters

category_name (str) – String of category field's value, referred to the return of read_category_info .
page (int) – Index of page.
num_row (int) – Number of row in a page.

Returns

list[dict], data queried by category name.

read_category_info()[source]

Return category information when data is grouped by indicated category field.

The result is similar to the following, where key represents the index field and categories represent statistical information for the index.

{"categories":[{"count":1,"id":0,"name":"0.jpg"},
               {"count":1,"id":1,"name":"1.jpg"},
               {"count":1,"id":2,"name":"2.jpg"},
               {"count":1,"id":3,"name":"3.jpg"}],
 "key":"file_name_0"}

Note

Please refer to the Examples of mindspore.mindrecord.MindPage .

Returns: str, description of group information.
Raises: MRMReadCategoryInfoError – If failed to read category information.

class mindspore.mindrecord.Cifar10ToMR(source, destination)[source]

A class to transform from cifar10 which needs to be a Python version with a name similar to: cifar-10-python.tar.gz to MindRecord.

Parameters

source (str) – The cifar10 directory to be transformed.
destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises

ValueError – If source or destination is invalid.

Examples

>>> from mindspore.mindrecord import Cifar10ToMR
>>>
>>> cifar10_dir = "/path/to/cifar10"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> cifar10_to_mr = Cifar10ToMR(cifar10_dir, mindrecord_file)
>>> cifar10_to_mr.transform()

transform(fields=None)[source]

Execute transformation from cifar10 to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.Cifar10ToMR .

Warning

Cifar10ToMR.transform() uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never load data that could have come from an untrusted source, or that could have been tampered with.

Parameters

fields (list[str], optional) – A list of index fields. Default: None . For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.
ValueError – If parameter fields is invalid.

class mindspore.mindrecord.Cifar100ToMR(source, destination)[source]

A class to transform from cifar100 which needs to be a Python version with a name similar to: cifar-100-python.tar.gz to MindRecord.

Parameters

source (str) – The cifar100 directory to be transformed.
destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises

ValueError – If source or destination is invalid.

Examples

>>> from mindspore.mindrecord import Cifar100ToMR
>>>
>>> cifar100_dir = "/path/to/cifar100"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> cifar100_to_mr = Cifar100ToMR(cifar100_dir, mindrecord_file)
>>> cifar100_to_mr.transform()

transform(fields=None)[source]

Execute transformation from cifar100 to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.Cifar100ToMR .

Parameters

fields (list[str], optional) – A list of index field, e.g.["fine_label", "coarse_label"]. Default: None . For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.
ValueError – If parameter fields is invalid.

class mindspore.mindrecord.CsvToMR(source, destination, columns_list=None, partition_number=1)[source]

A class to transform from csv to MindRecord.

Parameters

source (str) – The file path of csv.
destination (str) – The MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.
columns_list (list[str], optional) – A list of columns to be read. Default: None .
partition_number (int, optional) – The partition size, Default: 1 .

Raises

ValueError – If source , destination , partition_number is invalid.
RuntimeError – If columns_list is invalid.

Examples

>>> from mindspore.mindrecord import CsvToMR
>>>
>>> csv_file = "/path/to/csv/file"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> csv_to_mr = CsvToMR(csv_file, mindrecord_file)
>>> csv_to_mr.transform()

transform()[source]

Execute transformation from csv to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.CsvToMR .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.
IOError – Csv file does not exist.
ValueError – The first line of the CSV file is used as column name and each field cannot start with a number.

class mindspore.mindrecord.ImageNetToMR(map_file, image_dir, destination, partition_number=1)[source]

A class to transform from imagenet to MindRecord.

Parameters

map_file (str) –
The map file that indicates label. This file can be generated by command ls -l [image_dir] | grep -vE "total|\." | awk -F " " '{print $9, NR-1;}' > [file_path] , where image_dir is image directory contains n01440764, n01443537, n01484850 and n15075141 directory and file_path is the generated map_file . An example of map_file is as below:
```
n01440764 0
n01443537 1
n01484850 2
n01491361 3
...
n15075141 999
```
image_dir (str) – Image directory contains n01440764, n01443537, n01484850 and n15075141 directory.
destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.
partition_number (int, optional) – The partition size. Default: 1 .

Raises

ValueError – If map_file , image_dir or destination is invalid.

Examples

>>> from mindspore.mindrecord import ImageNetToMR
>>>
>>> map_file = "/path/to/imagenet/map_file"
>>> imagenet_dir = "/path/to/imagenet/train"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> imagenet_to_mr = ImageNetToMR(map_file, imagenet_dir, mindrecord_file, 8)
>>> imagenet_to_mr.transform()

transform()[source]

Execute transformation from imagenet to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.ImageNetToMR .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.

class mindspore.mindrecord.MnistToMR(source, destination, partition_number=1)[source]

A class to transform from Mnist to MindRecord.

Parameters

source (str) – Directory that contains t10k-images-idx3-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and train-labels-idx1-ubyte.gz.
destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.
partition_number (int, optional) – The partition size. Default: 1 .

Raises

ValueError – If source , destination , partition_number is invalid.

Examples

>>> from mindspore.mindrecord import MnistToMR
>>>
>>> mnist_dir = "/path/to/mnist"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> mnist_to_mr = MnistToMR(mnist_dir, mindrecord_file)
>>> mnist_to_mr.transform()

transform()[source]

Execute transformation from Mnist to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.MnistToMR .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.

class mindspore.mindrecord.TFRecordToMR(source, destination, feature_dict, bytes_fields=None)[source]

A class to transform from TFRecord to MindRecord.

Parameters

source (str) – TFRecord file to be transformed.
destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.
feature_dict (dict[str, FixedLenFeature]) – Dictionary that states the feature type, and FixedLenFeature is supported.
bytes_fields (list[str], optional) – The bytes fields which are in feature_dict and can be images bytes. Default: None , means that there is no byte dtype field such as image.

Raises

ValueError – If parameter is invalid.
Exception – when tensorflow module is not found or version is not correct.

Examples

>>> from mindspore.mindrecord import TFRecordToMR
>>> import tensorflow as tf
>>>
>>> tfrecord_file = "/path/to/tfrecord/file"
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> feature_dict = {"file_name": tf.io.FixedLenFeature([], tf.string),
...                 "image_bytes": tf.io.FixedLenFeature([], tf.string),
...                 "int64_scalar": tf.io.FixedLenFeature([], tf.int64),
...                 "float_scalar": tf.io.FixedLenFeature([], tf.float32),
...                 "int64_list": tf.io.FixedLenFeature([6], tf.int64),
...                 "float_list": tf.io.FixedLenFeature([7], tf.float32)}
>>> tfrecord_to_mr = TFRecordToMR(tfrecord_file, mindrecord_file, feature_dict, ["image_bytes"])
>>> tfrecord_to_mr.transform()

transform()[source]

Execute transformation from TFRecord to MindRecord.

Note

Please refer to the Examples of mindspore.mindrecord.TFRecordToMR .

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.

mindspore.mindrecord.set_enc_key(enc_key)[source]

Set the encode key.

Parameters: enc_key (str) – Str-type key used for encryption. The valid length is 16, 24, or 32. None indicates that encryption is not enabled.
Raises: ValueError – The input is not str or length error.

Examples

>>> from mindspore.mindrecord import set_enc_key
>>>
>>> set_enc_key("0123456789012345")

mindspore.mindrecord.set_enc_mode(enc_mode='AES-GCM')[source]

Set the encode mode.

Parameters: enc_mode (Union[str, function], optional) – This parameter is valid only when enc_key is not set to None . Specifies the encryption mode or customized encryption function, currently supports "AES-GCM" . Default: "AES-GCM" . If it is customized encryption, users need to ensure its correctness, the security of the encryption algorithm and raise exceptions when errors occur.
Raises: ValueError – The input is not valid encode mode or callable function.

Examples

>>> from mindspore.mindrecord import set_enc_mode
>>>
>>> set_enc_mode("AES-GCM")

mindspore.mindrecord.set_dec_mode(dec_mode='AES-GCM')[source]

Set the decode mode.

If the built-in enc_mode is used and dec_mode is not specified, the encryption algorithm specified by enc_mode is used for decryption. If you are using customized encryption function, you must specify customized decryption function at read time.

Parameters: dec_mode (Union[str, function], optional) – This parameter is valid only when enc_key is not set to None . Specifies the decryption mode or customized decryption function, currently supports "AES-GCM" . Default: "AES-GCM" . None indicates that decryption mode is not defined. If it is customized decryption, users need to ensure its correctness and raise exceptions when errors occur.
Raises: ValueError – The input is not valid decode mode or callable function.

Examples

>>> from mindspore.mindrecord import set_dec_mode
>>>
>>> set_dec_mode("AES-GCM")