mindspore.mindrecord

Introduction of mindrecord:

Mindrecord is a module to implement reading, writing, search and converting for MindSpore format dataset. Users could load(modify) mindrecord data through FileReader(FileWriter). Users could also convert other format datasets to mindrecord data through corresponding sub-module.

class mindspore.mindrecord.Cifar100ToMR(source, destination)[source]

A class to transform from cifar100 to MindRecord.

Parameters

source (str) – the cifar100 directory to be transformed.
destination (str) – the MindRecord file path to transform into.

Raises

ValueError – If source or destination is invalid.

run(fields=None)[source]

Executes transformation from cifar100 to MindRecord.

Parameters: fields (list[str]) – A list of index field, e.g.[“fine_label”, “coarse_label”].
Returns: MSRStatus, whether cifar100 is successfully transformed to MindRecord.

class mindspore.mindrecord.Cifar10ToMR(source, destination)[source]

A class to transform from cifar10 to MindRecord.

Parameters

source (str) – the cifar10 directory to be transformed.
destination (str) – the MindRecord file path to transform into.

Raises

ValueError – If source or destination is invalid.

run(fields=None)[source]

Executes transformation from cifar10 to MindRecord.

Parameters: fields (list[str], optional) – A list of index fields, e.g.[“label”] (default=None).
Returns: MSRStatus, whether cifar10 is successfully transformed to MindRecord.

class mindspore.mindrecord.CsvToMR(source, destination, columns_list=None, partition_number=1)[source]

A class to transform from csv to MindRecord.

Parameters

source (str) – the file path of csv.
destination (str) – the MindRecord file path to transform into.
columns_list (list[str], optional) – A list of columns to be read(default=None).
partition_number (int, optional) – partition size (default=1).

Raises

ValueError – If source, destination, partition_number is invalid.
RuntimeError – If columns_list is invalid.

run()[source]

Executes transformation from csv to MindRecord.

Returns: MSRStatus, whether csv is successfully transformed to MindRecord.

class mindspore.mindrecord.FileReader(file_name, num_consumer=4, columns=None, operator=None)[source]

Class to read MindRecord File series.

Parameters

file_name (str, list[str]) – One of MindRecord File or a file list.
num_consumer (int, optional) – Number of consumer threads which load data to memory (default=4). It should not be smaller than 1 or larger than the number of CPUs.
columns (list[str], optional) – A list of fields where corresponding data would be read (default=None).
operator (int, optional) – Reserved parameter for operators (default=None).

Raises

ParamValueError – If file_name, num_consumer or columns is invalid.

close()[source]: Stop reader worker and close File.

get_next()[source]

Yield a batch of data according to columns at a time.

Yields: dictionary – keys are the same as columns.
Raises: MRMUnsupportedSchemaError – If schema is invalid.

class mindspore.mindrecord.FileWriter(file_name, shard_num=1)[source]

Class to write user defined raw data into MindRecord File series.

Note

The mindrecord file may fail to be read if the file name is modified.

Parameters

file_name (str) – File name of MindRecord File.
shard_num (int, optional) – The Number of MindRecord File (default=1). It should be between [1, 1000].

Raises

ParamValueError – If file_name or shard_num is invalid.

add_index(index_fields)[source]

Select index fields from schema to accelerate reading.

Parameters

index_fields (list[str]) – Fields would be set as index which should be primitive type.

Returns

MSRStatus, SUCCESS or FAILED.

Raises

ParamTypeError – If index field is invalid.
MRMDefineIndexError – If index field is not primitive type.
MRMAddIndexError – If failed to add index field.
MRMGetMetaError – If the schema is not set or failed to get meta.

add_schema(content, desc=None)[source]

Return a schema id if schema is added successfully, or raise an exception.

Parameters

content (dict) – Dictionary of user defined schema.
desc (str, optional) – String of schema description (default=None).

Returns

int, schema id.

Raises

MRMInvalidSchemaError – If schema is invalid.
MRMBuildSchemaError – If failed to build schema.
MRMAddSchemaError – If failed to add schema.

commit()[source]

Flush data to disk and generate the corresponding database files.

Returns

MSRStatus, SUCCESS or FAILED.

Raises

MRMOpenError – If failed to open MindRecord File.
MRMSetHeaderError – If failed to set header.
MRMIndexGeneratorError – If failed to create index generator.
MRMGenerateIndexError – If failed to write to database.
MRMCommitError – If failed to flush data to disk.

open_and_set_header()[source]: Open writer and set header.

classmethod open_for_append(file_name)[source]

Open MindRecord file and get ready to append data.

Parameters

file_name (str) – String of MindRecord file name.

Returns

FileWriter, file writer for the opened MindRecord file.

Raises

ParamValueError – If file_name is invalid.
FileNameError – If path contains invalid characters.
MRMOpenError – If failed to open MindRecord File.
MRMOpenForAppendError – If failed to open file for appending data.

set_header_size(header_size)[source]

Set the size of header which contains shard information, schema information, page meta information, etc. The larger the header, the more training data a single Mindrecord file can store.

Parameters: header_size (int) – Size of header, between 16KB and 128MB.
Returns: MSRStatus, SUCCESS or FAILED.
Raises: MRMInvalidHeaderSizeError – If failed to set header size.

set_page_size(page_size)[source]

Set the size of page which mainly refers to the block to store training data, and the training data will be split into raw page and blob page in mindrecord. The larger the page, the more training data a single page can store.

Parameters: page_size (int) – Size of page, between 32KB and 256MB.
Returns: MSRStatus, SUCCESS or FAILED.
Raises: MRMInvalidPageSizeError – If failed to set page size.

write_raw_data(raw_data, parallel_writer=False)[source]

Write raw data and generate sequential pair of MindRecord File and validate data based on predefined schema by default.

Parameters

raw_data (list[dict]) – List of raw data.
parallel_writer (bool, optional) – Load data parallel if it equals to True (default=False).

Returns

MSRStatus, SUCCESS or FAILED.

Raises

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord File.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.

class mindspore.mindrecord.ImageNetToMR(map_file, image_dir, destination, partition_number=1)[source]

A class to transform from imagenet to MindRecord.

Parameters

map_file (str) –
the map file that indicates label. The map file content should be like this:
```
n02119789 0
n02100735 1
n02110185 2
n02096294 3
```
image_dir (str) – image directory contains n02119789, n02100735, n02110185 and n02096294 directory.
destination (str) – the MindRecord file path to transform into.
partition_number (int, optional) – partition size (default=1).

Raises

ValueError – If map_file, image_dir or destination is invalid.

run()[source]

Executes transformation from imagenet to MindRecord.

Returns: MSRStatus, whether imagenet is successfully transformed to MindRecord.

class mindspore.mindrecord.MindPage(file_name, num_consumer=4)[source]

Class to read MindRecord File series in pagination.

Parameters

file_name (str) – One of MindRecord File or a file list.
num_consumer (int, optional) – The number of consumer threads which load data to memory (default=4). It should not be smaller than 1 or larger than the number of CPUs.

Raises

ParamValueError – If file_name, num_consumer or columns is invalid.
MRMInitSegmentError – If failed to initialize ShardSegment.

property candidate_fields

Return candidate category fields.

Returns: list[str], by which data could be grouped.

property category_field

Getter function for category fields.

Returns: list[str], by which data could be grouped.

get_category_fields()[source]

Return candidate category fields.

Returns: list[str], by which data could be grouped.

read_at_page_by_id(category_id, page, num_row)[source]

Query by category id in pagination.

Parameters

category_id (int) – Category id, referred to the return of read_category_info.
page (int) – Index of page.
num_row (int) – Number of rows in a page.

Returns

list[dict], data queried by category id.

Raises

ParamValueError – If any parameter is invalid.
MRMFetchDataError – If failed to fetch data by category.
MRMUnsupportedSchemaError – If schema is invalid.

read_at_page_by_name(category_name, page, num_row)[source]

Query by category name in pagination.

Parameters

category_name (str) – String of category field’s value, referred to the return of read_category_info.
page (int) – Index of page.
num_row (int) – Number of row in a page.

Returns

list[dict], data queried by category name.

read_category_info()[source]

Return category information when data is grouped by indicated category field.

Returns: str, description of group information.
Raises: MRMReadCategoryInfoError – If failed to read category information.

set_category_field(category_field)[source]

Set category field for reading.

Note

Should be a candidate category field.

Parameters: category_field (str) – String of category field name.
Returns: MSRStatus, SUCCESS or FAILED.

class mindspore.mindrecord.MnistToMR(source, destination, partition_number=1)[source]

A class to transform from Mnist to MindRecord.

Parameters

source (str) – directory that contains t10k-images-idx3-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and train-labels-idx1-ubyte.gz.
destination (str) – the MindRecord file directory to transform into.
partition_number (int, optional) – partition size (default=1).

Raises

ValueError – If source, destination, partition_number is invalid.

run()[source]

Executes transformation from Mnist to MindRecord.

Returns: MSRStatus, whether successfully written into MindRecord.

class mindspore.mindrecord.TFRecordToMR(source, destination, feature_dict, bytes_fields=None)[source]

A class to transform from TFRecord to MindRecord.

Parameters

source (str) – the TFRecord file to be transformed.
destination (str) – the MindRecord file path to transform into.
feature_dict (dict) –
a dictionary that states the feature type, e.g. feature_dict = {“xxxx”: tf.io.FixedLenFeature([], tf.string), “yyyy”: tf.io.FixedLenFeature([], tf.int64)}

Follow case which uses VarLenFeature is not supported.

feature_dict = {“context”: {“xxxx”: tf.io.FixedLenFeature([], tf.string), “yyyy”: tf.io.VarLenFeature(tf.int64)}, “sequence”: {“zzzz”: tf.io.FixedLenSequenceFeature([], tf.float32)}}
bytes_fields (list, optional) – the bytes fields which are in feature_dict and can be images bytes.

Raises

ValueError – If parameter is invalid.
Exception – when tensorflow module is not found or version is not correct.

run()[source]

Execute transformation from TFRecord to MindRecord.

Returns: MSRStatus, whether TFRecord is successfully transformed to MindRecord.

tfrecord_iterator()[source]

Yield a dictionary whose keys are fields in schema.

Yields: dict, data dictionary whose keys are the same as columns.

tfrecord_iterator_oldversion()[source]

Yield a dict with key to be fields in schema, and value to be data. This function is for old version tensorflow whose version number < 2.1.0

Yields: dict, data dictionary whose keys are the same as columns.