{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 转换数据集为MindRecord\n", "\n", "[![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.2/tutorials/training/source_zh_cn/advanced_use/convert_dataset.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.2/mindspore_convert_dataset.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_modelarts.png)](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/notebook/loading?share-url-b64=aHR0cHM6Ly9vYnMuZHVhbHN0YWNrLmNuLW5vcnRoLTQubXlodWF3ZWljbG91ZC5jb20vbWluZHNwb3JlLXdlYnNpdGUvbm90ZWJvb2svbW9kZWxhcnRzL21pbmRzcG9yZV9jb252ZXJ0X2RhdGFzZXRfdG9fbWluZHJlY29yZC5pcHluYg==&image_id=65f636a0-56cf-49df-b941-7d2a07ba8c8c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 概述\n", "\n", "用户可以将非标准的数据集和常用的数据集转换为MindSpore数据格式,即MindRecord,从而方便地加载到MindSpore中进行训练。同时,MindSpore在部分场景做了性能优化,使用MindRecord数据格式可以获得更好的性能体验。\n", "\n", "MindSpore数据格式具备的特征如下:\n", "\n", "1. 实现多变的用户数据统一存储、访问,训练数据读取更加简便。\n", "2. 数据聚合存储,高效读取,且方便管理、移动。\n", "3. 高效的数据编解码操作,对用户透明、无感知。\n", "4. 可以灵活控制分区的大小,实现分布式训练。\n", "\n", "MindSpore数据格式的目标是归一化用户的数据集,并进一步通过`MindDataset`(详细使用方法参考[API](https://www.mindspore.cn/doc/api_python/zh-CN/r1.2/mindspore/dataset/mindspore.dataset.MindDataset.html))实现数据的读取,并用于训练过程。\n", "\n", "![data-conversion-concept](https://gitee.com/mindspore/docs/raw/r1.2/tutorials/training/source_zh_cn/advanced_use/images/data_conversion_concept.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 基本概念" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一个MindRecord文件由数据文件和索引文件组成,且数据文件及索引文件暂不支持重命名操作:\n", "\n", "- 数据文件\n", "\n", " 包含文件头、标量数据页、块数据页,用于存储用户归一化后的训练数据,且单个MindRecord文件建议小于20G,用户可将大数据集进行分片存储为多个MindRecord文件。\n", "\n", "\n", "- 索引文件\n", "\n", " 包含基于标量数据(如图像Label、图像文件名等)生成的索引信息,用于方便的检索、统计数据集信息。\n", "\n", "![mindrecord](https://gitee.com/mindspore/docs/raw/r1.2/tutorials/training/source_zh_cn/advanced_use/images/mindrecord.png)\n", "\n", "数据文件主要由以下几个关键部分组成:\n", "\n", "- 文件头\n", " \n", " 文件头主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等,是MindRecord文件的元信息。\n", "\n", "\n", "- 标量数据页\n", " \n", " 标量数据页主要用来存储整型、字符串、浮点型数据,如图像的Label、图像的文件名、图像的长宽等信息,即适合用标量来存储的信息会保存在这里。\n", "\n", "\n", "- 块数据页\n", " \n", " 块数据页主要用来存储二进制串、Numpy数组等数据,如二进制图像文件本身、文本转换成的字典等。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 将数据集转换为MindRecord\n", "\n", "下面本教程将简单演示如何将图片数据及其标注转换为MindRecord格式。更多MindSpore数据格式转换说明,可参见编程指南中[MindSpore数据格式转换](https://www.mindspore.cn/doc/programming_guide/zh-CN/r1.2/dataset_conversion.html)章节。" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-02-22T02:13:50.133177Z", "start_time": "2021-02-22T02:13:50.127990Z" } }, "source": [ "示例一:展示如何将数据按照定义的数据集结构转换为MindRecord数据文件。\n", "\n", "1. 导入文件写入工具类`FileWriter`。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T02:14:54.114866Z", "start_time": "2021-02-22T02:14:11.315731Z" } }, "outputs": [], "source": [ "from mindspore.mindrecord import FileWriter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 定义数据集结构文件Schema。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cv_schema_json = {\"file_name\": {\"type\": \"string\"}, \"label\": {\"type\": \"int32\"}, \"data\": {\"type\": \"bytes\"}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Schema文件主要包含字段名`name`、字段数据类型`type`和字段各维度维数`shape`:\n", "\n", "- 字段名:字段的引用名称,可以包含字母、数字和下划线。\n", "\n", "- 字段数据类型:包含int32、int64、float32、float64、string、bytes。\n", "\n", "- 字段维数:一维数组用[-1]表示,更高维度可表示为[m, n, …],其中m、n为各维度维数。\n", "\n", "> 如果字段有属性`shape`,则用户传入`write_raw_data`接口的数据必须为`numpy.ndarray`类型,对应数据类型必须为int32、int64、float32、float64。\n", "\n", "3. 按照用户定义的Schema格式,准备需要写入的数据列表,此处传入的是图片数据的二进制流。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data = [{\"file_name\": \"1.jpg\", \"label\": 0, \"data\": b\"\\x10c\\xb3w\\xa8\\xee$o&\\xd4\\x00\\xf8\\x129\\x15\\xd9\\xf2q\\xc0\\xa2\\x91YFUO\\x1dsE1\\x1ep\"},\n", " {\"file_name\": \"3.jpg\", \"label\": 99, \"data\": b\"\\xaf\\xafU<\\xb8|6\\xbd}\\xc1\\x99[\\xeaj+\\x8f\\x84\\xd3\\xcc\\xa0,i\\xbb\\xb9-\\xcdz\\xecp{T\\xb1\\xdb\"}]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. 添加索引字段可以加速数据读取,该步骤非必选。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "indexes = [\"file_name\", \"label\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. 创建FileWriter对象,传入文件名及分片数量,然后添加Schema文件及索引,调用`write_raw_data`接口写入数据,最后调用`commit`接口生成本地数据文件。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "writer = FileWriter(file_name=\"test.mindrecord\", shard_num=4)\n", "writer.add_schema(cv_schema_json, \"test_schema\")\n", "writer.add_index(indexes)\n", "writer.write_raw_data(data)\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "该示例会生成`test.mindrecord0`、`test.mindrecord0.db`、`test.mindrecord1`、`test.mindrecord1.db`、`test.mindrecord2`、`test.mindrecord2.db`、`test.mindrecord3`、`test.mindrecord3.db`共8个文件,称为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件,其中`test.mindrecord0`为数据文件,`test.mindrecord0.db`为索引文件。\n", "\n", "接口说明:\n", "\n", "- `write_raw_data`:将数据写入到内存之中。\n", "\n", "- `commit`:将最终内存中的数据写入到磁盘。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. 如果需要在现有数据格式文件中增加新数据,可以调用`open_for_append`接口打开已存在的数据文件,继续调用`write_raw_data`接口写入新数据,最后调用`commit`接口生成本地数据文件。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T02:32:29.779366Z", "start_time": "2021-02-22T02:32:29.606138Z" } }, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "writer = FileWriter.open_for_append(\"test.mindrecord0\")\n", "writer.write_raw_data(data)\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "示例二:将`jpg`格式的图片,按照示例一的方法,将其转换成MindRecord数据集。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下载需要处理的图片数据`transform.jpg`作为待处理的原始数据。\n", "\n", "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/`用于存放本次体验中所有的转换数据集。\n", "\n", "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/images/`用于存放下载下来的图片数据。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./datasets/convert_dataset_to_mindrecord/images/\n", "└── transform.jpg\n", "\n", "0 directories, 1 file\n" ] } ], "source": [ "!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/transform.jpg\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/images/\n", "!mv -f ./transform.jpg ./datasets/convert_dataset_to_mindrecord/images/\n", "!tree ./datasets/convert_dataset_to_mindrecord/images/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "执行以下代码,将下载的`transform.jpg`转换为MindRecord数据集。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# step 1 import class FileWriter\n", "import os \n", "from mindspore.mindrecord import FileWriter\n", "\n", "# clean up old run files before in Linux\n", "data_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/'\n", "os.system('rm -f {}test.*'.format(data_path))\n", "\n", "# import FileWriter class ready to write data\n", "data_record_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord'\n", "writer = FileWriter(file_name=data_record_path,shard_num=4)\n", "\n", "# define the data type\n", "data_schema = {\"file_name\":{\"type\":\"string\"},\"label\":{\"type\":\"int32\"},\"data\":{\"type\":\"bytes\"}}\n", "writer.add_schema(data_schema,\"test_schema\")\n", "\n", "# prepeare the data contents\n", "file_name = \"./datasets/convert_dataset_to_mindrecord/images/transform.jpg\"\n", "with open(file_name, \"rb\") as f:\n", " bytes_data = f.read()\n", "data = [{\"file_name\":\"transform.jpg\", \"label\":1, \"data\":bytes_data}]\n", "\n", "# add index field\n", "indexes = [\"file_name\",\"label\"]\n", "writer.add_index(indexes)\n", "\n", "# save data to the files\n", "writer.write_raw_data(data)\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "该示例会生成8个文件,成为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件,其中`test.mindrecord0`为数据文件,`test.mindrecord0.db`为索引文件,生成的文件如下所示:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n", "├── test.mindrecord0\n", "├── test.mindrecord0.db\n", "├── test.mindrecord1\n", "├── test.mindrecord1.db\n", "├── test.mindrecord2\n", "├── test.mindrecord2.db\n", "├── test.mindrecord3\n", "└── test.mindrecord3.db\n", "\n", "0 directories, 8 files\n" ] } ], "source": [ "!tree ./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 读取MindRecord数据集" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面将简单演示如何通过`MindDataset`读取MindRecord数据集。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. 导入读取类`MindDataset`。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import mindspore.dataset as ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 首先使用`MindDataset`读取MindRecord数据集,然后对数据创建了字典迭代器,并通过迭代器读取了一条数据记录。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T02:35:01.078717Z", "start_time": "2021-02-22T02:35:01.036028Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample: {'data': array([255, 216, 255, ..., 159, 255, 217], dtype=uint8), 'file_name': array(b'transform.jpg', dtype='|S13'), 'label': array(1, dtype=int32)}\n", "Got 1 samples\n" ] } ], "source": [ "file_name = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord0'\n", "# create MindDataset for reading data\n", "define_data_set = ds.MindDataset(dataset_file=file_name)\n", "# create a dictionary iterator and read a data record through the iterator\n", "count = 0\n", "for item in define_data_set.create_dict_iterator(output_numpy=True):\n", " print(\"sample: {}\".format(item))\n", " count += 1\n", "print(\"Got {} samples\".format(count))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }