{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/tutorials/zh_cn/advanced/dataset/mindspore_record.ipynb) \n", "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_download_code.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/tutorials/zh_cn/advanced/dataset/mindspore_record.py) \n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced/dataset/record.ipynb)\n", "\n", "# 格式转换\n", "\n", "MindSpore中可以把用于训练网络模型的数据集,转换为MindSpore特定的格式数据(MindSpore Record格式),从而更加方便地保存和加载数据。其目标是归一化用户的数据集,并进一步通过`MindDataset`接口实现数据的读取,并用于训练过程。\n", "\n", "![conversion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/source_zh_cn/advanced/dataset/images/data_conversion_concept.png)\n", "\n", "此外,MindSpore还针对部分数据场景进行了性能优化,使用MindSpore Record数据格式可以减少磁盘IO、网络IO开销,从而获得更好的使用体验。\n", "\n", "MindSpore Record数据格式具备的特征如下:\n", "\n", "1. 实现数据统一存储、访问,使得训练时数据读取更加简便。\n", "2. 数据聚合存储、高效读取,使得训练时数据方便管理和移动。\n", "3. 高效的数据编解码操作,使得用户可以对数据操作无感知。\n", "4. 可以灵活控制数据切分的分区大小,实现分布式数据处理。\n", "\n", "## Record文件结构\n", "\n", "如下图所示,MindSpore Record文件由数据文件和索引文件组成。\n", "\n", "![MindSpore Record](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/source_zh_cn/advanced/dataset/images/mindrecord.png)\n", "\n", "其中数据文件包含文件头、标量数据页、块数据页,用于存储用户归一化后的训练数据,且单个MindSpore Record文件建议小于20G,用户可将大数据集进行分片存储为多个MindSpore Record文件。\n", "\n", "而索引文件则包含基于标量数据(如图像Label、图像文件名等)生成的索引信息,用于方便地检索、统计数据集信息。\n", "\n", "数据文件中的文件头、标量数据页、块数据页的具体用途如下所示:\n", "\n", "- **文件头**:是MindSpore Record文件的元信息,主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等。\n", "- **标量数据页**:主要用来存储整型、字符串、浮点型数据,如图像的Label、图像的文件名、图像的长宽等信息,即适合用标量来存储的信息会保存在这里。\n", "- **块数据页**:主要用来存储二进制串、NumPy数组等数据,如二进制图像文件本身、文本转换成的字典等。\n", "\n", "> 值得注意的是,数据文件和索引文件均暂不支持重命名操作。\n", "\n", "## 转换成Record格式\n", "\n", "下面主要介绍如何将CV类数据和NLP类数据转换为MindSpore Record文件格式,并通过`MindDataset`接口,实现MindSpore Record文件的读取。\n", "\n", "### 转换CV类数据集\n", "\n", "本示例主要以包含100条记录的CV数据集并将其转换为MindSpore Record格式为例子,介绍如何将CV类数据集转换成MindSpore Record文件格式,并使用`MindDataset`接口读取。\n", "\n", "首先,需要创建100张图片的数据集并对齐进行保存,其样本包含`file_name`(字符串)、`label`(整型)、 `data`(二进制)三个字段,然后使用`MindDataset`接口读取该MindSpore Record文件。\n", "\n", "1. 生成100张图像,并转换成MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:03.889515Z", "start_time": "2021-02-22T10:34:02.950207Z" } }, "outputs": [], "source": [ "from PIL import Image\n", "from io import BytesIO\n", "from mindspore.mindrecord import FileWriter\n", "\n", "file_name = \"test_vision.mindrecord\"\n", "# 定义包含的字段\n", "cv_schema = {\"file_name\": {\"type\": \"string\"},\n", " \"label\": {\"type\": \"int32\"},\n", " \"data\": {\"type\": \"bytes\"}}\n", "\n", "# 声明MindSpore Record文件格式\n", "writer = FileWriter(file_name, shard_num=1, overwrite=True)\n", "writer.add_schema(cv_schema, \"it is a cv dataset\")\n", "writer.add_index([\"file_name\", \"label\"])\n", "\n", "# 创建数据集\n", "data = []\n", "for i in range(100):\n", " sample = {}\n", " white_io = BytesIO()\n", " Image.new('RGB', ((i+1)*10, (i+1)*10), (255, 255, 255)).save(white_io, 'JPEG')\n", " image_bytes = white_io.getvalue()\n", " sample['file_name'] = str(i+1) + \".jpg\"\n", " sample['label'] = i+1\n", " sample['data'] = white_io.getvalue()\n", "\n", " data.append(sample)\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面示例运行无报错说明数据集转换成功。\n", "\n", "2. 通过`MindDataset`接口读取MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:07.729322Z", "start_time": "2021-02-22T10:34:07.575711Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 100 samples\n" ] } ], "source": [ "from mindspore.dataset import MindDataset\n", "from mindspore.dataset.vision import Decode\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = MindDataset(dataset_files=file_name)\n", "decode_op = Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 转换NLP类数据集\n", "\n", "本示例首先创建一个包含100条记录的MindSpore Record文件格式,其样本包含八个字段,均为整型数组,然后使用`MindDataset`接口读取该MindSpore Record文件。\n", "\n", "> 为了方便展示,此处略去了将文本转换成字典序的预处理过程。\n", "\n", "1. 生成100条文本数据,并转换成MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:23.883130Z", "start_time": "2021-02-22T10:34:23.660213Z" } }, "outputs": [], "source": [ "import numpy as np\n", "from mindspore.mindrecord import FileWriter\n", "\n", "# 输出的MindSpore Record文件完整路径\n", "file_name = \"test_text.mindrecord\"\n", "\n", "# 定义样本数据包含的字段\n", "nlp_schema = {\"source_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]}}\n", "\n", "# 声明MindSpore Record文件格式\n", "writer = FileWriter(file_name, shard_num=1, overwrite=True)\n", "writer.add_schema(nlp_schema, \"Preprocessed nlp dataset.\")\n", "\n", "# 创建虚拟数据集\n", "data = []\n", "for i in range(100):\n", " sample = {\"source_sos_ids\": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),\n", " \"source_sos_mask\": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),\n", " \"source_eos_ids\": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),\n", " \"source_eos_mask\": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),\n", " \"target_sos_ids\": np.array([28, 29, 30, 31, 32], dtype=np.int64),\n", " \"target_sos_mask\": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),\n", " \"target_eos_ids\": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),\n", " \"target_eos_mask\": np.array([48, 49, 50, 51], dtype=np.int64)}\n", " data.append(sample)\n", "\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 通过`MindDataset`接口读取MindSpore Record格式文件。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:27.133717Z", "start_time": "2021-02-22T10:34:27.083785Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 100 samples\n", "source_sos_ids: [0 1 2 3 4]\n", "source_sos_ids: [1 2 3 4 5]\n", "source_sos_ids: [2 3 4 5 6]\n", "source_sos_ids: [3 4 5 6 7]\n", "source_sos_ids: [4 5 6 7 8]\n", "source_sos_ids: [5 6 7 8 9]\n", "source_sos_ids: [ 6 7 8 9 10]\n", "source_sos_ids: [ 7 8 9 10 11]\n", "source_sos_ids: [ 8 9 10 11 12]\n", "source_sos_ids: [ 9 10 11 12 13]\n" ] } ], "source": [ "from mindspore.dataset import MindDataset\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = MindDataset(dataset_files=file_name, shuffle=False)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n", "\n", "# 打印部分数据\n", "count = 0\n", "for item in data_set.create_dict_iterator(output_numpy=True):\n", " print(\"source_sos_ids:\", item[\"source_sos_ids\"])\n", " count += 1\n", " if count == 10:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset转存MindRecord\n", "\n", "MindSpore提供转换常用数据集的工具类,能够将常用的数据集转换为MindSpore Record文件格式。\n", "\n", "> 更多数据集转换的详细说明参考[API文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.mindrecord.html)。\n", "\n", "### 转存CIFAR-10数据集\n", "\n", "用户可以通过`Dataset.save`类,将CIFAR-10原始数据转换为MindSpore Record,并使用`MindDataset`接口读取。\n", "\n", "1. 下载[CIFAR-10数据集](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz),并使用`Cifar10Dataset`加载。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz (162.2 MB)\n", "\n", "file_sizes: 100%|████████████████████████████| 170M/170M [00:18<00:00, 9.34MB/s]\n", "Extracting tar.gz file...\n", "Successfully downloaded / unzipped to ./\n" ] } ], "source": [ "from download import download\n", "from mindspore.dataset import Cifar10Dataset\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\"\n", "\n", "path = download(url, \"./\", kind=\"tar.gz\", replace=True)\n", "dataset = Cifar10Dataset(\"./cifar-10-batches-bin/\") # 加载数据" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 调用`Dataset.save`接口,将CIFAR-10数据集转存为MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "dataset.save(\"cifar10.mindrecord\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. 通过`MindDataset`接口读取MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 60000 samples\n" ] } ], "source": [ "import os\n", "from mindspore.dataset import MindDataset\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = MindDataset(dataset_files=\"cifar10.mindrecord\")\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n", "\n", "if os.path.exists(\"cifar10.mindrecord\") and os.path.exists(\"cifar10.mindrecord.db\"):\n", " os.remove(\"cifar10.mindrecord\")\n", " os.remove(\"cifar10.mindrecord.db\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }