{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MindRecord格式转换\n",
    "\n",
    "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/tutorials/zh_cn/dataset/mindspore_record.ipynb)&emsp;\n",
    "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_download_code.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/tutorials/zh_cn/dataset/mindspore_record.py)&emsp;\n",
    "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/dataset/record.ipynb)\n",
    "\n",
    "MindSpore可以将用于训练网络模型的数据集转换为特定的数据格式（MindSpore Record），便于数据的保存和加载。其目标是归一化用户数据集，并通过[mindspore.dataset.MindDataset](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.MindDataset.html)接口实现数据的读取，用于训练过程。\n",
    "\n",
    "![conversion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/source_zh_cn/dataset/images/data_conversion_concept.png)\n",
    "\n",
    "此外，MindSpore还针对部分数据场景进行了性能优化。使用MindSpore Record数据格式可以减少磁盘IO和网络IO开销，从而获得更好的使用体验。\n",
    "\n",
    "MindSpore Record数据格式具备以下特征：\n",
    "\n",
    "1. 实现数据统一存储和访问，使训练时数据读取更加简便。\n",
    "2. 支持数据聚合存储和高效读取，便于数据管理和移动。\n",
    "3. 提供高效的数据编解码操作，用户对数据操作无感知。\n",
    "4. 可以灵活控制数据切分的分区大小，实现分布式数据处理。\n",
    "\n",
    "## Record文件结构\n",
    "\n",
    "如下图所示，MindSpore Record文件由数据文件和索引文件组成。\n",
    "\n",
    "![MindSpore Record](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/source_zh_cn/dataset/images/mindrecord.png)\n",
    "\n",
    "其中，数据文件包含文件头、标量数据页和块数据页，用于存储用户归一化后的训练数据。具体用途如下：\n",
    "\n",
    "- **文件头**：MindSpore Record文件的元信息。主要用于存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等。\n",
    "- **标量数据页**：主要用于存储整型、字符串、浮点型等标量类型数据，如图像的Label、文件名、长宽等信息。\n",
    "- **块数据页**：主要用于存储二进制串、NumPy数组等数据，如二进制图像文件本身、文本转换后的字典等。\n",
    "\n",
    "索引文件则包含基于标量数据（如图像Label、图像文件名等）生成的索引信息，便于检索和统计数据集信息。\n",
    "\n",
    "> - 单个MindSpore Record文件建议小于20G。对于大数据集，建议分片存储为多个MindSpore Record文件。\n",
    "> - 数据文件和索引文件均暂不支持重命名操作。\n",
    "\n",
    "## 转换成Record格式\n",
    "\n",
    "下面主要介绍如何将CV类数据和NLP类数据转换为MindSpore Record文件格式，并通过`MindDataset`接口读取。\n",
    "\n",
    "### 转换CV类数据\n",
    "\n",
    "本示例以包含100条记录的CV数据集为例，介绍如何将其转换为MindSpore Record格式，并使用`MindDataset`接口读取。\n",
    "\n",
    "具体来说，需要创建一个包含100张图片的数据集并保存。每个样本包含`file_name`（字符串）、`label`（整型）、 `data`（二进制）三个字段，然后使用`MindDataset`接口读取该MindSpore Record文件。\n",
    "\n",
    "1. 生成100张图像，并转换成MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:03.889515Z",
     "start_time": "2021-02-22T10:34:02.950207Z"
    }
   },
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "from io import BytesIO\n",
    "from mindspore.mindrecord import FileWriter\n",
    "\n",
    "file_name = \"test_vision.mindrecord\"\n",
    "# 定义包含的字段\n",
    "cv_schema = {\"file_name\": {\"type\": \"string\"},\n",
    "             \"label\": {\"type\": \"int32\"},\n",
    "             \"data\": {\"type\": \"bytes\"}}\n",
    "\n",
    "# 声明MindSpore Record文件格式\n",
    "writer = FileWriter(file_name, shard_num=1, overwrite=True)\n",
    "writer.add_schema(cv_schema, \"it is a cv dataset\")\n",
    "writer.add_index([\"file_name\", \"label\"])\n",
    "\n",
    "# 创建数据集\n",
    "data = []\n",
    "for i in range(100):\n",
    "    sample = {}\n",
    "    white_io = BytesIO()\n",
    "    Image.new('RGB', ((i+1)*10, (i+1)*10), (255, 255, 255)).save(white_io, 'JPEG')\n",
    "    image_bytes = white_io.getvalue()\n",
    "    sample['file_name'] = str(i+1) + \".jpg\"\n",
    "    sample['label'] = i+1\n",
    "    sample['data'] = white_io.getvalue()\n",
    "\n",
    "    data.append(sample)\n",
    "    if i % 10 == 0:\n",
    "        writer.write_raw_data(data)\n",
    "        data = []\n",
    "\n",
    "if data:\n",
    "    writer.write_raw_data(data)\n",
    "\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "若上述示例运行无报错，则说明数据集转换成功。\n",
    "\n",
    "2. 通过`MindDataset`接口读取MindSpore Record格式文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:07.729322Z",
     "start_time": "2021-02-22T10:34:07.575711Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 100 samples\n"
     ]
    }
   ],
   "source": [
    "from mindspore.dataset import MindDataset\n",
    "from mindspore.dataset.vision import Decode\n",
    "\n",
    "# 读取MindSpore Record格式文件\n",
    "data_set = MindDataset(dataset_files=file_name)\n",
    "decode_op = Decode()\n",
    "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 转换NLP类数据集\n",
    "\n",
    "本示例首先创建一个包含100条记录的文本数据，然后转换为MindSpore Record文件格式。每个样本包含八个字段，均为整型数组。最后，使用`MindDataset`接口读取该MindSpore Record文件。\n",
    "\n",
    "> 为便于展示，此处略去了将文本转换成字典序的预处理过程。\n",
    "\n",
    "1. 生成100条文本数据，并转换成MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:23.883130Z",
     "start_time": "2021-02-22T10:34:23.660213Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from mindspore.mindrecord import FileWriter\n",
    "\n",
    "# 输出的MindSpore Record文件完整路径\n",
    "file_name = \"test_text.mindrecord\"\n",
    "\n",
    "# 定义样本数据包含的字段\n",
    "nlp_schema = {\"source_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]}}\n",
    "\n",
    "# 声明MindSpore Record文件格式\n",
    "writer = FileWriter(file_name, shard_num=1, overwrite=True)\n",
    "writer.add_schema(nlp_schema, \"Preprocessed nlp dataset.\")\n",
    "\n",
    "# 创建虚拟数据集\n",
    "data = []\n",
    "for i in range(100):\n",
    "    sample = {\"source_sos_ids\": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),\n",
    "              \"source_sos_mask\": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),\n",
    "              \"source_eos_ids\": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),\n",
    "              \"source_eos_mask\": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),\n",
    "              \"target_sos_ids\": np.array([28, 29, 30, 31, 32], dtype=np.int64),\n",
    "              \"target_sos_mask\": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),\n",
    "              \"target_eos_ids\": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),\n",
    "              \"target_eos_mask\": np.array([48, 49, 50, 51], dtype=np.int64)}\n",
    "    data.append(sample)\n",
    "\n",
    "    if i % 10 == 0:\n",
    "        writer.write_raw_data(data)\n",
    "        data = []\n",
    "\n",
    "if data:\n",
    "    writer.write_raw_data(data)\n",
    "\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 通过`MindDataset`接口读取MindSpore Record格式文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:27.133717Z",
     "start_time": "2021-02-22T10:34:27.083785Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 100 samples\n",
      "source_sos_ids: [0 1 2 3 4]\n",
      "source_sos_ids: [1 2 3 4 5]\n",
      "source_sos_ids: [2 3 4 5 6]\n",
      "source_sos_ids: [3 4 5 6 7]\n",
      "source_sos_ids: [4 5 6 7 8]\n",
      "source_sos_ids: [5 6 7 8 9]\n",
      "source_sos_ids: [ 6  7  8  9 10]\n",
      "source_sos_ids: [ 7  8  9 10 11]\n",
      "source_sos_ids: [ 8  9 10 11 12]\n",
      "source_sos_ids: [ 9 10 11 12 13]\n"
     ]
    }
   ],
   "source": [
    "from mindspore.dataset import MindDataset\n",
    "\n",
    "# 读取MindSpore Record格式文件\n",
    "data_set = MindDataset(dataset_files=file_name, shuffle=False)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n",
    "\n",
    "# 打印部分数据\n",
    "count = 0\n",
    "for item in data_set.create_dict_iterator(output_numpy=True):\n",
    "    print(\"source_sos_ids:\", item[\"source_sos_ids\"])\n",
    "    count += 1\n",
    "    if count == 10:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset转存MindRecord\n",
    "\n",
    "MindSpore提供常用数据集的转换工具类，能够将常用的数据集转换为MindSpore Record文件格式。\n",
    "\n",
    "> 更多数据集转换的详细说明参考[API文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.mindrecord.html)。\n",
    "\n",
    "### 转存CIFAR-10数据集\n",
    "\n",
    "用户可以通过[mindspore.dataset.Dataset.save](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.save.html)方法，将CIFAR-10原始数据转换为MindSpore Record，并使用`MindDataset`接口读取。\n",
    "\n",
    "1. 下载[CIFAR-10数据集](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)，并使用`Cifar10Dataset`加载。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz (162.2 MB)\n",
      "\n",
      "file_sizes: 100%|████████████████████████████| 170M/170M [00:18<00:00, 9.34MB/s]\n",
      "Extracting tar.gz file...\n",
      "Successfully downloaded / unzipped to ./\n"
     ]
    }
   ],
   "source": [
    "from download import download\n",
    "from mindspore.dataset import Cifar10Dataset\n",
    "\n",
    "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\"\n",
    "\n",
    "path = download(url, \"./\", kind=\"tar.gz\", replace=True)\n",
    "dataset = Cifar10Dataset(\"./cifar-10-batches-bin/\")  # 加载数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 调用`Dataset.save`接口，将CIFAR-10数据集转存为MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.save(\"cifar10.mindrecord\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 通过`MindDataset`接口读取MindSpore Record格式文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 60000 samples\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from mindspore.dataset import MindDataset\n",
    "\n",
    "# 读取MindSpore Record文件格式\n",
    "data_set = MindDataset(dataset_files=\"cifar10.mindrecord\")\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n",
    "\n",
    "if os.path.exists(\"cifar10.mindrecord\") and os.path.exists(\"cifar10.mindrecord.db\"):\n",
    "    os.remove(\"cifar10.mindrecord\")\n",
    "    os.remove(\"cifar10.mindrecord.db\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}