{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 格式转换\n",
    "\n",
    "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/dataset/mindspore_record.ipynb)&emsp;\n",
    "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/dataset/mindspore_record.py)&emsp;\n",
    "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.8/tutorials/source_zh_cn/advanced/dataset/record.ipynb)\n",
    "\n",
    "MindSpore中可以把用于训练网络模型的数据集，转换为MindSpore特定的格式数据（MindSpore Record格式），从而更加方便地保存和加载数据。其目标是归一化用户的数据集，并进一步通过`MindDataset`接口实现数据的读取，并用于训练过程。\n",
    "\n",
    "![conversion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_zh_cn/advanced/dataset/images/data_conversion_concept.png)\n",
    "\n",
    "此外，MindSpore还针对部分数据场景进行了性能优化，使用MindSpore Record数据格式可以减少磁盘IO、网络IO开销，从而获得更好的使用体验。\n",
    "\n",
    "MindSpore Record数据格式具备的特征如下：\n",
    "\n",
    "1. 实现数据统一存储、访问，使得训练时数据读取更加简便。\n",
    "2. 数据聚合存储、高效读取，使得训练时数据方便管理和移动。\n",
    "3. 高效的数据编解码操作，使得用户可以对数据操作无感知。\n",
    "4. 可以灵活控制数据切分的分区大小，实现分布式数据处理。\n",
    "\n",
    "## Record文件结构\n",
    "\n",
    "如下图所示，MindSpore Record文件由数据文件和索引文件组成。\n",
    "\n",
    "![MindSpore Record](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_zh_cn/advanced/dataset/images/mindrecord.png)\n",
    "\n",
    "其中数据文件包含文件头、标量数据页、块数据页，用于存储用户归一化后的训练数据，且单个MindSpore Record文件建议小于20G，用户可将大数据集进行分片存储为多个MindSpore Record文件。\n",
    "\n",
    "而索引文件则包含基于标量数据（如图像Label、图像文件名等）生成的索引信息，用于方便的检索、统计数据集信息。\n",
    "\n",
    "数据文件中的文件头、标量数据页、块数据页的具体用途如下所示：\n",
    "\n",
    "- **文件头**：是MindSpore Record文件的元信息，主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等。\n",
    "- **标量数据页**：主要用来存储整型、字符串、浮点型数据，如图像的Label、图像的文件名、图像的长宽等信息，即适合用标量来存储的信息会保存在这里。\n",
    "- **块数据页**：主要用来存储二进制串、NumPy数组等数据，如二进制图像文件本身、文本转换成的字典等。\n",
    "\n",
    "> 值得注意的是，数据文件和索引文件均暂不支持重命名操作。\n",
    "\n",
    "## 转换成Record格式\n",
    "\n",
    "下面主要介绍如何将CV类数据和NLP类数据转换为MindSpore Record文件格式，并通过`MindDataset`接口，实现MindSpore Record文件的读取。\n",
    "\n",
    "### 转换CV类数据集\n",
    "\n",
    "本示例主要以包含100条记录的CV数据集并将其转换为MindSpore Record格式为例子，介绍如何将CV类数据集转换成MindSpore Record文件格式，并使用`MindDataset`接口读取。\n",
    "\n",
    "首先，需要创建100张图片的数据集并对齐进行保存，其样本包含`file_name`（字符串）、`label`（整型）、 `data`（二进制）三个字段，然后使用`MindDataset`接口读取该MindSpore Record文件。\n",
    "\n",
    "1. 生成100张图像，并转换成MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:03.889515Z",
     "start_time": "2021-02-22T10:34:02.950207Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "from PIL import Image\n",
    "from io import BytesIO\n",
    "\n",
    "import mindspore.mindrecord as record\n",
    "\n",
    "\n",
    "# 输出的MindSpore Record文件完整路径\n",
    "MINDRECORD_FILE = \"test.mindrecord\"\n",
    "\n",
    "if os.path.exists(MINDRECORD_FILE):\n",
    "    os.remove(MINDRECORD_FILE)\n",
    "    os.remove(MINDRECORD_FILE + \".db\")\n",
    "\n",
    "# 定义包含的字段\n",
    "cv_schema = {\"file_name\": {\"type\": \"string\"},\n",
    "             \"label\": {\"type\": \"int32\"},\n",
    "             \"data\": {\"type\": \"bytes\"}}\n",
    "\n",
    "# 声明MindSpore Record文件格式\n",
    "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n",
    "writer.add_schema(cv_schema, \"it is a cv dataset\")\n",
    "writer.add_index([\"file_name\", \"label\"])\n",
    "\n",
    "# 创建数据集\n",
    "data = []\n",
    "for i in range(100):\n",
    "    i += 1\n",
    "    sample = {}\n",
    "    white_io = BytesIO()\n",
    "    Image.new('RGB', (i*10, i*10), (255, 255, 255)).save(white_io, 'JPEG')\n",
    "    image_bytes = white_io.getvalue()\n",
    "    sample['file_name'] = str(i) + \".jpg\"\n",
    "    sample['label'] = i\n",
    "    sample['data'] = white_io.getvalue()\n",
    "\n",
    "    data.append(sample)\n",
    "    if i % 10 == 0:\n",
    "        writer.write_raw_data(data)\n",
    "        data = []\n",
    "\n",
    "if data:\n",
    "    writer.write_raw_data(data)\n",
    "\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上面的打印结果`MSRStatus.SUCCESS`可以看出，数据集转换成功。在本篇后续的例子中如果数据集转换成功均可看到此打印结果。\n",
    "\n",
    "2. 通过`MindDataset`接口读取MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:07.729322Z",
     "start_time": "2021-02-22T10:34:07.575711Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 100 samples\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.vision as vision\n",
    "\n",
    "# 读取MindSpore Record文件格式\n",
    "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n",
    "decode_op = vision.Decode()\n",
    "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 转换NLP类数据集\n",
    "\n",
    "本示例首先创建一个包含100条记录的MindSpore Record文件格式，其样本包含八个字段，均为整型数组，然后使用`MindDataset`接口读取该MindSpore Record文件。\n",
    "\n",
    "> 为了方便展示，此处略去了将文本转换成字典序的预处理过程。\n",
    "\n",
    "1. 生成100条文本数据，并转换成MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:23.883130Z",
     "start_time": "2021-02-22T10:34:23.660213Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import mindspore.mindrecord as record\n",
    "\n",
    "# 输出的MindSpore Record文件完整路径\n",
    "MINDRECORD_FILE = \"test.mindrecord\"\n",
    "\n",
    "if os.path.exists(MINDRECORD_FILE):\n",
    "    os.remove(MINDRECORD_FILE)\n",
    "    os.remove(MINDRECORD_FILE + \".db\")\n",
    "\n",
    "# 定义样本数据包含的字段\n",
    "nlp_schema = {\"source_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"source_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n",
    "              \"target_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]}}\n",
    "\n",
    "# 声明MindSpore Record文件格式\n",
    "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n",
    "writer.add_schema(nlp_schema, \"Preprocessed nlp dataset.\")\n",
    "\n",
    "# 创建虚拟数据集\n",
    "data = []\n",
    "for i in range(100):\n",
    "    i += 1\n",
    "    sample = {\"source_sos_ids\": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),\n",
    "              \"source_sos_mask\": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),\n",
    "              \"source_eos_ids\": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),\n",
    "              \"source_eos_mask\": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),\n",
    "              \"target_sos_ids\": np.array([28, 29, 30, 31, 32], dtype=np.int64),\n",
    "              \"target_sos_mask\": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),\n",
    "              \"target_eos_ids\": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),\n",
    "              \"target_eos_mask\": np.array([48, 49, 50, 51], dtype=np.int64)}\n",
    "    data.append(sample)\n",
    "\n",
    "    if i % 10 == 0:\n",
    "        writer.write_raw_data(data)\n",
    "        data = []\n",
    "\n",
    "if data:\n",
    "    writer.write_raw_data(data)\n",
    "\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 通过`MindDataset`接口读取MindSpore Record格式文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-02-22T10:34:27.133717Z",
     "start_time": "2021-02-22T10:34:27.083785Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 100 samples\n",
      "source_sos_ids: [1 2 3 4 5]\n",
      "source_sos_ids: [2 3 4 5 6]\n",
      "source_sos_ids: [3 4 5 6 7]\n",
      "source_sos_ids: [4 5 6 7 8]\n",
      "source_sos_ids: [5 6 7 8 9]\n",
      "source_sos_ids: [ 6  7  8  9 10]\n",
      "source_sos_ids: [ 7  8  9 10 11]\n",
      "source_sos_ids: [ 8  9 10 11 12]\n",
      "source_sos_ids: [ 9 10 11 12 13]\n",
      "source_sos_ids: [10 11 12 13 14]\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "# 读取MindSpore Record文件格式\n",
    "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE, shuffle=False)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n",
    "\n",
    "# 打印部分数据\n",
    "count = 0\n",
    "for item in data_set.create_dict_iterator():\n",
    "    print(\"source_sos_ids:\", item[\"source_sos_ids\"])\n",
    "    count += 1\n",
    "    if count == 10:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 其他数据集转换\n",
    "\n",
    "MindSpore提供转换常用数据集的工具类，能够将常用的数据集转换为MindSpore Record文件格式。\n",
    "\n",
    "> 更多数据集转换的详细说明参考[API文档](https://www.mindspore.cn/docs/zh-CN/r1.8/api_python/mindspore.mindrecord.html)。\n",
    "\n",
    "### 转换CIFAR-10数据集\n",
    "\n",
    "用户可以通过`Cifar10ToMR`类，将CIFAR-10原始数据转换为MindSpore Record，并使用`MindDataset`接口读取。\n",
    "\n",
    "1. 下载[CIFAR-10数据集](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)并解压到指定目录，以下示例代码将数据集下载并解压到指定位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mindvision import dataset\n",
    "\n",
    "# 声明数据集下载地址和数据集存储路径\n",
    "dl_path = \"./datasets\"\n",
    "dl_url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\"\n",
    "\n",
    "# 下载和解压数据集\n",
    "dl = dataset.DownLoad()\n",
    "dl.download_and_extract_archive(url=dl_url, download_path=dl_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "解压后数据集文件的目录结构如下所示：\n",
    "\n",
    "```text\n",
    "./datasets/cifar-10-batches-py\n",
    "├── batches.meta\n",
    "├── data_batch_1\n",
    "├── data_batch_2\n",
    "├── data_batch_3\n",
    "├── data_batch_4\n",
    "├── data_batch_5\n",
    "├── readme.html\n",
    "└── test_batch\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 创建`Cifar10ToMR`对象，调用`transform`接口，将CIFAR-10数据集转换为MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "from mindspore.mindrecord import Cifar10ToMR\n",
    "\n",
    "ds_target_path = \"./datasets/mindspore_dataset_conversion/\"\n",
    "\n",
    "os.system(\"rm -f {}*\".format(ds_target_path))\n",
    "os.system(\"mkdir -p {}\".format(ds_target_path))\n",
    "\n",
    "# CIFAR-10数据集路径\n",
    "CIFAR10_DIR = \"./datasets/cifar-10-batches-py\"\n",
    "# 输出的MindSpore Record文件路径\n",
    "MINDRECORD_FILE = \"./datasets/mindspore_dataset_conversion/cifar10.mindrecord\"\n",
    "\n",
    "cifar10_transformer = Cifar10ToMR(CIFAR10_DIR, MINDRECORD_FILE)\n",
    "cifar10_transformer.transform(['label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 通过`MindDataset`接口读取MindSpore Record文件格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 50000 samples\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.vision as vision\n",
    "\n",
    "# 读取MindSpore Record文件格式\n",
    "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n",
    "decode_op = vision.Decode()\n",
    "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 转换CSV数据集\n",
    "\n",
    "本示例首先创建一个包含5条记录的CSV文件，然后通过`CsvToMR`工具类将CSV文件转换为MindSpore Record文件格式，并最终通过`MindDataset`接口将其读取出来。\n",
    "\n",
    "> 本示例依赖第三方支持包`pandas`，可使用命令`pip install pandas`安装。如本文档以Notebook运行时，完成安装后需要重启kernel才能执行后续代码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 生成CSV文件，并转换成MindSpore Record。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import os\n",
    "from mindspore import mindrecord as record\n",
    "\n",
    "# CSV文件的路径\n",
    "CSV_FILE = \"test.csv\"\n",
    "# 输出的MindSpore Record文件路径\n",
    "MINDRECORD_FILE = \"test.mindrecord\"\n",
    "\n",
    "if os.path.exists(MINDRECORD_FILE):\n",
    "    os.remove(MINDRECORD_FILE)\n",
    "    os.remove(MINDRECORD_FILE + \".db\")\n",
    "\n",
    "def generate_csv():\n",
    "    \"\"\"生成csv格式文件数据\"\"\"\n",
    "    headers = [\"id\", \"name\", \"math\", \"english\"]\n",
    "    rows = [(1, \"Lily\", 78.5, 90),\n",
    "            (2, \"Lucy\", 99, 85.2),\n",
    "            (3, \"Mike\", 65, 71),\n",
    "            (4, \"Tom\", 95, 99),\n",
    "            (5, \"Jeff\", 85, 78.5)]\n",
    "    with open(CSV_FILE, 'w', encoding='utf-8') as f:\n",
    "        writer = csv.writer(f)\n",
    "        writer.writerow(headers)\n",
    "        writer.writerows(rows)\n",
    "\n",
    "# 生成csv格式文件数据\n",
    "generate_csv()\n",
    "\n",
    "# 转换csv格式文件\n",
    "csv_transformer = record.CsvToMR(CSV_FILE, MINDRECORD_FILE, partition_number=1)\n",
    "csv_transformer.transform()\n",
    "\n",
    "assert os.path.exists(MINDRECORD_FILE)\n",
    "assert os.path.exists(MINDRECORD_FILE + \".db\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 通过`MindDataset`接口读取MindSpore Record。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 5 samples\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 转换TFRecord数据集\n",
    "\n",
    "> 本示例需提前安装TensorFlow，目前只支持TensorFlow 1.13.0-rc1及以上版本。如本文档以Notebook运行时，完成安装后需要重启kernel才能执行后续代码。\n",
    "\n",
    "本示例首先通过TensorFlow创建一个TFRecord文件，然后通过`TFRecordToMR`工具类将TFRecord文件转换为MindSpore Record格式文件，最后通过`MindDataset`接口将其读取出来，并使用`Decode`函数对`image_bytes`字段进行解码。\n",
    "\n",
    "1. 导入相关模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections\n",
    "from io import BytesIO\n",
    "import os\n",
    "import mindspore.dataset as ds\n",
    "import mindspore.mindrecord as record\n",
    "import mindspore.dataset.vision as vision\n",
    "from PIL import Image\n",
    "import tensorflow as tf\n",
    "\n",
    "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 生成TFRecord文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Write 10 rows in tfrecord.\n"
     ]
    }
   ],
   "source": [
    "# TFRecord文件的路径\n",
    "TFRECORD_FILE = \"test.tfrecord\"\n",
    "# 输出的MindSpore Record文件路径\n",
    "MINDRECORD_FILE = \"test.mindrecord\"\n",
    "\n",
    "def generate_tfrecord():\n",
    "    def create_int_feature(values):\n",
    "        if isinstance(values, list):\n",
    "            feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))\n",
    "        else:\n",
    "            feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[values]))\n",
    "        return feature\n",
    "\n",
    "    def create_float_feature(values):\n",
    "        if isinstance(values, list):\n",
    "            feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))\n",
    "        else:\n",
    "            feature = tf.train.Feature(float_list=tf.train.FloatList(value=[values]))\n",
    "        return feature\n",
    "\n",
    "    def create_bytes_feature(values):\n",
    "        if isinstance(values, bytes):\n",
    "            white_io = BytesIO()\n",
    "            Image.new('RGB', (10, 10), (255, 255, 255)).save(white_io, 'JPEG')\n",
    "            image_bytes = white_io.getvalue()\n",
    "            feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))\n",
    "        else:\n",
    "            feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[bytes(values, encoding='utf-8')]))\n",
    "        return feature\n",
    "\n",
    "    writer = tf.io.TFRecordWriter(TFRECORD_FILE)\n",
    "\n",
    "    example_count = 0\n",
    "    for i in range(10):\n",
    "        # 随机创建Tensorflow样本数据\n",
    "        file_name = \"000\" + str(i) + \".jpg\"\n",
    "        image_bytes = bytes(str(\"aaaabbbbcccc\" + str(i)), encoding=\"utf-8\")\n",
    "        int64_scalar = i\n",
    "        float_scalar = float(i)\n",
    "        int64_list = [i, i+1, i+2, i+3, i+4, i+1234567890]\n",
    "        float_list = [float(i), float(i+1), float(i+2.8), float(i+3.2),\n",
    "                      float(i+4.4), float(i+123456.9), float(i+98765432.1)]\n",
    "\n",
    "        # 把数据存入TFRecord文件格式中\n",
    "        features = collections.OrderedDict()\n",
    "        features[\"file_name\"] = create_bytes_feature(file_name)\n",
    "        features[\"image_bytes\"] = create_bytes_feature(image_bytes)\n",
    "        features[\"int64_scalar\"] = create_int_feature(int64_scalar)\n",
    "        features[\"float_scalar\"] = create_float_feature(float_scalar)\n",
    "        features[\"int64_list\"] = create_int_feature(int64_list)\n",
    "        features[\"float_list\"] = create_float_feature(float_list)\n",
    "\n",
    "        tf_example = tf.train.Example(features=tf.train.Features(feature=features))\n",
    "        writer.write(tf_example.SerializeToString())\n",
    "        example_count += 1\n",
    "\n",
    "    writer.close()\n",
    "    print(\"Write {} rows in tfrecord.\".format(example_count))\n",
    "\n",
    "generate_tfrecord()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 将TFRecord转换成MindSpore Record。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_dict = {\"file_name\": tf.io.FixedLenFeature([], tf.string),\n",
    "                \"image_bytes\": tf.io.FixedLenFeature([], tf.string),\n",
    "                \"int64_scalar\": tf.io.FixedLenFeature([], tf.int64),\n",
    "                \"float_scalar\": tf.io.FixedLenFeature([], tf.float32),\n",
    "                \"int64_list\": tf.io.FixedLenFeature([6], tf.int64),\n",
    "                \"float_list\": tf.io.FixedLenFeature([7], tf.float32),\n",
    "                }\n",
    "\n",
    "if os.path.exists(MINDRECORD_FILE):\n",
    "    os.remove(MINDRECORD_FILE)\n",
    "    os.remove(MINDRECORD_FILE + \".db\")\n",
    "\n",
    "tfrecord_transformer = record.TFRecordToMR(TFRECORD_FILE, MINDRECORD_FILE, feature_dict, [\"image_bytes\"])\n",
    "tfrecord_transformer.transform()\n",
    "\n",
    "assert os.path.exists(MINDRECORD_FILE)\n",
    "assert os.path.exists(MINDRECORD_FILE + \".db\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. 通过`MindDataset`接口读取MindSpore Record。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Got 10 samples\n"
     ]
    }
   ],
   "source": [
    "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n",
    "decode_op = vision.Decode()\n",
    "data_set = data_set.map(operations=decode_op, input_columns=[\"image_bytes\"], num_parallel_workers=2)\n",
    "\n",
    "# 样本计数\n",
    "print(\"Got {} samples\".format(data_set.get_dataset_size()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}