{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 格式转换\n", "\n", "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/dataset/mindspore_record.ipynb) \n", "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.8/tutorials/zh_cn/advanced/dataset/mindspore_record.py) \n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.8/tutorials/source_zh_cn/advanced/dataset/record.ipynb)\n", "\n", "MindSpore中可以把用于训练网络模型的数据集,转换为MindSpore特定的格式数据(MindSpore Record格式),从而更加方便地保存和加载数据。其目标是归一化用户的数据集,并进一步通过`MindDataset`接口实现数据的读取,并用于训练过程。\n", "\n", "![conversion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_zh_cn/advanced/dataset/images/data_conversion_concept.png)\n", "\n", "此外,MindSpore还针对部分数据场景进行了性能优化,使用MindSpore Record数据格式可以减少磁盘IO、网络IO开销,从而获得更好的使用体验。\n", "\n", "MindSpore Record数据格式具备的特征如下:\n", "\n", "1. 实现数据统一存储、访问,使得训练时数据读取更加简便。\n", "2. 数据聚合存储、高效读取,使得训练时数据方便管理和移动。\n", "3. 高效的数据编解码操作,使得用户可以对数据操作无感知。\n", "4. 可以灵活控制数据切分的分区大小,实现分布式数据处理。\n", "\n", "## Record文件结构\n", "\n", "如下图所示,MindSpore Record文件由数据文件和索引文件组成。\n", "\n", "![MindSpore Record](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.8/tutorials/source_zh_cn/advanced/dataset/images/mindrecord.png)\n", "\n", "其中数据文件包含文件头、标量数据页、块数据页,用于存储用户归一化后的训练数据,且单个MindSpore Record文件建议小于20G,用户可将大数据集进行分片存储为多个MindSpore Record文件。\n", "\n", "而索引文件则包含基于标量数据(如图像Label、图像文件名等)生成的索引信息,用于方便的检索、统计数据集信息。\n", "\n", "数据文件中的文件头、标量数据页、块数据页的具体用途如下所示:\n", "\n", "- **文件头**:是MindSpore Record文件的元信息,主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等。\n", "- **标量数据页**:主要用来存储整型、字符串、浮点型数据,如图像的Label、图像的文件名、图像的长宽等信息,即适合用标量来存储的信息会保存在这里。\n", "- **块数据页**:主要用来存储二进制串、NumPy数组等数据,如二进制图像文件本身、文本转换成的字典等。\n", "\n", "> 值得注意的是,数据文件和索引文件均暂不支持重命名操作。\n", "\n", "## 转换成Record格式\n", "\n", "下面主要介绍如何将CV类数据和NLP类数据转换为MindSpore Record文件格式,并通过`MindDataset`接口,实现MindSpore Record文件的读取。\n", "\n", "### 转换CV类数据集\n", "\n", "本示例主要以包含100条记录的CV数据集并将其转换为MindSpore Record格式为例子,介绍如何将CV类数据集转换成MindSpore Record文件格式,并使用`MindDataset`接口读取。\n", "\n", "首先,需要创建100张图片的数据集并对齐进行保存,其样本包含`file_name`(字符串)、`label`(整型)、 `data`(二进制)三个字段,然后使用`MindDataset`接口读取该MindSpore Record文件。\n", "\n", "1. 生成100张图像,并转换成MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:03.889515Z", "start_time": "2021-02-22T10:34:02.950207Z" } }, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "from PIL import Image\n", "from io import BytesIO\n", "\n", "import mindspore.mindrecord as record\n", "\n", "\n", "# 输出的MindSpore Record文件完整路径\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "# 定义包含的字段\n", "cv_schema = {\"file_name\": {\"type\": \"string\"},\n", " \"label\": {\"type\": \"int32\"},\n", " \"data\": {\"type\": \"bytes\"}}\n", "\n", "# 声明MindSpore Record文件格式\n", "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n", "writer.add_schema(cv_schema, \"it is a cv dataset\")\n", "writer.add_index([\"file_name\", \"label\"])\n", "\n", "# 创建数据集\n", "data = []\n", "for i in range(100):\n", " i += 1\n", " sample = {}\n", " white_io = BytesIO()\n", " Image.new('RGB', (i*10, i*10), (255, 255, 255)).save(white_io, 'JPEG')\n", " image_bytes = white_io.getvalue()\n", " sample['file_name'] = str(i) + \".jpg\"\n", " sample['label'] = i\n", " sample['data'] = white_io.getvalue()\n", "\n", " data.append(sample)\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "从上面的打印结果`MSRStatus.SUCCESS`可以看出,数据集转换成功。在本篇后续的例子中如果数据集转换成功均可看到此打印结果。\n", "\n", "2. 通过`MindDataset`接口读取MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:07.729322Z", "start_time": "2021-02-22T10:34:07.575711Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 100 samples\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.vision as vision\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 转换NLP类数据集\n", "\n", "本示例首先创建一个包含100条记录的MindSpore Record文件格式,其样本包含八个字段,均为整型数组,然后使用`MindDataset`接口读取该MindSpore Record文件。\n", "\n", "> 为了方便展示,此处略去了将文本转换成字典序的预处理过程。\n", "\n", "1. 生成100条文本数据,并转换成MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:23.883130Z", "start_time": "2021-02-22T10:34:23.660213Z" } }, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "import numpy as np\n", "import mindspore.mindrecord as record\n", "\n", "# 输出的MindSpore Record文件完整路径\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "# 定义样本数据包含的字段\n", "nlp_schema = {\"source_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"source_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_sos_mask\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_ids\": {\"type\": \"int64\", \"shape\": [-1]},\n", " \"target_eos_mask\": {\"type\": \"int64\", \"shape\": [-1]}}\n", "\n", "# 声明MindSpore Record文件格式\n", "writer = record.FileWriter(file_name=MINDRECORD_FILE, shard_num=1)\n", "writer.add_schema(nlp_schema, \"Preprocessed nlp dataset.\")\n", "\n", "# 创建虚拟数据集\n", "data = []\n", "for i in range(100):\n", " i += 1\n", " sample = {\"source_sos_ids\": np.array([i, i + 1, i + 2, i + 3, i + 4], dtype=np.int64),\n", " \"source_sos_mask\": np.array([i * 1, i * 2, i * 3, i * 4, i * 5, i * 6, i * 7], dtype=np.int64),\n", " \"source_eos_ids\": np.array([i + 5, i + 6, i + 7, i + 8, i + 9, i + 10], dtype=np.int64),\n", " \"source_eos_mask\": np.array([19, 20, 21, 22, 23, 24, 25, 26, 27], dtype=np.int64),\n", " \"target_sos_ids\": np.array([28, 29, 30, 31, 32], dtype=np.int64),\n", " \"target_sos_mask\": np.array([33, 34, 35, 36, 37, 38], dtype=np.int64),\n", " \"target_eos_ids\": np.array([39, 40, 41, 42, 43, 44, 45, 46, 47], dtype=np.int64),\n", " \"target_eos_mask\": np.array([48, 49, 50, 51], dtype=np.int64)}\n", " data.append(sample)\n", "\n", " if i % 10 == 0:\n", " writer.write_raw_data(data)\n", " data = []\n", "\n", "if data:\n", " writer.write_raw_data(data)\n", "\n", "writer.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 通过`MindDataset`接口读取MindSpore Record格式文件。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-02-22T10:34:27.133717Z", "start_time": "2021-02-22T10:34:27.083785Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 100 samples\n", "source_sos_ids: [1 2 3 4 5]\n", "source_sos_ids: [2 3 4 5 6]\n", "source_sos_ids: [3 4 5 6 7]\n", "source_sos_ids: [4 5 6 7 8]\n", "source_sos_ids: [5 6 7 8 9]\n", "source_sos_ids: [ 6 7 8 9 10]\n", "source_sos_ids: [ 7 8 9 10 11]\n", "source_sos_ids: [ 8 9 10 11 12]\n", "source_sos_ids: [ 9 10 11 12 13]\n", "source_sos_ids: [10 11 12 13 14]\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE, shuffle=False)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))\n", "\n", "# 打印部分数据\n", "count = 0\n", "for item in data_set.create_dict_iterator():\n", " print(\"source_sos_ids:\", item[\"source_sos_ids\"])\n", " count += 1\n", " if count == 10:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 其他数据集转换\n", "\n", "MindSpore提供转换常用数据集的工具类,能够将常用的数据集转换为MindSpore Record文件格式。\n", "\n", "> 更多数据集转换的详细说明参考[API文档](https://www.mindspore.cn/docs/zh-CN/r1.8/api_python/mindspore.mindrecord.html)。\n", "\n", "### 转换CIFAR-10数据集\n", "\n", "用户可以通过`Cifar10ToMR`类,将CIFAR-10原始数据转换为MindSpore Record,并使用`MindDataset`接口读取。\n", "\n", "1. 下载[CIFAR-10数据集](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)并解压到指定目录,以下示例代码将数据集下载并解压到指定位置。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from mindvision import dataset\n", "\n", "# 声明数据集下载地址和数据集存储路径\n", "dl_path = \"./datasets\"\n", "dl_url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\"\n", "\n", "# 下载和解压数据集\n", "dl = dataset.DownLoad()\n", "dl.download_and_extract_archive(url=dl_url, download_path=dl_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "解压后数据集文件的目录结构如下所示:\n", "\n", "```text\n", "./datasets/cifar-10-batches-py\n", "├── batches.meta\n", "├── data_batch_1\n", "├── data_batch_2\n", "├── data_batch_3\n", "├── data_batch_4\n", "├── data_batch_5\n", "├── readme.html\n", "└── test_batch\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 创建`Cifar10ToMR`对象,调用`transform`接口,将CIFAR-10数据集转换为MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "from mindspore.mindrecord import Cifar10ToMR\n", "\n", "ds_target_path = \"./datasets/mindspore_dataset_conversion/\"\n", "\n", "os.system(\"rm -f {}*\".format(ds_target_path))\n", "os.system(\"mkdir -p {}\".format(ds_target_path))\n", "\n", "# CIFAR-10数据集路径\n", "CIFAR10_DIR = \"./datasets/cifar-10-batches-py\"\n", "# 输出的MindSpore Record文件路径\n", "MINDRECORD_FILE = \"./datasets/mindspore_dataset_conversion/cifar10.mindrecord\"\n", "\n", "cifar10_transformer = Cifar10ToMR(CIFAR10_DIR, MINDRECORD_FILE)\n", "cifar10_transformer.transform(['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. 通过`MindDataset`接口读取MindSpore Record文件格式。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 50000 samples\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.vision as vision\n", "\n", "# 读取MindSpore Record文件格式\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"data\"], num_parallel_workers=2)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 转换CSV数据集\n", "\n", "本示例首先创建一个包含5条记录的CSV文件,然后通过`CsvToMR`工具类将CSV文件转换为MindSpore Record文件格式,并最终通过`MindDataset`接口将其读取出来。\n", "\n", "> 本示例依赖第三方支持包`pandas`,可使用命令`pip install pandas`安装。如本文档以Notebook运行时,完成安装后需要重启kernel才能执行后续代码。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. 生成CSV文件,并转换成MindSpore Record。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import os\n", "from mindspore import mindrecord as record\n", "\n", "# CSV文件的路径\n", "CSV_FILE = \"test.csv\"\n", "# 输出的MindSpore Record文件路径\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "def generate_csv():\n", " \"\"\"生成csv格式文件数据\"\"\"\n", " headers = [\"id\", \"name\", \"math\", \"english\"]\n", " rows = [(1, \"Lily\", 78.5, 90),\n", " (2, \"Lucy\", 99, 85.2),\n", " (3, \"Mike\", 65, 71),\n", " (4, \"Tom\", 95, 99),\n", " (5, \"Jeff\", 85, 78.5)]\n", " with open(CSV_FILE, 'w', encoding='utf-8') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(headers)\n", " writer.writerows(rows)\n", "\n", "# 生成csv格式文件数据\n", "generate_csv()\n", "\n", "# 转换csv格式文件\n", "csv_transformer = record.CsvToMR(CSV_FILE, MINDRECORD_FILE, partition_number=1)\n", "csv_transformer.transform()\n", "\n", "assert os.path.exists(MINDRECORD_FILE)\n", "assert os.path.exists(MINDRECORD_FILE + \".db\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 通过`MindDataset`接口读取MindSpore Record。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 5 samples\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "\n", "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 转换TFRecord数据集\n", "\n", "> 本示例需提前安装TensorFlow,目前只支持TensorFlow 1.13.0-rc1及以上版本。如本文档以Notebook运行时,完成安装后需要重启kernel才能执行后续代码。\n", "\n", "本示例首先通过TensorFlow创建一个TFRecord文件,然后通过`TFRecordToMR`工具类将TFRecord文件转换为MindSpore Record格式文件,最后通过`MindDataset`接口将其读取出来,并使用`Decode`函数对`image_bytes`字段进行解码。\n", "\n", "1. 导入相关模块。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import collections\n", "from io import BytesIO\n", "import os\n", "import mindspore.dataset as ds\n", "import mindspore.mindrecord as record\n", "import mindspore.dataset.vision as vision\n", "from PIL import Image\n", "import tensorflow as tf\n", "\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 生成TFRecord文件。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Write 10 rows in tfrecord.\n" ] } ], "source": [ "# TFRecord文件的路径\n", "TFRECORD_FILE = \"test.tfrecord\"\n", "# 输出的MindSpore Record文件路径\n", "MINDRECORD_FILE = \"test.mindrecord\"\n", "\n", "def generate_tfrecord():\n", " def create_int_feature(values):\n", " if isinstance(values, list):\n", " feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))\n", " else:\n", " feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[values]))\n", " return feature\n", "\n", " def create_float_feature(values):\n", " if isinstance(values, list):\n", " feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))\n", " else:\n", " feature = tf.train.Feature(float_list=tf.train.FloatList(value=[values]))\n", " return feature\n", "\n", " def create_bytes_feature(values):\n", " if isinstance(values, bytes):\n", " white_io = BytesIO()\n", " Image.new('RGB', (10, 10), (255, 255, 255)).save(white_io, 'JPEG')\n", " image_bytes = white_io.getvalue()\n", " feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))\n", " else:\n", " feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[bytes(values, encoding='utf-8')]))\n", " return feature\n", "\n", " writer = tf.io.TFRecordWriter(TFRECORD_FILE)\n", "\n", " example_count = 0\n", " for i in range(10):\n", " # 随机创建Tensorflow样本数据\n", " file_name = \"000\" + str(i) + \".jpg\"\n", " image_bytes = bytes(str(\"aaaabbbbcccc\" + str(i)), encoding=\"utf-8\")\n", " int64_scalar = i\n", " float_scalar = float(i)\n", " int64_list = [i, i+1, i+2, i+3, i+4, i+1234567890]\n", " float_list = [float(i), float(i+1), float(i+2.8), float(i+3.2),\n", " float(i+4.4), float(i+123456.9), float(i+98765432.1)]\n", "\n", " # 把数据存入TFRecord文件格式中\n", " features = collections.OrderedDict()\n", " features[\"file_name\"] = create_bytes_feature(file_name)\n", " features[\"image_bytes\"] = create_bytes_feature(image_bytes)\n", " features[\"int64_scalar\"] = create_int_feature(int64_scalar)\n", " features[\"float_scalar\"] = create_float_feature(float_scalar)\n", " features[\"int64_list\"] = create_int_feature(int64_list)\n", " features[\"float_list\"] = create_float_feature(float_list)\n", "\n", " tf_example = tf.train.Example(features=tf.train.Features(feature=features))\n", " writer.write(tf_example.SerializeToString())\n", " example_count += 1\n", "\n", " writer.close()\n", " print(\"Write {} rows in tfrecord.\".format(example_count))\n", "\n", "generate_tfrecord()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. 将TFRecord转换成MindSpore Record。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "feature_dict = {\"file_name\": tf.io.FixedLenFeature([], tf.string),\n", " \"image_bytes\": tf.io.FixedLenFeature([], tf.string),\n", " \"int64_scalar\": tf.io.FixedLenFeature([], tf.int64),\n", " \"float_scalar\": tf.io.FixedLenFeature([], tf.float32),\n", " \"int64_list\": tf.io.FixedLenFeature([6], tf.int64),\n", " \"float_list\": tf.io.FixedLenFeature([7], tf.float32),\n", " }\n", "\n", "if os.path.exists(MINDRECORD_FILE):\n", " os.remove(MINDRECORD_FILE)\n", " os.remove(MINDRECORD_FILE + \".db\")\n", "\n", "tfrecord_transformer = record.TFRecordToMR(TFRECORD_FILE, MINDRECORD_FILE, feature_dict, [\"image_bytes\"])\n", "tfrecord_transformer.transform()\n", "\n", "assert os.path.exists(MINDRECORD_FILE)\n", "assert os.path.exists(MINDRECORD_FILE + \".db\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. 通过`MindDataset`接口读取MindSpore Record。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Got 10 samples\n" ] } ], "source": [ "data_set = ds.MindDataset(dataset_files=MINDRECORD_FILE)\n", "decode_op = vision.Decode()\n", "data_set = data_set.map(operations=decode_op, input_columns=[\"image_bytes\"], num_parallel_workers=2)\n", "\n", "# 样本计数\n", "print(\"Got {} samples\".format(data_set.get_dataset_size()))" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }