{ "cells": [ { "cell_type": "markdown", "source": [ "# 高级数据集管理\n", "\n", "`Ascend` `GPU` `CPU` `进阶` `数据准备`\n", "\n", "[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tYXN0ZXIvdHV0b3JpYWxzL3poX2NuL21pbmRzcG9yZV9kYXRhLmlweW5i&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.5/tutorials/zh_cn/mindspore_data.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.5/tutorials/zh_cn/mindspore_data.py) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.5/tutorials/source_zh_cn/intermediate/data.ipynb)\n", "\n", "MindSpore可以加载常见的数据集或自定义的数据集,这部分功能在初级教程中进行了部分介绍。加载自定义数据集有两种途径:\n", "\n", "- 通过`GeneratorDataset`对象加载,使用方法可参考[初级教程-自定义数据集](https://www.mindspore.cn/tutorials/zh-CN/r1.5/dataset.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86)。\n", "\n", "- 将数据集转换为MindRecord,即MindSpore数据格式,通过读取MindRecord文件进行加载数据。\n", "\n", "如果用户想要获得更好的性能体验,可以将数据集转换为MindRecord,从而方便地加载到MindSpore中进行训练。\n", "\n", "MindRecord的性能优化如下:\n", "\n", "- 实现多变的用户数据统一存储、访问,训练数据读取更加简便。\n", "- 数据聚合存储,高效读取,且方便管理、移动。\n", "- 高效的数据编解码操作,对用户透明、无感知。\n", "- 可以灵活控制分区的大小,实现分布式训练。\n", "\n", "常见数据集转换MindRecord可参考官方编程指南中的[MindSpore数据格式转换](https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.5/dataset_conversion.html#MindSpore%E6%95%B0%E6%8D%AE%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2),自定义数据集转换可参考下文。\n", "\n", "MindRecord的目标是归一化用户的数据集,并进一步通过`MindDataset`实现数据的读取,用于训练过程。下面对这两步进行说明。\n", "\n", "## 自定义数据集转换为MindRecord\n", "\n", "首先,下载需要处理的图片数据`transform.jpg`作为待处理的原始数据。\n", "\n", "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/`用于存放所有的转换数据集。\n", "\n", "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/images/`用于存放下载下来的图片数据。\n", "\n", "在Jupyter Notebook中执行以下命令,完成图片下载和文件夹的创建,并将图片移动到指定位置。" ], "metadata": { "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "source": [ "!wget -N --no-check-certificate https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/transform.jpg\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/\n", "!mkdir -p ./datasets/convert_dataset_to_mindrecord/images/\n", "!mv -f ./transform.jpg ./datasets/convert_dataset_to_mindrecord/images/" ], "outputs": [], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "下载的图片数据文件的目录结构如下:\n", "\n", "```text\n", "./datasets/convert_dataset_to_mindrecord/images/\n", "└── transform.jpg\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "导入文件写入工具类`FileWriter`。" ], "metadata": { "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 2, "source": [ "from mindspore.mindrecord import FileWriter" ], "outputs": [], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "创建FileWriter对象,传入文件名及分片数量。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "data_record_path = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord'\n", "writer = FileWriter(file_name=data_record_path,shard_num=4)" ], "outputs": [], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "定义数据集结构文件Schema,调用`write_raw_data`接口写入数据,最后调用`commit`接口生成本地数据文件。\n", "\n", "Schema文件主要包含字段名`name`、字段数据类型`type`和字段各维度维数`shape`:\n", "\n", "- 字段名:字段的引用名称,可以包含字母、数字和下划线。\n", "\n", "- 字段数据类型:包含int32、int64、float32、float64、string、bytes。\n", "\n", "- 字段维数:一维数组用[-1]表示,更高维度可表示为[m, n, …],其中m、n为各维度维数。\n", "\n", "> 如果字段有属性`shape`,则用户传入`write_raw_data`接口的数据必须为`numpy.ndarray`类型,对应数据类型必须为int32、int64、float32、float64。" ], "metadata": { "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 4, "source": [ "# 定义schema\n", "data_schema = {\"file_name\":{\"type\":\"string\"},\"label\":{\"type\":\"int32\"},\"data\":{\"type\":\"bytes\"}}\n", "writer.add_schema(data_schema,\"test_schema\")\n", "\n", "# 数据准备\n", "file_name = \"./datasets/convert_dataset_to_mindrecord/images/transform.jpg\"\n", "with open(file_name, \"rb\") as f:\n", " bytes_data = f.read()\n", "data = [{\"file_name\":\"transform.jpg\", \"label\":1, \"data\":bytes_data}]\n", "\n", "indexes = [\"file_name\",\"label\"]\n", "writer.add_index(indexes)\n", "\n", "# 数据写入\n", "writer.write_raw_data(data)\n", "\n", "# 生成本地数据\n", "writer.commit()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MSRStatus.SUCCESS" ] }, "metadata": {}, "execution_count": 4 } ], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "该示例会生成8个文件,成为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件,其中`test.mindrecord0`为数据文件,`test.mindrecord0.db`为索引文件,生成的文件为:\n", "\n", "```\n", "\n", "./datasets/convert_dataset_to_mindrecord/data_to_mindrecord/\n", "├── test.mindrecord0\n", "├── test.mindrecord0.db\n", "├── test.mindrecord1\n", "├── test.mindrecord1.db\n", "├── test.mindrecord2\n", "├── test.mindrecord2.db\n", "├── test.mindrecord3\n", "└── test.mindrecord3.db\n", "\n", "0 directories, 8 files\n", "```\n", "\n", "## 读取MindRecord数据集\n", "\n", "导入读取类`mindspore.dataset`。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "import mindspore.dataset as ds" ], "outputs": [], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "首先使用`MindDataset`读取MindRecord数据集,然后对数据创建字典迭代器,并通过迭代器读取一条数据记录。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "file_name = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord0'\n", "# 创建MindDataset\n", "define_data_set = ds.MindDataset(dataset_file=file_name)\n", "# 创建字典迭代器并通过迭代器读取数据记录\n", "count = 0\n", "for item in define_data_set.create_dict_iterator(output_numpy=True):\n", " print(\"sample: {}\".format(item))\n", " count += 1\n", "print(\"Got {} samples\".format(count))\n" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sample: {'data': array([255, 216, 255, ..., 159, 255, 217], dtype=uint8), 'file_name': array(b'transform.jpg', dtype='|S13'), 'label': array(1, dtype=int32)}\n", "Got 1 samples\n" ] } ], "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }