{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![在线运行](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9vYnMuZHVhbHN0YWNrLmNuLW5vcnRoLTQubXlodWF3ZWljbG91ZC5jb20vbWluZHNwb3JlLXdlYnNpdGUvbm90ZWJvb2svcjIuMC4wLWFscGhhL3R1dG9yaWFscy96aF9jbi9iZWdpbm5lci9taW5kc3BvcmVfZGF0YXNldC5pcHluYg==&imageid=77ef960a-bd26-4de4-9695-5b85a786fb90) [![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.0.0-alpha/tutorials/zh_cn/beginner/mindspore_dataset.ipynb) [![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.0.0-alpha/tutorials/zh_cn/beginner/mindspore_dataset.py) [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r2.0.0-alpha/tutorials/source_zh_cn/beginner/dataset.ipynb)\n", "\n", "[基本介绍](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/introduction.html) || [快速入门](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/quick_start.html) || [张量 Tensor](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/tensor.html) || **数据集 Dataset** || [数据变换 Transforms](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/transforms.html) || [网络构建](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/model.html) || [函数式自动微分](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/autograd.html) || [模型训练](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/train.html) || [保存与加载](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/save_load.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 数据集 Dataset\n", "\n", "数据是深度学习的基础,高质量的数据输入将在整个深度神经网络中起到积极作用。MindSpore提供基于Pipeline的[数据引擎](https://www.mindspore.cn/docs/zh-CN/r2.0.0-alpha/design/data_engine.html),通过[数据集(Dataset)](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/dataset.html)和[数据变换(Transforms)](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/transforms.html)实现高效的数据预处理。其中Dataset是Pipeline的起始,用于加载原始数据。`mindspore.dataset`提供了内置的文本、图像、音频等数据集加载接口,并提供了自定义数据集加载接口。\n", "\n", "此外MindSpore的领域开发库也提供了大量的预加载数据集,可以使用API一键下载使用。本教程将分别对不同的数据集加载方式、数据集常见操作和自定义数据集方法进行详细阐述。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from mindspore.dataset import vision\n", "from mindspore.dataset import MnistDataset, GeneratorDataset\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集加载\n", "\n", "我们使用**Mnist**数据集作为样例,介绍使用`mindspore.dataset`进行加载的方法。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`mindspore.dataset`提供的接口**仅支持解压后的数据文件**,因此我们使用`download`库下载数据集并解压。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip (10.3 MB)\n", "\n", "file_sizes: 100%|██████████████████████████| 10.8M/10.8M [00:02<00:00, 3.96MB/s]\n", "Extracting zip file...\n", "Successfully downloaded / unzipped to ./\n" ] } ], "source": [ "# Download data from open datasets\n", "from download import download\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/\" \\\n", " \"notebook/datasets/MNIST_Data.zip\"\n", "path = download(url, \"./\", kind=\"zip\", replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "压缩文件删除后,直接加载,可以看到其数据类型为MnistDataset。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "train_dataset = MnistDataset(\"MNIST_Data/train\", shuffle=False)\n", "print(type(train_dataset))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集迭代\n", "\n", "数据集加载后,一般以迭代方式获取数据,然后送入神经网络中进行训练。我们可以用`create_tuple_iterator`或`create_dict_iterator`接口创建数据迭代器,迭代访问数据。\n", "\n", "访问的数据类型默认为`Tensor`;若设置`output_numpy=True`,访问的数据类型为`Numpy`。\n", "\n", "下面定义一个可视化函数,迭代9张图片进行展示。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def visualize(dataset):\n", " figure = plt.figure(figsize=(4, 4))\n", " cols, rows = 3, 3\n", "\n", " for idx, (image, label) in enumerate(dataset.create_tuple_iterator()):\n", " figure.add_subplot(rows, cols, idx + 1)\n", " plt.title(int(label))\n", " plt.axis(\"off\")\n", " plt.imshow(image.asnumpy().squeeze(), cmap=\"gray\")\n", " if idx == cols * rows - 1:\n", " break\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualize(train_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集常用操作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pipeline的设计理念使得数据集的常用操作采用`dataset = dataset.operation()`的异步执行方式,执行操作返回新的Dataset,此时不执行具体操作,而是在Pipeline中加入节点,最终进行迭代时,并行执行整个Pipeline。\n", "\n", "下面分别介绍几种常见的数据集操作。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### shuffle\n", "\n", "数据集随机`shuffle`可以消除数据排列造成的分布不均问题。\n", "\n", "![op-shuffle](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/tutorials/source_zh_cn/advanced/dataset/images/op_shuffle.png)\n", "\n", "`mindspore.dataset`提供的数据集在加载时可配置`shuffle=True`,或使用如下操作:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "train_dataset = train_dataset.shuffle(buffer_size=64)\n", "\n", "visualize(train_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`map`操作是数据预处理的关键操作,可以针对数据集指定列(column)添加数据变换(Transforms),将数据变换应用于该列数据的每个元素,并返回包含变换后元素的新数据集。这里我们对Mnist数据集做数据缩放处理,将图像统一除255,数据类型由uint8转为float32。\n", "\n", "> Dataset支持的不同变换类型详见[数据变换Transforms](https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/beginner/transforms.html)。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(28, 28, 1) UInt8\n" ] } ], "source": [ "image, label = next(train_dataset.create_tuple_iterator())\n", "print(image.shape, image.dtype)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "train_dataset = train_dataset.map(vision.Rescale(1.0 / 255.0, 0), input_columns='image')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对比map前后的数据,可以看到数据类型变化。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(28, 28, 1) Float32\n" ] } ], "source": [ "image, label = next(train_dataset.create_tuple_iterator())\n", "print(image.shape, image.dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### batch\n", "\n", "将数据集打包为固定大小的`batch`是在有限硬件资源下使用梯度下降进行模型优化的折中方法,可以保证梯度下降的随机性和优化计算量。\n", "\n", "![op-batch](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0.0-alpha/tutorials/source_zh_cn/advanced/dataset/images/op_batch.png)\n", "\n", "一般我们会设置一个固定的batch size,将连续的数据分为若干批(batch)。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "train_dataset = train_dataset.batch(batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "batch后的数据增加一维,大小为`batch_size`。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(32, 28, 28, 1) Float32\n" ] } ], "source": [ "image, label = next(train_dataset.create_tuple_iterator())\n", "print(image.shape, image.dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 自定义数据集" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`mindspore.dataset`提供了部分常用数据集和标准格式数据集的加载接口。对于MindSpore暂不支持直接加载的数据集,可以通过构造自定义数据集类或自定义数据集生成函数的方式来生成数据集,然后通过`GeneratorDataset`接口实现自定义方式的数据集加载。\n", "\n", "`GeneratorDataset`支持通过可迭代对象、迭代器和生成函数构造自定义数据集,下面分别对其进行详解。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 可迭代对象\n", "\n", "Python中可以使用for循环遍历出所有元素的,都可以称为可迭代对象(Iterable),我们可以通过实现`__getitem__`方法来构造可迭代对象,并将其加载至`GeneratorDataset`。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Iterable object as input source\n", "class Iterable:\n", " def __init__(self):\n", " self._data = np.random.sample((5, 2))\n", " self._label = np.random.sample((5, 1))\n", "\n", " def __getitem__(self, index):\n", " return self._data[index], self._label[index]\n", "\n", " def __len__(self):\n", " return len(self._data)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "data = Iterable()\n", "dataset = GeneratorDataset(source=data, column_names=[\"data\", \"label\"])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# list, dict, tuple are also iterable object.\n", "dataset = GeneratorDataset(source=[(np.array(0),), (np.array(1),), (np.array(2),)], column_names=[\"col\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 迭代器\n", "\n", "Python中内置有`__iter__`和`__next__`方法的对象,称为迭代器(Iterator)。下面构造一个简单迭代器,并将其加载至`GeneratorDataset`。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Iterator as input source\n", "class Iterator:\n", " def __init__(self):\n", " self._index = 0\n", " self._data = np.random.sample((5, 2))\n", " self._label = np.random.sample((5, 1))\n", "\n", " def __next__(self):\n", " if self._index >= len(self._data):\n", " raise StopIteration\n", " else:\n", " item = (self._data[self._index], self._label[self._index])\n", " self._index += 1\n", " return item\n", "\n", " def __iter__(self):\n", " self._index = 0\n", " return self\n", "\n", " def __len__(self):\n", " return len(self._data)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "data = Iterator()\n", "dataset = GeneratorDataset(source=data, column_names=[\"data\", \"label\"])" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "vscode": { "interpreter": { "hash": "8c9da313289c39257cb28b126d2dadd33153d4da4d524f730c81a4aaccbd2ca7" } } }, "nbformat": 4, "nbformat_minor": 4 }