{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 通用数据处理\n",
    "\n",
    "`Ascend` `GPU` `CPU` `数据准备`\n",
    "\n",
    "[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9vYnMuZHVhbHN0YWNrLmNuLW5vcnRoLTQubXlodWF3ZWljbG91ZC5jb20vbWluZHNwb3JlLXdlYnNpdGUvbm90ZWJvb2svbW9kZWxhcnRzL3Byb2dyYW1taW5nX2d1aWRlL21pbmRzcG9yZV9waXBlbGluZS5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c)&emsp;[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.5/programming_guide/zh_cn/mindspore_pipeline_common.ipynb)&emsp;[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.5/programming_guide/zh_cn/mindspore_pipeline_common.py)&emsp;[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.5/docs/mindspore/programming_guide/source_zh_cn/pipeline_common.ipynb)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 概述\n",
    "\n",
    "数据是深度学习的基础，良好的数据输入可以对整个深度神经网络训练起到非常积极的作用。在训练前对已加载的数据集进行数据处理，可以解决诸如数据量过大、样本分布不均等问题，从而获得更加优化的数据输入。\n",
    "\n",
    "MindSpore的各个数据集类都为用户提供了多种数据处理算子，用户可以构建数据处理pipeline定义需要使用的数据处理操作，数据即可在训练过程中像水一样源源不断地经过数据处理pipeline流向训练系统。\n",
    "\n",
    "MindSpore目前支持的部分常用数据处理算子如下表所示，更多数据处理操作参见[API文档](https://www.mindspore.cn/docs/api/zh-CN/r1.5/api_python/mindspore.dataset.html)。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "| 数据处理算子  | 算子说明 |\n",
    "| :----  | :----           |\n",
    "| shuffle | 对数据集进行混洗，随机打乱数据顺序。 |\n",
    "| map | 提供自定义函数或算子，作用于数据集的指定列数据。 |\n",
    "| batch | 对数据集进行分批，可以减少训练轮次，加速训练过程。 |\n",
    "| repeat | 对数据集进行重复，达到扩充数据量的目的。 |\n",
    "| zip | 将两个数据集进行列拼接，合并为一个数据集。 |\n",
    "| concat | 将两个数据集进行行拼接，合并为一个数据集。 |"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 数据处理算子\n",
    "\n",
    "### shuffle\n",
    "\n",
    "对数据集进行混洗，随机打乱数据顺序。\n",
    "\n",
    "> 设定的`buffer_size`越大，混洗程度越大，但时间、计算资源消耗也会更大。\n",
    "\n",
    "![shuffle](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/shuffle.png)\n",
    "\n",
    "下面的样例先构建了一个随机数据集，然后对其进行混洗操作，最后展示了混洗后的数据结果。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "ds.config.set_seed(0)\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(5):\n",
    "        yield (np.array([i, i+1, i+2]),)\n",
    "\n",
    "dataset1 = ds.GeneratorDataset(generator_func, [\"data\"])\n",
    "\n",
    "dataset1 = dataset1.shuffle(buffer_size=2)\n",
    "for data in dataset1.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### map\n",
    "\n",
    "将指定的函数或算子作用于数据集的指定列数据，实现数据映射操作。用户可以自定义映射函数，也可以直接使用`c_transforms`或`py_transforms`中的算子针对图像、文本数据进行数据增强。\n",
    "\n",
    "> 更多数据增强的使用说明，参见编程指南中[数据增强](https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.5/augmentation.html)章节。\n",
    "\n",
    "![map](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/map.png)\n",
    "\n",
    "下面的样例先构建了一个随机数据集，然后定义了数据翻倍的映射函数并将其作用于数据集，最后对比展示了映射前后的数据结果。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(5):\n",
    "        yield (np.array([i, i+1, i+2]),)\n",
    "\n",
    "def pyfunc(x):\n",
    "    return x*2\n",
    "\n",
    "dataset = ds.GeneratorDataset(generator_func, [\"data\"])\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(data)\n",
    "\n",
    "print(\"------ after processing ------\")\n",
    "\n",
    "dataset = dataset.map(operations=pyfunc, input_columns=[\"data\"])\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n",
      "------ after processing ------\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 2, 4])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 4, 6])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 6, 8])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [ 6,  8, 10])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [ 8, 10, 12])}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### batch\n",
    "\n",
    "将数据集分批，分别输入到训练系统中进行训练，可以减少训练轮次，达到加速训练过程的目的。\n",
    "\n",
    "![batch](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/batch.png)\n",
    "\n",
    "下面的样例先构建了一个随机数据集，然后分别展示了保留多余数据与否的数据集分批结果，其中批大小为2。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(5):\n",
    "        yield (np.array([i, i+1, i+2]),)\n",
    "\n",
    "dataset1 = ds.GeneratorDataset(generator_func, [\"data\"])\n",
    "\n",
    "dataset1 = dataset1.batch(batch_size=2, drop_remainder=False)\n",
    "for data in dataset1.create_dict_iterator():\n",
    "    print(data)\n",
    "\n",
    "print(\"------ drop remainder ------\")\n",
    "\n",
    "dataset2 = ds.GeneratorDataset(generator_func, [\"data\"])\n",
    "\n",
    "dataset2 = dataset2.batch(batch_size=2, drop_remainder=True)\n",
    "for data in dataset2.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n",
      "[[0, 1, 2],\n",
      " [1, 2, 3]])}\n",
      "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n",
      "[[2, 3, 4],\n",
      " [3, 4, 5]])}\n",
      "{'data': Tensor(shape=[1, 3], dtype=Int64, value=\n",
      "[[4, 5, 6]])}\n",
      "------ drop remainder ------\n",
      "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n",
      "[[0, 1, 2],\n",
      " [1, 2, 3]])}\n",
      "{'data': Tensor(shape=[2, 3], dtype=Int64, value=\n",
      "[[2, 3, 4],\n",
      " [3, 4, 5]])}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### repeat\n",
    "\n",
    "对数据集进行重复，达到扩充数据量的目的。\n",
    "\n",
    "> `repeat`和`batch`操作的顺序会影响训练batch的数量，建议将`repeat`置于`batch`之后。\n",
    "\n",
    "![repeat](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/repeat.png)\n",
    "\n",
    "下面的样例先构建了一个随机数据集，然后将其重复2次，最后展示了重复后的数据结果。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(5):\n",
    "        yield (np.array([i, i+1, i+2]),)\n",
    "\n",
    "dataset1 = ds.GeneratorDataset(generator_func, [\"data\"])\n",
    "\n",
    "dataset1 = dataset1.repeat(count=2)\n",
    "for data in dataset1.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5])}\n",
      "{'data': Tensor(shape=[3], dtype=Int64, value= [4, 5, 6])}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### zip\n",
    "\n",
    "将两个数据集进行列拼接，合并为一个数据集。\n",
    "\n",
    "> 如果两个数据集的列名相同，则不会合并，请注意列的命名。\n",
    "> \n",
    "> 如果两个数据集的行数不同，合并后的行数将和较小行数保持一致。\n",
    "\n",
    "  ![zip](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/zip.png)\n",
    "\n",
    "下面的样例先构建了两个不同样本数的随机数据集，然后将其进行列拼接，最后展示了拼接后的数据结果。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(7):\n",
    "        yield (np.array([i, i+1, i+2]),)\n",
    "\n",
    "def generator_func2():\n",
    "    for i in range(4):\n",
    "        yield (np.array([1, 2]),)\n",
    "\n",
    "dataset1 = ds.GeneratorDataset(generator_func, [\"data1\"])\n",
    "dataset2 = ds.GeneratorDataset(generator_func2, [\"data2\"])\n",
    "\n",
    "dataset3 = ds.zip((dataset1, dataset2))\n",
    "\n",
    "for data in dataset3.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [0, 1, 2]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [2, 3, 4]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [3, 4, 5]), 'data2': Tensor(shape=[2], dtype=Int64, value= [1, 2])}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### concat\n",
    "\n",
    "将两个数据集进行行拼接，合并为一个数据集。\n",
    "\n",
    "> 输入数据集中的列名，列数据类型和列数据的排列应相同。\n",
    "\n",
    "![concat](https://gitee.com/mindspore/docs/raw/r1.5/docs/mindspore/programming_guide/source_zh_cn/images/concat.png)\n",
    "\n",
    "下面的样例先构建了两个随机数据集，然后将其进行行拼接，最后展示了拼接后的数据结果。值得一提的是，使用`+`运算符也能达到同样的效果。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "import numpy as np\n",
    "import mindspore.dataset as ds\n",
    "\n",
    "def generator_func():\n",
    "    for i in range(2):\n",
    "        yield (np.array([0, 0, 0]),)\n",
    "\n",
    "def generator_func2():\n",
    "    for i in range(2):\n",
    "        yield (np.array([1, 2, 3]),)\n",
    "\n",
    "dataset1 = ds.GeneratorDataset(generator_func, [\"data1\"])\n",
    "dataset2 = ds.GeneratorDataset(generator_func2, [\"data1\"])\n",
    "\n",
    "dataset3 = dataset1.concat(dataset2)\n",
    "\n",
    "for data in dataset3.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [0, 0, 0])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n",
      "{'data1': Tensor(shape=[3], dtype=Int64, value= [1, 2, 3])}\n"
     ]
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}