{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# 数据采样\n",
    "\n",
    "`Ascend` `GPU` `CPU` `数据准备`\n",
    "\n",
    "[![在线运行](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tb2RlbGFydHMvcHJvZ3JhbW1pbmdfZ3VpZGUvbWluZHNwb3JlX3NhbXBsZXIuaXB5bmI=&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c)&emsp;[![下载Notebook](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_sampler.ipynb)&emsp;[![下载样例代码](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_sampler.py)&emsp;[![查看源文件](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_zh_cn/sampler.ipynb)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 概述\n",
    "\n",
    "MindSpore提供了多种用途的采样器（Sampler），帮助用户对数据集进行不同形式的采样，以满足训练需求，能够解决诸如数据集过大或样本类别分布不均等问题。只需在加载数据集时传入采样器对象，即可实现数据的采样。\n",
    "\n",
    "MindSpore目前提供的部分采样器类别如下表所示。此外，用户也可以根据需要实现自定义的采样器类。更多采样器的使用方法参见[API文档](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/mindspore.dataset.html)。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "| 采样器名称  | 采样器说明 |\n",
    "| :----  | :----           |\n",
    "| RandomSampler | 随机采样器，在数据集中随机地采样指定数目的数据。 |\n",
    "| WeightedRandomSampler | 带权随机采样器，依照长度为N的概率列表，在前N个样本中随机采样指定数目的数据。 |\n",
    "| SubsetRandomSampler | 子集随机采样器，在指定的索引范围内随机采样指定数目的数据。 |\n",
    "| PKSampler | PK采样器，在指定的数据集类别P中，每种类别各采样K条数据。 |\n",
    "| DistributedSampler | 分布式采样器，在分布式训练中对数据集分片进行采样。 |"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## MindSpore采样器\n",
    "\n",
    "下面以CIFAR-10数据集为例，介绍几种常用MindSpore采样器的使用方法。\n",
    "\n",
    "以下示例代码将数据集下载并解压到指定位置。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import os\n",
    "import requests\n",
    "import tarfile\n",
    "import zipfile\n",
    "import shutil\n",
    "\n",
    "requests.packages.urllib3.disable_warnings()\n",
    "\n",
    "def download_dataset(url, target_path):\n",
    "    \"\"\"download and decompress dataset\"\"\"\n",
    "    if not os.path.exists(target_path):\n",
    "        os.makedirs(target_path)\n",
    "    download_file = url.split(\"/\")[-1]\n",
    "    if not os.path.exists(download_file):\n",
    "        res = requests.get(url, stream=True, verify=False)\n",
    "        if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n",
    "            download_file = os.path.join(target_path, download_file)\n",
    "        with open(download_file, \"wb\") as f:\n",
    "            for chunk in res.iter_content(chunk_size=512):\n",
    "                if chunk:\n",
    "                    f.write(chunk)\n",
    "    if download_file.endswith(\"zip\"):\n",
    "        z = zipfile.ZipFile(download_file, \"r\")\n",
    "        z.extractall(path=target_path)\n",
    "        z.close()\n",
    "    if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n",
    "        t = tarfile.open(download_file)\n",
    "        names = t.getnames()\n",
    "        for name in names:\n",
    "            t.extract(name, target_path)\n",
    "        t.close()\n",
    "    print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n",
    "\n",
    "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\")\n",
    "test_path = \"./datasets/cifar-10-batches-bin/test\"\n",
    "train_path = \"./datasets/cifar-10-batches-bin/train\"\n",
    "os.makedirs(test_path, exist_ok=True)\n",
    "os.makedirs(train_path, exist_ok=True)\n",
    "if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")):\n",
    "    shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path)\n",
    "[shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))]"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "解压后数据集文件的目录结构如下：\n",
    "\n",
    "```text\n",
    "./datasets/cifar-10-batches-bin\n",
    "├── readme.html\n",
    "├── test\n",
    "│   └── test_batch.bin\n",
    "└── train\n",
    "    ├── batches.meta.txt\n",
    "    ├── data_batch_1.bin\n",
    "    ├── data_batch_2.bin\n",
    "    ├── data_batch_3.bin\n",
    "    ├── data_batch_4.bin\n",
    "    └── data_batch_5.bin\n",
    "```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### RandomSampler\n",
    "\n",
    "从索引序列中随机采样指定数目的数据。\n",
    "\n",
    "下面的样例使用随机采样器分别从CIFAR-10数据集中有放回和无放回地随机采样5个数据，并展示已加载数据的形状和标签。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "ds.config.set_seed(0)\n",
    "\n",
    "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n",
    "\n",
    "print(\"------ Without Replacement ------\")\n",
    "\n",
    "sampler = ds.RandomSampler(num_samples=5)\n",
    "dataset1 = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n",
    "\n",
    "for data in dataset1.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])\n",
    "\n",
    "print(\"------ With Replacement ------\")\n",
    "\n",
    "sampler = ds.RandomSampler(replacement=True, num_samples=5)\n",
    "dataset2 = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n",
    "\n",
    "for data in dataset2.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "------ Without Replacement ------\n",
      "Image shape: (32, 32, 3) , Label: 1\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 0\n",
      "Image shape: (32, 32, 3) , Label: 4\n",
      "------ With Replacement ------\n",
      "Image shape: (32, 32, 3) , Label: 0\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 3\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 6\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### WeightedRandomSampler\n",
    "\n",
    "指定长度为N的采样概率列表，按照概率在前N个样本中随机采样指定数目的数据。\n",
    "\n",
    "下面的样例使用带权随机采样器从CIFAR-10数据集的前10个样本中按概率获取6个样本，并展示已读取数据的形状和标签。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "ds.config.set_seed(1)\n",
    "\n",
    "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n",
    "\n",
    "weights = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "sampler = ds.WeightedRandomSampler(weights, num_samples=6)\n",
    "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 6\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### SubsetRandomSampler\n",
    "\n",
    "从指定索引子序列中随机采样指定数目的数据。\n",
    "\n",
    "下面的样例使用子序列随机采样器从CIFAR-10数据集的指定子序列中抽样3个样本，并展示已读取数据的形状和标签。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "ds.config.set_seed(2)\n",
    "\n",
    "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n",
    "\n",
    "indices = [0, 1, 2, 3, 4, 5]\n",
    "sampler = ds.SubsetRandomSampler(indices, num_samples=3)\n",
    "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Image shape: (32, 32, 3) , Label: 1\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 4\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### PKSampler\n",
    "\n",
    "在指定的数据集类别P中，每种类别各采样K条数据。\n",
    "\n",
    "下面的样例使用PK采样器从CIFAR-10数据集中每种类别抽样2个样本，最多20个样本，并展示已读取数据的形状和标签。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "ds.config.set_seed(3)\n",
    "\n",
    "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n",
    "\n",
    "sampler = ds.PKSampler(num_val=2, class_column='label', num_samples=20)\n",
    "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Image shape: (32, 32, 3) , Label: 0\n",
      "Image shape: (32, 32, 3) , Label: 0\n",
      "Image shape: (32, 32, 3) , Label: 1\n",
      "Image shape: (32, 32, 3) , Label: 1\n",
      "Image shape: (32, 32, 3) , Label: 2\n",
      "Image shape: (32, 32, 3) , Label: 2\n",
      "Image shape: (32, 32, 3) , Label: 3\n",
      "Image shape: (32, 32, 3) , Label: 3\n",
      "Image shape: (32, 32, 3) , Label: 4\n",
      "Image shape: (32, 32, 3) , Label: 4\n",
      "Image shape: (32, 32, 3) , Label: 5\n",
      "Image shape: (32, 32, 3) , Label: 5\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 7\n",
      "Image shape: (32, 32, 3) , Label: 7\n",
      "Image shape: (32, 32, 3) , Label: 8\n",
      "Image shape: (32, 32, 3) , Label: 8\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 9\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### DistributedSampler\n",
    "\n",
    "在分布式训练中，对数据集分片进行采样。\n",
    "\n",
    "下面的样例使用分布式采样器将构建的数据集分为3片，在每个分片中采样3个数据样本，并展示已读取的数据。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "data_source = [0, 1, 2, 3, 4, 5, 6, 7, 8]\n",
    "\n",
    "sampler = ds.DistributedSampler(num_shards=3, shard_id=0, shuffle=False, num_samples=3)\n",
    "dataset = ds.NumpySlicesDataset(data_source, column_names=[\"data\"], sampler=sampler)\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(data)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "{'data': Tensor(shape=[], dtype=Int64, value= 0)}\n",
      "{'data': Tensor(shape=[], dtype=Int64, value= 3)}\n",
      "{'data': Tensor(shape=[], dtype=Int64, value= 6)}\n"
     ]
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 自定义采样器\n",
    "\n",
    "用户可以继承`Sampler`基类，通过实现`__iter__`方法来自定义采样器的采样方式。\n",
    "\n",
    "下面的样例定义了一个从下标0至下标9间隔为2采样的采样器，将其作用于CIFAR-10数据集，并展示已读取数据的形状和标签。"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "source": [
    "import mindspore.dataset as ds\n",
    "\n",
    "class MySampler(ds.Sampler):\n",
    "    def __iter__(self):\n",
    "        for i in range(0, 10, 2):\n",
    "            yield i\n",
    "\n",
    "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n",
    "\n",
    "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=MySampler())\n",
    "\n",
    "for data in dataset.create_dict_iterator():\n",
    "    print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Image shape: (32, 32, 3) , Label: 6\n",
      "Image shape: (32, 32, 3) , Label: 9\n",
      "Image shape: (32, 32, 3) , Label: 1\n",
      "Image shape: (32, 32, 3) , Label: 2\n",
      "Image shape: (32, 32, 3) , Label: 8\n"
     ]
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}