{ "cells": [ { "cell_type": "markdown", "source": [ "# 数据采样\n", "\n", "`Ascend` `GPU` `CPU` `数据准备`\n", "\n", "[](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tb2RlbGFydHMvcHJvZ3JhbW1pbmdfZ3VpZGUvbWluZHNwb3JlX3NhbXBsZXIuaXB5bmI=&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_sampler.ipynb) [](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_sampler.py) [](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_zh_cn/sampler.ipynb)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 概述\n", "\n", "MindSpore提供了多种用途的采样器(Sampler),帮助用户对数据集进行不同形式的采样,以满足训练需求,能够解决诸如数据集过大或样本类别分布不均等问题。只需在加载数据集时传入采样器对象,即可实现数据的采样。\n", "\n", "MindSpore目前提供的部分采样器类别如下表所示。此外,用户也可以根据需要实现自定义的采样器类。更多采样器的使用方法参见[API文档](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/mindspore.dataset.html)。" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "| 采样器名称 | 采样器说明 |\n", "| :---- | :---- |\n", "| RandomSampler | 随机采样器,在数据集中随机地采样指定数目的数据。 |\n", "| WeightedRandomSampler | 带权随机采样器,依照长度为N的概率列表,在前N个样本中随机采样指定数目的数据。 |\n", "| SubsetRandomSampler | 子集随机采样器,在指定的索引范围内随机采样指定数目的数据。 |\n", "| PKSampler | PK采样器,在指定的数据集类别P中,每种类别各采样K条数据。 |\n", "| DistributedSampler | 分布式采样器,在分布式训练中对数据集分片进行采样。 |" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## MindSpore采样器\n", "\n", "下面以CIFAR-10数据集为例,介绍几种常用MindSpore采样器的使用方法。\n", "\n", "以下示例代码将数据集下载并解压到指定位置。" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import os\n", "import requests\n", "import tarfile\n", "import zipfile\n", "import shutil\n", "\n", "requests.packages.urllib3.disable_warnings()\n", "\n", "def download_dataset(url, target_path):\n", " \"\"\"download and decompress dataset\"\"\"\n", " if not os.path.exists(target_path):\n", " os.makedirs(target_path)\n", " download_file = url.split(\"/\")[-1]\n", " if not os.path.exists(download_file):\n", " res = requests.get(url, stream=True, verify=False)\n", " if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n", " download_file = os.path.join(target_path, download_file)\n", " with open(download_file, \"wb\") as f:\n", " for chunk in res.iter_content(chunk_size=512):\n", " if chunk:\n", " f.write(chunk)\n", " if download_file.endswith(\"zip\"):\n", " z = zipfile.ZipFile(download_file, \"r\")\n", " z.extractall(path=target_path)\n", " z.close()\n", " if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n", " t = tarfile.open(download_file)\n", " names = t.getnames()\n", " for name in names:\n", " t.extract(name, target_path)\n", " t.close()\n", " print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n", "\n", "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\")\n", "test_path = \"./datasets/cifar-10-batches-bin/test\"\n", "train_path = \"./datasets/cifar-10-batches-bin/train\"\n", "os.makedirs(test_path, exist_ok=True)\n", "os.makedirs(train_path, exist_ok=True)\n", "if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")):\n", " shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path)\n", "[shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "解压后数据集文件的目录结构如下:\n", "\n", "```text\n", "./datasets/cifar-10-batches-bin\n", "├── readme.html\n", "├── test\n", "│ └── test_batch.bin\n", "└── train\n", " ├── batches.meta.txt\n", " ├── data_batch_1.bin\n", " ├── data_batch_2.bin\n", " ├── data_batch_3.bin\n", " ├── data_batch_4.bin\n", " └── data_batch_5.bin\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### RandomSampler\n", "\n", "从索引序列中随机采样指定数目的数据。\n", "\n", "下面的样例使用随机采样器分别从CIFAR-10数据集中有放回和无放回地随机采样5个数据,并展示已加载数据的形状和标签。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(0)\n", "\n", "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n", "\n", "print(\"------ Without Replacement ------\")\n", "\n", "sampler = ds.RandomSampler(num_samples=5)\n", "dataset1 = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n", "\n", "for data in dataset1.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])\n", "\n", "print(\"------ With Replacement ------\")\n", "\n", "sampler = ds.RandomSampler(replacement=True, num_samples=5)\n", "dataset2 = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n", "\n", "for data in dataset2.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------ Without Replacement ------\n", "Image shape: (32, 32, 3) , Label: 1\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 0\n", "Image shape: (32, 32, 3) , Label: 4\n", "------ With Replacement ------\n", "Image shape: (32, 32, 3) , Label: 0\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 3\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 6\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### WeightedRandomSampler\n", "\n", "指定长度为N的采样概率列表,按照概率在前N个样本中随机采样指定数目的数据。\n", "\n", "下面的样例使用带权随机采样器从CIFAR-10数据集的前10个样本中按概率获取6个样本,并展示已读取数据的形状和标签。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(1)\n", "\n", "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n", "\n", "weights = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]\n", "sampler = ds.WeightedRandomSampler(weights, num_samples=6)\n", "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n", "\n", "for data in dataset.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 6\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### SubsetRandomSampler\n", "\n", "从指定索引子序列中随机采样指定数目的数据。\n", "\n", "下面的样例使用子序列随机采样器从CIFAR-10数据集的指定子序列中抽样3个样本,并展示已读取数据的形状和标签。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(2)\n", "\n", "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n", "\n", "indices = [0, 1, 2, 3, 4, 5]\n", "sampler = ds.SubsetRandomSampler(indices, num_samples=3)\n", "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n", "\n", "for data in dataset.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Image shape: (32, 32, 3) , Label: 1\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 4\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### PKSampler\n", "\n", "在指定的数据集类别P中,每种类别各采样K条数据。\n", "\n", "下面的样例使用PK采样器从CIFAR-10数据集中每种类别抽样2个样本,最多20个样本,并展示已读取数据的形状和标签。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(3)\n", "\n", "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n", "\n", "sampler = ds.PKSampler(num_val=2, class_column='label', num_samples=20)\n", "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=sampler)\n", "\n", "for data in dataset.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Image shape: (32, 32, 3) , Label: 0\n", "Image shape: (32, 32, 3) , Label: 0\n", "Image shape: (32, 32, 3) , Label: 1\n", "Image shape: (32, 32, 3) , Label: 1\n", "Image shape: (32, 32, 3) , Label: 2\n", "Image shape: (32, 32, 3) , Label: 2\n", "Image shape: (32, 32, 3) , Label: 3\n", "Image shape: (32, 32, 3) , Label: 3\n", "Image shape: (32, 32, 3) , Label: 4\n", "Image shape: (32, 32, 3) , Label: 4\n", "Image shape: (32, 32, 3) , Label: 5\n", "Image shape: (32, 32, 3) , Label: 5\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 7\n", "Image shape: (32, 32, 3) , Label: 7\n", "Image shape: (32, 32, 3) , Label: 8\n", "Image shape: (32, 32, 3) , Label: 8\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 9\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### DistributedSampler\n", "\n", "在分布式训练中,对数据集分片进行采样。\n", "\n", "下面的样例使用分布式采样器将构建的数据集分为3片,在每个分片中采样3个数据样本,并展示已读取的数据。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "import mindspore.dataset as ds\n", "\n", "data_source = [0, 1, 2, 3, 4, 5, 6, 7, 8]\n", "\n", "sampler = ds.DistributedSampler(num_shards=3, shard_id=0, shuffle=False, num_samples=3)\n", "dataset = ds.NumpySlicesDataset(data_source, column_names=[\"data\"], sampler=sampler)\n", "\n", "for data in dataset.create_dict_iterator():\n", " print(data)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'data': Tensor(shape=[], dtype=Int64, value= 0)}\n", "{'data': Tensor(shape=[], dtype=Int64, value= 3)}\n", "{'data': Tensor(shape=[], dtype=Int64, value= 6)}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 自定义采样器\n", "\n", "用户可以继承`Sampler`基类,通过实现`__iter__`方法来自定义采样器的采样方式。\n", "\n", "下面的样例定义了一个从下标0至下标9间隔为2采样的采样器,将其作用于CIFAR-10数据集,并展示已读取数据的形状和标签。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "import mindspore.dataset as ds\n", "\n", "class MySampler(ds.Sampler):\n", " def __iter__(self):\n", " for i in range(0, 10, 2):\n", " yield i\n", "\n", "DATA_DIR = \"./datasets/cifar-10-batches-bin/train/\"\n", "\n", "dataset = ds.Cifar10Dataset(DATA_DIR, sampler=MySampler())\n", "\n", "for data in dataset.create_dict_iterator():\n", " print(\"Image shape:\", data['image'].shape, \", Label:\", data['label'])" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Image shape: (32, 32, 3) , Label: 6\n", "Image shape: (32, 32, 3) , Label: 9\n", "Image shape: (32, 32, 3) , Label: 1\n", "Image shape: (32, 32, 3) , Label: 2\n", "Image shape: (32, 32, 3) , Label: 8\n" ] } ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }