{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 数据处理概述" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.4.10/zh_cn/model_train/dataset/mindspore_overview.ipynb) [](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.4.10/zh_cn/model_train/dataset/mindspore_overview.py) [](https://gitee.com/mindspore/docs/blob/r2.4.10/docs/mindspore/source_zh_cn/model_train/dataset/overview.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "MindSpore Dataset 提供两种数据处理能力:数据处理Pipeline模式和数据处理轻量化模式。\n", "\n", "1. 数据处理Pipeline模式:提供基于C++ Runtime的并发数据处理流水线(Pipeline)能力。用户通过定义数据集加载、数据变换、数据Batch等流程,即可以实现数据集的高效加载、高效处理、高效Batch,且并发度可调、缓存可调等能力,实现为NPU卡训练提供零Bottle Neck的训练数据。\n", "\n", "2. 数据处理轻量化模式:支持用户使用数据变换操作(如:Resize、Crop、HWC2CHW等)进行单个样本的数据处理。\n", "\n", "本章节后续重点讲述两种数据处理模式。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 数据处理Pipeline模式\n", "\n", "用户通过API定义的Dataset流水线,运行训练进程后Dataset会从数据集中循环加载数据 -> 处理 -> Batch -> 迭代器,最终用于训练。\n", "\n", "\n", "\n", "如上图所示,MindSpore Dataset模块使得用户很简便地定义数据预处理Pipeline,并以最高效(多进程/多线程)的方式处理数据集中样本,具体的步骤参考如下:\n", "\n", "- 加载数据集(Dataset):用户可以方便地使用 Dataset类 ([标准格式数据集](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.html#%E6%A0%87%E5%87%86%E6%A0%BC%E5%BC%8F)、[vision数据集](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.html#%E8%A7%86%E8%A7%89)、[nlp数据集](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.html#%E6%96%87%E6%9C%AC)、[audio数据集](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.html#%E9%9F%B3%E9%A2%91)) 来加载已支持的数据集,或者通过 UDF Loader + [GeneratorDataset 自定义数据集](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.GeneratorDataset.html#mindspore.dataset.GeneratorDataset) 实现Python层自定义数据集的加载,同时加载类方法可以使用多种Sampler、数据分片、数据shuffle等功能;\n", "\n", "- 数据集操作(filter/ skip):用户通过数据集对象方法 [.shuffle](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.shuffle.html#mindspore.dataset.Dataset.shuffle) / [.filter](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.filter.html#mindspore.dataset.Dataset.filter) / [.skip](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.skip.html#mindspore.dataset.Dataset.skip) / [.split](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.split.html#mindspore.dataset.Dataset.split) / [.take](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/operation/mindspore.dataset.Dataset.take.html#mindspore.dataset.Dataset.take) / … 来实现数据集的进一步混洗、过滤、跳过、最多获取条数等操作;\n", "\n", "- 数据集样本变换操作(map):用户可以将数据变换操作 ([vision数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E8%A7%86%E8%A7%89) , [nlp数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC) , [audio数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E9%9F%B3%E9%A2%91) ) 添加到map操作中执行,数据预处理过程中可以定义多个map操作,用于执行不同变换操作,数据变换操作也可以是 用户自定义变换的 PyFunc ;\n", "\n", "- 批(batch):用户在样本完成变换后,使用 [.batch](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/batch/mindspore.dataset.Dataset.batch.html#mindspore.dataset.Dataset.batch) 操作将多个样本组织成batch,也可以通过batch的参数 per_batch_map 来自定义batch逻辑;\n", "\n", "- 迭代器(create_dict_iterator):最后用户通过数据集对象方法 [.create_dict_iterator](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/iterator/mindspore.dataset.Dataset.create_dict_iterator.html#mindspore.dataset.Dataset.create_dict_iterator) / [.create_tuple_iterator](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/dataset_method/iterator/mindspore.dataset.Dataset.create_tuple_iterator.html#mindspore.dataset.Dataset.create_tuple_iterator) 来创建迭代器将预处理完成的数据循环输出。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 数据集加载\n", "\n", "下面主要介绍单个数据集加载、数据集组合、数据集切分、数据集保存等常用数据集加载方式。\n", "\n", "#### 单个数据集加载\n", "\n", "数据集加载类用于实现本地磁盘、OBS数据集、共享存储上的训练数据集加载,主要作用是将存储上的数据集Load至内存中。数据集加载接口如下:\n", "\n", "| 数据集接口分类 | API列表 | 说明 |\n", "|------------------------|----------------------------------------------------------|--------------------------------------------------------------|\n", "| 标准格式数据集 | [MindDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.MindDataset.html#mindspore.dataset.MindDataset) 、 [TFRecordDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.TFRecordDataset.html#mindspore.dataset.TFRecordDataset) 、 [CSVDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.CSVDataset.html#mindspore.dataset.CSVDataset) 等 | 其中 MindDataset 依赖 MindSpore 数据格式, 详见: [格式转换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/record.html) |\n", "| 自定义数据集 | [GeneratorDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.GeneratorDataset.html#mindspore.dataset.GeneratorDataset) 、 [RandomDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.RandomDataset.html#mindspore.dataset.RandomDataset) 等 | 其中 GeneratorDataset 负责加载 用户自定义DataLoader, 详见: [自定义数据集](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86) |\n", "| 常用数据集 | [ImageFolderDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.ImageFolderDataset.html#mindspore.dataset.ImageFolderDataset) 、 [Cifar10Dataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.Cifar10Dataset.html#mindspore.dataset.Cifar10Dataset) 、 [IWSLT2017Dataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.IWSLT2017Dataset.html#mindspore.dataset.IWSLT2017Dataset) 、 [LJSpeechDataset](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/dataset/mindspore.dataset.LJSpeechDataset.html#mindspore.dataset.LJSpeechDataset) 等 | 用于常用的开源数据集 |\n", "\n", "以上数据集加载([示例](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8A%A0%E8%BD%BD))可以配置不同的参数以实现不同的加载效果,常用参数举例如下:\n", "\n", "1. 从数据集中过滤指定的列,参数名:```columns_list```,该参数仅针对部分数据集接口,默认值:None,加载所有数据列。\n", "\n", "2. 可以配置数据集的读取并发数,参数名:```num_parallel_workers```,默认值:8。\n", "\n", "3. 可以通过参数配置数据集的采样逻辑:\n", " 1) 开启混洗,参数名:```shuffle```,默认值:True。\n", "\n", " 2) 对数据集进行分片,参数名:```num_shards & shard_id```,默认值:None,不分片。\n", "\n", " 3) 其他更多的采样逻辑可以参考:[数据采样](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/sampler.html)。\n", "\n", "#### 数据集组合\n", "\n", "数据集组合可以将多个数据集以串联/并朕的方式组合起来,形成一个全新的dataset对象。\n", "\n", "- 将多个数据集串联起来" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'column_1': Tensor(shape=[], dtype=Int32, value= 3)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 2)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 1)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 6)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 5)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 4)}\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(1234)\n", "\n", "data = [1, 2, 3]\n", "dataset1 = ds.NumpySlicesDataset(data=data, column_names=[\"column_1\"])\n", "\n", "data = [4, 5, 6]\n", "dataset2 = ds.NumpySlicesDataset(data=data, column_names=[\"column_1\"])\n", "\n", "dataset = dataset1.concat(dataset2)\n", "for item in dataset.create_dict_iterator():\n", " print(item)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "- 将多个数据集并联起来" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'column_1': Tensor(shape=[], dtype=Int32, value= 3), 'column_2': Tensor(shape=[], dtype=Int32, value= 6)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 2), 'column_2': Tensor(shape=[], dtype=Int32, value= 5)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 1), 'column_2': Tensor(shape=[], dtype=Int32, value= 4)}\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(1234)\n", "\n", "data = [1, 2, 3]\n", "dataset1 = ds.NumpySlicesDataset(data=data, column_names=[\"column_1\"])\n", "\n", "data = [4, 5, 6]\n", "dataset2 = ds.NumpySlicesDataset(data=data, column_names=[\"column_2\"])\n", "\n", "dataset = dataset1.zip(dataset2)\n", "for item in dataset.create_dict_iterator():\n", " print(item)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "#### 数据集切分\n", "\n", "将数据集切分成 训练数据集 和 验证数据集,分别用于训练过程和验证过程。" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">>>> train dataset >>>>\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 5)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 2)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 6)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 1)}\n" ] } ], "source": [ "import mindspore.dataset as ds\n", "\n", "data = [1, 2, 3, 4, 5, 6]\n", "dataset = ds.NumpySlicesDataset(data=data, column_names=[\"column_1\"], shuffle=False)\n", "\n", "train_dataset, eval_dataset = dataset.split([4, 2])\n", "\n", "print(\">>>> train dataset >>>>\")\n", "for item in train_dataset.create_dict_iterator():\n", " print(item)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">>>> eval dataset >>>>\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 3)}\n", "{'column_1': Tensor(shape=[], dtype=Int32, value= 4)}\n" ] } ], "source": [ "print(\">>>> eval dataset >>>>\")\n", "for item in eval_dataset.create_dict_iterator():\n", " print(item)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "#### 数据集保存\n", "\n", "将数据集重新保存到MindRecord数据格式。" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import os\n", "import mindspore.dataset as ds\n", "\n", "ds.config.set_seed(1234)\n", "\n", "data = [1, 2, 3, 4, 5, 6]\n", "dataset = ds.NumpySlicesDataset(data=data, column_names=[\"column_1\"])\n", "if os.path.exists(\"./train_dataset.mindrecord\"):\n", " os.remove(\"./train_dataset.mindrecord\")\n", "if os.path.exists(\"./train_dataset.mindrecord.db\"):\n", " os.remove(\"./train_dataset.mindrecord.db\")\n", "dataset.save(\"./train_dataset.mindrecord\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 数据变换\n", "\n", "#### 普通数据变换\n", "\n", "用户可以使用 ```.map(...)``` 操作对样本进行变换操作,使用 ```.filter(...)``` 操作对样本进行过滤操作,使用 ```.project(...)``` 操作对多列进行排序和过滤,使用 ```.rename(...)``` 操作对指定列重命名,使用 ```.shuffle(...)``` 操作对数据进行缓存区大小的混洗,使用 ```.skip(...)``` 操作跳过数据集的前 n 条,使用 ```.take(...)``` 操作只读数据集的前 n 条样本,如下重点说明 ```.map(...)``` 的使用方法:\n", "\n", "- 在 ```.map(...)``` 中使用 Dataset 提供的数据变换操作\n", "\n", " Dataset提供了丰富的数据变换操作([列表](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#)),这些数据变换操作可以直接放在 ```.map(...)``` 中使用。具体使用方法参考 [map变换操作](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E5%86%85%E7%BD%AE%E6%95%B0%E6%8D%AE%E5%8F%98%E6%8D%A2%E6%93%8D%E4%BD%9C)。\n", "\n", "- 在 ```.map(...)``` 中使用 自定义 数据变换操作\n", "\n", " Dataset也支持用户自定义的数据变换操作,仅需将用户自定义函数传递给 ```.map(...)``` 退可。具体使用方法参考:[自定义map变换操作](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E5%8F%98%E6%8D%A2%E6%93%8D%E4%BD%9C)。\n", "\n", "- 在 ```.map(...)``` 中返回 Dict 数据结构 数据\n", "\n", " Dataset也支持用户自定义的数据变换操作中返回 Dict 数据结构,使得 定义的数据变换 更加灵活。具体使用方法参考:[自定义map变换操作处理字典对象](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/python_objects.html#%E8%87%AA%E5%AE%9A%E4%B9%89map%E5%A2%9E%E5%BC%BA%E6%93%8D%E4%BD%9C%E5%A4%84%E7%90%86%E5%AD%97%E5%85%B8%E5%AF%B9%E8%B1%A1)。\n", "\n", "#### 自动数据增强\n", "\n", "除了以上的普通数据变换,Dataset 还提供了一种自动数据变换方式,可以基于特定策略自动对图像进行数据变换处理。详细说明见:[自动数据增强](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/augment.html)。\n", "\n", "### 数据batch\n", "\n", "Dataset提供 ```.batch(...)``` 操作,可以很方便的将数据变换操作后的样本组织成batch。\n", "\n", "1. 默认 ```.batch(...)``` 操作,将batch_size个样本组织成shape为 (batch_size, ...)的数据,详细用法请参考 [batch操作](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E6%95%B0%E6%8D%AEbatch);\n", "\n", "2. 自定义 ```.batch(..., per_batch_map, ...)``` 操作,支持用户将 [np.ndarray, nd.ndarray, ...] 多条数据按照自定义逻辑组织batch,详细用法请参考 [自定义batch操作](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/python_objects.html#batch%E6%93%8D%E4%BD%9C%E5%A4%84%E7%90%86%E5%AD%97%E5%85%B8%E5%AF%B9%E8%B1%A1)。\n", "\n", "### 数据集迭代器\n", "\n", "用户在定义完成 ```数据集加载(xxDataset)-> 数据处理(.map)-> 数据batch(.batch)``` Dataset流水线后,可以通过 迭代器方法 ```.create_dict_iterator(...)``` / ```.create_tuple_iterator(...)``` 循环将数据输出。具体的使用方法参考:[数据集迭代器](https://www.mindspore.cn/tutorials/zh-CN/r2.4.10/beginner/dataset.html#%E6%95%B0%E6%8D%AE%E9%9B%86%E8%BF%AD%E4%BB%A3%E5%99%A8)。\n", "\n", "### 性能优化\n", "\n", "#### 数据处理性能优化\n", "\n", "针对数据处理Pipeline性能不足的场景,可以参考 [数据处理性能优化](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/optimize.html) 来进一步优化性能,以满足训练端到端性能要求。\n", "\n", "#### 单节点数据缓存\n", "\n", "另外,对于推理场景,为了追求极致的性能,可以使用 [单节点数据缓存](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/cache.html) 将数据集缓存于本地内存中,以加速数据集的读取和预处理。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 数据处理轻量化模式\n", "\n", "用户可以直接使用数据变换操作处理一条数据,返回值即是数据变换的结果。\n", "\n", "数据变换操作([vision数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E8%A7%86%E8%A7%89) , [nlp数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC) , [audio数据变换](https://www.mindspore.cn/docs/zh-CN/r2.4.10/api_python/mindspore.dataset.transforms.html#%E9%9F%B3%E9%A2%91) )可以像调用普通函数一样直接来使用,一般用法是先初始化数据变换对象,然后通过 括号方法 传入需要处理的数据 并得到处理的结果。" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/banana.jpg (17 kB)\n", "\n", "file_sizes: 100%|██████████████████████████| 17.1k/17.1k [00:00<00:00, 8.55MB/s]\n", "Successfully downloaded file to ./banana.jpg\n" ] }, { "data": { "text/plain": [ "'./banana.jpg'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from download import download\n", "from PIL import Image\n", "import mindspore.dataset.vision as vision\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/banana.jpg\"\n", "download(url, './banana.jpg', replace=True)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Image.type: <class 'PIL.Image.Image'>, Image.shape: (356, 200)\n" ] } ], "source": [ "\n", "img_ori = Image.open(\"banana.jpg\").convert(\"RGB\")\n", "print(\"Image.type: {}, Image.shape: {}\".format(type(img_ori), img_ori.size))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Image.type: <class 'PIL.Image.Image'>, Image.shape: (569, 320)\n" ] } ], "source": [ "# Apply Resize to input immediately\n", "resize_op = vision.Resize(size=(320))\n", "img = resize_op(img_ori)\n", "print(\"Image.type: {}, Image.shape: {}\".format(type(img), img.size))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "更多的示例请参考:[轻量化数据处理](https://www.mindspore.cn/docs/zh-CN/r2.4.10/model_train/dataset/eager.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.7.5 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "vscode": { "interpreter": { "hash": "5109d816b82be14675a6b11f8e0f0d2e80f029176ed3710d54e125caa8520dfd" } } }, "nbformat": 4, "nbformat_minor": 4 }