应用单节点数据缓存

Linux Ascend GPU CPU 数据准备 中级 高级

概述

对于需要重复访问远程的数据集或需要重复从磁盘中读取数据集的情况，可以使用单节点缓存算子将数据集缓存于本地内存中，以加速数据集的读取。

下面，本教程将演示如何使用单节点缓存服务来缓存经过数据增强处理的数据。

配置环境

使用缓存服务前，需要安装MindSpore，并设置相关环境变量。以Conda环境为例，设置方法如下：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore/lib
export PATH=$PATH:{path_to_conda}/envs/{your_env_name}/bin

启动缓存服务器

在使用单节点缓存服务之前，首先需要启动缓存服务器：

$ cache_admin --start
Cache server startup completed successfully!
The cache server daemon has been created as process id 10394 and is listening on port 50052

Recommendation:
Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup

若提示找不到libpython3.7m.so.1.0文件，尝试在虚拟环境下查找其路径并设置环境变量：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib

创建缓存会话

若缓存服务器中不存在缓存会话，则需要创建一个缓存会话，得到缓存会话id：

$ cache_admin -g
Session created for server on port 50052: 1493732251

缓存会话id由服务器随机分配。

创建缓存实例

创建Python脚本my_training_script.py，在脚本中使用DatasetCache API来定义一个名为some_cache的缓存实例，并把上一步中创建的缓存会话id传入session_id参数：

import mindspore.dataset as ds

some_cache = ds.DatasetCache(session_id=1493732251, size=0, spilling=True)

插入缓存实例

下面样例中使用到CIFAR-10数据集。运行样例前，需要参照数据集加载中的方法下载并存放CIFAR-10数据集。目录结构如下：

├─my_training_script.py
└─cifar-10-batches-bin
    ├── batches.meta.txt
    ├── data_batch_1.bin
    ├── data_batch_2.bin
    ├── data_batch_3.bin
    ├── data_batch_4.bin
    ├── data_batch_5.bin
    ├── readme.html
    └── test_batch.bin

继续编写Python脚本，在应用数据增强算子时将所创建的some_cache作为其cache参数传入：

import mindspore.dataset.vision.c_transforms as c_vision

dataset_dir = "cifar-10-batches-bin/"
data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1)

# apply cache to map
rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0)
data = data.map(input_columns=["image"], operations=rescale_op, cache=some_cache)

num_iter = 0
for item in data.create_dict_iterator(num_epochs=1):  # each data is a dictionary
    # in this example, each dictionary has a key "image"
    print("{} image shape: {}".format(num_iter, item["image"].shape))
    num_iter += 1

运行Python脚本my_training_script.py，得到输出结果：

image shape: (32, 32, 3)
image shape: (32, 32, 3)
image shape: (32, 32, 3)
image shape: (32, 32, 3)
image shape: (32, 32, 3)

通过cache_admin --list_sessions命令可以查看当前会话有五条数据，说明数据缓存成功。

$ cache_admin --list_sessions
Listing sessions for server on port 50052

     Session    Cache Id  Mem cached  Disk cached  Avg cache size  Numa hit
  1493732251  3618046178       5          n/a          12442         5

销毁缓存会话

在训练结束后，可以选择将当前的缓存销毁并释放内存：

$ cache_admin --destroy_session 1493732251
Drop session successfully for server on port 50052

以上命令将销毁缓存会话id为1493732251的缓存。

关闭缓存服务器

使用完毕后，可以选择关闭缓存服务器，该操作将销毁当前服务器中存在的所有缓存会话并释放内存。

$ cache_admin --stop
Cache server on port 50052 has been stopped successfully.