Application of Single-Node Tensor Cache
Ascend
GPU
CPU
Data Preparation
Overview
If you need to repeatedly access remote datasets or read datasets from disks, you can use the single-node cache operator to cache datasets in the local memory to accelerate dataset reading.
This tutorial demonstrates how to use the single-node cache service, and shows several best practices of using cache to improve the performance of network training or evaluating.
Quick Start
Configuring the Environment
Before using the cache service, you need to install MindSpore and set related environment variables. The Conda environment is used as an example. The setting method is as follows:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore/lib export PATH=$PATH:{path_to_conda}/envs/{your_env_name}/bin
Starting the Cache Server
Before using the single-node cache service, you need to start the cache server.
$ cache_admin --start Cache server startup completed successfully! The cache server daemon has been created as process id 10394 and is listening on port 50052 Recommendation: Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup
If the system displays a message indicating that the
libpython3.7m.so.1.0
file cannot be found, search for the file path in the virtual environment and set environment variables.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib
Creating a Cache Session
If no cache session exists on the cache server, a cache session needs to be created to obtain the cache session ID.
$ cache_admin -g Session created for server on port 50052: 1493732251
The cache session ID is randomly allocated by the server.
Creating a Cache Instance
Create the Python script
my_training_script.py
, use theDatasetCache
API to define a cache instance namedsome_cache
in the script, and specify thesession_id
parameter to a cache session ID created in the previous step.import mindspore.dataset as ds some_cache = ds.DatasetCache(session_id=1493732251, size=0, spilling=False)
Inserting a Cache Instance
The following uses the CIFAR-10 dataset as an example. Before running the sample, download and store the CIFAR-10 dataset by referring to Loading Dataset. The directory structure is as follows:
├─my_training_script.py └─cifar-10-batches-bin ├── batches.meta.txt ├── data_batch_1.bin ├── data_batch_2.bin ├── data_batch_3.bin ├── data_batch_4.bin ├── data_batch_5.bin ├── readme.html └── test_batch.bin
To cache the enhanced data processed by data augmentation of the map operator, the created
some_cache
instance is used as the input parameter of thecache
API in the map operator.import mindspore.dataset.vision.c_transforms as c_vision dataset_dir = "cifar-10-batches-bin/" data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1) # apply cache to map rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0) data = data.map(input_columns=["image"], operations=rescale_op, cache=some_cache) num_iter = 0 for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary # in this example, each dictionary has a key "image" print("{} image shape: {}".format(num_iter, item["image"].shape)) num_iter += 1
Run the Python script
my_training_script.py
. The following information is displayed:0 image shape: (32, 32, 3) 1 image shape: (32, 32, 3) 2 image shape: (32, 32, 3) 3 image shape: (32, 32, 3) 4 image shape: (32, 32, 3)
You can run the
cache_admin --list_sessions
command to check whether there are five data records in the current session. If yes, the data is successfully cached.$ cache_admin --list_sessions Listing sessions for server on port 50052 Session Cache Id Mem cached Disk cached Avg cache size Numa hit 1493732251 3618046178 5 n/a 12442 5
Destroying a Cache Session
After the training is complete, you can destroy the current cache and release the memory.
$ cache_admin --destroy_session 1493732251 Drop session successfully for server on port 50052
The preceding command is used to destroy the cache whose session ID is 1493732251.
Stopping the Cache Server
After using the cache server, you can stop the cache server. This operation will destroy all cache sessions on the current server and release the memory.
$ cache_admin --stop Cache server on port 50052 has been stopped successfully.
Best Practices
Using Cache to Speed Up ResNet Evaluation During Training
For a complex network, epoch training usually needs to be performed for dozens or even hundreds of times. Before training, it is difficult to know when a model can achieve required accuracy in epoch training. Therefore, the accuracy of the model is usually validated at a fixed epoch interval during training and the corresponding model is saved. After the training is completed, users can quickly select the optimal model by viewing the change of the corresponding model accuracy.
Therefore, the performance of evaluation during training will have a great impact on the total end-to-end time required. In this section, we will show an example of leveraging the cache service and caching data after augmentation in Tensor format in memory to speed up the evaluation procedure.
The inference data processing procedure usually does not contain random operations. For example, the dataset processing in ResNet50 evaluation only contains augmentations like Decode
, Resize
, CenterCrop
, Normalize
, HWC2CHW
, TypeCast
. Therefore, it’s usually better to inject cache after the last augmentation step and directly cache data that’s fully augmented, to minimize repeated computations and to yield the most performance benefits. In this section, we will follow this approach and take ResNet as an example.
For the complete sample code, please refer to ResNet in ModelZoo.
Create a Shell script named
cache_util.sh
for cache management:bootup_cache_server() { echo "Booting up cache server..." result=$(cache_admin --start 2>&1) echo "${result}" } generate_cache_session() { result=$(cache_admin -g | awk 'END {print $NF}') echo "${result}" }
Complete sample code: cache_util.sh
In the Shell script for starting the distributed training i.e.,
run_distributed_train.sh
, start a cache server for evaluation during training scenarios and generate a cache session, saved inCACHE_SESSION_ID
Shell variable:source cache_util.sh if [ "x${RUN_EVAL}" == "xTrue" ] then bootup_cache_server CACHE_SESSION_ID=$(generate_cache_session) fi
Pass the
CACHE_SESSION_ID
as well as other arguments when start the Python training script:python train.py \ --net=$1 \ --dataset=$2 \ --run_distribute=True \ --device_num=$DEVICE_NUM \ --dataset_path=$PATH2 \ --run_eval=$RUN_EVAL \ --eval_dataset_path=$EVAL_DATASET_PATH \ --enable_cache=True \ --cache_session_id=$CACHE_SESSION_ID \ &> log &
In Python training script
train.py
, use the following code to receivecache_session_id
that’s passed in and use it when defining a eval dataseteval_dataset
:import argparse parser.add_argument('--enable_cache', type=ast.literal_eval, default=False, help='Caching the eval dataset in memory to speedup evaluation, default is False.') parser.add_argument('--cache_session_id', type=str, default="", help='The session id for cache service.') args_opt = parser.parse_args() eval_dataset = create_dataset( dataset_path=args_opt.eval_dataset_path, do_train=False, batch_size=config.batch_size, target=target, enable_cache=args_opt.enable_cache, cache_session_id=args_opt.cache_session_id)
In Python
dataset.py
script which creates the dataset processing pipeline,create aDatasetCache
instance according toenable_cache
andcache_session_id
arguments, and inject the cache instance after the last step of data augmentation, i.e., afterTyepCast
:def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False, enable_cache=False, cache_session_id=None): ... if enable_cache: if not cache_session_id: raise ValueError("A cache session_id must be provided to use cache.") eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0) data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8, cache=eval_cache) else: data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
Execute the training script:
... epoch: 40, acc: 0.5665486653645834, eval_cost:30.54 epoch: 41, acc: 0.6212361653645834, eval_cost:2.80 epoch: 42, acc: 0.6523844401041666, eval_cost:3.77 ...
By default, the evaluation starts after the 40th epoch, and
eval_cost
shows how much time it costs for each evaluation run, measured by seconds.The following table compares the average evaluation time with/without cache:
| | without cache | with cache | | -------------------------- | ------------- | ---------- | | 4p, resnet50, imagenet2012 | 10.59s | 3.62s |
On Ascend machine with 4 parallel pipelines, it generally takes around 88 seconds for each training epoch and ResNet training usually requires 90 epochs. Therefore, using cache can shorten the total end-to-end time from 8849 seconds to 8101 seconds, thus bringing 348 seconds total time reduction.
After the training run is completed, you can destroy the current cache and release the memory:
$ cache_admin --stop Cache server on port 50052 has been stopped successfully.
Using Cache to Speed Up Training with Datasets on NFS
To share a large dataset across multiple servers, many users resort to NFS (Network File System) to store their datasets (Please check Huawei cloud - Creating an NFS Shared Directory on ECS for how to setup and config an NFS server).
However, due to the fact that the cost of accessing NFS is usually large, running training with a dataset located on NFS is relatively slow. To improve training performance for this scenario, we can leverage cache service to cache the dataset in the form of Tensor in memory. After caching, the following training epochs can directly access from memory, thus avoiding costly remote dataset access.
Note that typically after reading the dataset, certain random operations such as RandomCropDecodeResize
would be performed in the dataset processing procedure. Caching after these random operations would result in the loss of randomness of the data, and therefore affect the final accuracy. As a result, we choose to directly cache the source dataset. In this section, we will follow this approach and take MobileNetV2 as an example.
For the complete sample code, please refer to MobileNetV2 in ModelZoo.
Create a Shell script namely
cache_util.sh
for cache management:bootup_cache_server() { echo "Booting up cache server..." result=$(cache_admin --start 2>&1) echo "${result}" } generate_cache_session() { result=$(cache_admin -g | awk 'END {print $NF}') echo "${result}" }
Complete sample code: cache_util.sh
In the Shell script for starting the distributed training with NFS dataset i.e.,
run_train_nfs_cache.sh
, start a cache server for scenarios where dataset is on NFS. Then generate a cache session, saved inCACHE_SESSION_ID
Shell variable:source cache_util.sh bootup_cache_server CACHE_SESSION_ID=$(generate_cache_session)
Pass the
CACHE_SESSION_ID
as well as other arguments when start the Python training script:python train.py \ --platform=$1 \ --dataset_path=$5 \ --pretrain_ckpt=$PRETRAINED_CKPT \ --freeze_layer=$FREEZE_LAYER \ --filter_head=$FILTER_HEAD \ --enable_cache=True \ --cache_session_id=$CACHE_SESSION_ID \ &> log$i.log &
In the
train_parse_args()
function of Python argument-parsing scriptargs.py
, use the following code to receivecache_session_id
that’s passed in:import argparse def train_parse_args(): ... train_parser.add_argument('--enable_cache', type=ast.literal_eval, default=False, help='Caching the dataset in memory to speedup dataset processing, default is False.') train_parser.add_argument('--cache_session_id', type=str, default="", help='The session id for cache service.') train_args = train_parser.parse_args()
In Python training script
train.py
,calltrain_parse_args()
to parse the arguments that’s passed in such ascache_session_id
, and use it when defining the training dataset:from src.args import train_parse_args args_opt = train_parse_args() dataset = create_dataset( dataset_path=args_opt.dataset_path, do_train=True, config=config, enable_cache=args_opt.enable_cache, cache_session_id=args_opt.cache_session_id)
In Python
dataset.py
script which creates the dataset processing pipeline,create aDatasetCache
instance according toenable_cache
andcache_session_id
arguments, and inject the cache instance directly after theImageFolderDataset
:def create_dataset(dataset_path, do_train, config, repeat_num=1, enable_cache=False, cache_session_id=None): ... if enable_cache: nfs_dataset_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0) else: nfs_dataset_cache = None if config.platform == "Ascend": rank_size = int(os.getenv("RANK_SIZE", '1')) rank_id = int(os.getenv("RANK_ID", '0')) if rank_size == 1: data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, cache=nfs_dataset_cache) else: data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=rank_size, shard_id=rank_id, cache=nfs_dataset_cache)
Execute the training run via
run_train_nfs_cache.sh
:epoch: [ 0/ 200], step:[ 2134/ 2135], loss:[4.682/4.682], time:[3364893.166], lr:[0.780] epoch time: 3384387.999, per step time: 1585.193, avg loss: 4.682 epoch: [ 1/ 200], step:[ 2134/ 2135], loss:[3.750/3.750], time:[430495.242], lr:[0.724] epoch time: 431005.885, per step time: 201.876, avg loss: 4.286 epoch: [ 2/ 200], step:[ 2134/ 2135], loss:[3.922/3.922], time:[420104.849], lr:[0.635] epoch time: 420669.174, per step time: 197.035, avg loss: 3.534 epoch: [ 3/ 200], step:[ 2134/ 2135], loss:[3.581/3.581], time:[420825.587], lr:[0.524] epoch time: 421494.842, per step time: 197.421, avg loss: 3.417 ...
The following table compares the average epoch time with/without cache:
| 4p, MobileNetV2, imagenet2012 | without cache | with cache | | ---------------------------------------- | ------------- | ---------- | | first epoch time | 1649s | 3384s | | average epoch time (exclude first epoch) | 458s | 421s |
With cache, the first epoch time increases significantly due to cache writing overhead, but all later epochs can benefit from caching the dataset in memory. Therefore, the more epochs, the more cache case shows benefits due to per-step-time savings.
MobileNetV2 generally requires 200 epochs in total, therefore, using cache can shorten the total end-to-end time from 92791 seconds to 87163 seconds, thus bringing 5628 seconds total time reduction.
After the training run is completed, you can destroy the current cache and release the memory:
$ cache_admin --stop Cache server on port 50052 has been stopped successfully.