Incremental Operator Build
Ascend
Model Optimization
Overview
When a network model is executed, MindSpore builds the used operators. The time consumed in this stage increases with the scale of the network model. To improve the performance of secondary model execution, an incremental operator build mechanism is provided. When MindSpore executes a network model, the default rank_0/kernel_meta
folder is generated in the directory where the execution is performed. During the execution, operator cache files (in the .o
, .info
, or .json
format) generated during network build are saved to this directory. If you execute the same network model again or only part of the model changes, MindSpore automatically calls the reusable operator cache files in the rank_0/kernel_meta
folder, which significantly reduces the network build time and improves the execution performance. Currently, the incremental operator build function can be used only on the Ascend AI chips.
The following demonstrates how to use the incremental operator build function.
Usage
Incremental operator build is enabled by default on MindSpore and does not need to be controlled. The following describes how to build a simple network model case test_square.py
in the src
directory. The current directory structure is as follows:
└─src
└── test_square.py
Execute the following test case:
import numpy as np
import mindspore.nn as nn
import mindspore.context as context
import mindspore.ops as ops
from mindspore import Tensor
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.square = ops.Square()
def construct(self, data):
return self.square(data)
def test_net():
x = np.array([1.0, 4.0, 9.0]).astype(np.float32)
square = Net()
output = square(Tensor(x))
print("x: ", x)
print("output: ", output)
The network model consists of a single operator Square
, and the output is a square value of the input. The command output is as follows:
x: [1. 4. 9.]
output: [1. 16. 81.]
The rank_0/kernel_meta
folder is generated in the directory where the execution is performed, which contains the .o
, .json
, and .info
files of the Square operator. The current directory structure is as follows:
└─src
├── test_square.py
└── rank_0
└──kernel_meta
├── square_12484080525657478220_2.info
├── square_12484080525657478220_2.json
└── square_12484080525657478220_2.o
For an operator:
The .o
file is an executable file generated by MindSpore for the operator during network model execution.
The .info
file records all valid information about the operator, including the operator name, attributes, input and output formats, and input and output data types. The .info
file is used to search for and determine whether the .o
file of the operator can be reused. The details are as follows:
{
"job_content": {
"SocInfo": {
"autoTilingMode": "NO_TUNE",
"coreNum": "",
"coreType": "",
"deviceId": "2",
"l1Fusion": "false",
"l2Fusion": "false",
"l2Mode": "2",
"mdl_bank_path": "",
"offlineTune": false,
"op_bank_path": "",
"op_bank_update": false,
"op_debug_dir": "./rank_0/",
"op_debug_level": "0",
"op_impl_mode": "",
"op_impl_mode_list": [],
"socVersion": "Ascend910A",
"vector_fp_ceiling": ""
},
"full_name": "Default/Square-op1",
"fusion_op_name": "square_12484080525657478220_2",
"graph_name": "",
"l1_size": -1,
"op_list": [
{
"name": "x_0",
"output_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "x_0",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"type": "Data"
},
{
"build_type": "accurately_build",
"dynamic_compile_static": false,
"func_name": "square",
"input_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "x_0",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"int64mode": false,
"max_kernel_id": 10,
"miss_support_info": "",
"module_name": "impl.square",
"name": "Default/Square-op1",
"op_name": "square_12484080525657478220_2",
"ori_name": [
"Default/Square-op1"
],
"output_data_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"output_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "y",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"pattern": "Opaque",
"py_module_path": "/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe",
"type": "Square",
"unknown_shape": false
}
],
"scope_id": -1
},
"job_id": 1,
"job_type": "Compile",
"source_id": 2
}
The .json
file stores the operator build result, which will be used during running. The details are as follows:
{
"batchBindOnly": 1,
"binFileName": "square_12484080525657478220_2",
"binFileSuffix": ".o",
"blockDim": 1,
"build_result": "",
"kernelName": "square_12484080525657478220_2__kernel0",
"magic": "RT_DEV_BINARY_MAGIC_ELF",
"opParaSize": 0,
"parameters": [
0,
0
],
"sha256": "38ec670e4536958a70a653a0f3bbc7a5aadf66b5fd2b6cfe5379964668929797"
}
After the preceding three types of operator cache files are generated, you can perform incremental operator build when executing the network model. That is, only new or modified operators are built, greatly improving the network build performance.
FAQs
Cache files cannot be shared in different scenarios, such as multi-device and single-device scenarios, or training and inference scenarios.
The
rank_0
is the default value if the envRANK_ID
is empty. If theRANK_ID
is not empty, for exampleRANK_ID=3
, the pathrank_3/kernel_meta
will be generated.The path of
kernel_meta
can be specified by the environment variableMS_COMPILER_CACHE_PATH
. For example,export MS_COMPILER_CACHE_PATH=/home/xxx/
,export RANK_ID=2
, the operator compilation cache file will be saved in/home/xxx/rank_2/kernel_meta/
.When multiple devices are running, the
rank_{ID}/kernel_meta
folder is generated in multipledevice
directories when the network model is executed(TheID
is the value of environment variableRANK_ID
).Note that when multiple devices are running, if the operator cache files in
rank_{ID}/kernel_meta
of some devices are deleted and the same network model is executed again, devices that do not need to be rebuilt may time out. As a result, the execution fails. In this case, you can set the environment variableHCCL_CONNECT_TIMEOUT
, that is, the waiting time between multiple devices, to avoid failure. However, this method takes a long time, which is equivalent to deleting and rebuilding all devices(TheID
is the value of environment variableRANK_ID
).If the process is interrupted during the network building process, there is a possibility that an error occurs when the cache files in
rank_0/kernel_meta
are generated. As a result, the subsequent re-execution fails. In this case, you need to delete therank_0/kernel_meta
folder and rebuild the network.