算子增量编译
Ascend
模型调试
概述
在执行网络模型的过程中,MindSpore会对所使用的算子进行编译,该阶段耗时会随网络模型规模的增大而增大。为提升用户二次执行模型的性能体验,我们提供了一种算子增量编译机制。MindSpore执行网络模型时会在执行目录下生成rank_0/kernel_meta
默认目录,并在执行过程中保存网络编译生成的算子缓存文件到此目录,包括.o
文件,.info
文件以及.json
文件。若用户再次执行相同的网络模型,或者仅有部分变化,MindSpore会自动调用rank_0/kernel_meta
目录下可复用的算子缓存文件,显著减少网络编译时间,提升执行性能。目前算子增量编译功能仅支持在昇腾AI芯片上使用。
下面,本教程将演示如何使用算子增量编译。
使用方法
算子增量编译在MindSpore中默认开启,用户无需对其进行控制。下面我们在src
目录下构造一个简单的网络用例test_square.py
。当前目录结构为:
└─src
└── test_square.py
执行如下用例:
import numpy as np
import mindspore.nn as nn
import mindspore.context as context
import mindspore.ops as ops
from mindspore import Tensor
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.square = ops.Square()
def construct(self, data):
return self.square(data)
def test_net():
x = np.array([1.0, 4.0, 9.0]).astype(np.float32)
square = Net()
output = square(Tensor(x))
print("x: ", x)
print("output: ", output)
该网络由一个单算子Square
构成,输出为输入的平方值。执行结果如下:
x: [1. 4. 9.]
output: [1. 16. 81.]
在执行目录下,生成了rank_0/kernel_meta
文件夹,其中包含Square算子的.o
文件,.json
文件以及.info
文件,当前目录结构为:
└─src
├── test_square.py
└── rank_0
└──kernel_meta
├── square_12484080525657478220_2.info
├── square_12484080525657478220_2.json
└── square_12484080525657478220_2.o
对于一个算子来说:
.o
文件即MindSpore在网络执行过程中对该算子生成的可执行文件。
.info
文件记录了该算子的所有有效信息,包括算子名称、算子属性、输入输出格式、输入输出数据类型等等。.info
文件用于查找并确定算子的.o
文件是否可复用。详细内容如下:
{
"job_content": {
"SocInfo": {
"autoTilingMode": "NO_TUNE",
"coreNum": "",
"coreType": "",
"deviceId": "2",
"l1Fusion": "false",
"l2Fusion": "false",
"l2Mode": "2",
"mdl_bank_path": "",
"offlineTune": false,
"op_bank_path": "",
"op_bank_update": false,
"op_debug_dir": "./rank_0/",
"op_debug_level": "0",
"op_impl_mode": "",
"op_impl_mode_list": [],
"socVersion": "Ascend910A",
"vector_fp_ceiling": ""
},
"full_name": "Default/Square-op1",
"fusion_op_name": "square_12484080525657478220_2",
"graph_name": "",
"l1_size": -1,
"op_list": [
{
"name": "x_0",
"output_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "x_0",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"type": "Data"
},
{
"build_type": "accurately_build",
"dynamic_compile_static": false,
"func_name": "square",
"input_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "x_0",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"int64mode": false,
"max_kernel_id": 10,
"miss_support_info": "",
"module_name": "impl.square",
"name": "Default/Square-op1",
"op_name": "square_12484080525657478220_2",
"ori_name": [
"Default/Square-op1"
],
"output_data_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"output_desc": [
{
"L1_addr_offset": 0,
"L1_fusion_type": -1,
"L1_workspace_size": -1,
"addr_type": 0,
"data_type": "float32",
"dtype": "float32",
"format": "ND",
"name": "y",
"ori_format": "NCHW",
"ori_shape": [
3
],
"param_type": "required",
"range": [
[
3,
3
]
],
"shape": [
3
],
"slice_offset": [],
"split_index": 0,
"total_shape": [],
"valid": true,
"valid_shape": []
}
],
"pattern": "Opaque",
"py_module_path": "/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe",
"type": "Square",
"unknown_shape": false
}
],
"scope_id": -1
},
"job_id": 1,
"job_type": "Compile",
"source_id": 2
}
.json
文件存放了算子编译结果,在运行时将会使用到。详细内容如下:
{
"batchBindOnly": 1,
"binFileName": "square_12484080525657478220_2",
"binFileSuffix": ".o",
"blockDim": 1,
"build_result": "",
"kernelName": "square_12484080525657478220_2__kernel0",
"magic": "RT_DEV_BINARY_MAGIC_ELF",
"opParaSize": 0,
"parameters": [
0,
0
],
"sha256": "38ec670e4536958a70a653a0f3bbc7a5aadf66b5fd2b6cfe5379964668929797"
}
在生成如上的三种算子缓存文件之后,用户在执行网络模型时即可进行算子增量编译,即仅编译新增或者有改动的算子,大幅提升网络编译性能。
常见问题
不同场景下缓存文件通常不能共用,例如多卡与单卡、训练与推理等。
rank_0
是在环境变量RANK_ID
为空的情况下的默认值,如果该环境变量的值不为空,则会生成相应RANK_ID
号的路径。如RANK_ID=3
,则生成rank_3/kernel_meta
。kernel_meta
生成的路径可以通过环境变量MS_COMPILER_CACHE_PATH
指定,例如export MS_COMPILER_CACHE_PATH=/home/workspace/
,export RANK_ID=2
,则算子编译缓存文件位于/home/workspace/rank_2/kernel_meta/
。在多卡运行时,执行网络模型将会在多个
device
目录下均生成rank_{ID}/kernel_meta
文件夹(ID
为环境变量RANK_ID
的值)。请注意,在多卡运行的情况下,如果仅删除部分卡的
rank_{ID}/kernel_meta
下的算子缓存文件后重复执行相同的网络模型,可能会引起不需重新编译算子的部分卡等候超时,导致执行失败。在这种情况下,可以通过设置环境变量HCCL_CONNECT_TIMEOUT
,即多卡间等待时间来避免失败,但该方式耗时等同于全部删除缓存重新编译(ID
为环境变量RANK_ID
的值)。如果在网络编译的过程中打断进程,有概率会导致
rank_0/kernel_meta
中的缓存文件生成错误,并使得后续重新执行的过程失败。此时需要用户去删除rank_0/kernel_meta
文件夹,重新编译网络。