Incremental Operator Build

Ascend Model Optimization

View Source On Gitee

Overview

When a network model is executed, MindSpore builds the used operators. The time consumed in this stage increases with the scale of the network model. To improve the performance of secondary model execution, an incremental operator build mechanism is provided. When MindSpore executes a network model, the default rank_0/kernel_meta folder is generated in the directory where the execution is performed. During the execution, operator cache files (in the .o, .info, or .json format) generated during network build are saved to this directory. If you execute the same network model again or only part of the model changes, MindSpore automatically calls the reusable operator cache files in the rank_0/kernel_meta folder, which significantly reduces the network build time and improves the execution performance. Currently, the incremental operator build function can be used only on the Ascend AI chips.

The following demonstrates how to use the incremental operator build function.

Usage

Incremental operator build is enabled by default on MindSpore and does not need to be controlled. The following describes how to build a simple network model case test_square.py in the src directory. The current directory structure is as follows:

└─src
    └── test_square.py

Execute the following test case:

import numpy as np
import mindspore.nn as nn
import mindspore.context as context
import mindspore.ops as ops
from mindspore import Tensor

context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.square = ops.Square()

    def construct(self, data):
        return self.square(data)

def test_net():
    x = np.array([1.0, 4.0, 9.0]).astype(np.float32)
    square = Net()
    output = square(Tensor(x))
    print("x: ", x)
    print("output: ", output)


The network model consists of a single operator Square, and the output is a square value of the input. The command output is as follows:

x: [1. 4. 9.]
output: [1. 16. 81.]

The rank_0/kernel_meta folder is generated in the directory where the execution is performed, which contains the .o, .json, and .info files of the Square operator. The current directory structure is as follows:

└─src
    ├── test_square.py
    └── rank_0
        └──kernel_meta
           ├── square_12484080525657478220_2.info
           ├── square_12484080525657478220_2.json
           └── square_12484080525657478220_2.o

For an operator:

The .o file is an executable file generated by MindSpore for the operator during network model execution.

The .info file records all valid information about the operator, including the operator name, attributes, input and output formats, and input and output data types. The .info file is used to search for and determine whether the .o file of the operator can be reused. The details are as follows:

{
    "job_content": {
        "SocInfo": {
            "autoTilingMode": "NO_TUNE",
            "coreNum": "",
            "coreType": "",
            "deviceId": "2",
            "l1Fusion": "false",
            "l2Fusion": "false",
            "l2Mode": "2",
            "mdl_bank_path": "",
            "offlineTune": false,
            "op_bank_path": "",
            "op_bank_update": false,
            "op_debug_dir": "./rank_0/",
            "op_debug_level": "0",
            "op_impl_mode": "",
            "op_impl_mode_list": [],
            "socVersion": "Ascend910A",
            "vector_fp_ceiling": ""
        },
        "full_name": "Default/Square-op1",
        "fusion_op_name": "square_12484080525657478220_2",
        "graph_name": "",
        "l1_size": -1,
        "op_list": [
            {
                "name": "x_0",
                "output_desc": [
                    {
                        "L1_addr_offset": 0,
                        "L1_fusion_type": -1,
                        "L1_workspace_size": -1,
                        "addr_type": 0,
                        "data_type": "float32",
                        "dtype": "float32",
                        "format": "ND",
                        "name": "x_0",
                        "ori_format": "NCHW",
                        "ori_shape": [
                            3
                        ],
                        "param_type": "required",
                        "range": [
                            [
                                3,
                                3
                            ]
                        ],
                        "shape": [
                            3
                        ],
                        "slice_offset": [],
                        "split_index": 0,
                        "total_shape": [],
                        "valid": true,
                        "valid_shape": []
                    }
                ],
                "type": "Data"
            },
            {
                "build_type": "accurately_build",
                "dynamic_compile_static": false,
                "func_name": "square",
                "input_desc": [
                    {
                        "L1_addr_offset": 0,
                        "L1_fusion_type": -1,
                        "L1_workspace_size": -1,
                        "addr_type": 0,
                        "data_type": "float32",
                        "dtype": "float32",
                        "format": "ND",
                        "name": "x_0",
                        "ori_format": "NCHW",
                        "ori_shape": [
                            3
                        ],
                        "param_type": "required",
                        "range": [
                            [
                                3,
                                3
                            ]
                        ],
                        "shape": [
                            3
                        ],
                        "slice_offset": [],
                        "split_index": 0,
                        "total_shape": [],
                        "valid": true,
                        "valid_shape": []
                    }
                ],
                "int64mode": false,
                "max_kernel_id": 10,
                "miss_support_info": "",
                "module_name": "impl.square",
                "name": "Default/Square-op1",
                "op_name": "square_12484080525657478220_2",
                "ori_name": [
                    "Default/Square-op1"
                ],
                "output_data_desc": [
                    {
                        "L1_addr_offset": 0,
                        "L1_fusion_type": -1,
                        "L1_workspace_size": -1,
                        "addr_type": 0,
                        "data_type": "float32",
                        "dtype": "float32",
                        "format": "ND",
                        "ori_format": "NCHW",
                        "ori_shape": [
                            3
                        ],
                        "param_type": "required",
                        "range": [
                            [
                                3,
                                3
                            ]
                        ],
                        "shape": [
                            3
                        ],
                        "slice_offset": [],
                        "split_index": 0,
                        "total_shape": [],
                        "valid": true,
                        "valid_shape": []
                    }
                ],
                "output_desc": [
                    {
                        "L1_addr_offset": 0,
                        "L1_fusion_type": -1,
                        "L1_workspace_size": -1,
                        "addr_type": 0,
                        "data_type": "float32",
                        "dtype": "float32",
                        "format": "ND",
                        "name": "y",
                        "ori_format": "NCHW",
                        "ori_shape": [
                            3
                        ],
                        "param_type": "required",
                        "range": [
                            [
                                3,
                                3
                            ]
                        ],
                        "shape": [
                            3
                        ],
                        "slice_offset": [],
                        "split_index": 0,
                        "total_shape": [],
                        "valid": true,
                        "valid_shape": []
                    }
                ],
                "pattern": "Opaque",
                "py_module_path": "/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe",
                "type": "Square",
                "unknown_shape": false
            }
        ],
        "scope_id": -1
    },
    "job_id": 1,
    "job_type": "Compile",
    "source_id": 2
}

The .json file stores the operator build result, which will be used during running. The details are as follows:

{
    "batchBindOnly": 1,
    "binFileName": "square_12484080525657478220_2",
    "binFileSuffix": ".o",
    "blockDim": 1,
    "build_result": "",
    "kernelName": "square_12484080525657478220_2__kernel0",
    "magic": "RT_DEV_BINARY_MAGIC_ELF",
    "opParaSize": 0,
    "parameters": [
        0,
        0
    ],
    "sha256": "38ec670e4536958a70a653a0f3bbc7a5aadf66b5fd2b6cfe5379964668929797"
}

After the preceding three types of operator cache files are generated, you can perform incremental operator build when executing the network model. That is, only new or modified operators are built, greatly improving the network build performance.

FAQs

  • Cache files cannot be shared in different scenarios, such as multi-device and single-device scenarios, or training and inference scenarios.

  • The rank_0 is the default value if the env RANK_ID is empty. If the RANK_ID is not empty, for exampleRANK_ID=3, the path rank_3/kernel_meta will be generated.

  • The path of kernel_meta can be specified by the environment variable MS_COMPILER_CACHE_PATH. For example, export MS_COMPILER_CACHE_PATH=/home/xxx/,export RANK_ID=2, the operator compilation cache file will be saved in /home/xxx/rank_2/kernel_meta/.

  • When multiple devices are running, the rank_{ID}/kernel_meta folder is generated in multiple device directories when the network model is executed(The IDis the value of environment variable RANK_ID).

    Note that when multiple devices are running, if the operator cache files in rank_{ID}/kernel_meta of some devices are deleted and the same network model is executed again, devices that do not need to be rebuilt may time out. As a result, the execution fails. In this case, you can set the environment variable HCCL_CONNECT_TIMEOUT, that is, the waiting time between multiple devices, to avoid failure. However, this method takes a long time, which is equivalent to deleting and rebuilding all devices(The IDis the value of environment variable RANK_ID).

  • If the process is interrupted during the network building process, there is a possibility that an error occurs when the cache files in rank_0/kernel_meta are generated. As a result, the subsequent re-execution fails. In this case, you need to delete the rank_0/kernel_meta folder and rebuild the network.