Error Analysis

As mentioned before, error analysis refers to analyzing and inferring possible error causes based on the obtained network and framework information (such as error messages and network code).

During error analysis, the first step is to identify the scenario where the error occurs and determine whether the error is caused by data loading and processing or network construction and training. You can determine whether it is a data or network problem based on the format of the error message. In the distributed parallel scenario, you can use a single-device execution network for verification. If there is no data loading and processing or network construction and training problem, the error is caused by a parallel scenario problem. The following describes the error analysis methods in different scenarios.

Data Loading and Processing Error Analysis

When an error is reported during data processing, check whether C++ error messages are contained as shown in Figure 1. Typically, the name of the data processing operation using the C++ language is the same as that using Python. Therefore, you can determine the data processing operation that reports the error based on the error message and locate the error in the Python code.

Figure 1

As shown in the following figure, batch_op.cc reports a C++ error. The batch operation combines multiple consecutive pieces of data in a dataset into a batch for data processing, which is implemented at the backend. According to the error description, the input data does not meet the parameter requirements of the batch operation. Data to be batch operated has the same shape, and the sizes of different shapes are displayed.

Data loading and processing has three phases: data preparation, data loading, and data augmentation. The following table lists common errors.

Error Type	Error Description	Case Analysis
Data preparation error	The dataset is faulty, involving a path or MindRecord file problem.	Error Case
Data loading error	Incorrect resource configuration, customized loading method, or iterator usage in the data loading phase.	Error Case
Data augmentation error	Unmatched data format/size, high resource usage, or multi-thread suspension.	Error Case

Network Construction and Training Error Analysis

The network construction and training process can be executed in the dynamic graph mode or static graph mode, and has two phases: build and execution. The error analysis method varies according to the execution phase in different modes.

The following table lists common network construction and training errors.

Error Type	Error Description	Case Analysis
Incorrect context configuration	An error occurs when the system configures the context.	Error Analysis
Syntax error	Python syntax errors and MindSpore static graph syntax errors, such as unsupported control flow syntax and tensor slicing errors	Error Analysis
Operator build error	The operator parameter value, type, or shape does not meet the requirements, or the operator function is restricted.	Error Analysis
Operator execution error	Input data exceptions, operator implementation errors, function restrictions, resource restrictions, etc.	Error Analysis
Insufficient resources	The device memory is insufficient, the number of function call stacks exceeds the threshold, and the number of flow resources exceeds the threshold.	Error Analysis

Error Analysis of the Dynamic Graph Mode

In dynamic graph mode, the program is executed line by line according to the code writing sequence, and the execution result can be returned in time. Figure 2 shows the error message reported during dynamic graph build. The error message is from the Python frontend, indicating that the number of function parameters does not meet the requirements. Through the Python call stack, you can locate the error code: c = self.mul(b, self.func(a,a,b)).

Generally, the error message may contain WARNING logs. During error analysis, analyze the error message following Traceback first.

pynative-errmsg

Figure 2

In dynamic graph mode, common network construction and training errors are found in environment configuration, Python syntax, and operator usage. The general analysis method is as follows:

Determine the object where the error is reported based on the error description, for example, the operator API name.
Locate the code line where the error is reported based on the Python call stack information.
Analyze the code input data and calculation logic at the position where the error occurs, and find the error cause based on the description and specifications of the error object in the MindSpore API.

Error Analysis of the Static Graph Mode

In static graph mode, MindSpore builds the network structure into a computational graph, and then performs the computation operations involved in the graph. Therefore, errors reported in static graph mode include computational graph build errors and computational graph execution errors. Figure 3 shows the error message reported during computational graph build. When an error occurs, the analyze_failed.ir file is automatically saved to help analyze the location of the error code.

graph-errmsg

Figure 3

The general error analysis method in static graph mode is as follows:

Check whether the error is caused by graph build or graph execution based on the error description.

If the error is reported during computational graph build, analyze the cause and location of the failure based on the error description and the analyze_failed.ir file automatically saved when the error occurs.
If the error is reported during computational graph execution, the error may be caused by insufficient resources or improper operator execution. You need to further distinguish the error based on the error message. If the error is reported during operator execution, locate the operator, use the dump function to save the input data of the operator, and analyze the cause of the error based on the input data.

For details about how to analyze and infer the failure cause, see the analysis methods described in analyze_failed.ir.

For details about how to use Dump to save the operator input data, see Dump Function Debugging.

Distributed Parallel Error Analysis

MindSpore provides the distributed parallel training function and supports multiple parallel modes. The following table lists common distributed parallel training errors and possible causes.

Error Type	Error Description
Incorrect policy configuration	Incorrect operator logic.
	Incorrect scalar policy configuration.
	No policy configuration.
Parallel script error	Incorrect script startup, or unmatched parallel configuration and startup task.

Incorrect Policy Configuration

Policy check errors may be reported after you enable automatic parallelism using mindspore.set_auto_parallel_context(parallel_mode="semi_auto_parallel"). These policy check errors are reported due to specific operator slicing restrictions. The following uses three examples to describe how to analyze the three types of errors.

Incorrect Operator Logic

The error message is as follows:

[ERROR]Check StridedSliceInfo1414: When there is a mask, the input is not supported to be split

The following shows a piece of possible error code where the network input is a [2, 4] tensor. The network is sliced to obtain the first half of dimension 0 in the input tensor. It is equivalent to the x[:1, :]operation in NumPy, where x is the input tensor. On the network, the (2,1) policy is configured for the stridedslice operator to slice dimension 0.

tensor = Tensor(ones((2, 4)))
stridedslice = ops.StridedSlice((0, 0),(1, 4), (1, 1))

class MyStridedSlice(nn.Cell):
    def __init__(self):
        super(MyStridedSlice, self).__init__()
        self.slice = stridedslice.shard(((2,1),))

    def construct(self, x):
        # x is a two-dimensional tensor
        return self.slice(x)

Error cause:

The piece of code performs the slice operation on dimension 0. However, the configured policy (2,1) indicates that the slice operation is performed on both dimension 0 and dimension 1 of the input tensor. According to the description of operator slicing in the MindSpore API,

only the mask whose value is all 0s is supported. All dimensions that are sliced must be extracted together. The input dimensions whose strides is not set to 1 cannot be sliced.

Dimensions that are sliced cannot be separately extracted. Therefore, the policy must be modified as follows:

Change the policy of dimension 0 from 2 to 1. In this way, dimension 0 will be sliced into one, that is, dimension 0 will not be sliced. Therefore, the policy meets the operator restrictions and the policy check is successful.

class MyStridedSlice(nn.Cell):
    def __init__(self):
        super(MyStridedSlice, self).__init__()
        self.slice = stridedslice.shard(((1,1),))

    def construct(self, x):
        # x is a two-dimensional tensor
        return self.slice(x)

Incorrect Scalar Policy Configuration

Error message:

[ERROR] The strategy is ..., strategy len:. is not equal to inputs len:., index:

Possible error code:

class MySub(nn.Cell):
    def __init__(self):
        super(MySub, self).__init__()
        self.sub = ops.Sub().shard(((1,1), (1,)))
    def construct(self, x):
        # x is a two-dimensional tensor
        return self.sub(x, 1)

The input of many operators can be scalars, such as addition, subtraction, multiplication, and division operations and axis of operators such as concat and gather. For such operations with scalar input, do not configure policies for these scalars. If the preceding method is used to configure a policy for the subtraction operation and the policy (1,) is configured for scalar 1, an error is reported. That is, the length of the policy whose index is 1 is 1, which is not equal to the length 0 of the corresponding input. In this case, the input is a scalar.

Modified code:

In this case, set an empty policy for the scalar or do not set any policy (recommended method).

self.sub = ops.Sub().shard(((1,1),()))

self.sub = ops.Sub().shard(((1,1),))

No Policy Configuration

[ERROR]The strategy is ((8, 1)), shape 4 can not be divisible by strategy value 8

Possible error code:

class MySub(nn.Cell):
    def __init__(self):
        super(MySub, self).__init__()
        self.sub = ops.Sub()
    def construct(self, x):
        # x is a two-dimensional tensor
        return self.sub(x, 1)

The following piece of code runs training in an 8-device environment in semi-automatic parallel mode. No policy is configured for the Sub operator in the example and the default policy of the Sub operator is data parallel. Assume that the input x is a matrix of size [2, 4]. After the build starts, an error is reported, indicating that the input dimensions are insufficient for slicing. In this case, you need to modify the policy as follows (ensure that the number of dimensions for slicing are less than that of the input tensor):

class MySub(nn.Cell):
    def __init__(self):
        super(MySub, self).__init__()
        self.sub = ops.Sub().shard(((2, 1), ()))
    def construct(self, x):
        # x is a two-dimensional tensor
        return self.sub(x, 1)

(2, 1) indicates that dimension 0 of the first input tensor is sliced into two parts, and dimension 1 is sliced into one, that is, dimension 1 is not sliced. The second input of ops.Sub is a scalar that cannot be sliced. Therefore, the slicing policy is set to empty ().

Parallel Script Error

The following is a piece of code for running an 8-device Ascend environment and using the bash script to start the training task.

#!/bin/bash
set -e
EXEC_PATH=$(pwd)
export RANK_SIZE=8
export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json

for((i=0;i<RANK_SIZE;i++))
do
    rm -rf device$i
    mkdir device$i
    cp ./train.py ./device$i
    cd ./device$i
    export DEVICE_ID=$i
    export RANK_ID=$i
    echo "start training for device $i"
    env > env$i.log
    python ./train.py > train.log$i 2>&1 &
    cd ../
done
echo "The program launch succeed, the log is under device0/train.log0."

Errors may occur in the following scenarios:

The number of training tasks (RANK_SIZE) started using the for loop does not match the number of devices configured in the rank_table_8pcs.json configuration file. As a result, an error is reported.
The command for executing the training script is not executed in asynchronous mode (python ./train.py > train.log$i 2>&1). As a result, training tasks are started at different time, and an error is reported. In this case, add the & operator to the end of the command, indicating that the command is executed asynchronously in the subshell. In this way, multiple tasks can be started synchronously.

In parallel scenarios, you may encounter the Distribute Task Failed error. In this case, analyze whether the error occurs in the computational graph build phase or the execution phase of printing training loss to further locate the error.

For details, visit the following website:

For more information about distributed parallel errors in MindSpore, see Distributed Task Failed.

CANN Error Analysis

This section is only applicable to Ascend platform, CANN (Compute Architecture for Neural Networks) is Huawei heterogeneous computing architecture for AI scenarios, and MindSpore of Ascend platform runs on top of CANN.

When running MindSpore on the Ascend platform, certain scenarios will encounter errors from the underlying CANN. Such errors are generally reported in the log with the Ascend error occurred keyword, and the error message consists of the error code and the error content, as follows:

[ERROR] PROFILER(138694,ffffaa6c8480,python):2022-01-10-14:19:56.741.053 [mindspore/ccsrc/profiler/device/ascend/ascend_profiling.cc:51] ReportErrorMessage] Ascend error occurred, error message:
EK0001: Path [/ms_test/csj/csj/user_scene/profiler_chinese_中文/resnet/scripts/train/data/profiler] for [profilerResultPath] is invalid or does not exist. The Path name can only contain A-Za-z0-9-_.

One of the CANN error codes consists of 6 characters, such as EK0001 above, which contains three fields:

Field 1	Field 2	Field 3
Level (1 position)	Module (1 position)	Error code (4 positions)

Among them, the level is divided into E, W, I, respectively, indicating error, alarm, prompt class. Module indicates the CANN module that reports errors, as shown in the following table:

Err error code	CANN module	Err error code	CANN module
E10000-E19999	GE	EE0000-EE9999	runtime
E20000-E29999	FE	EF0000-EF9999	LxFusion
E30000-E39999	AICPU	EG0000-EG9999	mstune
E40000-E49999	TEFusion	EH0000-EH9999	ACL
E50000-E89999	AICORE	EI0000-EJ9999	HCCL&HCCP
E90000-EB9999	TBE Compiling front and back ends	EK0000-EK9999	Profiling
EC0000-EC9999	Autotune	EL0000-EL9999	Driver
ED0000-ED9999	RLTune	EZ0000-EZ9999	Operator public error

AICORE operator: The AI Core operator is the main component of the computational core of the Ascend AI processor and is responsible for performing computationally intensive operator related to vector and tensor. AICPU operator: AI CPU operator is the AI CPU responsible for executing CPU-like operator (including control operator, scalar and vector, and other general-purpose computations) in the Hayes SoC of the Ascend processor.

Among the 4-bit error codes, 0000~8999 are user-class errors and 9000~9999 are internal error codes. Generally, user-class error users can correct the error by themselves according to the error message, while internal error codes need to contact Huawei for troubleshooting. You can go to MindSpore Community or Ascend Community to submit issue to get help. Some common error reporting scenarios are shown in the following table:

Common Error Types	Error Description	Case Analysis
AICORE Operator Compilation Problem	AICORE Operator Error During Compilation	AICORE Operator Compilation Problem
AICORE Operator Execution Problem	AICORE Operator Error During Execution	AICORE Operator Execution Problem
AICPU Operator Execution Problem	AICPU Operator Error During Execution	AICPU Operator Execution Problem
runtime FAQ	Including input data exceptions, operator implementation errors, functional limitations, resource limitations, etc.	runtime FAQ
HCCL & HCCP FAQ	Common communication problems during multi-machine multi-card training, including socket build timeout, notify wait timeout, ranktable configuration error, etc.	HCCL & HCCP FAQ
profiling FAQ	Errors when running profiling for performance tuning	profiling FAQ

For more information about CANN errors, refer to the Ascend CANN Developer Documentation to check the troubleshooting section of the corresponding CANN version.