AOT-Type Custom Operators(CPU/GPU)
Overview
AOT (Ahead-Of-Time) type of custom operators employ a pre-compilation approach, which requires network developers to manually write the source code files corresponding to the operator implementation functions based on specific interfaces. These source code files need to be compiled into dynamic link libraries (DLLs) in advance. During network runtime, the framework will automatically invoke and execute the functions contained within these dynamic link libraries.
AOT-type custom operators support the CUDA language for GPU platforms and the C and C++ languages for CPU platforms. For the development of custom operators specifically on the Ascend platform, please refer to AOT-Type Custom Operators(Ascend).
In this tutorial, we provide several simple use cases of AOT-type custom operators on both CPU and GPU platforms as demonstrations. For more comprehensive examples of AOT-type custom operators, please refer to the examples section in the MindSpore source code.
The Introduction to the General Usage Features of AOT-type Custom Operators
The custom operator of AOT-type adopts the AOT compilation method, which requires network developers to hand-write the source code file of the operator implementation based on a specific interface and compiles the source code file into a dynamic library in advance, and then the framework will automatically call and run the function defined in the dynamic library. In terms of the development language of the operator implementation, the GPU platform supports CUDA, and the CPU platform supports C and C++. The interface specification of the operator implementation in the source file is as follows:
extern "C" int func_name(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, void *extra);
where the function name func_name
can be replaced with any valid function name. The return value is of type int. 0 means normal exit, and non-zero means an exception occurs. The meaning of the parameter list is as follows:
nparam (int): The number of inputs and outputs. For example, if an operator has 2 inputs and 1 output, then the value of nparam is 3.
params (void **): An array of pointers, with each pointer pointing to the input or output data. For example, if an operator has 2 inputs and 1 output, then params[0] points to the first input data, params[1] points to the second input data, params[2] points to the output data.
ndims (int *): An array of integers, each integer represents the dimensions of the shape of input or output. For example, if params[i] is a tensor with shape [1024, 1024], then ndims[i] is 2.
shapes (int64_t **): An array of shapes, each element in array represents for the shape of input or output. For example, if params[i] is a tensor with shape [1024, 1024], then shapes[i][0] is 1024, shapes[i][1] is 1024.
dtypes (const char **): Array of data types, each element in array represents for the data type of input or output. The value of data type can be "float32", "float16", "float", "float64", "int", "int8", "int16", "int32", "int64", "uint", "uint8", "uint16", "uint32", "uint64", "bool".
stream (void *): Stream pointer, only used in Cuda file.
extra (void *): Used for further extension.
In the Python script, the format for the func
input in Custom
is Path_To_Func:CustomFunc
, where CustomFunc
is the name of the function above and Path_To_Func
is the path to the corresponding function source file or binary library.
MindSpore recognizes the automatic compilation through the file name suffix. To use the automatic compilation feature, please use source files with suffixes such as
cpp
,cc
, orcu
. Otherwise, MindSpore will treat it as a binary library path;To prevent malicious third-party libraries from tampering, please set the environment variable
MS_CUSTOM_AOT_WHITE_LIST
to the path of legitimate third-party libraries. Only files in the directories specified byMS_CUSTOM_AOT_WHITE_LIST
and its subdirectories will be called by custom operators.
Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic.
If the operator only supports some specific input and output data types, the operator information needs to be registered. For the creation of operator information, please refer to Registering the Operator Information.
The following examples introduce the development process of AOT-type custom operator on GPU platform and CPU platform, where the custom operator implements the function of adding two input tensors.
A GPU Example
Use the CUDA language to write the source file add.cu for the operator implementation:
#define THREADS 1024
__global__ void CustomAddKernel(float *input1, float *input2, float *output, size_t size) {
auto idx = blockIdx.x * THREADS + threadIdx.x;
if (idx < size) {
output[idx] = input1[idx] + input2[idx];
}
}
extern "C" int CustomAdd(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream,
void *extra) {
cudaStream_t custream = static_cast<cudaStream_t>(stream);
if (nparam != 3) return 1;
void *input1 = params[0];
void *input2 = params[1];
void *output = params[2];
size_t size = 1;
for (int i = 0; i < ndims[2]; i++) {
size *= shapes[2][i];
}
int n = size / THREADS;
for (int i = 0; i < nparam; i++) {
if (strcmp(dtypes[i], "float32") != 0) {
return 2;
}
}
CustomAddKernel<<<n + 1, THREADS, 0, custream>>>(static_cast<float *>(input1), static_cast<float *>(input2),
static_cast<float *>(output), size);
return 0;
}
Compile add.cu into a dynamic library add.so:
nvcc --shared -Xcompiler -fPIC -o add.so add.cu
Write the test case test_custom_aot.py:
import numpy as np
import mindspore as ms
import mindspore.ops as ops
ms.set_device(device_target="GPU")
if __name__ == "__main__":
# Define a custom operator of aot type
op = ops.Custom("./add.so:CustomAdd", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="aot")
x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32)
x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32)
output = op(ms.Tensor(x0), ms.Tensor(x1))
print(output)
The following points need to be explained in this example:
In this example, you need to place test_custom_aot.py and add.so in the same directory. If add.so is in another directory, you need to replace the value of the first parameter of
Custom
primitive with the absolute path of add.so.Use Python lambda functions to infer the output shape and data type, and pass them to the
out_shape
andout_dtype
parameters of theCustom
primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor.The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs.
Execute case:
python test_custom_aot.py
The execution result is as follows:
[[2. 2.]
[4. 4.]]
A CPU Example
Use C/C++ language to write the source file add.cc for the operator implementation:
#include <string.h>
using size_t = decltype(sizeof(int));
using int64_t = decltype(sizeof(long));
extern "C" int CustomAdd(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, void *extra) {
if (nparam != 3) return 1;
float *input1 = static_cast<float *>(params[0]);
float *input2 = static_cast<float *>(params[1]);
float *output = static_cast<float *>(params[2]);
size_t size = 1;
for (int i = 0; i < ndims[2]; i++) {
size *= shapes[2][i];
}
for (int i = 0; i < nparam; i++) {
if (strcmp(dtypes[i], "float32") != 0) {
return 2;
}
}
for (int i = 0; i < size; i++) {
output[i] = input1[i] + input2[i];
}
return 0;
}
Compile add.cc into a dynamic library add.so:
g++ --shared -fPIC -o add.so add.cc
Write the test case test_custom_aot.py:
import numpy as np
import mindspore as ms
import mindspore.ops as ops
ms.set_device(device_target="CPU")
if __name__ == "__main__":
# Define a custom operator of aot type
op = ops.Custom("./add.so:CustomAdd", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="aot")
x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32)
x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32)
output = op(ms.Tensor(x0), ms.Tensor(x1))
print(output)
The following points need to be explained in this example:
In this example, you need to place test_custom_aot.py and add.so in the same directory. If add.so is in another directory, you need to replace the value of the first parameter of
Custom
primitive with the absolute path of add.so.Use Python lambda functions to infer the output shape and data type, and pass them to the
out_shape
andout_dtype
parameters of theCustom
primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor.The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs.
Execute case:
python test_custom_aot.py
The execution result is as follows:
[[2. 2.]
[4. 4.]]
For more complete examples of AOT-type custom operators, see the use cases in the MindSpore source code.
The Introduction to the Advanced Usage Features of AOT-type Custom Operators
In the rest of tutorial, we will demonstrate advanced features of AOT-type custom operators, including:
Auto-compilation of AOT-type custom operators;
Attributes and intermediate variables of AOT-type custom operators;
Dynamic shape support for AOT-type custom operators.
For the complete source code of the example, check here in the MindSpore source code.
Auto-compilation of AOT-type Custom Operators
When the user's AOT-type custom operator file is a single file and does not require custom compilation options during compilation, users can use the automatic compilation feature. In this way, users will provide the source file for the implementation of the custom operator, and MindSpore will automatically compile the source file into a binary library. Currently, this function supports C++ file compilation based on GCC and CUDA file compilation based on NVCC. When using the automatic compilation function, there are several points to note:
MindSpore recognizes the method of automatic compilation as a file name suffix. In order to use the auto compilation feature, please use a source file with a suffix of
cpp
,cc
, orcu
. In other cases, MindSpore will process as a binary library path.The result of automatic compilation is in the folder akg_kernel_meta.
The default compilation options are:
C++:
g++ -std=c++17 --shared -fPIC -D_GLIBCXX_USE_CXX11_ABI=0 -I./ -o $object_path, $source_path
CUDA 10:
nvcc --shared -Xcompiler -fPIC -O3 -gencode arch=compute_70, code=sm_70 --use_fast_math --expt-relaxed-constexpr -D_GLIBCXX_USE_CXX11_ABI=0 -I./ -o $object_path, $source_path
CUDA 11(or higher version):
nvcc --shared -Xcompiler -fPIC -O3 -gencode arch=compute_80, code=sm_80 --use_fast_math --expt-relaxed-constexpr -D_GLIBCXX_USE_CXX11_ABI=0 -I./ -o $object_path, $source_path
MindSpore requires the compilation option of
-D_ GLIBCXX_ USE_ CXX11_ ABI = 0
, so please avoid using a CUDA software stack with a version lower than 10.1.168 on GPU platforms.
Attributes and Intermediate Variables of AOT-type Custom Operators
Many commonly used operators have attributes, such as the kernel size, padding, and strides of the convlution operator.
Operators with different attribute values have the same computational logic, with the only difference being the values of the attributes during initialization.
In addition, during the calculation process of the operator, some additional memory spaces may be needed to store the intermediate variables.
The following calculation is an example. If we consider the input_1
and input_2
to calculate output
as follows:
tmp = Add(input_1, input_2)
output = ReduceSum(tmp, axis, keep_dims)
Here, we need to add the following intermediate variables and attributes to the operator in the computation function, including:
tmp
as an intermediate variable to record the intermediate result of addition;axis
as an attribute of typeint
, andkeep_dims
as an attribute of typebool
.
AOT-type custom operators provide functionality to add attributes, and then we can define a class of custom operators with a single source code. These operators have the same computational logic but achieve different computational effects by assigning values to the attributes during operator initialization. Additionally, to allow MindSpore to manage memory allocation and release, AOT-type custom operators provide interfaces to specify the size of intermediate variables, allowing MindSpore to allocate memory for computation.
Dynamic Shape Support for AOT-type Custom Operators
Dynamic Shape refers to that the shapes of inputs or outputs of an operator depends on the specific operation and cannot be calculated in advance at compile time. Specifically, there are two cases: the shapes of the operator's inputs are unknown at compile time, and the shapes of the operator's outputs depend on the specific input values. The case that the shapes of the operator's inputs are unknown at compile time is more common. Any operator, regardless of their own calculation logic, needs to support this case if it is used in a network that supports dynamic shape inputs.
Currently, the AOT-type custom operators support the dynamic shape scenario when the shape of the operator's input is unknown at compile time. This is achieved by defining a C++ version of the shape derivation function to support type derivation for custom operators in this scenario.
It should be noted that custom operators do not yet support dynamic shape scenarios where the shape of the operator output depends on the value of a specific input.
The Introduction AOT-type Custom Operator Advanced Usage Interface
Main Function
In the source code file, the main function of the operator implementation function must follow the following specifications:
extern "C" int FuncName(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, void *extra);
The function name FuncName
can be replaced with any valid function name. The return value is of type int, with 0 indicating normal exit and non-zero indicating an exception. The meaning of the parameter list is as follows:
nparam (int): The total number of inputs, outputs, and intermediate variables. For example, if the operator has 2 inputs, 1 output, and 1 intermediate variable, then
nparam
is 4.params (void **): An array of pointers to inputs, outputs, and intermediate variables. For example, if the operator has 2 inputs, 1 output, and 1 intermediate variable, then
params[0]
points to the memory of the first input data,params[1]
points to the memory of the second input data,params[2]
points to the memory of the output data, andparams[3]
points to the memory of the intermediate variable.ndims (int *): An array of dimensions for inputs, output,s and intermediate variables. For example, if
params[i]
is a tensor with shape [1024, 1024], thenndims[i]
is 2.shapes (int64_t **): An array of shapes for inputs, outputs, and intermediate variables. For example, if
params[i]
is a tensor with shape [1024, 1024], thenshapes[i][0]
is 1024 andshapes[i][1]
is 1024.dtypes (const char **): An array of data types for inputs, outputs, and intermediate variables. The elements in
dtypes
can take values among the list "float32", "float16", "float", "float64", "int", "int8", "int16", "int32", "int64", "uint", "uint8", "uint16", "uint32", "uint64", and "bool".stream (void *): The pointer to a CUDA stream, only required for GPU operator implementation.
extra_void (void *): The pointer to a data structure related to attributes.
Initialization Function
To support operator attributes and intermediate variables, we need to define an operator initialization function. The definition of the operator initialization function must follow the following specifications:
extern "C" int FuncNameInit(int *ndims, int64_t **shapes, const char **dtypes, AotExtra *extra);
The function name FuncName
is the name of the operator main function. The return value is of type int, with 0 indicating normal exit and non-zero indicating an exception. The meaning of the parameter list is as follows:
ndims (int *): Array of dimensions for input and output shapes.
shapes (int64_t **): Array of shapes for inputs and outputs.
dtypes (const char **): Array of data types for inputs and outputs.
extra (AotExtra *): Custom operator extensions with attributes. The
AotExtra
type is defined in the header file custom_aot_extra.h provided by MindSpore.
Shape Inference Function
To support dynamic shape, a C++ version of the shape inference function needs to be added to the custom operator of Aot type. The definition of the operator shape inference function must meet the following specifications:
extern "C" std::vector<int64_t> FuncNameInferShape(int *ndims, int64_t **shapes, AotExtra *extra)
The function name FuncName
is the name of the operator main function.
The return value is of type std::vector<int64_t>
and represents the output shape.
The meaning of the parameter list is as follows:
ndims
(int *): Array of dimensions for input shapes.shapes
(int64_t **): Array of shapes for inputs.extra
(AotExtra *): Pointer to an extension for attribute-bearing custom operators. TheAotExtra
type is defined in the header file custom_aot_extra.h provided by MindSpore.
Operator Attribute Registration (Python)
The initialization of operator attributes is implemented through the operator registration function. For each attribute, we create an attr
for the operator registration file, setting the attribute name and value. The registration function is as follows:
def attr(self, name=None, param_type=None, value_type=None, default_value=None, **kwargs)
Please refer to the CustomRegOp interface documentation for the meaning of each parameter. When registering a custom operator of Aot type, we set the following four parameters:
name
: the name of the attribute of the AOT-type custom operator;param_type
: the parameter type of the attribute. For attributes of AOT-type custom operators, this input is fixed to be "required", which means it is a required parameter;value_type
: the numerical type of the attribute. For attributes of AOT-type custom operators, this input can be a specific numerical type or "all", which means no restrictions on the type;The last input needs to specify the input name as
value=
, and the input value is the value of the attribute.
Advanced Usage Example of AOT-type Custom Operator
Now we introduce the advanced usage of custom Aot operators using an example of a fused Add and ReduceSum operator. The operator first adds two inputs, and then performs sum operation along a certain axis. The basic calculation logic is as follows:
tmp = Add(input_1, input_2)
output = ReduceSum(tmp, axis, keep_dims)
Here, we need to add the following intermediate variables and attributes in the computation function, including:
tmp
is an intermediate variable that records the intermediate result of the addition;axis
is a property of typeint
, andkeep_dims
is a property of typebool
.
Operator Implementation File (C++/CUDA): kernel.cc
To implement the operator, we create a source file named kernel.cc
, which includes an operator attribute class add_reduce_kernel_attr
and three functions: CustomKernelInit
, CustomKernelInferShape
, and CustomKernel
.
Operator Attribute Class
First, we define a data structure to store operator attributes, which inherits from AotKernelData
.
AotKernelData
is the base class for custom operator attribute data structures.
By downloading the header file custom_aot_extra.h provided by MindSpore and placing it in the same directory as the source file, we can use the related interfaces by including it with #include "custom_aot_extra.h"
at the beginning of the file.
#include <vector>
#include "custom_aot_extra.h"
class add_reduce_kernel_attr : public AotKernelData {
public:
int64_t axis;
bool keep_dim;
};
Here, we define the following variables in the attribute class add_kernel
:
axis
: member variable, type isint64_t
;keep_dim
: member variable, type isbool
.
Operator Initialization Function
After defining the operator attribute class, we define the operator initialization function. Notice that the initialization function name here is CustomKernelInit
, and the corresponding prefix for the following functions should be CustomKernel
.
extern "C" int CustomKernelInit(int *ndims, int64_t **shapes, const char **dtypes, AotExtra *extra) {
size_t workspace_size = 1;
for (size_t i = 0; i < ndims[0]; i++) {
workspace_size *= shapes[0][i];
}
std::vector<size_t> workspace = {workspace_size * sizeof(float)};
extra->SetWorkSpace(workspace);
add_reduce_kernel_attr *kernel_data_ptr = new add_reduce_kernel_attr;
kernel_data_ptr->axis = extra->Attr<int64_t>("axis");
kernel_data_ptr->keep_dim = extra->Attr<bool>("keep_dim");
extra->SetKernelData(kernel_data_ptr);
return 0;
}
Here, we need a intermediate variable workspace
to record the intermediate result of addition. The method is as follows:
Calculate the memory size required for
workspace
: Since the size ofworkspace
is the same as that of the first input, we multiply the size of each dimension ofshapes[0]
to calculate the number of elements inworkspace
, and then multiply it bysizeof(float)
to get the memory size (assuming the element type is float by default).Store all the memory sizes of intermediate variables in a
std::vector<size_t>
object:std::vector<size_t> workspace = {workspace_size * sizeof(float)};
. Here, since there is only one intermediate variable, the vector has only one element.Set the memory size of the intermediate variable using the
SetWorkSpace
function ofAotExtra *extra
:extra->SetWorkSpace(workspace)
.
In addition, we need to obtain the values of two attributes, axis
and keep_dim
, as follows:
Create a pointer to an
add_reduce_kernel_attr
object:add_reduce_kernel_attr *kernel_ptr = new add_reduce_kernel_attr
.Retrieve the attribute values from
extra
and store them in the member variables ofkernel_ptr
:kernel_data_ptr->axis = extra->Attr<int64_t>("axis"); kernel_data_ptr->keep_dim = extra->Attr<bool>("keep_dim");
. Here,reduce_axis
andkeep_dim
are of typeint
andbool
respectively. We use the corresponding template function ofextra->Attr<T>(std::string name)
to obtain the value of the attribute with the given type.The supported types for
T
in step 2 arebool
,string
,int64_t
,float
,std::vector<int64_t>
,std::vector<float>
,std::vector<std::vector<int64_t>>
, andstd::vector<std::vector<float>>
.
Store
kernel_ptr
inextra
for use during operator calculation:extra->SetKernelData(kernel_ptr)
.
Operator Shape Inference Function
To define a dynamic shape scene, we define a C++ version of the operator shape inference function as follows. Notice that the operator shape inference function name is CustomKernelInferShape
, and shares the same prefix CustomKernel
with the initialization function name CustomKernelInit
.
#include <vector>
#include "custom_aot_extra.h"
extern "C" std::vector<int64_t> CustomKernelInferShape(int *ndims, int64_t **shapes, AotExtra *extra) {
const int64_t kDynRankSize = -2;
if (shapes[0][0] == kDynRankSize) {
return std::vector<int64_t>{shapes[0][0]};
}
int64_t axis = extra->Attr<int64_t>("axis");
bool keep_dim = extra->Attr<bool>("keep_dim");
if (keep_dim) {
if (axis == 0) {
return std::vector<int64_t>{1, shapes[0][1]};
} else {
return std::vector<int64_t>{shapes[0][0], 1};
}
} else {
return std::vector<int64_t>{shapes[0][1 - axis]};
}
}
In the above example, we need to note the following:
According to the MindSpore specifications, dynamic shape inputs includes two cases: the dynamic shape case and the dynamic rank case, with corresponding shape inputs as follows:
the dynamic shape case: If the size of a certain dimension of the input is unknown, it is represented by -1. For example, the shape of the input is [1024, -1, 1024], which indicates that the input is a three-dimensional tensor with dimensions of 1024 and -1 for the second dimension;
the dynamic rank case: The number of dimensions of the input is unknown, and the shape of the input is fixed as [-2, ].
To support C++ shape inference functions, we need to handle cases when inputs are either dynamic shape or dynamic rank. For example, in the above example, if the input is of dynamic rank, the output will also be of dynamic rank. Therefore, when we find that the input is [-2, ], we directly return [-2, ].
For scenarios where the output shape depends on attributes, you can use the
extra->Attr<T>(std::string name)
template interface to obtain attributes.
Operator Computation Function (Main Function)
The interface specification of the operator computation function is the same as that of a custom operator without attributes.
It is worth noting that the operator main function name CustomKernel
needs to be the same as the prefix of the initialization function name CustomKernelInit
and the operator shape inference function name CustomKernelInferShape
mentioned above.
The main function, together with the above two functions, forms the source file kernel.cc
.
extern "C" int CustomKernel(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream,
void *extra_void) {
constexpr int OUTPUT_INDEX = 2;
float *input_1 = static_cast<float *>(params[0]);
float *input_2 = static_cast<float *>(params[1]);
float *output = static_cast<float *>(params[2]);
float *tmp = static_cast<float *>(params[3]);
// Add
int in_size = 1;
for (int i = 0; i < ndims[OUTPUT_INDEX]; i++) {
in_size *= shapes[OUTPUT_INDEX][i];
}
for (int i = 0; i < in_size; i++) {
tmp[i] = input_1[i] + input_2[i];
}
// ReduceSum
AotExtra *extra = static_cast<AotExtra *>(extra_void);
auto kernel_ptr = static_cast<add_reduce_kernel_attr *>(extra->KernelData());
bool keep_dim = kernel_ptr->keep_dim;
int64_t axis = kernel_ptr->axis;
int64_t input_dim_1 = shapes[0][1];
int size;
if (keep_dim) {
size = shapes[1][0] * shapes[1][1];
} else {
size = shapes[1][0];
}
int ext = shapes[0][axis];
for (int i = 0; i < size; i++) {
output[i] = 0;
for (int j = 0; j < ext; j++) {
int idx = input_dim_1 * (i * axis + j * (1 - axis)) + i * (1 - axis) + j * axis;
output[i] = output[i] + tmp[idx];
}
}
return 0;
}
In the computation of Add, we used the intermediate variable of the operator, and the method is as follows:
Convert the pointers in the
params
array tofloat *
one by one. According to the introduction of the interface above, the elements in the array are: two input address pointers (input_1
andinput_2
), an output address pointer (output
), and an intermediate variable address pointer (tmp
);Store the result of adding the two inputs into the intermediate variable:
tmp[i] = input_1[i] + input_2[i]
.
In the computation of ReduceSum, we used the attribute value of the operator, and the method is as follows:
Convert the
extra_void
type to aAotExtra
type pointer:AotExtra *extra = static_cast<AotExtra *>(extra_void)
.Get the
kernel_ptr
object pointer created in the initialization function fromextra
:auto kernel_ptr = static_cast<add_reduce_kernel_attr *>(extra->KernelData())
. Here,extra->KernelData()
obtains a void object pointer, which needs to be further converted to thekernel_ptr
object pointer.Use the attribute values stored in
kernel_ptr
for calculation:bool keep_dim = kernel_ptr->keep_dim; int64_t axis = kernel_ptr->axis;
. Here, we obtain the variableskeep_dim
andaxis
fromkernel_ptr
for computation.
Operator Definition File: test_custom_aot.py
To add AOT-type custom operator to a MindSpore network using the above functions, we create the file test_custom_aot.py
.
import numpy as np
from mindspore import Tensor
from mindspore import dtype as mstype
from mindspore.nn import Cell
import mindspore as ms
import mindspore.ops as ops
from mindspore.ops import DataType, CustomRegOp
class ReduceDynNet(Cell):
def __init__(self, out_types, axis, keep_dim):
super(ReduceDynNet, self).__init__()
reduce_cpu_info = CustomRegOp("reduce_kernel_cpu") \
.input(0, "x1") \
.input(0, "x2") \
.output(0, "y") \
.dtype_format(DataType.None_None, DataType.None_None, DataType.None_None) \
.attr("axis", "required", "all", value=axis) \
.attr("keep_dim", "required", "all", value=keep_dim) \
.target("CPU") \
.get_op_info()
# As the shape inference function of C++ version is defined above, the ouptut_shape can be 'None'
self.program = ops.Custom("./kernel.cc:CustomKernel", None, out_types, "aot", reg_info=reduce_cpu_info)
def construct(self, x, y):
return self.program(x, y)
The ReduceDynNet
in this file includes two parts: the operator registration function and the operator definition class.
Operator Registration
The assignment of operator attributes during initialization is implemented through the operator registration function.
For the function of custom operator registration, please refer to the relevant documentation of CustomRegOp.
For each attribute, we create an attr
for the operator registration file reduce_cpu_info
, setting the attribute name and value.
Each attr
item here has four inputs: the first is the name, such as "axis"
or "keep_dim"
; the middle two are "required"
and "all"
; the last input needs to specify the input name as value=
, and the input value is the value of the attribute, for example, value=axis
and value=keep_dim
here.
We determine the values of these two parameters from the network input, and these values should match the types used in the extra->Attr<T>
template interface in the initialization function and shape inference function above.
In addition, if we need to define multiple operator registration files, we need to use different operator file names, which is the argument of CustomRegOp
, here it is "add_with_attr_kernel_cpu"
. If we want to define another operator with the same prototype but different attribute values, the name cannot be duplicated.
Operator Definition
In the Python file above, a custom operator of type aot
is defined using the interface Custom
of MindSpore: self.program = ops.Custom("./kernel.cc:CustomKernel", None, out_types, "aot", reg_info=reduce_cpu_info)
. Since we defined the C++ version of the shape inference function earlier, ouptut_shape
can be set to None
.
Notice that in the operator definition, we directly use the source file name ./kernel.cc
, so we are utilizing the automatic compilation feature provided by MindSpore. Make sure that the corresponding compiler (g++ in this case, and nvcc for GPU environment) is available in the environment.
Operator Call
As a test, we add the __main__
function to the test_custom_aot.py
file:
if __name__ == "__main__":
shape = (4, 5)
axis = 1
keep_dim = False
ms.set_device(device_target="CPU")
input_x = np.ones(shape).astype(np.float32)
input_y = np.ones(shape).astype(np.float32)
test = ReduceDynNet(mstype.float32, axis, keep_dim)
dyn_x = Tensor(shape=[4, None], dtype=mstype.float32)
# set the net to dynamic shape
test.set_inputs(dyn_x, dyn_x)
output = test(Tensor(input_x),Tensor(input_y))
print(output)
Execute the file to call the operator:
python test_custom_aot.py
Execution result is as follows:
[10. 10. 10. 10.]
Introduction to Multi-Output Custom Operators of the AOT Type
Custom operators of the AOT-type support multiple outputs (outputs as tuples). The definition of the operator file for a custom operator with multiple outputs is similar to that of a single-output operator, but corresponding modifications need to be made based on the multi-output scenario, including:
Operator inference function: The output of the
infer
function needs to be written in the form of a tuple;Operator registration file: The names and data type information of multiple outputs need to be listed;
Operator computation function: It needs to identify the pointers corresponding to multiple outputs.
Below, we demonstrate the method of defining a custom operator of the AOT-type with multiple outputs using an example. For specific file usage, please refer to here.
Operator Inference Function
In the case of multiple outputs, the operator inference function should be written in the form of a tuple.
Taking the case where the output shapes are constants as an example, the out_shapes
in the custom operator below is ([3], [3], [3])
,
and out_dtypes
is (mstype.float32, mstype.float32, mstype.float32)
, which correspond to the shapes and data types of the three outputs, respectively.
self.program = ops.Custom(func, ([3], [3], [3]), (mstype.float32, mstype.float32, mstype.float32), "aot", bprop, reg)
Operator Registering Function
When defining multiple outputs, we need to clearly specify the names of the inputs and outputs in sequence, and indicate the corresponding data formats for both inputs and outputs in the dtype_format
section. For example:
multioutput_gpu_info = CustomRegOp() \
.input(0, "x1") \
.input(1, "x2") \
.output(0, "y1") \
.output(1, "y2") \
.output(2, "y3") \
.dtype_format(DataType.F32_Default, DataType.F32_Default,
DataType.F32_Default, DataType.F32_Default, DataType.F32_Default) \
.target("GPU") \
.get_op_info()
Here, we define a registration file for an operator with two inputs and three outputs. Therefore, we add two input
items and three output
items in the registration file. Additionally, the five data formats defined in dtype_format
correspond to the data format requirements for the two inputs and three outputs in sequence.
Operator Computation Function
The following CustomAddMulDiv
function is the computation function of the custom op.
constexpr int THREADS = 1024;
__global__ void CustomAddMulDivKernel(float *input1, float *input2, float *output1, float *output2, float *output3,
size_t size) {
auto idx = blockIdx.x * THREADS + threadIdx.x;
if (idx < size) {
output1[idx] = input1[idx] + input2[idx];
output2[idx] = input1[idx] * input2[idx];
output3[idx] = input1[idx] / input2[idx];
}
}
extern "C" int CustomAddMulDiv(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes,
void *stream, void *extra) {
cudaStream_t custream = static_cast<cudaStream_t>(stream);
constexpr int OUTPUT_INDEX = 2;
constexpr int TOTAL_PARAM_NUM = 5;
// There are two inputs and three outputs, so the nparam should be 5.
if (nparam != TOTAL_PARAM_NUM) {
return 1;
}
// This is to check if the type of parameters the same as what the user wants.
for (int i = 0; i < nparam; i++) {
if (strcmp(dtypes[i], "float32") != 0) {
return 2;
}
}
// input1's index is 0, input2's index is 1, output1's index is 2, output2's index is 3 and output3's index is 4
void *input1 = params[0];
void *input2 = params[1];
void *output1 = params[2];
void *output2 = params[3];
void *output3 = params[4];
size_t size = 1;
// Cumprod of output's shape to compute elements' num
for (int i = 0; i < ndims[OUTPUT_INDEX]; i++) {
size *= shapes[OUTPUT_INDEX][i];
}
int n = size / THREADS;
// Do the computation
CustomAddMulDivKernel<<<n + 1, THREADS, 0, custream>>>(static_cast<float *>(input1), static_cast<float *>(input2),
static_cast<float *>(output1), static_cast<float *>(output2),
static_cast<float *>(output3), size);
// When return 0, MindSpore will continue to run if this kernel could launch successfully.
return 0;
}
Please note that since the operator has two inputs and three outputs, nparam
should be 5, and the five pointers in the params
array should correspond to the two inputs and three outputs in sequence.
Therefore, in the above code, we obtain the inputs and outputs as follows:
void *input1 = params[0];
void *input2 = params[1];
void *output1 = params[2];
void *output2 = params[3];
void *output3 = params[4];
For the complete operator computation file, please refer to here.
Operator in Scripts
When a custom operator with multiple outputs is involved in a net, the results can be used as a normal tuple, for example:
class AOTMultiOutputNet(Cell):
def __init__(self, func, out_shapes, out_types, bprop=None, reg=None):
super(AOTMultiOutputNet, self).__init__()
self.program = ops.Custom(func, out_shapes, out_types, "aot", bprop, reg)
self.add = ops.Add()
self.mul = ops.Mul()
def construct(self, x, y):
aot = self.program(x, y)
add_res = self.add(aot[0], aot[1])
mul_res = self.mul(add_res, aot[2])
return mul_res
if __name__ == "__main__":
x = np.array([1.0, 1.0, 1.0]).astype(np.float32)
y = np.array([1.0, 1.0, 1.0]).astype(np.float32)
net = AOTMultiOutputNet("./add_mul_div.cu:CustomAddMulDiv", ([3], [3], [3]),
(mstype.float32, mstype.float32, mstype.float32), reg=multioutput_gpu_info)
output = test(Tensor(input_x),Tensor(input_y))
print(output)
Here aot
as the output of the multi-output custom operator can be used as a tuple. Execution result is as follows:
[3. 3. 3.]