Custom Fusion
Overview
Operator fusion combines multiple independent operators into a larger, more complex operator to reduce runtime memory accesses and improve computational efficiency. This approach minimizes the storage and transmission of intermediate results, effectively reducing memory access overhead. Additionally, fusing multiple operators reduces the number of computations, which can significantly enhance computational efficiency on parallel computing devices like NPUs.
Currently, MindSpore supports two fusion methods:
Setting
jit_level=O1
to enable graph kernel fusion:This feature automatically decomposes complex, large operators into smaller basic operators and fuses them into fusion operators according to specified fusion rules. The low-level implementation of the fusion operators is then automatically generated by AKG.
Fusion through Pass, enabled by default:
This feature automatically combines multiple consecutive small operators in the model that meet specific conditions into a single fusion operator. Each fusion operator corresponds to a fusion Pass. After the IR graph of MindSpore passes through the fusion Pass, the operators are replaced with the fused operators. MindSpore provides a wide range of operator fusion optimization Passes. These fusion operators are extracted and summarized based on common user requirements, meeting the needs of most users.
In practical network debugging, users may want to manually control the fusion of operators for scenarios such as:
Network debugging:
Users can manually control the fusion switches based on their own scenarios. This can exclude fusion operators that perform poorly in certain scenarios or use more aggressive fusion strategies to improve network computation speed.
Addressing accuracy issues:
When encountering accuracy problems, users can disable certain operator fusions to locate issues and identify specific operators causing the accuracy deviation.
To support these scenarios, we provide relevant interfaces for operator fusion optimization, enabling users to customize fusion strategies for debugging.
Therefore, MindSpore provides interfaces related to fusion operator optimization passes, allowing users to toggle fusion passes for debugging purposes.
Debugging Interfaces
Currently, operator fusion-related optimization passes are included in the graph kernel optimization module. The environ variable MS_DEV_GRAPH_KERNEL_FLAGS
provide controlling the switches for related graph optimization passes, including:
Specifying Optimization Level
opt_level: Set the optimization level. Default:
2
. Graph kernel fusion can be enabled equivalently by setting opt_level greater than 0.0: disables graph kernel fusion;
1: enables the basic fusion of operators;
2: includes all optimizations of level 1, and turns on more optimizations such as CSE, arithmetic simplification and so on;
3: includes all optimizations of level 2, and turns on more optimizations such as SitchingFusion, ParallelFusion and so on. Optimizations of this level are radical and unstable in some scenarios. Be caution when using this level.
Specifying Automatic Fusion Strategy
enable_expand_ops: Forcefully expand operators that are not in the default list, requiring an expander implementation for the corresponding operator. For example, setting
--enable_expand_ops=Square
forces the Square operator to expand. The list of default expanded operators can be found in Appendix 1.disable_expand_ops: Disable the expansion of the specified operators.
enable_expand_ops_only: Allow only the specified operators to expand. When this option is set, the above two options are ignored.
enable_cluster_ops: Add specified operators to the set of operators participating in fusion based on the default fusion operator list. For example, setting
--enable_cluster_ops=MatMul
allows the MatMul operator to participate in fusion. The list of default fusion operators can be found in Appendix 2.disable_cluster_ops: Prevent the specified operators from participating in the fusion set.
enable_cluster_ops_only: Allow only the specified operators to participate in the fusion set. When this option is set, the above two options are ignored.
enable_packet_ops_only: When enabling the kernel packet feature, this option restricts fusion to the specified operators only.
disable_packet_ops: When enabling the kernel packet feature, this option prohibits fusion for the specified operators.
Enabling or Disabling Automatic/Manual Fusion Pass
enable_pass: Enable passes that are disabled by default using this option.
disable_pass: Disable passes that are enabled by default using this option.
Opening debug info
dump_as_text: Save detailed information about key processes as text files in the
graph_kernel_dump
directory. Default value:False
.enable_debug_mode: Insert synchronization points before and after the graph kernelmod launch, and print debugging information if the launch fails. This is supported only for the GPU backend. Default value:
False
.
Note: with the format "–key=value", multiple configuration items separated by space, multiple value items separated by commas, for example: export MS_DEV_GRAPH_KERNEL_FLAGS='–enable_expand_ops=Square –enable_cluster_ops=MatMul,Add'
Obtaining Pass Names
Users have two ways to obtain the corresponding pass names during debugging, or they can refer to the appendix list 3 for supported passes.
Through IR Names
If users have dumped the relevant IR, they can obtain the related fusion pass name from the IR name. For example, if the IR name is hwopt ge_unify_mindir_pm_44_add_layer_norm_fusion_0559.ir
, the pass name add_layer_norm_fusion
can be extracted from the ir name.
Through INFO Messages
In [INFO]
messages, we provide a list of all passes that support custom switches. Users can generate [INFO]
messages by setting export GLOG_v=1
. In the [INFO]
messages, users can search for graph kernel pass
to obtain the list of these passes. For example, in the following message, the names of all passes that can be customized are listed after graph kernel pass:
.
[INFO] PRE_ACT(631369,ffffb5450af0,python):2024-08-22-15:34:16.978.158 [mindspore/ccsrc/plugin/device/ascend/optimizer/backend_common_unify_mindir.cc:191] GetBackendFusionGroupPassManager] graph kernel passes: FlashAttentionFusionV1,FlashAttentionFusionV2,add_layer_norm_fusion,add_layer_norm_v3_fusion,add_layer_norm_ext_fusion,inference_swiglu_fusion,inference_matmul_split_fusion,shape_reshape,shape_reshape_2,add_rms_norm_quant_fusion,rms_norm_quant_fusion,add_rms_norm_fusion,add_cast_rms_norm_cast_fusion,MatMulAllReduce,split_concat_fusion,matmul_elem_biasadd_fusion,matmul_elem_add_fusion,matmul_elem_relu_fusion,matmul_elem_gelu_fusion,inference_qbmm_add_fusion,inference_qbmm_allreduce_add_fusion.
For individual passes, users can also confirm whether they are enabled through log messages. For example:
Enabled Pass: The following message indicates that
rms_norm_quant_fusion
is enabled and can be disabled usingdisable_pass
.[INFO] GRAPH_KERNEL(631369,ffffb5450af0,python):2024-08-22-15:34:17.640.739 [mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_pass_manager.cc:84] RunPass] Run graph kernel pass fusion_group_10_rms_norm_quant_fusion in 74.64 us
Disabled Pass: The following message indicates that
transpose_matmul_fusion
is disabled and can be enabled usingenable_pass
.[INFO] GRAPH_KERNEL(631369,ffffb5450af0,python):2024-08-22-15:34:17.640.771 [mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_pass_manager.cc:73] Run] graph kernel pass fusion_group_11_add_rms_norm_fusion is disabled.
Appendix 1: List of Default Expander Operators for Relevant Backends
Note: This list is provided for reference only and subject to change.
operator name |
Ascend |
CPU |
GPU |
---|---|---|---|
Adam |
Y |
Y |
N |
AdamApplyOneWithDecayAssign |
Y |
N |
N |
Addcmul |
Y |
N |
N |
AddN |
Y |
Y |
Y |
BiasAdd |
Y |
Y |
Y |
BiasAddGrad |
Y |
Y |
Y |
FillV2 |
Y |
N |
N |
GeLU |
Y |
Y |
Y |
Gelu |
Y |
Y |
Y |
FastGelu |
Y |
N |
N |
FastGeluGrad |
Y |
N |
N |
FastGeLU |
Y |
N |
N |
FastGeLUGrad |
Y |
N |
N |
SiLU |
Y |
N |
N |
SiLUGrad |
Y |
N |
N |
GeLUGrad |
Y |
Y |
Y |
RsqrtGrad |
Y |
N |
N |
SqrtGrad |
Y |
Y |
Y |
Square |
Y |
Y |
Y |
Tile |
Y |
Y |
Y |
ClipByNormNoDivSum |
Y |
N |
N |
FusedMulAdd |
Y |
N |
N |
Sigmoid |
Y |
N |
Y |
SigmoidGrad |
Y |
N |
Y |
SigmoidCrossEntropyWithLogits |
Y |
N |
Y |
SigmoidCrossEntropyWithLogitsGrad |
Y |
N |
Y |
SquaredDifference |
Y |
N |
Y |
TanhGrad |
Y |
Y |
N |
OnesLike |
Y |
Y |
Y |
ZerosLike |
Y |
N |
N |
ReduceMean |
Y |
N |
Y |
LogSoftmaxGrad |
N |
N |
Y |
ReLU |
Y |
Y |
Y |
ReluGrad |
Y |
N |
Y |
AssignAdd |
Y |
Y |
Y |
LambApplyOptimizerAssign |
Y |
N |
N |
LambApplyWeightAssign |
Y |
N |
N |
AdamApplyOneWithDecay |
Y |
N |
N |
ExpandDims |
N |
Y |
Y |
Squeeze |
N |
N |
Y |
SoftmaxGradExt |
N |
N |
N |
ApplyMomentum |
N |
N |
N |
LeakyReLUExt |
Y |
N |
N |
EluExt |
Y |
N |
N |
SoftplusExt |
Y |
N |
N |
SoftplusGradExt |
Y |
N |
N |
RepeatInterleaveInt |
Y |
N |
N |
HShrink |
Y |
N |
N |
HSigmoid |
Y |
N |
N |
HSwish |
Y |
N |
N |
BinaryCrossEntropy |
Y |
N |
N |
Erf |
Y |
N |
N |
Tanh |
Y |
N |
N |
Cosh |
Y |
N |
N |
Sinh |
Y |
N |
N |
ClampScalar |
Y |
N |
N |
DivMod |
Y |
N |
N |
BCEWithLogitsLoss |
Y |
N |
N |
AcoshExt |
Y |
N |
N |
AsinhExt |
Y |
N |
N |
MeanExt |
Y |
N |
N |
Erfc |
N |
N |
Y |
AdamWeightDecay |
N |
N |
Y |
BatchMatMul |
N |
N |
Y |
Dropout |
N |
N |
Y |
DropoutGrad |
N |
N |
Y |
MaximumGrad |
N |
Y |
Y |
MinimumGrad |
N |
Y |
Y |
LayerNorm |
N |
N |
Y |
LayerNormGrad |
N |
N |
Y |
LogSoftmax |
N |
N |
Y |
MatMul |
N |
N |
Y |
ArgMaxWithValue |
N |
N |
Y |
ArgMinWithValue |
N |
N |
Y |
Slice |
N |
N |
Y |
Softmax |
N |
N |
Y |
SoftmaxCrossEntropyWithLogits |
N |
N |
Y |
EqualCount |
N |
N |
Y |
SquareSumAll |
N |
N |
Y |
IdentityMath |
N |
N |
Y |
StandardNormal |
N |
N |
Y |
Softplus |
N |
Y |
N |
SoftplusGrad |
N |
Y |
N |
Appendix 2: List of Default Cluster Operators for Relevant Backends
Note: This list is provided for reference only and subject to change.
operator name |
Ascend |
CPU |
GPU |
---|---|---|---|
Abs |
Y |
Y |
Y |
Add |
Y |
Y |
Y |
BroadcastTo |
Y |
N |
N |
Cast |
Y |
Y |
Y |
Exp |
Y |
Y |
Y |
Log |
Y |
Y |
Y |
Maximum |
Y |
Y |
Y |
Minimum |
Y |
Y |
Y |
Mul |
Y |
Y |
Y |
Neg |
Y |
Y |
Y |
Pow |
Y |
Y |
Y |
Div |
Y |
N |
Y |
RealDiv |
Y |
Y |
Y |
Reciprocal |
Y |
Y |
Y |
Rsqrt |
Y |
Y |
Y |
Sqrt |
Y |
Y |
Y |
Sub |
Y |
Y |
Y |
Equal |
Y |
Y |
Y |
NotEqual |
Y |
N |
Y |
Greater |
Y |
N |
Y |
GreaterEqual |
Y |
N |
Y |
Less |
Y |
Y |
Y |
LessEqual |
Y |
Y |
Y |
LogicalAnd |
Y |
N |
Y |
LogicalOr |
Y |
N |
Y |
LogicalNot |
Y |
Y |
Y |
Select |
Y |
Y |
Y |
Assign |
Y |
N |
Y |
ReduceSum |
Y |
Y |
Y |
IsFinite |
Y |
N |
Y |
Reshape |
N |
Y |
Y |
Transpose |
Y |
Y |
Y |
Floor |
Y |
N |
Y |
Ceil |
Y |
N |
N |
Trunc |
Y |
N |
Y |
Round |
N |
Y |
Y |
Tanh |
N |
Y |
Y |
ACos |
N |
N |
Y |
Acosh |
N |
N |
Y |
ArgMax |
N |
N |
N |
Argmin |
N |
N |
N |
Asin |
N |
N |
Y |
Asinh |
N |
N |
Y |
Atan |
N |
N |
Y |
Atan2 |
N |
N |
Y |
Cos |
N |
N |
Y |
Erf |
N |
N |
Y |
Expm1 |
N |
N |
Y |
FloorDiv |
N |
N |
Y |
FloorMod |
N |
N |
Y |
IsInf |
N |
N |
Y |
IsNan |
N |
N |
Y |
Mod |
N |
Y |
Y |
ReduceMax |
N |
Y |
Y |
ReduceMin |
N |
N |
Y |
Sign |
N |
N |
Y |
Sin |
N |
N |
Y |
StridedSlice |
N |
N |
Y |
CumSum |
N |
N |
Y |
OneHot |
N |
N |
Y |
Appendix 3: List of Enabled Passes for Relevant Backends
Note: This list is provided for reference only and subject to change. The actual enabled passes should be determined using the methods described above.
Pass Names |
Backend |
---|---|
FlashAttentionFusionV1 |
Ascend |
FlashAttentionFusionV2 |
Ascend |
add_layer_norm_fusion |
Ascend |
add_layer_norm_v3_fusion |
Ascend |
add_layer_norm_ext_fusion |
Ascend |
inference_swiglu_fusion |
Ascend |
inference_matmul_split_fusion |
Ascend |
shape_reshape |
Ascend |
shape_reshape_2 |
Ascend |
add_rms_norm_quant_fusion |
Ascend |
rms_norm_quant_fusion |
Ascend |
add_rms_norm_fusion |
Ascend |
add_cast_rms_norm_cast_fusion |
Ascend |
MatMulAllReduce |
Ascend |
split_concat_fusion |
Ascend |
matmul_elem_biasadd_fusion |
Ascend |
matmul_elem_add_fusion |
Ascend |
matmul_elem_relu_fusion |
Ascend |
matmul_elem_gelu_fusion |
Ascend |
inference_qbmm_add_fusion |
Ascend |
inference_qbmm_allreduce_add_fusion |
Ascend |