Custom Fusion

View Source On Gitee

Overview

Operator fusion combines multiple independent operators into a larger, more complex operator to reduce runtime memory accesses and improve computational efficiency. This approach minimizes the storage and transmission of intermediate results, effectively reducing memory access overhead. Additionally, fusing multiple operators reduces the number of computations, which can significantly enhance computational efficiency on parallel computing devices like NPUs.

Currently, MindSpore supports two fusion methods:

  1. Setting jit_level=O1 to enable graph kernel fusion:

    This feature automatically decomposes complex, large operators into smaller basic operators and fuses them into fusion operators according to specified fusion rules. The low-level implementation of the fusion operators is then automatically generated by AKG.

  2. Fusion through Pass, enabled by default:

    This feature automatically combines multiple consecutive small operators in the model that meet specific conditions into a single fusion operator. Each fusion operator corresponds to a fusion Pass. After the IR graph of MindSpore passes through the fusion Pass, the operators are replaced with the fused operators. MindSpore provides a wide range of operator fusion optimization Passes. These fusion operators are extracted and summarized based on common user requirements, meeting the needs of most users.

In practical network debugging, users may want to manually control the fusion of operators for scenarios such as:

  • Network debugging:

    Users can manually control the fusion switches based on their own scenarios. This can exclude fusion operators that perform poorly in certain scenarios or use more aggressive fusion strategies to improve network computation speed.

  • Addressing accuracy issues:

    When encountering accuracy problems, users can disable certain operator fusions to locate issues and identify specific operators causing the accuracy deviation.

To support these scenarios, we provide relevant interfaces for operator fusion optimization, enabling users to customize fusion strategies for debugging.

Therefore, MindSpore provides interfaces related to fusion operator optimization passes, allowing users to toggle fusion passes for debugging purposes.

Debugging Interfaces

Currently, operator fusion-related optimization passes are included in the graph kernel optimization module. The environ variable MS_DEV_GRAPH_KERNEL_FLAGS provide controlling the switches for related graph optimization passes, including:

Specifying Optimization Level

  • opt_level: Set the optimization level. Default: 2 . Graph kernel fusion can be enabled equivalently by setting opt_level greater than 0.

    • 0: disables graph kernel fusion;

    • 1: enables the basic fusion of operators;

    • 2: includes all optimizations of level 1, and turns on more optimizations such as CSE, arithmetic simplification and so on;

    • 3: includes all optimizations of level 2, and turns on more optimizations such as SitchingFusion, ParallelFusion and so on. Optimizations of this level are radical and unstable in some scenarios. Be caution when using this level.

Specifying Automatic Fusion Strategy

  • enable_expand_ops: Forcefully expand operators that are not in the default list, requiring an expander implementation for the corresponding operator. For example, setting --enable_expand_ops=Square forces the Square operator to expand. The list of default expanded operators can be found in Appendix 1.

  • disable_expand_ops: Disable the expansion of the specified operators.

  • enable_expand_ops_only: Allow only the specified operators to expand. When this option is set, the above two options are ignored.

  • enable_cluster_ops: Add specified operators to the set of operators participating in fusion based on the default fusion operator list. For example, setting --enable_cluster_ops=MatMul allows the MatMul operator to participate in fusion. The list of default fusion operators can be found in Appendix 2.

  • disable_cluster_ops: Prevent the specified operators from participating in the fusion set.

  • enable_cluster_ops_only: Allow only the specified operators to participate in the fusion set. When this option is set, the above two options are ignored.

  • enable_packet_ops_only: When enabling the kernel packet feature, this option restricts fusion to the specified operators only.

  • disable_packet_ops: When enabling the kernel packet feature, this option prohibits fusion for the specified operators.

Enabling or Disabling Automatic/Manual Fusion Pass

  • enable_pass: Enable passes that are disabled by default using this option.

  • disable_pass: Disable passes that are enabled by default using this option.

Opening debug info

  • dump_as_text: Save detailed information about key processes as text files in the graph_kernel_dump directory. Default value: False.

  • enable_debug_mode: Insert synchronization points before and after the graph kernelmod launch, and print debugging information if the launch fails. This is supported only for the GPU backend. Default value: False.

Note: with the format "–key=value", multiple configuration items separated by space, multiple value items separated by commas, for example: export MS_DEV_GRAPH_KERNEL_FLAGS='–enable_expand_ops=Square –enable_cluster_ops=MatMul,Add'

Obtaining Pass Names

Users have two ways to obtain the corresponding pass names during debugging, or they can refer to the appendix list 3 for supported passes.

Through IR Names

If users have dumped the relevant IR, they can obtain the related fusion pass name from the IR name. For example, if the IR name is hwopt ge_unify_mindir_pm_44_add_layer_norm_fusion_0559.ir, the pass name add_layer_norm_fusion can be extracted from the ir name.

Through INFO Messages

In [INFO] messages, we provide a list of all passes that support custom switches. Users can generate [INFO] messages by setting export GLOG_v=1. In the [INFO] messages, users can search for graph kernel pass to obtain the list of these passes. For example, in the following message, the names of all passes that can be customized are listed after graph kernel pass:.

[INFO] PRE_ACT(631369,ffffb5450af0,python):2024-08-22-15:34:16.978.158 [mindspore/ccsrc/plugin/device/ascend/optimizer/backend_common_unify_mindir.cc:191] GetBackendFusionGroupPassManager] graph kernel passes: FlashAttentionFusionV1,FlashAttentionFusionV2,add_layer_norm_fusion,add_layer_norm_v3_fusion,add_layer_norm_ext_fusion,inference_swiglu_fusion,inference_matmul_split_fusion,shape_reshape,add_rms_norm_quant_fusion,rms_norm_quant_fusion,add_rms_norm_fusion,add_cast_rms_norm_cast_fusion,MatMulAllReduce,split_concat_fusion,matmul_elemwise_fusion,inference_qbmm_add_fusion,inference_qbmm_allreduce_add_fusion.

For individual passes, users can also confirm whether they are enabled through log messages. For example:

  • Enabled Pass: The following message indicates that rms_norm_quant_fusion is enabled and can be disabled using disable_pass.

    [INFO] GRAPH_KERNEL(631369,ffffb5450af0,python):2024-08-22-15:34:17.640.739 [mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_pass_manager.cc:84] RunPass] Run graph kernel pass fusion_group_10_rms_norm_quant_fusion in 74.64 us
    
  • Disabled Pass: The following message indicates that transpose_matmul_fusion is disabled and can be enabled using enable_pass.

    [INFO] GRAPH_KERNEL(631369,ffffb5450af0,python):2024-08-22-15:34:17.640.771 [mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_pass_manager.cc:73] Run] graph kernel pass fusion_group_11_add_rms_norm_fusion is disabled.
    

Appendix 1: List of Default Expander Operators for Relevant Backends

Note: This list is provided for reference only and subject to change.

operator name

Ascend

CPU

GPU

Adam

Y

Y

N

AdamApplyOneWithDecayAssign

Y

N

N

Addcmul

Y

N

N

AddN

Y

Y

Y

BiasAdd

Y

Y

Y

BiasAddGrad

Y

Y

Y

FillV2

Y

N

N

GeLU

Y

Y

Y

Gelu

Y

Y

Y

FastGelu

Y

N

N

FastGeluGrad

Y

N

N

FastGeLU

Y

N

N

FastGeLUGrad

Y

N

N

SiLU

Y

N

N

SiLUGrad

Y

N

N

GeLUGrad

Y

Y

Y

RsqrtGrad

Y

N

N

SqrtGrad

Y

Y

Y

Square

Y

Y

Y

Tile

Y

Y

Y

ClipByNormNoDivSum

Y

N

N

FusedMulAdd

Y

N

N

Sigmoid

Y

N

Y

SigmoidGrad

Y

N

Y

SigmoidCrossEntropyWithLogits

Y

N

Y

SigmoidCrossEntropyWithLogitsGrad

Y

N

Y

SquaredDifference

Y

N

Y

TanhGrad

Y

Y

N

OnesLike

Y

Y

Y

ZerosLike

Y

N

N

ReduceMean

Y

N

Y

LogSoftmaxGrad

N

N

Y

ReLU

Y

Y

Y

ReluGrad

Y

N

Y

AssignAdd

Y

Y

Y

LambApplyOptimizerAssign

Y

N

N

LambApplyWeightAssign

Y

N

N

AdamApplyOneWithDecay

Y

N

N

ExpandDims

N

Y

Y

Squeeze

N

N

Y

SoftmaxGradExt

N

N

N

ApplyMomentum

N

N

N

LeakyReLUExt

Y

N

N

EluExt

Y

N

N

SoftplusExt

Y

N

N

SoftplusGradExt

Y

N

N

RepeatInterleaveInt

Y

N

N

HShrink

Y

N

N

HSigmoid

Y

N

N

HSwish

Y

N

N

BinaryCrossEntropy

Y

N

N

Erf

Y

N

N

Tanh

Y

N

N

Cosh

Y

N

N

Sinh

Y

N

N

ClampScalar

Y

N

N

DivMod

Y

N

N

BCEWithLogitsLoss

Y

N

N

AcoshExt

Y

N

N

AsinhExt

Y

N

N

MeanExt

Y

N

N

Erfc

N

N

Y

AdamWeightDecay

N

N

Y

BatchMatMul

N

N

Y

Dropout

N

N

Y

DropoutGrad

N

N

Y

MaximumGrad

N

Y

Y

MinimumGrad

N

Y

Y

LayerNorm

N

N

Y

LayerNormGrad

N

N

Y

LogSoftmax

N

N

Y

MatMul

N

N

Y

ArgMaxWithValue

N

N

Y

ArgMinWithValue

N

N

Y

Slice

N

N

Y

Softmax

N

N

Y

SoftmaxCrossEntropyWithLogits

N

N

Y

EqualCount

N

N

Y

SquareSumAll

N

N

Y

IdentityMath

N

N

Y

StandardNormal

N

N

Y

Softplus

N

Y

N

SoftplusGrad

N

Y

N

Appendix 2: List of Default Cluster Operators for Relevant Backends

Note: This list is provided for reference only and subject to change.

operator name

Ascend

CPU

GPU

Abs

Y

Y

Y

Add

Y

Y

Y

BroadcastTo

Y

N

N

Cast

Y

Y

Y

Exp

Y

Y

Y

Log

Y

Y

Y

Maximum

Y

Y

Y

Minimum

Y

Y

Y

Mul

Y

Y

Y

Neg

Y

Y

Y

Pow

Y

Y

Y

Div

Y

N

Y

RealDiv

Y

Y

Y

Reciprocal

Y

Y

Y

Rsqrt

Y

Y

Y

Sqrt

Y

Y

Y

Sub

Y

Y

Y

Equal

Y

Y

Y

NotEqual

Y

N

Y

Greater

Y

N

Y

GreaterEqual

Y

N

Y

Less

Y

Y

Y

LessEqual

Y

Y

Y

LogicalAnd

Y

N

Y

LogicalOr

Y

N

Y

LogicalNot

Y

Y

Y

Select

Y

Y

Y

Assign

Y

N

Y

ReduceSum

Y

Y

Y

IsFinite

Y

N

Y

Reshape

N

Y

Y

Transpose

Y

Y

Y

Floor

Y

N

Y

Ceil

Y

N

N

Trunc

Y

N

Y

Round

N

Y

Y

Tanh

N

Y

Y

ACos

N

N

Y

Acosh

N

N

Y

ArgMax

N

N

N

Argmin

N

N

N

Asin

N

N

Y

Asinh

N

N

Y

Atan

N

N

Y

Atan2

N

N

Y

Cos

N

N

Y

Erf

N

N

Y

Expm1

N

N

Y

FloorDiv

N

N

Y

FloorMod

N

N

Y

IsInf

N

N

Y

IsNan

N

N

Y

Mod

N

Y

Y

ReduceMax

N

Y

Y

ReduceMin

N

N

Y

Sign

N

N

Y

Sin

N

N

Y

StridedSlice

N

N

Y

CumSum

N

N

Y

OneHot

N

N

Y

Appendix 3: List of Enabled Passes for Relevant Backends

Note: This list is provided for reference only and subject to change. The actual enabled passes should be determined using the methods described above.

Pass Names

Backend

FlashAttentionFusionV1

Ascend

FlashAttentionFusionV2

Ascend

add_layer_norm_fusion

Ascend

add_layer_norm_v3_fusion

Ascend

add_layer_norm_ext_fusion

Ascend

inference_swiglu_fusion

Ascend

inference_matmul_split_fusion

Ascend

shape_reshape

Ascend

add_rms_norm_quant_fusion

Ascend

rms_norm_quant_fusion

Ascend

add_rms_norm_fusion

Ascend

add_cast_rms_norm_cast_fusion

Ascend

MatMulAllReduce

Ascend

split_concat_fusion

Ascend

matmul_elemwise_fusion

Ascend

inference_qbmm_add_fusion

Ascend

inference_qbmm_allreduce_add_fusion

Ascend