Release Notes

View Source On Gitee

MindSpore 2.4.1 Release Notes

Major Features and Improvements

AutoParallel

  • [STABLE] Split/concat branch communication computation parallel is supported. Users split input data to form parallelizable branches. Automatic communication computing parallelism is performed between branches, reducing communication overhead.

  • [STABLE] Sequence pipelines are supported. The LLama series models for the dev branch of MindFormers reduces the Bubble as well as the memory overhead of pipeline parallelism by introducing Sequence dimension splitting.

PyNative

  • [STABLE] In PyNative mode, communication operators are assigned streams by default based on the communication domain. They support concurrent execution of communication operators, optimize collaborative parallel strategies, provide fine-grained communication masking, and enhance model performance.

Bug Fixes

  • IB0R4N: Fixed the problem of loading distributed weights with inaccurate accuracy under certain splitting strategies.

Contributors

bantao;caifubi;candanzg;chaijinwei;changzherui;chengbin;chujinjin;DeshiChen;dingjinshan;fary86;fuhouyu;gaoyong10;GuoZhibin;halo;haozhang;hedongdong;huangbingjian;hujiahui8;huoxinyou;jiangshanfeng;jiaorui;jiaxueyu;jshawjc;kisnwang;lichen;limingqi107;liubuyu;looop5;luochao60;luoyang;machenggui;MengXiangyu;Mrtutu;NaCN;panzhihui;qiuzhongya;shenhaojing;shilishan;tanghuikang;TuDouNi;wang_ziqi;weiyang;wujueying;XianglongZeng;xuxinglei;yang guodong;yanghaoran;yao_yf;yide12;yihangchen;YijieChen;YingtongHu;yuchaojie;YuJianfeng;zhangdanyang;ZhangZGC;zhengzuohe;zong_shuai;ZPaC;冯一航;胡彬;宦晓玲;李林杰;刘崇鸣;刘勇琪;任新;王禹程;王振邦;熊攀;俞涵;张栩浩;周一航;

MindSpore 2.4.0 Release Notes

Major Features and Improvements

Dataset

  • [STABLE] Modify the default value of the max_rowsize parameter in the interface mindspore.dataset.GeneratorDataset, mindspore.dataset.Dataset.map, and mindspore.dataset.Dataset.batch to None to enable dynamic allocation of shared memory by default, in which case the shared memory will be requested in real time with the input data and accelerate the data processing, so the user does not need to adjust the size of this parameter in advance.

  • [BETA] Data processing supports the independent process mode, which will reduce the GIL lock conflict between the training process and the data reading process to improve the performance in dynamic graph mode. This mode can be enabled or disabled via the environment variable MS_INDEPENDENT_DATASET.

Ascend

  • [STABLE] Customized operators support the Ascend Dynamics graph scenario Pyboost execution mode, which reduces operator call overhead.

  • [STABLE] The Ascend Print operator supports scenarios where the output is oversized tensor or print calls are intensive, and users can specify the slice size and timeout time to support different scenarios via the MS_DUMP_SLICE_SIZE and MS_DUMP_WAIT_TIME environment variables.

  • [STABLE] Unified deterministic computation settings. Users can enable ascending deterministic computation by only setting mindspore.set_context(deterministic="ON").

  • [STABLE] Supports aggregate communication anomaly monitoring, quickly exits training after monitoring communication anomalies to avoid timeout waiting.

  • [STABLE] Supports the graceful exit function for sub-healthy devices. When the training framework detects the presence of sub-healthy device configuration information in the cluster, it saves the CKPT and uniformly ends the cluster training process.

Runtime

  • [STABLE] Backend compilation cache is supported in O0/O1 mode and is turned on by default when frontend compilation cache is turned on.

  • [STABLE] The aclnnAllGatherMatmul, aclnnMatmulReduceScatter, and aclnnMatmulAllReduce algorithms are supported in O0/O1 modes to improve performance.

  • [STABLE] O0/O1 modes support to disable cluster heartbeat configuration by export MS_DISABLE_HEARTBEAT=1 to reduce scheduler load.

  • [STABLE] O0/O1 modes support communication arithmetic fusion.

  • [STABLE] Virtual memory support in O2 mode, defragmentation support, which is enabled in Ascend backend by default.

  • [STABLE] Dynamic request for device memory occupation, support single card multi-user use, which is enabled in Ascend backend by default.

  • [STABLE] Optimize graph fusion compilation performance in O1 mode, enabled by default.

  • [STABLE] Support kernel packet fusion optimization in O1 mode to improve the performance of dynamic shape network execution, enabled by default.

  • [BETA] Epilogue fusion between the MatMul and Elementwise operator is supported in O1 mode. Enable via mindspore.set_context(graph_kernel_flags="--enable_cluster_ops=MatMul").

  • [BETA] O1 mode supports user-controlled graph fusion optimization scope, user can control to turn on or off the corresponding fusion operator via the enable_pass/disable_pass option of graph_kernel_flags.

  • [BETA] The GPTO execution order optimization module is supported in O0 mode and is enabled through mindspore.set_context(exec_order="gpto").

PyNative

  • [STABLE] Parameter cell_id of Hook function corresponding to mindspore.nn.Cell.register_backward_hook and mindspore.nn.Cell.register_forward_hook is changed to cell's python object.

  • [STABLE] Added Cell.register_backward_pre_hook interface, this API registers the backward propagation hook function on a Cell, which is called each time the gradient of that Cell is computed.

  • [STABLE] Optimize the PyNative process AICPU class operator downstream cache to improve API execution performance.

  • [STABLE] Added the function of converting the device memory occupied by a group of Tensor to a contiguous piece of memory under dynamic graph.

FrontEnd

  • [STABLE] Weight de-redundancy saving and loading is supported in fault recovery scenarios.

  • [STABLE] Mixed precision training with support for auto mode.

  • [STABLE] Support saving and loading of safetensors format, as well as offline aggregation and distributed loading based on safetensors in parallel scenarios.

  • [BETA] Added new cyclic arithmetic interface mindspore.ops.WhileLoop, mindspore.ops.ForiLoop, and mindspore.ops.Scan, optimizing loop compilation time.

  • [BETA] The graph mode supports the operator passing keyword arguments.

Parallel

  • [STABLE] mindspore.ops.TensorDump operator supports distributed parallel scenarios, and users can decide to print input/output slices by configuring the TensorDump operator's input_output attribute; add new interface mindspore.ops.tensordump.

  • [STABLE] msrun supports customizing the rank id based on the passing rank table file, and supports rearranging the rank id via the --rank_table_file passing json file.

  • [STABLE] Supports LCCL, a high-performance communication library in Ascend stand-alone. Users can enable LCCL in Ascend back-end training scenarios via the MS_ENABLE_LCCL environment variable.

  • [STABLE] The strategy propagation algorithm is adapted to LLaMA/Mixtral networks, which reduces the workload of users in configuring the sharding strategy for LLaMA/Mixtral networks.

  • [STABLE] Support high dimensional tensor parallelism, user can configure mindspore.ops.MatMul and mindspore.oops.BatchMatMul input_layout switching 1D/2D/3D tensor slice mode.

  • [STABLE] Simulation compilation does not consume hardware resources when SIMULATION_LEVEL=0 and SIMULATION_LEVEL=1 runtime jit_level is O0/O1.

  • [STABLE] Allreduce introduced in parallel by the BatchMatMul model is automatically converted to a ReduceScatter to reduce communication according to the matching rules if enable_allreduce_slice_to_reducescatter is turned on in parallel_speed_up_json when it follows the slice operation.

  • [STABLE] mindspore.nn.Cell.shard and mindspore.shard support user-configurable policies of type mindspore.Layout and sharding strategy for each parameter parameter_plan.

  • [BETA] SAPP supports fully automatic generation of residual arithmetic policies after manual preconfiguration of arithmetic parallel sharding strategy. The user activates the .shard() preconfigured parallel sharding strategy by turning on the MS_INTERFERED_SAPP environment variable.

  • The [BETA] mindspore.ops.Custom operator supports configuring the sharding strategy.

Inference

  • [STABLE] New Qwen2 and LLaMA3.1 series of large models support training and inference architecture, realize the unification of script, distributed policy and runtime, reduce the inference delay by fusing large operators, and effectively improve the network throughput.

  • [STABLE] Support parallel decoding service-oriented deployment to realize LookAhead speculative inference for large models of LLaMA series.

  • [BETA] Support SLoRA service-oriented deployment, realizing multi-trimming weight scheduling inference for large models.

Dump

  • [STABLE] Optimize Dump for use by device type and optimization level.

  • [STABLE] Asynchronous Dump support in Ascend O0/O1 mode, including asynchronous Tensor, overflow, and statistics (host and device modes).

  • [STABLE] Overflow Dump supports configuring the maximum number of overflows.

  • [STABLE] Ascend O2 mode supports set dump.

  • [STABLE] Support qint4 x 2 quantization type Dump.

API Change

New API

  • [STABLE] mindspore.mint APIs add a large number of functional, nn interfaces. mint interfaces are currently experimental interfaces, performance is better than ops in graph compilation mode O0 and PyNative mode. Currently does not support graph sink mode and CPU, GPU backend. It will be gradually improved.

    mindspore.mint

    mindspore.mint.full

    mindspore.mint.repeat_interleave

    mindspore.mint.linspace

    mindspore.mint.scatter

    mindspore.mint.tril

    mindspore.mint.argmin

    mindspore.mint.sign

    mindspore.mint.remainder

    mindspore.mint.flatten

    mindspore.mint.asin

    mindspore.mint.arcsin

    mindspore.mint.sinh

    mindspore.mint.arcsinh

    mindspore.mint.atan

    mindspore.mint.arctan

    mindspore.mint.atanh

    mindspore.mint.arctanh

    mindspore.mint.acos

    mindspore.mint.arccos

    mindspore.mint.acosh

    mindspore.mint.arccosh

    mindspore.mint.erfc

    mindspore.mint.expm1

    mindspore.mint.log1p

    mindspore.mint.logical_xor

    mindspore.mint.round

    mindspore.mint.tan

    mindspore.mint.trace

    mindspore.mint.trunc

    mindspore.mint.cross

    mindspore.mint.masked_select

    mindspore.mint.bitwise_and

    mindspore.mint.bitwise_or

    mindspore.mint.bitwise_xor

    mindspore.mint.cosh

    mindspore.mint.cummax

    mindspore.mint.cummin

    mindspore.mint.median

    mindspore.mint.roll

    mindspore.mint.sinc

    mindspore.mint.sinh

    mindspore.mint.xlogy

    mindspore.mint.nn

    mindspore.mint.nn.ReLU

    mindspore.mint.nn.Hardsigmoid

    mindspore.mint.nn.AvgPool2d

    mindspore.mint.nn.MSELoss

    mindspore.mint.nn.LogSoftmax

    mindspore.mint.nn.Mish

    mindspore.mint.nn.PReLU

    mindspore.mint.nn.SELU

    mindspore.mint.nn.Softshrink

    mindspore.mint.nn.Hardshrink

    mindspore.mint.nn.Hardswish

    mindspore.mint.nn.L1Loss

    mindspore.mint.nn.functional

    mindspore.mint.nn.functional.hardsigmoid

    mindspore.mint.nn.functional.log_softmax

    mindspore.mint.nn.functional.mish

    mindspore.mint.nn.functional.prelu

    mindspore.mint.nn.functional.selu

    mindspore.mint.nn.functional.softshrink

    mindspore.mint.nn.functional.hardshrink

    mindspore.mint.nn.functional.hardswish

    mindspore.mint.nn.functional.l1_loss

Interface Changes

  • Interface name: mindspore.dataset.GeneratorDataset

    Changed: The default value of parameter max_rowsize is changed from 6 to None to enable dynamic allocation of shared memory by default.

    Original interface v2.4.0 interface
    class GeneratorDataset(source,
                           column_names=None,
                           column_types=None,
                           schema=None,
                           num_samples=None,
                           num_parallel_workers=1,
                           shuffle=None,
                           sampler=None,
                           num_shards=None,
                           shard_id=None,
                           python_multiprocessing=True,
                           max_rowsize=6)
    
    class GeneratorDataset(source,
                           column_names=None,
                           column_types=None,
                           schema=None,
                           num_samples=None,
                           num_parallel_workers=1,
                           shuffle=None,
                           sampler=None,
                           num_shards=None,
                           shard_id=None,
                           python_multiprocessing=True,
                           max_rowsize=None)
    
  • Interface name: mindspore.dataset.Dataset.batch

    Changed: The default value of parameter max_rowsize is changed from 16 to None to enable dynamic allocation of shared memory by default.

    Original interface v2.4.0 interface
    def batch(input_dataset,
              batch_size,
              drop_remainder=False,
              num_parallel_workers=None,
              per_batch_map=None,
              input_columns=None,
              output_columns=None,
              python_multiprocessing=False,
              max_rowsize=16)
    
    def batch(input_dataset,
              batch_size,
              drop_remainder=False,
              num_parallel_workers=None,
              per_batch_map=None,
              input_columns=None,
              output_columns=None,
              python_multiprocessing=False,
              max_rowsize=None)
    
  • Interface name: mindspore.dataset.Dataset.map

    Changed: The default value of parameter max_rowsize is changed from 16 to None to enable dynamic allocation of shared memory by default.

    Original interface v2.4.0 interface
    def map(input_dataset,
            operations=None,
            input_columns=None,
            output_columns=None,
            num_parallel_workers=None,
            python_multiprocessing=False,
            cache=None,
            callbacks=None,
            max_rowsize=16, offload=None)
    
    def map(input_dataset,
            operations=None,
            input_columns=None,
            output_columns=None,
            num_parallel_workers=None,
            python_multiprocessing=False,
            cache=None,
            callbacks=None,
            max_rowsize=None, offload=None)
    
  • Interface name: mindspore.ops.TensorDump

    Changed: New parameter input_output to control printing behavior.

    Original interface v2.4.0 interface
    class TensorDump()
    
    class TensorDump(input_output='out')
    
  • Interface name: File formats saved by MindSpore Dump Tensor

    Changed: The npy file obtained by Dump adds the dtype information of the original Tensor to the filename.

    Original interface v2.4.0 interface
    {op_type}.{op_name}.{task_id}.{stream_id}.
    {timestamp}.{input_output_index}.{slot}.
    {format}.npy
    
    {op_type}.{op_name}.{task_id}.{stream_id}.
    {timestamp}.{input_output_index}.{slot}.
    {format}.{dtype}.npy
    

Non-compatible Interface Changes

  • Interface name: mindspore.nn.Cell.register_backward_hook(hook_fn)

    Changed: The input parameter of hook_fn is changed from cell_id to cell object.

    Descriptions: For the original hook, you can get the original cell_id by id(cell) in hook_fn.

    Original interface v2.4.0 interface
    def register_backward_hook(hook_fn)
    Parameter hook_fn(cell_id,
               grad_input, grad_output)
               -> New grad_output or None
    
    def register_backward_hook(hook_fn)
    Parameter hook_fn(cell,
               grad_input, grad_output)
               -> New grad_input or None
    
  • Interface name: mindspore.nn.Cell.register_forward_hook(hook_fn)

    Changed: The input parameter of hook_fn is changed from cell_id to cell object.

    Descriptions: For the original hook, you can get the original cell_id by id(cell) in hook_fn.

    Original interface v2.4.0 interface
    def register_forward_hook(hook_fn)
    Parameter hook_fn(cell_id, inputs, outputs)-> New outputs or None
    
    def register_forward_hook(hook_fn)
    Parameter hook_fn(cell, inputs, outputs)-> New outputs or None
    
  • Interface name: mindspore.communication.comm_func.all_reduce

    Changed: all_reduce adds a new parameter async_op, and the return value is changed from Tensor to a tuple consisting of Tensor and CommHandle.

    Descriptions: async_op indicates whether all_reduce has multi-stream parallelism turned on, and the default value is False.

    Original interface v2.4.0 interface
    def all_reduce(tensor,
                   op=ReduceOp.SUM,
                   group=GlobalComm.WORLD_COMM_GROUP)->Tensor
    
    def all_reduce(tensor,
                   op=ReduceOp.SUM,
                   group=GlobalComm.WORLD_COMM_GROUP,
                   async_op=False)
                   ->tuple(Tensor, CommHandle)
    
  • Interface name: mindspore.communication.comm_func.all_gather_into_tensor

    Changed: all_reduce adds a new parameter async_op, and the return value is changed from Tensor to a tuple consisting of Tensor and CommHandle.

    Descriptions: async_op indicates whether all_gather_into_tensor has multi-stream parallelism turned on, and the default value is False.

    Original interface v2.4.0 interface
    def all_gather_into_tensor(tensor,
                               group=GlobalComm.
                               WORLD_COMM_GROUP)->Tensor
    
    def all_gather_into_tensor(tensor,
                               group=GlobalComm.
                               WORLD_COMM_GROUP,
                               async_op=False)->
                               tuple(Tensor, CommHandle)
    
  • Interface name: mindspore.communication.comm_func.reduce_scatter_tensor

    Changed: all_reduce adds a new parameter async_op, and the return value is changed from Tensor to a tuple consisting of Tensor and CommHandle.

    Descriptions: async_op indicates whether reduce_scatter_tensor has multi-stream parallelism turned on, and the default value is False.

    Original interface v2.4.0 interface
    def reduce_scatter_tensor(tensor,
                              op=ReduceOp.SUM,
                              group=GlobalComm.
                              WORLD_COMM_GROUP)->Tensor
    
    def reduce_scatter_tensor(tensor,
                              op=ReduceOp.SUM,
                              group=GlobalComm.WORLD_COMM_GROUP,
                              async_op=False)->
                              tuple(Tensor, CommHandle)
    
  • Interface name: mindspore.communication.comm_func.isend

    Changed: The return value is changed from Tensor to Handle.

    Descriptions: isend enables multi-stream parallelism by default.

    Original interface v2.4.0 interface
    def isend(tensor,
              dst=0,group=GlobalComm.
              WORLD_COMM_GROUP, tag=0)->Tensor
    
    def isend(tensor,
              dst=0,group=GlobalComm.
              WORLD_COMM_GROUP, tag=0)->CommHandle
    
  • Interface name: mindspore.communication.comm_func.irecv

    Changed: The return value is changed from Tensor to Handle.

    Descriptions: irecv enables multi-stream parallelism by default.

    Original interface v2.4.0 interface
    def irecv(tensor,
              src=0, group=GlobalComm.
              WORLD_COMM_GROUP, tag=0)->Tensor
    
    def irecv(tensor,
              src=0,
              group=GlobalComm.
              WORLD_COMM_GROUP, tag=0)->CommHandle
    
  • Interface name: mindspore.communication.comm_func.all_to_all_with_output_shape

    Changed: all_to_all_with_output_shape adds a new parameter async_op, and the return value is changed from Tensor to a tuple consisting of Tensor and CommHandle.

    Descriptions: async_op indicates whether all_to_all_with_output_shape enables multi-stream parallelism, the default value is False.

    Original interface v2.4.0 interface
    def all_to_all_with_output_shape(output_shape_list,
                                     input_tensor_list,
                                     group=None)->tuple(Tensor)
    
    def all_to_all_with_output_shape(output_shape_list,
                                     input_tensor_list,
                                     group=None,
                                     async_op=False)->
                                     tuple(tuple(Tensor),
                                     CommHandle)
    
  • Interface name: mindspore.communication.comm_func.all_to_all_single_with_output_shape

    Changed: all_to_all_single_with_output_shape adds a new parameter async_op, and the return value is changed from Tensor to a tuple consisting of Tensor and CommHandle.

    Descriptions: async_op indicates whether all_to_all_single_with_output_shape enables multi-stream parallelism, the default value is False.

    Original interface v2.4.0 interface
    def all_to_all_single_with_output_shape(output_shape,
                                            tensor,
                                            output_split_sizes=None,
                                            input_split_sizes=None,
                                            group=None)->Tensor
    
    def all_to_all_single_with_output_shape(output_shape,
                                            tensor,
                                            output_split_sizes=None,
                                            input_split_sizes=None,
                                            group=None,
                                            async_op=False)->
                                            tuple(Tensor, CommHandle)
    

Contributors

anyrenwei,bantao,baochong,Bellatan,BJ-WANG,caifubi,candanzg,candyhong,Carey,cccc1111,ccsszz,changzherui,chengbin,chengfeng27,chengxb7532,chenjianping,chenweifeng,chujinjin,dairenjie,DavidFFFan,DeshiChen,dingjinshan,emmmmtang,fanyi20,fary86,fengyixing,fix-dryrun,fuchao,fuhouyu,gaoyong10,gengdongjie,gent1e,GuoZhibin,guozhijian,halo,hangq,haozhang,hedongdong,Henry Shi,HighCloud,Hongxing,huandong1,huangbingjian,HuangLe02,huangziling,huda,huiliang166,hujiahui8,huoxinyou,jiangchenglin3,jianghui58,jiangshanfeng,jiaorui,jiaxueyu,jijiarong,jjfeing,JoeyLin,jshawjc,jxl,kairui_kou,kisnwang,kk,lanzhineng,LiangZhibo,lichen,limingqi107,lionelchang,liubuyu,liujunzhu,liuluobin,liyejun,LLLRT,looop5,luochao60,luoxuewei,luoyang,machenggui,maning202007,maoyuanpeng1,Margaret_wangrui,MengXiangyu,mengyuanli,moran,Mrtutu,mylinchi,NaCN,nomindcarry,panzhihui,paolopoggi,pengqi,pierreleca,qiuleilei,qiuyufeng,qiuzhongya,r1chardf1d0,shaoshengqi,shen_haochen,shenhaojing,shenwei41,shihlCST,shilishan,shiro-zzz,shiziyang,shop-pin,shunyuanhan,shuqian0,stavewu,superxf,suteng,tanghuikang,tangmengcheng,tan-wei-cheng,tan-wei-cheng-3260,tianxiaodong,TronZhang,TuDouNi,VectorSL,vincen45,wang_ziqi,wanghenchang,wangjie,wangshaocong,weiyang,wtobill,wudawei,wujueying,wwwbby,xfan233,XianglongZeng,xiaotianci,xiaoxin_zhang,xiaoxiongzhu,xiaoxuanKL,xiaoyao,XinDu,xuxinglei,xuzhubin,yanghaoran,yanglong,yangzhenzhang,yanx,Yanzhi_YI,yao_yf,yefeng,yide12,yihangchen,YijieChen,YingLai Lin,ylw,yuanpeng2024,yuanqi,yuchaojie,Yuheng Wang,YuJianfeng,YukioZzz,yyuse,zangqx,ZeyuHan,zhangbuxue,zhanghaibo,zhangminli,zhangqinghua,zhangyanhui,ZhangZGC,zhangzhen,zhanzhan,zhengzuohe,zhouyaqiang0,zhuguodong,zichun_ye,zjun,zong_shuai,ZPaC,zuochuanyong,zyli2020,程超,蛋蛋de忧桑,狄新凯,范吉斌,冯一航,付国华,胡彬,宦晓玲,黄勇,黄卓,康伟,李良灿,李林杰,李寅杰3,刘崇鸣,刘思铭,刘涛Liu,刘勇琪,刘子涵,吕浩宇,吕昱峰(Nate.River),钱丹,十一雷,孙昊辰,王禹程,王振邦,王梓润,吴大维,熊攀,徐安越,许子豪,俞涵,云骑士,张峻源,张王泽,张栩浩,赵文璇,周莉莉,朱家兴,邹文祥