Multi Dimensional

As deep learning evolves, models get larger and larger. For example, in the field of NLP, in just a few years, the amount of parameters has developed from BERT’s 100 million to GPT-3’s 170 billion, and then to Pangu alpha 200 billion, and the current industry has even proposed a million billion. It can be seen that the scale of parameters has shown an exponential growth trend in recent years. On the other hand, with the development of related technologies in the fields of big data and the Internet, the datasets available for model training are also rapidly expanding, such as recommendations, natural language processing and other scenarios of the dataset that can reach terabytes.

In the face of large-scale data and large-scale parameter training, a single device either takes a long time to complete model training, or it cannot be trained due to insufficient display memory. Therefore, distributed training technology needs to be introduced.

Currently, the most commonly used distributed training technique is data parallelism. Data parallelization splits the training data into multiple devices, each maintaining the same model parameters and the same size of computing tasks, but processing different data. In the process of backpropagation, the parameter gradient generated by each device is globally AllReduce synchronously summed. When the dataset is large and the model is small, there is an advantage to choosing data parallelism, such as ResNet50. However, when the model is large, or the dataset and model are larger, other distributed features need to be used.

MindSpore provides the following advanced features to support distributed training of large models, and users can flexibly combine them according to their own needs.

Operator Parallel

Operator-level parallelism is a distributed computation of operators by splitting their input tensors into multiple devices in units. On the one hand, data samples and model parameters can be split into multiple devices at the same time to complete the training of large models. On the other hand, you can make full use of cluster resources for parallel computing to improve the overall speed.

The users can set the sharding strategy of each operator in the forward network, and the framework models each operator and its input tensor according to the sharding strategy of the operator, so that the computational logic of the operator remains mathematically equivalent before and after the sharding.

Pipeline Parallel 

When there are a large number of cluster devices, if only the operator level is used in parallel, communication needs to be carried out on the communication domain of the entire cluster, which may make communication inefficient and reduce overall performance.

Pipeline parallel can split the neural network structure into multiple stages, and each stage runs in a part of the device. The communication domain of the set communication limits to this part of the device, and the stage uses point-to-point communication.

The advantages of pipeline parallel are that they can improve communication efficiency and easily handle layered neural network structures. The disadvantage is that some nodes may be idle at the same time.

Optimizer Parallel

When training in parallel with data or operators, the parameters of the model may have the same copy on multiple devices. This allows the optimizer to have redundant calculations across multiple devices when updating this weight. In this case, the optimizer’s computational volume can be spread across multiple devices through optimizer parallelism. It has the advantage of reducing static memory consumption and reducing the amount of computation in the optimizer. The disadvantage is that it increases the communication overhead.

Host Device Training 

When training large models, the overall size of the model that can be trained will be limited by the number of devices due to the limited memory capacity of each device (accelerator). In order to complete larger-scale model training, you can use the host and device heterogeneous training modes. It takes advantage of both the large memory on the host side and the fast calculation on the accelerator side, and is an effective way to reduce the number of devices during the training of the super-large model.

Recompute 

MindSpore automatically derives the reverse graph according to the forward graph calculation process, and the forward graph and the inverse graph together form a complete calculation graph. When calculating some reverse operators, it may be necessary to use the calculation results of some forward operators, resulting in the calculation results of these forward operators, which need to reside in memory until these reverse operators are calculated, and the memory they occupy will not be reused by other operators. The compute results of these forward operators, which reside in memory for a long time, push up the peak memory footprint of the computation, especially in large-scale network models. In order to reduce memory peaks, the recomputing technique can not save the calculation results of the forward activation layer, so that the memory can be reused, and then when calculating the reverse part, recalculate the results of the forward activation layer.

Description of the Interface Related to the Feature

Feature category	Feature interface	Description	Function
Auto-parallel	set_auto_paralle_context(search_mode=mode)	Specify the policy search algorithm, with a value of type string, and the optional value: 1. “sharding_propagation”: indicate a policy search by using sharding strategy propagation algorithm;2. “dynamic_programming”: indicate the use of dynamic programming algorithms for policy search;3. “recursive_programming”: indicate the use of a double recursive algorithm for policy search;	Automatic parallel allows the user to search for sharding strategy without configuring or configuring a small number of operators, and the framework searches for the sharding strategy.
	set_algo_parametrs(fully_use_devces=bool_value)	Whether operators need to be split across all devices when setting up search policies. Its value is of type bool, which defaults to True.	If the operator is split into all devices, the search space can be reduced and the search speed can be improved, but the search strategy is not globally optimal.
	set_auto_paralle_context(all_redce_fusion_configconfig)	Configure the gradient AllReduce operator fusion strategy with a value of type list. For example: [20, 35], which means that the first 20 AllReduces are fused into 1, the 20th to 35th AllReduce are fused into 1, and the remaining AllReduce are fused into 1.	Reduce the number of operations of the AllReduce communication operator and improve communication efficiency.
comm_fusion	set_auto_paralle_context(comm_fuion=config)	Set the fusion configuration of the communication operator, and support the configuration of the AllReduce, AllGather, and ReduceScatter communication operators currently. Its value is of type dict, such as comm_fusion={“allreduce”: {“mode”: “auto”, “config”: None}}. There are three options for “mode” among them: “auto”: Automatically perform operator fusion according to the data volume threshold of 64MB, and the configuration parameter “config” is None. “size”: Communicate operator fusion according to the method of manually setting the data volume threshold, and the configuration parameter “config” type is int, with the unit of MB. “index”: Only “allreduce” supports configuring index, which means that the configuration parameter “config” type is list according to the way the sequence number of the communication operator is fused. For example: [20, 35], which means that the first 20 AllReduces are fused into 1, the 20th to 35th AllReduce are fused into 1, and the remaining AllReduce are fused into 1.	Reduce the number of operations of the AllReduce/AllGath er/ReduceScatter communication operator and improve communication efficiency.
Dataset slicing	set_auto_parallel_context(dataset_strategy=config)	Configure the sharding policy for the dataset. where config is Union[str, tuple]. When a string is passed in, there are two options: “full_batch”: indicates that the dataset is not tangential, and “data_parallel”: indicates that the dataset is sliced in parallel with the data. When passed in tuple, the content in tuple represents the shard() interface of the dataset, similar to the premiumive shard() interface. if this interface is not called, it defaults to the “data_parallel” mode.	When the number of samples is smaller than the number of cards, it can be imported in the way of “full_batch”; when the number of samples is large and the model parameters are small, it can be imported in the way of “data_parallel”; when the data set is high-resolution image data, it can be imported by configuring the tuple sharding strategy.
Distributed inference	infer_predict_layout(*predict_data)	Use inference data to perform precompilation, which outputs the splitting information of the operator.	Obtain the sharding information of the ownership weight at the time of inference.
	load_distributed_checkpoint(network,checkpoint_filenames,predict_strategy=None,train_strategy_filename=None)	Load the distributed weights. Each machine needs to pre-place the full amount of ckpt. where network represents the inference network, checkpoint_filenames represents the checkpoint file, predict_strategy is the output of the infer_predict_layout(), and train_strategy_filename is the operator slicing strategy information saved during training.	Load distributed weights for distributed inference.
Functional operator sharding	shard(in_strategy,out_strategy,device=“Ascend”,level=0) In Cell class	Set the sharding strategy of the input and output tensors of the cell, and the parallel strategy of the remaining operators is propagated by the sharding strategy. in_strategy/out_strategy specify the sharding policy for the input/output tensor. device specifies the execution device, and level specifies the pattern of the sharding policy propagation algorithm.	In PyNative mode, specify that a cell instance executes in graph mode, and synchronizes the operator-level model according to the specified input-output sharding strategy, while the rest of the model is still executed in Python mode.
	ops.shard(fn, in_strategy, out_strategy, device=“Ascend”, level=0)	The incoming fn is a cell instance or function. The rest of the input is the same as shard, and the return value is a function. When this function is called, the operator-level model is executed in graph mode in parallel.	This usage allows you to specify that a function performs model parallelism at the operator level, with the same function as cell’s shard method.