Multi Dimensional

View Source On Gitee

As deep learning evolves, models get larger and larger. For example, in the field of NLP, in just a few years, the amount of parameters has developed from BERT’s 100 million to GPT-3’s 170 billion, and then to Pangu alpha 200 billion, and the current industry has even proposed a million billion. It can be seen that the scale of parameters has shown an exponential growth trend in recent years. On the other hand, with the development of related technologies in the fields of big data and the Internet, the datasets available for model training are also rapidly expanding, such as recommendations, natural language processing and other scenarios of the dataset that can reach terabytes.

In the face of large-scale data and large-scale parameter training, a single device either takes a long time to complete model training, or it cannot be trained due to insufficient display memory. Therefore, distributed training technology needs to be introduced.

Currently, the most commonly used distributed training technique is data parallelism. Data parallelization splits the training data into multiple devices, each maintaining the same model parameters and the same size of computing tasks, but processing different data. In the process of backpropagation, the parameter gradient generated by each device is globally AllReduce synchronously summed. When the dataset is large and the model is small, there is an advantage to choosing data parallelism, such as ResNet50. However, when the model is large, or the dataset and model are larger, other distributed features need to be used.

MindSpore provides the following advanced features to support distributed training of large models, and users can flexibly combine them according to their own needs.

Operator Parallel

Operator-level parallelism is a distributed computation of operators by splitting their input tensors into multiple devices in units. On the one hand, data samples and model parameters can be split into multiple devices at the same time to complete the training of large models. On the other hand, you can make full use of cluster resources for parallel computing to improve the overall speed.

The users can set the sharding strategy of each operator in the forward network, and the framework models each operator and its input tensor according to the sharding strategy of the operator, so that the computational logic of the operator remains mathematically equivalent before and after the sharding.

Pipeline Parallel

When there are a large number of cluster devices, if only the operator level is used in parallel, communication needs to be carried out on the communication domain of the entire cluster, which may make communication inefficient and reduce overall performance.

Pipeline parallel can split the neural network structure into multiple stages, and each stage runs in a part of the device. The communication domain of the set communication limits to this part of the device, and the stage uses point-to-point communication.

The advantages of pipeline parallel are that they can improve communication efficiency and easily handle layered neural network structures. The disadvantage is that some nodes may be idle at the same time.

Optimizer Parallel

When training in parallel with data or operators, the parameters of the model may have the same copy on multiple devices. This allows the optimizer to have redundant calculations across multiple devices when updating this weight. In this case, the optimizer’s computational volume can be spread across multiple devices through optimizer parallelism. It has the advantage of reducing static memory consumption and reducing the amount of computation in the optimizer. The disadvantage is that it increases the communication overhead.

Host Device Training

When training large models, the overall size of the model that can be trained will be limited by the number of devices due to the limited memory capacity of each device (accelerator). In order to complete larger-scale model training, you can use the host and device heterogeneous training modes. It takes advantage of both the large memory on the host side and the fast calculation on the accelerator side, and is an effective way to reduce the number of devices during the training of the super-large model.

Recompute

MindSpore automatically derives the reverse graph according to the forward graph calculation process, and the forward graph and the inverse graph together form a complete calculation graph. When calculating some reverse operators, it may be necessary to use the calculation results of some forward operators, resulting in the calculation results of these forward operators, which need to reside in memory until these reverse operators are calculated, and the memory they occupy will not be reused by other operators. The compute results of these forward operators, which reside in memory for a long time, push up the peak memory footprint of the computation, especially in large-scale network models. In order to reduce memory peaks, the recomputing technique can not save the calculation results of the forward activation layer, so that the memory can be reused, and then when calculating the reverse part, recalculate the results of the forward activation layer.