Optimizer Parallel

View Source On Gitee

Overview

When performing data parallel training, the parameter update part of the model is computed redundantly across cards. Optimizer parallelism can effectively reduce memory consumption and improve network performance on large-scale networks (e.g., Bert, GPT) by spreading the computation of the optimizer to the cards of the data parallel dimension.

In AUTO_PARALLEL or SEMI_AUTO_PARALLEL mode to enable optimizer parallelism, if the parameters after slicing strategy have duplicate slices between machines and the highest dimension of the shape is divisible by the cardinality of the duplicate slices, the framework saves the parameters as minimal slices and updates them in the optimizer. All optimizers are supported in this mode.

Parallel mode

Parameter update mode

Optimizer support

Backend support

Automatic/semi-automatic parallel

The parameters are sliced into N copies according to data parallelism, and each card updates the parameters on the current card

all optimizers

Ascend, GPU

In either mode, the optimizer parallelism does not affect the compute graph of the original forward and backward network, but only the compute volume and compute logic of the parameter updates.

Hardware platforms supported by the optimizer parallel model include Ascend, GPU, and need to be run in Graph mode.

Related interfaces:

  1. mindspore.parallel.auto_parallel.AutoParallel(network, parallel_mode="semi_auto"): Encapsulates the specified parallel mode via static graph parallelism, where network is the top-level Cell or function to be encapsulated, and parallel_mode takes the value semi_auto, indicating a semi-automatic parallel mode. The interface returns a Cell encapsulated with parallel configuration.

  2. mindspore.parallel.auto_parallel.AutoParallel.hsdp(shard_size=-1, threshold=64, optimizer_level="level1"): Configures and enables optimizer parallelism through this interface. shard_size specifies the size of the communication group for optimizer weight sharding. threshold defines the minimum memory size (in KB) required for a parameter to be sharded. Parameters smaller than this threshold will not be sharded during parameter partitioning. optimizer_level is used to specify the splitting level for optimizer sharding. When optimizer_level=level1, splitting is performed on weights and optimizer state. When optimizer_level=level2, splitting is performed on weights, optimizer state, and gradients. When optimizer_level=level3, splitting is performed on weights, optimizer state,gradients, additionally, before the backward pass, the weights are further applied with allgather communication to release the memory used by the forward pass allgather.

  3. mindspore.nn.Cell.set_comm_fusion(fusion_type=NUM): In automatic/semi-automatic mode, each parameter generates a corresponding AllGather operation and ReduceScatter operation. These communication operators are automatically inserted by the auto-parallel framework. However, as the number of parameters increases, the number of corresponding communication operators also increases, and the scheduling and startup of operators generated by communication operations incurs more overhead. Therefore, it is possible to manually configure fusion markers NUM for the AllGather and ReduceScatter operations corresponding to parameters within each Cell through the set_comm_fusion method provided by Cell in order to improve communication efficiency. MindSpore will fuse the communication operators corresponding to the same NUM parameters to minimize communication overhead.

Basic Principles

In the traditional data parallel model, each device keeps copies of the model parameters, slices the training data, synchronizes the gradient information after each iteration by using communication operators, and finally updates the parameters through optimizer calculations. While this model is effective in improving the training throughput, the optimizer introduces redundant memory and computation and fails to maximize the use of machine resources. Therefore, we need to focus on how to eliminate the redundant memory and computation of the optimizer.

In a training iteration, the data parallelism introduces a communication operation to synchronize the gradients across multiple cards to collect the parameter gradients generated by the different samples on each card. Because the model parallelism is not involved, the optimizer operations on each card are actually updated based on the same parameters and in the same direction. The fundamental idea of eliminating optimizer redundancy is to spread this memory and computation across the cards to achieve memory and performance gains.

If you want to implement parallel computing for the optimizer, there are two implementation ideas, weights grouping and weights sharding.

Weights grouping is to do inter-layer division of the parameters and gradients within the optimizer, and the general training flow is shown in Figure 1. The parameters and gradients are grouped onto different cards to be updated, and then the updated weights are shared among devices through a communication broadcast operation. The memory and performance gains of the solution depend on the group with the largest proportion of parameters. When the parameters are divided evenly, the theoretical positive gains are N-1/N of optimizer runtime and dynamic memory, and N-1/N of memory size for optimizer state parameters, where N denotes the number of devices. And the negative gain introduced is the communication time that comes when sharing network weights.

images

Figure 1: Schematic diagram of the parameter grouping training process

Another way to implement parameter slicing is to do intra-layer division of parameters, and take the corresponding slice for each parameter and gradient according to the device number. After updating the parameters and gradients, the communication aggregation operation is called to share the parameters among devices. The advantage of this scheme is that it naturally supports load balancing, i.e., the number of parameters and computations are consistent on each card, and the disadvantage is that the shape of the parameter requires to be divisible by the number of devices. The theoretical gains of this scheme are consistent with the parameter grouping, and the following improvements are made to the framework in order to extend the advantages.

  • First, slice the weights in the network can further reduce static memory. However, this also requires performing the shared weight operation at the end of the iteration before the forward start of the next iteration, ensuring that the original tensor shape remains the same after going into the forward and backward operations.

  • In addition, the main negative gain from the parallel operation of the optimizer is the communication time of the shared weights, which can bring a performance gain if we can reduce or hide it. One advantage of communication cross-iteration execution is that communication operations can be executed interleaved with the forward network by fusing the communication operators in appropriate groups, thus hiding the communication time consumption as much as possible. The communication time consumption is also related to the communication volume. For the network involving mixed precision, if we can use fp16 communication, the communication volume will be reduced by half compared to fp32. Combining the above characteristics, the implementation scheme of parameter slicing is shown in Figure 2.

image

Figure 2: Schematic diagram of the parameter slicing training process

In the test validation of the actual network training, we found that the memory gain from parameter slicing is significant. In particular, for large-scale network models, the popular Adaptive Moment estimation (Adam) and Layer-wise Adaptive Moments optimizer for Batching training (LAMB) are usually chosen to train the network, and the number of parameters and computations of the optimizer itself should not be neglected. After parameter grouping, the weight parameters in the network and the two copies of state parameters in the optimizer are reduced by a factor of N-1/N, which greatly saves the static memory. This provides the possibility to increase the number of samples in a single iteration and improve the overall training throughput, which effectively solves the memory pressure of large-scale network training.

Optimizer parameter slicing implemented by MindSpore also has the advantage of being mixed with operator-level parallelism. When the number of sliced parts in the operator-level model parallel parameters are smaller than the number of dimensions, the optimizer parameters can continue to be sliced in the dimension of data parallelism, increasing the utilization of machine resources and thus improving the end-to-end performance.