Data Parallel
Overview
Data parallel is the most commonly used parallel training approach for accelerating model training and handling large-scale datasets. In data parallel mode, the training data is divided into multiple copies and then each copy is assigned to a different compute node, such as multiple cards or multiple devices. Each node processes its own subset of data independently and uses the same model for forward and backward propagation, and ultimately performs model parameter updates after synchronizing the gradients of all nodes.
Hardware platforms supported for data parallelism include Ascend, GPU and CPU, in addition to both PyNative and Graph modes.
Related interfaces are as follows:
mindspore.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL)
: Set the data parallel mode.mindspore.nn.DistributedGradReducer()
: Perform multi-card gradient aggregation.
Overall Process
Environmental dependencies
Before starting parallel training, the communication resources are initialized by calling the
mindspore.communication.init
interface and the global communication groupWORLD_COMM_GROUP
is automatically created. The communication group enables communication operators to distribute messages between cards and machines, and the global communication group is the largest one, including all devices in current training. The current mode is set to data parallel mode by callingmindspore.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL)
.Data distribution
The core of data parallel lies in splitting the dataset in sample dimensions and sending it down to different cards. In all dataset loading interfaces provided by the
mindspore.dataset
module, there arenum_shards
andshard_id
parameters which are used to split the dataset into multiple copies and cycle through the samples in a way that collectsbatch
data to their respective cards, and will start from the beginning when there is a shortage of data.Network composition
The data parallel network is written in a way that does not differ from the single-card network, due to the fact that during forward propagation & backward propagation the models of each card are executed independently from each other, only the same network structure is maintained. The only thing we need to pay special attention to is that in order to ensure the training synchronization between cards, the corresponding network parameter initialization values should be the same. In
DATA_PARALLEL
mode, we can usemindspore.set_seed
to set the seed or enableparameter_broadcast
inmindspore.set_auto_parallel_context
to achieve the same initialization of weights between multiple cards.Gradient aggregation
Data parallel should theoretically achieve the same training effect as the single-card machine. In order to ensure the consistency of the computational logic, the gradient aggregation operation between cards is realized by calling the
mindspore.nn.DistributedGradReducer()
interface, which automatically inserts theAllReduce
operator after the gradient computation is completed.DistributedGradReducer()
provides themean
switch, which allows the user to choose whether to perform an average operation on the summed gradient values, or to treat them as hyperparameters.Parameter update
Because of the introduction of the gradient aggregation operation, the models of each card will enter the parameter update step together with the same gradient values.