Network Training and Gadient Derivation
Automatic Differentiation
After the forward network is constructed, MindSpore provides an interface to automatic differentiation to calculate the gradient results of the model. In the tutorial of automatic derivation, some descriptions of various gradient calculation scenarios are given.
Network Training
The entire training network consists of the forward network (network and loss function), automatic gradient derivation and optimizer update. MindSpore provides three ways to implement this process.
Encapsulate
model
and perform network training by usingmodel.train
ormodel.fit
, such as model training.Apply the encapsulated TrainOneStepCell and TrainOneStepWithLossScaleCell separately to common training process and training process with loss_scale, such as Quick Start: Linear Fitting.
Customize training Cell.
Customizing training Cell
The first two methods are illustrated with two examples from the official website. For the customized training Cell, we first review what is done in the TrainOneStepCell:
import mindspore as ms
from mindspore import ops, nn
from mindspore.parallel._utils import (_get_device_num, _get_gradients_mean,
_get_parallel_mode)
class TrainOneStepCell(nn.Cell):
def __init__(self, network, optimizer):
super(TrainOneStepCell, self).__init__(auto_prefix=False)
self.network = network # Network structure with loss
self.network.set_grad() # Required for PYNATIVE mode. If True, the inverse network requiring the computation of gradients will be generated when the forward network is executed.
self.optimizer = optimizer # Optimizer for parameter updates
self.weights = self.optimizer.parameters # Get the parameters of the optimizer
# Related logic of parallel computation
self.reducer_flag = False
self.grad_reducer = ops.identity
self.parallel_mode = _get_parallel_mode()
self.reducer_flag = self.parallel_mode in (ms.ParallelMode.DATA_PARALLEL, ms.ParallelMode.HYBRID_PARALLEL)
if self.reducer_flag:
self.mean = _get_gradients_mean()
self.degree = _get_device_num()
self.grad_reducer = nn.DistributedGradReducer(self.weights, self.mean, self.degree)
def construct(self, *inputs):
loss, grads = ms.value_and_grad(self.network, grad_position=None, weights=self.weights)(*inputs) # obtain the loss and get the gradient of all Parameter free variables
# grads = grad_op(grads) # Some computing logic for the gradient can be added here, such as gradient clipping
grads = self.grad_reducer(grads) # Gradient aggregation
self.optimizer(grads) # Optimizer updates parameters
return loss
The whole training process can be encapsulated into a Cell, in which the forward computation, backward gradient derivation and parameter update of the network can be realized, and we can do some special processing on the gradient after obtaining it.
Gradient Clipping
When there is gradient explosion or particularly large gradient during the training process, in a case that training is not stable, you can consider adding gradient clipping. Here is an example of the common use of global_norm for gradient clipping scenarios.
import mindspore as ms
from mindspore import nn, ops
_grad_scale = ops.MultitypeFuncGraph("grad_scale")
@_grad_scale.register("Tensor", "Tensor")
def tensor_grad_scale(scale, grad):
return grad * ops.cast(ops.Reciprocal()(scale), ops.dtype(grad))
class MyTrainOneStepCell(nn.TrainOneStepWithLossScaleCell):
"""
Network training package class with gradient clip.
Append an optimizer to the training network after that the construct function
can be called to create the backward graph.
Args:
network (Cell): The training network.
optimizer (Cell): Optimizer for updating the weights.
sens (Number): The adjust parameter. Default value is 1.0.
grad_clip (bool): Whether clip gradients. Default value is False.
"""
def __init__(self, network, optimizer, scale_sense=1, grad_clip=False):
if isinstance(scale_sense, (int, float)):
scale_sense = nn.FixedLossScaleUpdateCell(scale_sense)
super(MyTrainOneStepCell, self).__init__(network, optimizer, scale_sense)
self.grad_clip = grad_clip
def construct(self, x, img_shape, gt_bboxe, gt_label, gt_num):
# Most are attributes and methods of base class, and refer to the corresponding base class API for details
weights = self.weights
loss = self.network(x, img_shape, gt_bboxe, gt_label, gt_num)
scaling_sens = self.scale_sense
# Start floating point overflow detection. Create and clear overflow detection status
status, scaling_sens = self.start_overflow_check(loss, scaling_sens)
# Multiply the gradient by a scale to prevent the gradient from overflowing
scaling_sens_filled = ops.ones_like(loss) * ops.cast(scaling_sens, ops.dtype(loss))
grads = self.grad(self.network, weights)(x, img_shape, gt_bboxe, gt_label, gt_num, scaling_sens_filled)
# Calculate the true gradient value by dividing the obtained gradient by the scale
grads = self.hyper_map(ops.partial(_grad_scale, scaling_sens), grads)
# Gradient clipping
if self.grad_clip:
grads = ops.clip_by_global_norm(grads)
# Gradient aggregation
grads = self.grad_reducer(grads)
# Get floating point overflow status
cond = self.get_overflow_status(status, grads)
# Calculate loss scaling factor based on overflow state during dynamic loss scale
overflow = self.process_loss_scale(cond)
# If there is no overflow, execute the optimizer to update the parameters
if not overflow:
self.optimizer(grads)
return loss, cond, scaling_sens.value()
Gradient Accumulation
Gradient accumulation is a method that data samples of the training neural network are split into several small Batch by Batch, and are calculated in order to solve the problem that Batch size is too large due to lack of memory, or the OOM, that is, the neural network can not be trained or the network model is too large to load.
Please refer to Gradient Accumulation for details.