Gradient Derivation
Automatic Differentiation Interfaces
After the forward network is constructed, MindSpore provides an interface to automatic differentiation to calculate the gradient results of the model. In the tutorial of automatic derivation, some descriptions of various gradient calculation scenarios are given.
There are three MindSpore interfaces for finding gradients currently.
mindspore.grad
There are four configurable parameters in mindspore.grad:
fn (Union[Cell, Function]) - The function or network (Cell) to be derived.
grad_position (Union[NoneType, int, tuple[int]]) - Specifies the index of the input position for the derivative. Default value: 0.
weights (Union[ParameterTuple, Parameter, list[Parameter]]) - The network parameter that needs to return the gradient in the training network. Default value: None.
has_aux (bool) - Mark for whether to return the auxiliary parameters. If True, the number of fn outputs must be more than one, where only the first output of fn is involved in the derivation and the other output values will be returned directly. Default value: False.
where grad_position
and weights
together determine which values of the gradient are to be output, and has_aux configures whether to find the gradient on the first input or on all outputs when there are multiple outputs.
grad_position |
weights |
output |
---|---|---|
0 |
None |
Gradient of the first input |
1 |
None |
Gradient of the second input |
(0, 1) |
None |
(Gradient of the first input, gradient of the second input) |
None |
weights |
(Gradient of weights) |
0 |
weights |
(Gradient of the first input), (Gradient of weights) |
(0, 1) |
weights |
(Gradient of the first input, Gradient of the second input), (Gradient of weights) |
None |
None |
Report an error |
Run an actual example to see exactly how it works.
First, a network with parameters is constructed, which has two outputs loss and logits, where loss is the output we use to find the gradient.
import mindspore as ms
from mindspore import nn
class Net(nn.Cell):
def __init__(self, in_channel, out_channel):
super(Net, self).__init__()
self.fc = nn.Dense(in_channel, out_channel, has_bias=False)
self.loss = nn.MSELoss()
def construct(self, x, y):
logits = self.fc(x).squeeze()
loss = self.loss(logits, y)
return loss, logits
net = Net(3, 1)
net.fc.weight.set_data(ms.Tensor([[2, 3, 4]], ms.float32)) # Set a fixed value for fully connected weight
print("=== weight ===")
for param in net.trainable_params():
print("name:", param.name, "data:", param.data.asnumpy())
x = ms.Tensor([[1, 2, 3]], ms.float32)
y = ms.Tensor(19, ms.float32)
loss, logits = net(x, y)
print("=== output ===")
print(loss, logits)
=== weight ===
name: fc.weight data: [[2. 3. 4.]]
=== output ===
1.0 20.0
# Find the gradient for the first input
print("=== grads 1 ===")
grad_func = ms.grad(net, grad_position=0, weights=None, has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
=== grads 1 ===
grad [[4. 6. 8.]]
logit (Tensor(shape=[], dtype=Float32, value= 20),)
# Find the gradient for the second input
print("=== grads 2 ===")
grad_func = ms.grad(net, grad_position=1, weights=None, has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
=== grads 2 ===
grad -2.0
logit (Tensor(shape=[], dtype=Float32, value= 20),)
# Finding the gradient for multiple inputs
print("=== grads 3 ===")
grad_func = ms.grad(net, grad_position=(0, 1), weights=None, has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
=== grads 3 ===
grad (Tensor(shape=[1, 3], dtype=Float32, value=
[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2))
logit (Tensor(shape=[], dtype=Float32, value= 20),)
# Find the gradient for weights
print("=== grads 4 ===")
grad_func = ms.grad(net, grad_position=None, weights=net.trainable_params(), has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logits", logit)
=== grads 4 ===
grad (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),)
logits (Tensor(shape=[], dtype=Float32, value= 20),)
# Find the gradient for the first output and weights
print("=== grads 5 ===")
grad_func = ms.grad(net, grad_position=0, weights=net.trainable_params(), has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
=== grads 5 ===
grad (Tensor(shape=[1, 3], dtype=Float32, value=
[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))
logit (Tensor(shape=[], dtype=Float32, value= 20),)
# Find the gradient for multiple inputs and weights
print("=== grads 6 ===")
grad_func = ms.grad(net, grad_position=(0, 1), weights=net.trainable_params(), has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
=== grads 6 ===
grad ((Tensor(shape=[1, 3], dtype=Float32, value=
[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2)), (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))
logit (Tensor(shape=[], dtype=Float32, value= 20),)
# Scenario with has_aux=False
print("=== grads 7 ===")
grad_func = ms.grad(net, grad_position=0, weights=None, has_aux=False)
grad = grad_func(x, y) # Only one output
print("grad", grad)
=== grads 7 ===
grad [[ 6. 9. 12.]]
The has_aux=False
scenario is actually equivalent to summing two outputs as the output of finding the gradient:
class Net2(nn.Cell):
def __init__(self, in_channel, out_channel):
super().__init__()
self.fc = nn.Dense(in_channel, out_channel, has_bias=False)
self.loss = nn.MSELoss()
def construct(self, x, y):
logits = self.fc(x).squeeze()
loss = self.loss(logits, y)
return loss + logits
net2 = Net2(3, 1)
net2.fc.weight.set_data(ms.Tensor([[2, 3, 4]], ms.float32)) # Set a fixed value for fully connected weight
grads = ms.grad(net2, grad_position=0, weights=None, has_aux=False)
grad = grads(x, y) # Only one output
print("grad", grad)
grad [[ 6. 9. 12.]]
# grad_position=None, weights=None
print("=== grads 8 ===")
grad_func = ms.grad(net, grad_position=None, weights=None, has_aux=True)
grad, logit = grad_func(x, y)
print("grad", grad)
print("logit", logit)
# === grads 8 ===
# ValueError: `grad_position` and `weight` can not be None at the same time.
mindspore.value_and_grad
The parameters of the interface mindspore.value_and_grad is the same as that of the above grad, except that this interface calculates the forward result and gradient of the network at once.
grad_position |
weights |
output |
---|---|---|
0 |
None |
(Output of the network, gradient of the first input) |
1 |
None |
(Output of the network, gradient of the second input) |
(0, 1) |
None |
(Output of the network, (Gradient of the first input, gradient of the second input)) |
None |
weights |
(Output of the network, (gradient of the weights)) |
0 |
weights |
(Output of the network, ((Gradient of the first input), (gradient of the weights))) |
(0, 1) |
weights |
(Output of the network, ((Gradient of the first input, gradient of the second input), (gradient of the weights))) |
None |
None |
Report an error |
print("=== value and grad ===")
value_and_grad_func = ms.value_and_grad(net, grad_position=(0, 1), weights=net.trainable_params(), has_aux=True)
value, grad = value_and_grad_func(x, y)
print("value", value)
print("grad", grad)
=== value and grad ===
value (Tensor(shape=[], dtype=Float32, value= 1), Tensor(shape=[], dtype=Float32, value= 20))
grad ((Tensor(shape=[1, 3], dtype=Float32, value=
[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2)), (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))
mindspore.ops.GradOperation
mindspore.ops.GradOperation, a higher-order function that generates a gradient function for the input function.
The gradient function generated by the GradOperation higher-order function can be customized by the construction parameters.
This function is similar to the function of grad, and it is not recommended in the current version. Please refer to the description in the API for details.
loss scale
Since the gradient overflow may be encountered in the process of finding the gradient in the mixed accuracy scenario, we generally use the loss scale to accompany the gradient derivation.
On Ascend, because operators such as Conv, Sort, and TopK can only be float16, and MatMul is preferably float16 due to performance issues, it is recommended that loss scale operations be used as standard for network training. [List of operators on Ascend only support float16][https://www.mindspore.cn/docs/en/r2.1/migration_guide/debug_and_tune.html#4-training-accuracy].
The overflow can obtain overflow operator information via MindSpore Insight debugger or dump data.
General overflow manifests itself as loss Nan/INF, loss suddenly becomes large, etc.
from mindspore.amp import StaticLossScaler, all_finite
loss_scale = StaticLossScaler(1024.) # 静态lossscale
def forward_fn(x, y):
loss, logits = net(x, y)
print("loss", loss)
loss = loss_scale.scale(loss)
return loss, logits
value_and_grad_func = ms.value_and_grad(forward_fn, grad_position=None, weights=net.trainable_params(), has_aux=True)
(loss, logits), grad = value_and_grad_func(x, y)
print("=== loss scale ===")
print("loss", loss)
print("grad", grad)
print("=== unscale ===")
loss = loss_scale.unscale(loss)
grad = loss_scale.unscale(grad)
print("loss", loss)
print("grad", grad)
# Check whether there is an overflow, and return True if there is no overflow
state = all_finite(grad)
print(state)
loss 1.0
=== loss scale ===
loss 1024.0
grad (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.04800000e+003, 4.09600000e+003, 6.14400000e+003]]),)
=== unscale ===
loss 1.0
grad (Tensor(shape=[1, 3], dtype=Float32, value=
[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),)
True
The principle of loss scale is very simple. By multiplying a relatively large value for loss, through the chain conduction of the gradient, a relatively large value is multiplied on the link of calculating the gradient, to prevent accuracy problems from occurring when the gradient is too small during the back propagation.
After calculating the gradient, you need to divide the loss and gradient back to the original value to ensure that the whole calculation process is correct.
Finally, you generally need to use all_finite to determine if there is an overflow, and if there is no overflow you can use the optimizer to update the parameters.
Gradient Cropping
When the training process encountered gradient explosion or particularly large gradient, and training instability, you can consider adding gradient cropping. Here is an example of using global_norm for gradient cropping scenarios:
from mindspore import ops
grad = ops.clip_by_global_norm(grad)
Gradient Accumulation
Gradient accumulation is a way that data samples of a kind of training neural network is split into several small Batches by Batch, and then calculated in order to solve the OOM (Out Of Memory) problem that due to the lack of memory, resulting in too large Batch size, the neural network can not be trained or the network model is too large to load.
For detailed, refer to Gradient Accumulation.