Advanced Automatic Differentiation

The grad and value_and_grad provided by the mindspore.ops module generate the gradients of the network model. grad computes the network gradient, and value_and_grad computes both the forward output and the gradient of the network. This article focuses on how to use the main functions of the grad, including first-order and second-order derivations, derivation of the input or network weights separately, returning auxiliary variables, and stopping calculating the gradient.

For more information about the derivative interface, please refer to the API documentation.

First-order Derivation

Method: mindspore.grad. The parameter usage is as follows:

fn: the function or network to be derived.
grad_position: specifies the index of the input position to be derived. If the index is int type, it means to derive for a single input; if tuple type, it means to derive for the position of the index within the tuple, where the index starts from 0; and if None, it means not to derive for the input. In this scenario, weights is non-None. Default: 0.
weights: the network variables that need to return the gradients in the training network. Generally the network variables can be obtained by weights = net.trainable_params(). Default: None.
has_aux: symbol for whether to return auxiliary arguments. If True, the number of fn outputs must be more than one, where only the first output of fn is involved in the derivation and the other output values will be returned directly. Default: False.

The following is a brief introduction to the use of the grad by first constructing a customized network model Net and then performing a first-order derivative on it:

\[f(x, y)=(x * z) * y \tag{1}\]

First, define the network model Net, input x, and input y.

import numpy as np
from mindspore import ops, Tensor
import mindspore.nn as nn
import mindspore as ms

# Define the inputs x and y.
x = Tensor([3.0], dtype=ms.float32)
y = Tensor([5.0], dtype=ms.float32)


class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.z = ms.Parameter(ms.Tensor(np.array([1.0], np.float32)), name='z')

    def construct(self, x, y):
        out = x * x * y * self.z
        return out

Computing the First-order Derivative for Input

To derive the inputs x and y, set grad_position to (0, 1):

\[\frac{\partial f}{\partial x}=2 * x * y * z \tag{2}\]

\[\frac{\partial f}{\partial y}=x * x * z \tag{3}\]

net = Net()
grad_fn = ms.grad(net, grad_position=(0, 1))
gradients = grad_fn(x, y)
print(gradients)

(Tensor(shape=[1], dtype=Float32, value= [ 3.00000000e+01]), Tensor(shape=[1], dtype=Float32, value= [ 9.00000000e+00]))

Computing the Derivative for Weight

Derive for the weight z, where it is not necessary to derive for the inputs, and set grad_position to None:

\[\frac{\partial f}{\partial z}=x * x * y \tag{4}\]

params = ms.ParameterTuple(net.trainable_params())

output = ms.grad(net, grad_position=None, weights=params)(x, y)
print(output)

(Tensor(shape=[1], dtype=Float32, value= [ 4.50000000e+01]),)

Returning Auxiliary Variables

Simultaneous derivation for the inputs and weights, where only the first output is involved in the derivation, with the following sample code:

net = nn.Dense(10, 1)
loss_fn = nn.MSELoss()


def forward(inputs, labels):
    logits = net(inputs)
    loss = loss_fn(logits, labels)
    return loss, logits


inputs = Tensor(np.random.randn(16, 10).astype(np.float32))
labels = Tensor(np.random.randn(16, 1).astype(np.float32))
weights = net.trainable_params()

# Aux value does not contribute to the gradient.
grad_fn = ms.grad(forward, grad_position=0, weights=None, has_aux=True)
inputs_gradient, (aux_logits,) = grad_fn(inputs, labels)
print(len(inputs_gradient), aux_logits.shape)

16, (16, 1)

Stopping Gradient Computation

You can use stop_gradient to stop computing the gradient of a specified operator to eliminate the impact of the operator on the gradient.

Based on the matrix multiplication network model used for the first-order derivation, add an operator out2 and disable the gradient computation to obtain the customized network Net2. Then, check the derivation result of the input.

The sample code is as follows:

class Net(nn.Cell):

    def __init__(self):
        super(Net, self).__init__()

    def construct(self, x, y):
        out1 = x * y
        out2 = x * y
        out2 = ops.stop_gradient(out2) # Stop computing the gradient of the out2 operator.
        out = out1 + out2
        return out


net = Net()
grad_fn = ms.grad(net)
output = grad_fn(x, y)
print(output)

[5.0]

According to the preceding information, stop_gradient is set for out2. Therefore, out2 does not contribute to gradient computation. The output result is the same as that when out2 is not added.

Delete out2 = stop_gradient(out2) and check the output result. An example of the code is as follows:

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()

    def construct(self, x, y):
        out1 = x * y
        out2 = x * y
        # out2 = stop_gradient(out2)
        out = out1 + out2
        return out


net = Net()
grad_fn = ms.grad(net)
output = grad_fn(x, y)
print(output)

[10.0]

According to the printed result, after the gradient of the out2 operator is computed, the gradients generated by the out2 and out1 operators are the same. Therefore, the value of each item in the result is twice the original value (accuracy error exists).

High-order Derivation

High-order differentiation is used in domains such as AI-supported scientific computing and second-order optimization. For example, in the molecular dynamics simulation, when the potential energy is trained using the neural network, the derivative of the neural network output to the input needs to be computed in the loss function, and then the second-order cross derivative of the loss function to the input and the weight exists in backward propagation.

In addition, the second-order derivative of the output to the input exists in a differential equation solved by AI (such as PINNs). Another example is that in order to enable the neural network to converge quickly in the second-order optimization, the second-order derivative of the loss function to the weight needs to be computed using the Newton method.

MindSpore can support high-order derivatives by computing derivatives for multiple times. The following uses several examples to describe how to compute derivatives.

Single-input Single-output High-order Derivative

For example, the formula of the Sin operator is as follows:

\[f(x) = sin(x) \tag{1}\]

The first derivative is:

\[f'(x) = cos(x) \tag{2}\]

The second derivative is:

\[f''(x) = cos'(x) = -sin(x) \tag{3}\]

The second derivative (-Sin) is implemented as follows:

import numpy as np
import mindspore.nn as nn
import mindspore.ops as ops
import mindspore as ms

class Net(nn.Cell):
    """Feedforward network model"""

    def __init__(self):
        super(Net, self).__init__()
        self.sin = ops.Sin()

    def construct(self, x):
        out = self.sin(x)
        return out

x_train = ms.Tensor(np.array([3.1415926]), dtype=ms.float32)

net = Net()
firstgrad = ms.grad(net)
secondgrad = ms.grad(firstgrad)
output = secondgrad(x_train)

# Print the result.
result = np.around(output.asnumpy(), decimals=2)
print(result)

[-0.]

The preceding print result shows that the value of \(-sin(3.1415926)\) is close to \(0\).

Single-input Multi-output High-order Derivative

Compute the derivation of the following formula:

\[f(x) = (f_1(x), f_2(x)) \tag{1}\]

Where:

\[f_1(x) = sin(x) \tag{2}\]

\[f_2(x) = cos(x) \tag{3}\]

MindSpore uses the reverse-mode automatic differentiation mechanism during gradient computation. The output result is summed and then the derivative of the input is computed. Therefore, the first derivative is:

\[f'(x) = cos(x) -sin(x) \tag{4}\]

The second derivative is:

\[f''(x) = -sin(x) - cos(x) \tag{5}\]

import numpy as np
from mindspore import ops, Tensor
import mindspore.nn as nn
import mindspore as ms

class Net(nn.Cell):
    """Feedforward network model"""
    def __init__(self):
        super(Net, self).__init__()
        self.sin = ops.Sin()
        self.cos = ops.Cos()

    def construct(self, x):
        out1 = self.sin(x)
        out2 = self.cos(x)
        return out1, out2

x_train = Tensor(np.array([3.1415926]), dtype=ms.float32)

net = Net()
firstgrad = ms.grad(net)
secondgrad = ms.grad(firstgrad)
output = secondgrad(x_train)

# Print the result.
result = np.around(output.asnumpy(), decimals=2)
print(result)

[1.]

The preceding print result shows that the value of \(-sin(3.1415926) - cos(3.1415926)\) is close to \(1\).

Multiple-Input Multiple-Output High-Order Derivative

Compute the derivation of the following formula:

\[f(x, y) = (f_1(x, y), f_2(x, y)) \tag{1}\]

Where:

\[f_1(x, y) = sin(x) - cos(y) \tag{2}\]

\[f_2(x, y) = cos(x) - sin(y) \tag{3}\]

MindSpore uses the reverse-mode automatic differentiation mechanism during gradient computation. The output result is summed and then the derivative of the input is computed.

Sum:

\[\sum{output} = sin(x) + cos(x) - sin(y) - cos(y) \tag{4}\]

The first derivative of output sum with respect to input \(x\) is:

\[\dfrac{\mathrm{d}\sum{output}}{\mathrm{d}x} = cos(x) - sin(x) \tag{5}\]

The second derivative of output sum with respect to input \(x\) is:

\[\dfrac{\mathrm{d}\sum{output}^{2}}{\mathrm{d}^{2}x} = -sin(x) - cos(x) \tag{6}\]

The first derivative of output sum with respect to input \(y\) is:

\[\dfrac{\mathrm{d}\sum{output}}{\mathrm{d}y} = -cos(y) + sin(y) \tag{7}\]

The second derivative of output sum with respect to input \(y\) is:

\[\dfrac{\mathrm{d}\sum{output}^{2}}{\mathrm{d}^{2}y} = sin(y) + cos(y) \tag{8}\]

import numpy as np
from mindspore import ops, Tensor
import mindspore.nn as nn
import mindspore as ms

class Net(nn.Cell):
    """Feedforward network model"""
    def __init__(self):
        super(Net, self).__init__()
        self.sin = ops.Sin()
        self.cos = ops.Cos()

    def construct(self, x, y):
        out1 = self.sin(x) - self.cos(y)
        out2 = self.cos(x) - self.sin(y)
        return out1, out2

x_train = Tensor(np.array([3.1415926]), dtype=ms.float32)
y_train = Tensor(np.array([3.1415926]), dtype=ms.float32)

net = Net()
firstgrad = ms.grad(net, grad_position=(0, 1))
secondgrad = ms.grad(firstgrad, grad_position=(0, 1))
output = secondgrad(x_train, y_train)

# Print the result.
print(np.around(output[0].asnumpy(), decimals=2))
print(np.around(output[1].asnumpy(), decimals=2))

[1.]
[-1.]

According to the preceding result, the value of the second derivative \(-sin(3.1415926) - cos(3.1415926)\) of the output to the input \(x\) is close to \(1\), and the value of the second derivative \(sin(3.1415926) + cos(3.1415926)\) of the output to the input \(y\) is close to \(-1\).

The accuracy may vary depending on the computing platform. Therefore, the execution results of the code in this section vary slightly on different platforms.