Optimization Algorithms
Ascend
GPU
CPU
Model Development
Overview
mindspore.nn.optim
is a module in the MindSpore framework for implementing various optimization algorithms, including common optimizers and learning rates. In addition, the universal APIs can integrate updated and complex methods into the module.
mindspore.nn.optim
provides common optimizers for models, such as SGD
, ADAM
, and Momentum
. The optimizer is used to compute and update the gradient. The selection of the model optimization algorithm directly affects the performance of the final model. If the effect is poor, the problem may be caused by the optimization algorithm instead of the feature or model design. In addition, mindspore.nn
provides the learning rate module. Learning rates are classified into dynamic_lr
and learning_rate_schedule
, which are both dynamic learning rates. However, the implementation methods are different. The learning rate is the most important parameter in supervised learning and deep learning. It determines whether the objective function can converge to a local minimum and when it can converge to a minimum. An appropriate learning rate can make the objective function converge to a local minimum in an appropriate time.
All the following examples support the CPU, GPU, and Ascend environments.
Learning Rates
dynamic_lr
The mindspore.nn.dynamic_lr
module contains the following classes:
piecewise_constant_lr
class: computes the learning rate based on the unchanged segment.exponential_decay_lr
class: computes the learning rate based on the exponential decay function.natural_exp_decay_lr
class: computes the learning rate based on the natural exponential decay function.inverse_decay_lr
class: computes the learning rate based on the inverse time attenuation function.cosine_decay_lr
class: computes the learning rate based on the cosine attenuation function.polynomial_decay_lr
class: computes the learning rate based on the polynomial attenuation function.warmup_lr
class: improves the learning rate.
They are different implementations of dynamic_lr
.
For example, the code example of the piecewise_constant_lr
class is as follows:
from mindspore.nn import piecewise_constant_lr
def test_dynamic_lr():
milestone = [2, 5, 10]
learning_rates = [0.1, 0.05, 0.01]
lr = piecewise_constant_lr(milestone, learning_rates)
print(lr)
if __name__ == '__main__':
test_dynamic_lr()
The following information is displayed:
[0.1, 0.1, 0.05, 0.05, 0.05, 0.01, 0.01, 0.01, 0.01, 0.01]
learning_rate_schedule
The mindspore.nn.learning_rate_schedule
module has the following classes: ExponentialDecayLR
, NaturalExpDecayLR
, InverseDecayLR
, and CosineDecayLR
. PolynomialDecayLR
class and WarmUpLR
class. They belong to learning_rate_schedule
but are implemented in different ways. Their meanings are as follows:
ExponentialDecayLR
class: computes the learning rate based on the exponential decay function.NaturalExpDecayLR
class: computes the learning rate based on the natural exponential decay function.InverseDecayLR
class: computes the learning rate based on the inverse time attenuation function.CosineDecayLR
class: computes the learning rate based on the cosine attenuation function.PolynomialDecayLR
class: computes the learning rate based on the polynomial attenuation function.WarmUpLR
class: improves the learning rate.
They are different implementations of learning_rate_schedule
.
For example, the code example of the ExponentialDecayLR class is as follows:
from mindspore import dtype as mstype
from mindspore import Tensor
from mindspore.nn import ExponentialDecayLR
def test_learning_rate_schedule():
learning_rate = 0.1 # learning_rate(float) - The initial value of learning rate.
decay_rate = 0.9 # decay_rate(float) - The decay rate.
decay_steps = 4 # decay_steps(int) - A value used to calculate decayed learning rate.
global_step = Tensor(2, mstype.int32)
exponential_decay_lr = ExponentialDecayLR(learning_rate, decay_rate, decay_steps)
res = exponential_decay_lr(global_step)
print(res)
if __name__ == '__main__':
test_learning_rate_schedule()
The following information is displayed:
0.094868325
Optimzers
Usage
To use mindspore.nn.optim
, you need to build an Optimizer
object. This object can maintain the current parameter status and update parameters based on the computed gradient.
Building
To build an Optimizer
, you need to give it an iterable that contains the parameters (must be Variable objects) that need to be optimized. Then, you can set the Optimizer
parameter options, such as the learning rate and weight attenuation.
A code example is as follows:
from mindspore import nn
optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
optim = nn.Adam(params=net.trainable_params())
optim = nn.Adam(group_params, learning_rate=0.1, weight_decay=0.0)
Setting options for each parameter separately
The optimizer also allows you to set options for each parameter separately. Do not pass in the variable directly but pass in the iterable of a dictionary. Each dictionary defines a group of parameters and contains a key, which corresponds to a parameter value. Other keys should be other parameters accepted by the optimizer and will be used to optimize this group of parameters.
You can pass options as keyword parameters, which are used as default values in groups where these options are not overridden. This is useful when you want to change the options of only one parameter group without changing the options of other parameter groups.
Take SGD
as an example. When you want to determine the learning rate of each layer, run the following command:
from mindspore import nn
optim = nn.SGD([{'params': conv_params, 'weight_decay': 0.01},
{'params': no_conv_params, 'lr': 0.01},
{'order_params': net.trainable_params()}],
learning_rate=0.1, weight_decay=0.0)
This example indicates that when the parameter is conv_params, the weight attenuation is 0.01 and the learning rate is 0.1. When the parameter is no_conv_params, the weight attenuation is 0.0 and the learning rate is 0.01. The learning_rate=0.1 is used for all groups where the learning rate is not set. The same rule applies to weight_deca.
Built-in Optimizers
Common deep learning optimization algorithms include SGD
, Adam
, Ftrl
, lazyadam
, Momentum
, RMSprop
, Lars
, Proximal_ada_grad
, and lamb
.
In the mindspore.nn.optim
module, they have corresponding class implementations. For example:
SGD
: The default parameter is pure SGD. When themomentum
parameter is set to a value other than 0, the first-order momentum is considered. Afternesterov
is set to True, the value changes toNAG
, that is,Nesterov Accelerated Gradient
. When the gradient is computed, the gradient of the step forward is computed.RMSprop
considers the second-order momentum. Different parameters have different learning rates, that is, adaptive learning rates.Adagrad
is optimized. Only the second-order momentum within a certain window is considered through exponential smoothing.Adam
considers both first-order momentum and second-order momentum. It can be seen as a further consideration of the first-order momentum based onRMSprop
.
For example, the code example of SGD
is as follows:
from mindspore import nn, Model, Tensor
import mindspore.ops as ops
import numpy as np
from mindspore import dtype as mstype
from mindspore import Parameter
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.matmul = ops.MatMul()
self.conv = nn.Conv2d(1, 6, 5, pad_mode='valid')
self.z = Parameter(Tensor(np.array([1.0], np.float32)))
def construct(self, x, y):
x = x * self.z
out = self.matmul(x, y)
return out
net = Net()
optim = nn.SGD(params=net.trainable_params())
conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
group_params = [{'params': conv_params, 'weight_decay': 0.01},
{'params': no_conv_params, 'lr': 0.01},
{'order_params': net.trainable_params()}]
optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
loss = nn.SoftmaxCrossEntropyWithLogits()
model = Model(net, loss_fn=loss, optimizer=optim)