mindspore.nn.Optimizer
- class mindspore.nn.Optimizer(learning_rate, parameters, weight_decay=0.0, loss_scale=1.0)[source]
Base class for updating parameters. Never use this class directly, but instantiate one of its subclasses instead.
Grouping parameters is supported. If parameters are grouped, different strategy of learning_rate, weight_decay and grad_centralization can be applied to each group.
Note
If parameters are not grouped, the weight_decay in optimizer will be applied on the network parameters without 'beta' or 'gamma' in their names. Users can group parameters to change the strategy of decaying weight. When parameters are grouped, each group can set weight_decay. If not, the weight_decay in optimizer will be applied.
- Parameters
learning_rate (Union[float, int, Tensor, Iterable, LearningRateSchedule]) –
float: The fixed learning rate value. Must be equal to or greater than 0.
int: The fixed learning rate value. Must be equal to or greater than 0. It will be converted to float.
Tensor: Its value should be a scalar or a 1-D vector. For scalar, fixed learning rate will be applied. For vector, learning rate is dynamic, then the i-th step will take the i-th value as the learning rate.
Iterable: Learning rate is dynamic. The i-th step will take the i-th value as the learning rate.
LearningRateSchedule: Learning rate is dynamic. During training, the optimizer calls the instance of LearningRateSchedule with step as the input to get the learning rate of current step.
parameters (Union[list[Parameter], list[dict]]) –
Must be list of Parameter or list of dict. When the parameters is a list of dict, the string "params", "lr", "weight_decay", "grad_centralization" and "order_params" are the keys can be parsed.
params: Required. Parameters in current group. The value must be a list of Parameter.
lr: Optional. If "lr" in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in optimizer will be used. Fixed and dynamic learning rate are supported.
weight_decay: Optional. If "weight_decay" in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the optimizer will be used.
grad_centralization: Optional. Must be Boolean. If "grad_centralization" is in the keys, the set value will be used. If not, the grad_centralization is False by default. This configuration only works on the convolution layer.
order_params: Optional. When parameters is grouped, this usually is used to maintain the order of parameters that appeared in the network to improve performance. The value should be parameters whose order will be followed in optimizer. If order_params in the keys, other keys will be ignored and the element of 'order_params' must be in one group of params.
weight_decay (Union[float, int]) – An int or a floating point value for the weight decay. It must be equal to or greater than 0. If the type of weight_decay input is int, it will be converted to float. Default:
0.0
.loss_scale (float) – A floating point value for the loss scale. It must be greater than 0. If the type of loss_scale input is int, it will be converted to float. In general, use the default value. Only when FixedLossScaleManager is used for training and the drop_overflow_update in FixedLossScaleManager is set to
False
, this value needs to be the same as the loss_scale in FixedLossScaleManager. Refer to classmindspore.amp.FixedLossScaleManager
for more details. Default:1.0
.
- Raises
TypeError – If learning_rate is not one of int, float, Tensor, Iterable, LearningRateSchedule.
TypeError – If element of parameters is neither Parameter nor dict.
TypeError – If loss_scale is not a float.
TypeError – If weight_decay is neither float nor int.
ValueError – If loss_scale is less than or equal to 0.
ValueError – If weight_decay is less than 0.
ValueError – If learning_rate is a Tensor, but the dimension of tensor is greater than 1.
- Supported Platforms:
Ascend
GPU
CPU
Examples
>>> import mindspore as ms >>> from mindspore import nn >>> import numpy as np >>> import mindspore >>> from mindspore import nn, ops, Tensor >>> >>> class MyMomentum(nn.Optimizer): ... def __init__(self, params, learning_rate, momentum=0.9): ... super(MyMomentum, self).__init__(learning_rate, params) ... self.moments = self.parameters.clone(prefix="moments", init="zeros") ... self.momentum = momentum ... self.opt = ops.ApplyMomentum() ... ... def construct(self, gradients): ... params = self.parameters ... lr = self.get_lr() ... gradients = self.flatten_gradients(gradients) ... gradients = self.decay_weight(gradients) ... gradients = self.gradients_centralization(gradients) ... gradients = self.scale_grad(gradients) ... ... success = None ... for param, mom, grad in zip(params, self.moments, gradients): ... success = self.opt(param, mom, lr, grad, self.momentum) ... return success >>> >>> net = nn.Dense(2, 3) >>> loss_fn = nn.MAELoss() >>> opt = MyMomentum(net.trainable_params(), 0.01) >>> >>> device_target = opt.target >>> opt_unique = opt.unique >>> weight_decay_value = opt.get_weight_decay() >>> >>> def forward_fn(data, label): ... logits = net(data) ... loss = loss_fn(logits, label) ... return loss, logits >>> >>> grad_fn = mindspore.value_and_grad(forward_fn, None, opt.parameters, has_aux=True) >>> >>> def train_step(data, label): ... (loss, _), grads = grad_fn(data, label) ... opt(grads) ... return loss >>> >>> data = Tensor(np.random.rand(4, 10, 2), mindspore.dtype.float32) >>> label = Tensor(np.random.rand(4, 10, 3), mindspore.dtype.float32) >>> train_step(data, label)
- broadcast_params(optim_result)[source]
Apply Broadcast operations in the sequential order of parameter groups.
- Parameters
optim_result (bool) – The results of updating parameters. This input is used to ensure that the parameters are updated before they are broadcast.
- Returns
The broadcast parameters.
- decay_weight(gradients)[source]
Weight decay.
An approach to reduce the overfitting of a deep learning neural network model. User-defined optimizers based on
mindspore.nn.Optimizer
can also call this interface to apply weight decay.
- flatten_gradients(gradients)[source]
Flatten gradients into several chunk tensors grouped by data type if network parameters are flattened.
A method to enable performance improvement by using contiguous memory for parameters and gradients. User-defined optimizers based on
mindspore.nn.Optimizer
should call this interface to support contiguous memory for network parameters.
- get_lr()[source]
The optimizer calls this interface to get the learning rate for the current step. User-defined optimizers based on
mindspore.nn.Optimizer
can also call this interface before updating the parameters.- Returns
float, the learning rate of current step.
- get_lr_parameter(param)[source]
When parameters is grouped and learning rate is different for each group, get the learning rate of the specified param.
- Parameters
param (Union[Parameter, list[Parameter]]) – The Parameter or list of Parameter.
- Returns
A single Parameter or list[Parameter] according to the input type. If learning rate is dynamic, LearningRateSchedule or list[LearningRateSchedule] that used to calculate the learning rate will be returned.
Examples
>>> from mindspore import nn >>> # Define the network structure of LeNet5. Refer to >>> # https://gitee.com/mindspore/docs/blob/master/docs/mindspore/code/lenet.py >>> net = LeNet5() >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params())) >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params())) >>> group_params = [{'params': conv_params, 'lr': 0.05}, ... {'params': no_conv_params, 'lr': 0.01}] >>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0) >>> conv_lr = optim.get_lr_parameter(conv_params) >>> print(conv_lr[0].asnumpy()) 0.05
- get_weight_decay()[source]
The optimizer calls this interface to get the weight decay value for the current step. User-defined optimizers based on
mindspore.nn.Optimizer
can also call this interface before updating the parameters.- Returns
float, the weight decay value of current step.
- gradients_centralization(gradients)[source]
Gradients centralization.
A method for optimizing convolutional layer parameters to improve the training speed of a deep learning neural network model. User-defined optimizers based on
mindspore.nn.Optimizer
can also call this interface to centralize gradients.
- scale_grad(gradients)[source]
Restore gradients for mixed precision.
User-defined optimizers based on
mindspore.nn.Optimizer
can also call this interface to restore gradients.
- property target
The property is used to determine whether the parameter is updated on host or device. The input type is str and can only be 'CPU', 'Ascend' or 'GPU'.
- property unique
Whether to make the gradients unique in optimizer. Generally, it is used in sparse networks. Set to True if the gradients of the optimizer are sparse, while set to False if the forward network has made the parameters unique, that is, the gradients of the optimizer is no longer sparse. The default value is True when it is not set.