mindspore.nn.Optimizer

class mindspore.nn.Optimizer(learning_rate, parameters, weight_decay=0.0, loss_scale=1.0)[source]

Base class for updating parameters. Never use this class directly, but instantiate one of its subclasses instead.

Grouping parameters is supported. If parameters are grouped, different strategy of learning_rate, weight_decay and grad_centralization can be applied to each group.

Note

If parameters are not grouped, the weight_decay in optimizer will be applied on the network parameters without ‘beta’ or ‘gamma’ in their names. Users can group parameters to change the strategy of decaying weight. When parameters are grouped, each group can set weight_decay. If not, the weight_decay in optimizer will be applied.

Parameters
  • learning_rate (Union[float, int, Tensor, Iterable, LearningRateSchedule]) –

    • float: The fixed learning rate value. Must be equal to or greater than 0.

    • int: The fixed learning rate value. Must be equal to or greater than 0. It will be converted to float.

    • Tensor: Its value should be a scalar or a 1-D vector. For scalar, fixed learning rate will be applied. For vector, learning rate is dynamic, then the i-th step will take the i-th value as the learning rate.

    • Iterable: Learning rate is dynamic. The i-th step will take the i-th value as the learning rate.

    • LearningRateSchedule: Learning rate is dynamic. During training, the optimizer calls the instance of LearningRateSchedule with step as the input to get the learning rate of current step.

  • parameters (Union[list[Parameter], list[dict]]) –

    Must be list of Parameter or list of dict. When the parameters is a list of dict, the string “params”, “lr”, “weight_decay”, “grad_centralization” and “order_params” are the keys can be parsed.

    • params: Required. Parameters in current group. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in optimizer will be used. Fixed and dynamic learning rate are supported.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the optimizer will be used.

    • grad_centralization: Optional. Must be Boolean. If “grad_centralization” is in the keys, the set value will be used. If not, the grad_centralization is False by default. This configuration only works on the convolution layer.

    • order_params: Optional. When parameters is grouped, this usually is used to maintain the order of parameters that appeared in the network to improve performance. The value should be parameters whose order will be followed in optimizer. If order_params in the keys, other keys will be ignored and the element of ‘order_params’ must be in one group of params.

  • weight_decay (Union[float, int]) – An int or a floating point value for the weight decay. It must be equal to or greater than 0. If the type of weight_decay input is int, it will be converted to float. Default: 0.0 .

  • loss_scale (float) – A floating point value for the loss scale. It must be greater than 0. If the type of loss_scale input is int, it will be converted to float. In general, use the default value. Only when FixedLossScaleManager is used for training and the drop_overflow_update in FixedLossScaleManager is set to False , this value needs to be the same as the loss_scale in FixedLossScaleManager. Refer to class mindspore.amp.FixedLossScaleManager for more details. Default: 1.0 .

Raises
  • TypeError – If learning_rate is not one of int, float, Tensor, Iterable, LearningRateSchedule.

  • TypeError – If element of parameters is neither Parameter nor dict.

  • TypeError – If loss_scale is not a float.

  • TypeError – If weight_decay is neither float nor int.

  • ValueError – If loss_scale is less than or equal to 0.

  • ValueError – If weight_decay is less than 0.

  • ValueError – If learning_rate is a Tensor, but the dimension of tensor is greater than 1.

Supported Platforms:

Ascend GPU CPU

broadcast_params(optim_result)[source]

Apply Broadcast operations in the sequential order of parameter groups.

Parameters

optim_result (bool) – The results of updating parameters. This input is used to ensure that the parameters are updated before they are broadcast.

Returns

bool, the status flag.

decay_weight(gradients)[source]

Weight decay.

An approach to reduce the overfitting of a deep learning neural network model. User-defined optimizers based on mindspore.nn.Optimizer can also call this interface to apply weight decay.

Parameters

gradients (tuple[Tensor]) – The gradients of network parameters, and have the same shape as the parameters.

Returns

tuple[Tensor], The gradients after weight decay.

flatten_gradients(gradients)[source]

Flatten gradients into several chunk tensors grouped by data type if network parameters are flattened.

A method to enable performance improvement by using contiguous memory for parameters and gradients. User-defined optimizers based on mindspore.nn.Optimizer should call this interface to support contiguous memory for network parameters.

Parameters

gradients (tuple[Tensor]) – The gradients of network parameters.

Returns

tuple[Tensor], The gradients after flattened, or the original gradients if parameters are not flattened.

get_lr()[source]

The optimizer calls this interface to get the learning rate for the current step. User-defined optimizers based on mindspore.nn.Optimizer can also call this interface before updating the parameters.

Returns

float, the learning rate of current step.

get_lr_parameter(param)[source]

When parameters is grouped and learning rate is different for each group, get the learning rate of the specified param.

Parameters

param (Union[Parameter, list[Parameter]]) – The Parameter or list of Parameter.

Returns

Parameter, single Parameter or list[Parameter] according to the input type. If learning rate is dynamic, LearningRateSchedule or list[LearningRateSchedule] that used to calculate the learning rate will be returned.

Examples

>>> from mindspore import nn
>>> # Define the network structure of LeNet5. Refer to
>>> # https://gitee.com/mindspore/docs/blob/r2.1/docs/mindspore/code/lenet.py
>>> net = LeNet5()
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'lr': 0.05},
...                 {'params': no_conv_params, 'lr': 0.01}]
>>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
>>> conv_lr = optim.get_lr_parameter(conv_params)
>>> print(conv_lr[0].asnumpy())
0.05
get_weight_decay()[source]

The optimizer calls this interface to get the weight decay value for the current step. User-defined optimizers based on mindspore.nn.Optimizer can also call this interface before updating the parameters.

Returns

float, the weight decay value of current step.

gradients_centralization(gradients)[source]

Gradients centralization.

A method for optimizing convolutional layer parameters to improve the training speed of a deep learning neural network model. User-defined optimizers based on mindspore.nn.Optimizer can also call this interface to centralize gradients.

Parameters

gradients (tuple[Tensor]) – The gradients of network parameters, and have the same shape as the parameters.

Returns

tuple[Tensor], The gradients after gradients centralization.

scale_grad(gradients)[source]

Restore gradients for mixed precision.

User-defined optimizers based on mindspore.nn.Optimizer can also call this interface to restore gradients.

Parameters

gradients (tuple[Tensor]) – The gradients of network parameters, and have the same shape as the parameters.

Returns

tuple[Tensor], The gradients after loss scale.

property target

The property is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

property unique

Whether to make the gradients unique in optimizer. Generally, it is used in sparse networks. Set to True if the gradients of the optimizer are sparse, while set to False if the forward network has made the parameters unique, that is, the gradients of the optimizer is no longer sparse. The default value is True when it is not set.