Learning Rate and Optimizer
Before reading this chapter, please read the official MindSpore tutorial Optimizer.
Here is an introduction to some special ways of using MindSpore optimizer and the principle of learning rate decay strategy.
Optimizer Comparison
Optimizer Support Differences
A comparison of the similarities and differences between the optimizers supported by both PyTorch and MindSpore is detailed in the API mapping table. Optimizers not supported in MindSpore at the moment: LBFGS, NAdam, RAdam.
Here are the common differences between optimizers in mindspore.experimental.optim and PyTorch:
Parameter |
Explanation |
---|---|
foreach |
If set to |
fused |
Enables optimizer fusion; if set to |
differentiable |
Determines whether to automatically differentiate optimizer steps during training; not supported in MindSpore. |
capturable |
Determines whether the optimizer instance can be safely captured in a CUDA graph, which is an optimization technique for executing computation graphs on the GPU for improved performance. Setting to |
Optimizer Execution and Usage Differences
When PyTorch executes the optimizer in a single step, it is usually necessary to manually execute the zero_grad()
method to set the historical gradient to 0 (or None), then use loss.backward()
to calculate the gradient of the current training step, and finally call the step()
method of the optimizer to update the network weights;
The use of the optimizer in MindSpore requires only a direct calculation of the gradients and then uses optimizer(grads)
to perform the update of the network weights.
PyTorch | MindSpore |
|
|
Hyperparameter Differences
Hyperparameter Names
Similarities and differences between network weight and learning rate parameter names:
Parameters |
PyTorch |
MindSpore |
Differences |
---|---|---|---|
network weight |
params |
params |
The parameters are the same |
learning rate |
lr |
learning_rate |
The parameters are different |
PyTorch | MindSpore |
|
|
Hyperparameter Configuration Methods
The parameters are not grouped:
The data types of the
params
different: input types in PyTorch areiterable(Tensor)
anditerable(dict)
, which support iterator types, while input types in MindSpore arelist(Parameter)
,list(dict)
, which do not support iterators.Other hyperparameter configurations and support differences are detailed in the API mapping table.
The parameters are grouped:
PyTorch supports all parameter groupings. MindSpore supports certain key groupings: “params”, “lr”, “weight_decay”, “grad_centralization”, “order_params”.
PyTorch MindSpore optim.SGD([ {'params': model.base.parameters()}, {'params': model.classifier.parameters(), 'lr': 1e-3} ], lr=1e-2, momentum=0.9)
conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params())) no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params())) group_params = [{'params': conv_params, 'weight_decay': 0.01, 'lr': 0.02}, {'params': no_conv_params}] optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9)
Runtime Hyperparameter Modification
PyTorch supports modifying arbitrary optimizer parameters during training, and provides LRScheduler
for dynamically modifying the learning rate;
MindSpore currently does not support modifying optimizer parameters during training, but provides a way to modify the learning rate and weight decay. See the Learning Rate Strategy Comparison and Weight Decay sections for details.
Weight Decay
Modify weight_decay
in PyTorch is as below.
Implement dynamic weight decay in MindSpore: Users can inherit the class of Cell
custom dynamic weight decay and pass it into the optimizer.
PyTorch | MindSpore |
|
|
Saving and Loading Optimizer State
PyTorch optimizer module provides state_dict()
for viewing and saving the optimizer state, and load_state_dict
for loading the optimizer state.
MindSpore optimizer module is inherited from Cell
. The optimizer is saved and loaded in the same way as the network is saved and loaded, usually in conjunction with save_checkpoint
and load_checkpoint
.
PyTorch | MindSpore |
|
|
Learning Rate Strategy Comparison
Dynamic Learning Rate Differences
The LRScheduler
class is defined in PyTorch to manage the learning rate. To use dynamic learning rates, pass an optimizer
instance into the LRScheduler
subclass, call scheduler.step()
in a loop to perform learning rate modifications, and synchronize the changes to the optimizer.
There are two implementations of dynamic learning rates in MindSpore, Cell
and list
. Both types of dynamic learning rates are used in the same way and are passed into the optimizer after instantiation is complete. The former computes the learning rate at each step in the internal construct
, while the latter pre-generates the learning rate list directly according to the computational logic, and updates the learning rate internally during the training process. Please refer to Dynamic Learning Rate for details.
PyTorch | MindSpore |
|
|
Custom Learning Rate Differences
PyTorch dynamic learning rate module, LRScheduler
, provides a LambdaLR
interface for custom learning rate adjustment rules, which can be specified by passing lambda expressions or custom functions.
MindSpore does not provide a similar lambda interface. Custom learning rate adjustment rules can be implemented through custom functions or custom LearningRateSchedule
.
PyTorch | MindSpore |
|
|
Obatining the Learning Rate
PyTorch:
In the fixed learning rate scenario, the learning rate is usually viewed and printed by
optimizer.state_dict()
. For example, when parameters are grouped, useoptimizer.state_dict()['param_groups'][n]['lr']
for the nth parameter group, and useoptimizer.state_dict()['param_groups'][0]['lr']
when the parameters are not grouped;In the dynamic learning rate scenario, you can use the
get_lr
method of theLRScheduler
to get the current learning rate or theprint_lr
method to print the learning rate.
MindSpore:
The interface to view the learning rate directly is not provided at present, and the problem will be fixed in the subsequent version.
Learning Rate Update
PyTorch:
PyTorch provides the torch.optim.lr_scheduler
package for dynamically modifying LR. When using the package, you need to explicitly call optimizer.step()
and scheduler.step()
to update LR. For details, see How Do I Adjust the Learning Rate.
MindSpore:
The learning rate of MindSpore is packaged in the optimizer. Each time the optimizer is invoked, the learning rate update step is automatically updated..
Parameters Grouping
MindSpore optimizer supports some special operations, such as different learning rates (lr), weight_decay and gradient_centralization strategies can be set for all trainable parameters in the network. For example:
from mindspore import nn
# Define model
class Network(nn.Cell):
def __init__(self):
super().__init__()
self.layer1 = nn.SequentialCell([
nn.Conv2d(3, 12, kernel_size=3, pad_mode='pad', padding=1),
nn.BatchNorm2d(12),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.layer2 = nn.SequentialCell([
nn.Conv2d(12, 4, kernel_size=3, pad_mode='pad', padding=1),
nn.BatchNorm2d(4),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.pool = nn.AdaptiveMaxPool2d((5, 5))
self.fc = nn.Dense(100, 10)
def construct(self, x):
x = self.layer1(x)
x = self.layer2(x)
x = self.pool(x)
x = x.view((-1, 100))
out = nn.Dense(x)
return out
def params_not_in(param, param_list):
# Use the Parameter id to determine if param is not in the param_list
param_id = id(param)
for p in param_list:
if id(p) == param_id:
return False
return True
net = Network()
trainable_param = net.trainable_params()
conv_weight, bn_weight, dense_weight = [], [], []
for _, cell in net.cells_and_names():
# Determine what the API is and add the corresponding parameters to the different lists
if isinstance(cell, nn.Conv2d):
conv_weight.append(cell.weight)
elif isinstance(cell, nn.BatchNorm2d):
bn_weight.append(cell.gamma)
bn_weight.append(cell.beta)
elif isinstance(cell, nn.Dense):
dense_weight.append(cell.weight)
other_param = []
# The parameters in all groups cannot be duplicated, and the intersection between groups is all the parameters that need to be updated
for param in trainable_param:
if params_not_in(param, conv_weight) and params_not_in(param, bn_weight) and params_not_in(param, dense_weight):
other_param.append(param)
group_param = [{'order_params': trainable_param}]
# The parameter list for each group cannot be empty
if conv_weight:
conv_weight_lr = nn.cosine_decay_lr(0., 1e-3, total_step=1000, step_per_epoch=100, decay_epoch=10)
group_param.append({'params': conv_weight, 'weight_decay': 1e-4, 'lr': conv_weight_lr})
if bn_weight:
group_param.append({'params': bn_weight, 'weight_decay': 0., 'lr': 1e-4})
if dense_weight:
group_param.append({'params': dense_weight, 'weight_decay': 1e-5, 'lr': 1e-3})
if other_param:
group_param.append({'params': other_param})
opt = nn.Momentum(group_param, learning_rate=1e-3, weight_decay=0.0, momentum=0.9)
The following points need to be noted:
The list of parameters for each group cannot be empty.
Use the values set in the optimizer if
weight_decay
andlr
are not set, and use the values in the grouping parameter dictionary if they are set.lr
in each group can be static or dynamic, but cannot be regrouped.weight_decay
in each group needs to be a conforming floating point number.The parameters in all groups cannot be duplicated, and the intersection between groups is all the parameters that need to be updated.
MindSpore Learning Rate Decay Strategy
During the training process, MindSpore learning rate is in the form of parameters in the network. Before executing the optimizer to update the trainable parameters in network, MindSpore will call get_lr to get the value of the learning rate needed for the current step.
MindSpore learning rate supports static, dynamic, and grouping, where the static learning rate is a Tensor in float32 type in the network.
There are two types of dynamic learning rates, one is a Tensor in the network, with the length of the total number of steps of training and in float32 type, such as Dynamic LR function. There is global_step
in the optimizer, and the parameter will be +1 for every optimizer update. MindSpore will internally get the learning rate value of the current step based on the parameters global_step
and learning_rate
.
The other one is the one that generates the value of learning rate by composition, such as LearningRateSchedule class.
The grouping learning rate is as described in parameter grouping in the previous section.
Because the learning rate of MindSpore is a parameter, we can also modify the value of learning rate during training by assigning values to learning_rate
parameter, as in LearningRateScheduler Callback. This method only supports static learning rates passed into the optimizer. The key code is as follows:
import mindspore as ms
from mindspore import ops, nn
net = nn.Dense(1, 2)
optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
print(optimizer.learning_rate.data.asnumpy())
new_lr = 0.01
# Rewrite the value of the learning_rate parameter
ops.assign(optimizer.learning_rate, ms.Tensor(new_lr, ms.float32))
print(optimizer.learning_rate.data.asnumpy())
Outputs:
0.1
0.01