mindspore.nn.thor

mindspore.nn.thor(net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32, use_nesterov=False, decay_filter=lambda x: ..., split_indices=None, enable_clip_grad=False, frequency=100)[source]

Updates gradients by second-order algorithm–THOR.

Trace-based Hardware-driven layer-ORiented Natural Gradient Descent Computation (THOR) algorithm is proposed in:

THOR: Trace-based Hardware-driven layer-ORiented Natural Gradient Descent Computation

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} & \textbf{Parameter:} \: \text{the learning rate } \gamma\text{, the damping parameter }\lambda \\ & \textbf{Init:} \: \lambda \leftarrow 0 \\ & A_{i-1}=\mathbb{E}\left[a_{i-1} a_{i-1}^{T}\right] \\ & G_{i}=\mathbb{E}\left[D_{s_i} D_{s_i}^{T}\right] \\ & w_{i}^{(k+1)} \leftarrow w_{i}^{(k)}-\gamma\left(\left(A_{i-1}^{(k)}+\lambda I\right)^{-1} \otimes\left(G_{i}^{(k)}+\lambda I\right)^{-1}\right) \nabla_{w_{i}} J^{(k)} \end{array}\end{split}\]

$a_{i-1}$ represents the input of i-th layer,and which is the activations of previous layer, $D_{s_i}$ represents the derivative of the loss function of the output of the i-th layer, $I$ represents the identity matrix, $\lambda$ represents $damping$, $g_i$ represents gradients of the i-th layer, $\otimes$ represents Kronecker product, $\gamma$ represents ‘learning rate’

Parameters

net (Cell) – The training network.
learning_rate (Tensor) – A value for the learning rate.
damping (Tensor) – A value for the damping.
momentum (float) – Hyper-parameter of type float, means momentum for the moving average. It must be at least 0.0.
weight_decay (int, float) – Weight decay (L2 penalty). It must be equal to or greater than 0.0. Default: 0.0.
loss_scale (float) – A value for the loss scale. It must be greater than 0.0. In general, use the default value. Default: 1.0.
batch_size (int) – The size of a batch. Default: 32
use_nesterov (bool) – Enable Nesterov momentum. Default: False.
decay_filter (function) – A function to determine which layers the weight decay applied to. And it only works when the weight_decay > 0. Default: lambda x: x.name not in []
split_indices (list) – Set allreduce fusion strategy by A/G layer indices . Only works when distributed computing. ResNet50 as an example, there are 54 layers of A/G respectively, when split_indices is set to [26, 53], it means A/G is divided into two groups to allreduce, one is 0~26 layer, and the other is 27~53. Default: None
enable_clip_grad (bool) – Whether to clip the gradients. Default: False
frequency (int) – The update interval of A/G and $A^{-1}/G^{-1}$. When frequency equals N (N is greater than 1), A/G and $A^{-1}/G^{-1}$ will be updated every N steps, and other steps will use the stale A/G and $A^{-1}/G^{-1}$ to update weights. Default: 100.

Inputs:

gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

tuple[bool], all elements are True.

Raises

TypeError – If learning_rate is not Tensor.
TypeError – If loss_scale, momentum or frequency is not a float.
TypeError – If weight_decay is neither float nor int.
TypeError – If use_nesterov is not a bool.
ValueError – If loss_scale is less than or equal to 0.
ValueError – If weight_decay or momentum is less than 0.
ValueError – If frequency is not int.
ValueError – If frequency is less than 2.

Supported Platforms:: Ascend GPU

Examples

>>> from mindspore.nn import thor
>>> from mindspore import Model
>>> from mindspore import FixedLossScaleManager
>>> from mindspore.train.callback import LossMonitor
>>> from mindspore.train.train_thor import ConvertModelUtils
>>> from mindspore import nn
>>> from mindspore import Tensor
>>>
>>> net = Net()
>>> dataset = create_dataset()
>>> temp = Tensor([4e-4, 1e-4, 1e-5, 1e-5], mstype.float32)
>>> optim = thor(net, learning_rate=temp, damping=temp, momentum=0.9, loss_scale=128, frequency=4)
>>> loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
>>> loss_scale = FixedLossScaleManager(128, drop_overflow_update=False)
>>> model = Model(net, loss_fn=loss, optimizer=optim, loss_scale_manager=loss_scale, metrics={'acc'},
...               amp_level="O2", keep_batchnorm_fp32=False)
>>> model = ConvertModelUtils.convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=optim,
...                                                 loss_scale_manager=loss_scale, metrics={'acc'},
...                                                 amp_level="O2", keep_batchnorm_fp32=False)
>>> loss_cb = LossMonitor()
>>> model.train(1, dataset, callbacks=loss_cb, sink_size=4, dataset_sink_mode=True)