文档反馈

问题文档片段

问题文档片段包含公式时，显示为空格。

提交类型

issue

有点复杂...

找人问问吧。

请选择提交类型

问题类型

规范和低错类

- 规范和低错类：

- 错别字或拼写错误，标点符号使用错误、公式错误或显示异常。

- 链接错误、空单元格、格式错误。

- 英文中包含中文字符。

- 界面和描述不一致，但不影响操作。

- 表述不通顺，但不影响理解。

- 版本号不匹配：如软件包名称、界面版本号。

易用性

- 易用性：

- 关键步骤错误或缺失，无法指导用户完成任务。

- 缺少主要功能描述、关键词解释、必要前提条件、注意事项等。

- 描述内容存在歧义指代不明、上下文矛盾。

- 逻辑不清晰，该分类、分项、分步骤的没有给出。

正确性

- 正确性：

- 技术原理、功能、支持平台、参数类型、异常报错等描述和软件实现不一致。

- 原理图、架构图等存在错误。

- 命令、命令参数等错误。

- 代码片段错误。

- 命令无法完成对应功能。

- 界面错误，无法指导操作。

- 代码样例运行报错、运行结果不符。

风险提示

- 风险提示：

- 对重要数据或系统存在风险的操作，缺少安全提示。

内容合规

- 内容合规：

- 违反法律法规，涉及政治、领土主权等敏感词。

- 内容侵权。

请选择问题类型

问题描述

点击输入详细问题描述，以帮助我们快速定位问题。

文档反馈

mindspore.nn.thor

mindspore.nn.thor(net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32, use_nesterov=False, decay_filter=<lambda x: x.name not in []>, split_indices=None, enable_clip_grad=False, frequency=100)[source]

Updates gradients by second-order algorithm–THOR.

Trace-based Hardware-driven layer-ORiented Natural Gradient Descent Computation (THOR) algorithm is proposed in:

THOR: Trace-based Hardware-driven layer-ORiented Natural Gradient Descent Computation

The updating formulas are as follows,

\begin{array}{r} \begin{array}{ll} A_{i} = a_{i} {a_{i}}^{T} \\ G_{i} = D_{s_{i}} {D_{s_{i}}}^{T} \\ m_{i} = β * m_{i} + (G_{i}^{(k)} + λ I)^{- 1}) g_{i} ({\overset{―}{A}}_{i - 1}^{(k)} + λ I)^{- 1} \\ w_{i} = w_{i} - α * m_{i} \end{array} \end{array}

$D_{s_{i}}$ represents the derivative of the loss function of the output of the i-th layer, $a_{i - 1}$ represents the input of i-th layer,and which is the activations of previous layer, $β$ represents momentum, $I$ represents the identity matrix, $\overset{―}{A}$ represents the transpose of matrix A, $λ$ represents ‘damping’, $g_{i}$ represents gradients of the i-th layer, $\otimes$ represents Kronecker product, $α$ represents ‘learning rate’

Note

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

When separating parameter groups, if you want to centralize the gradient, set grad_centralization to True, but the gradient centralization can only be applied to the parameters of the convolution layer. If the parameters of the non convolution layer are set to True, an error will be reported.

To improve parameter groups performance, the customized order of parameters can be supported.

Parameters

net (Cell) – The training network.
learning_rate (Tensor) – A value for the learning rate.
damping (Tensor) – A value for the damping.
momentum (float) – Hyper-parameter of type float, means momentum for the moving average. It must be at least 0.0.
weight_decay (int, float) – Weight decay (L2 penalty). It must be equal to or greater than 0.0. Default: 0.0.
loss_scale (float) – A value for the loss scale. It must be greater than 0.0. In general, use the default value. Default: 1.0.
batch_size (int) – The size of a batch. Default: 32
use_nesterov (bool) – Enable Nesterov momentum. Default: False.
decay_filter (function) – A function to determine which layers the weight decay applied to. And it only works when the weight_decay > 0. Default: lambda x: x.name not in []
split_indices (list) – Set allreduce fusion strategy by A/G layer indices . Only works when distributed computing. ResNet50 as an example, there are 54 layers of A/G respectively, when split_indices is set to [26, 53], it means A/G is divided into two groups to allreduce, one is 0~26 layer, and the other is 27~53. Default: None
enable_clip_grad (bool) – Whether to clip the gradients. Default: False
frequency (int) – The update interval of A/G and $A^{-1}/G^{-1}$. When frequency equals N (N is greater than 1), A/G and $A^{-1}/G^{-1}$ will be updated every N steps, and other steps will use the stale A/G and $A^{-1}/G^{-1}$ to update weights. Default: 100.

Inputs:

gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

tuple[bool], all elements are True.

Raises

TypeError – If learning_rate is not Tensor.
TypeError – If loss_scale,`momentum` or frequency is not a float.
TypeError – If weight_decay is neither float nor int.
TypeError – If use_nesterov is not a bool.
ValueError – If loss_scale is less than or equal to 0.
ValueError – If weight_decay or momentum is less than 0.
ValueError – If frequency is not int.
ValueError – If frequency is less than 2.

Supported Platforms:: Ascend GPU

Examples

>>> net = Net()
>>> optim = thor(net, lr=Tensor(1e-3), damping=Tensor(1e-3), momentum=0.9)
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = ConvertModelUtils().convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt,
... loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False)
>>> model.train(config.epoch_size, dataset, callbacks=cb, sink_size=100, dataset_sink_mode=True)