mindflow.core.AdaHessian

class mindflow.core.AdaHessian(params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0, use_amsgrad=False, **kwargs)[source]

The Adahessian optimizer. It has been proposed in ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning . See the Torch implementation for reference. The Hessian power here is fixed to 1, and the way of spatially averaging the Hessian traces follows the default behavior in the Torch implementation, that is

for 1D: no spatial average.
for 2D: use the entire row as the spatial average.
for 3D (assume 1D Conv, can be customized): use the last dimension as spatial average.
for 4D (assume 2D Conv, can be customized): use the last 2 dimensions as spatial average.

Args see mindspore.nn.Adam .

Supported Platforms:: Ascend

Examples

>>> import numpy as np
>>> import mindspore as ms
>>> from mindspore import ops, nn
>>> from mindflow import AdaHessian
>>> ms.set_context(device_target="Ascend", mode=ms.GRAPH_MODE)
>>> net = nn.Conv2d(in_channels=2, out_channels=4, kernel_size=3)
>>> def forward(a):
>>>     return ops.mean(net(a)**2)**.5
>>> grad_fn = ms.grad(forward, grad_position=None, weights=net.trainable_params())
>>> optimizer = AdaHessian(net.trainable_params())
>>> inputs = ms.Tensor(np.reshape(range(100), [2, 2, 5, 5]), dtype=ms.float32)
>>> optimizer(grad_fn, inputs)
>>> print(optimizer.moment2[0].shape)
(4, 2, 3, 3)