mindformers.core.Came

class mindformers.core.Came(params, learning_rate=None, eps=(1e-30, 1e-3, 1e-16), clip_threshold=1.0, decay_rate=0.8, beta1=0.9, beta3=0.99, weight_decay=0.0, scale_parameter=False, relative_step=False, warmup_init=False, compression=False, loss_scale=1)[source]

Updates gradients by the Confidence-guided Adaptive Memory Efficient Optimization (Came) algorithm.

The Came algorithm is proposed in CAME: Confidence-guided Adaptive Memory Efficient Optimization .

Parameters

params (Union[list[Parameter], list[dict]]) – When the params is a list of Parameter which will be updated, the element in params must be class Parameter.
learning_rate (Union[float, Tensor]) – A value or a graph for the learning rate. When the learning_rate is a Tensor in a 1D dimension. If the type of learning_rate is int, it will be converted to float. Default: None.
eps (Union[list, tuple]) – The regularization constans for square gradient, parameter scale and instability_matrix respectively. default: (1e-30, 1e-3, 1e-16)
clip_threshold (float) – The threshold of root mean square of final gradient update. default: 1.0
decay_rate (float) – The coefficient used to compute running averages of square gradient. Should be in range [0.0, 1.0]. default: 0.8.
beta1 (float) – The coefficient to computing running averages of gradient. Should be in range [0.0, 1.0]. Default: 0.9.
beta3 (float) – The coefficient to computing running averages of gradient. Should be in range [0.0, 1.0]. Default: 0.99.
weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Should be in range [0.0, 1.0]. default: 0.0.
scale_parameter (bool) – If True, learning rate is scaled by root mean square of parameter. default: True
relative_step (bool) – If True, time-dependent learning rate is computed instead of external learning rate. default: True
warmup_init (bool) – The time-dependent learning rate computation depends on whether warm-up initialization is being used. default: False
compression (bool) – If True, the data type of the running averages exponent will be compression to float16. default: False
loss_scale (int) – An integer point value for the loss scale. Should be greater than 0. In general, use the default value. Only when FixedLossScaleManager is used for training and the drop_overflow_update in FixedLossScaleManager is set to False, then this value needs to be the same as the loss_scale in FixedLossScaleManager. Refer to class mindspore.amp.FixedLossScaleManager for more details. Default: 1.

Inputs:

gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

Raises

TypeError – If learning_rate is not one of int, float, Tensor, Iterable, LearningRateSchedule.
TypeError – If element of parameters is neither Parameter nor dict.
TypeError – If decay_rate, weight_decay, beta1, beta3, eps or loss_scale is not a float.
TypeError – If use_locking or use_nesterov is not a bool.
ValueError – If loss_scale or eps is less than or equal to 0.
ValueError – If decay_rate, weight_decay, beta1 or beta3 is not in range [0.0, 1.0].

Examples

>>> import mindspore as ms
>>> import mindspore.nn as nn
>>> from mindformers import AutoModel
>>> from mindformers.core.optim import Came
>>>
>>> ms.set_context(mode=ms.context.GRAPH_MODE)
>>> net = AutoModel.from_pretrained("llama2_7b", num_layers=2)
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = Came(params=net.trainable_params(), learning_rate=0.1)
>>>
>>> #2) Use parameter groups and set different values
>>> layernorm_params = list(filter(lambda x: 'norm' in x.name, net.trainable_params()))
>>> no_layernorm_params = list(filter(lambda x: 'norm' not in x.name, net.trainable_params()))
>>> group_params = [{'params': layernorm_params, 'weight_decay': 0.01},
...                 {'params': no_layernorm_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = Came(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The layernorm_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_layernorm_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = ms.Model(net, loss_fn=loss, optimizer=optim)