mindformers.core.Came
- class mindformers.core.Came(params, learning_rate=None, eps=(1e-30, 1e-3, 1e-16), clip_threshold=1.0, decay_rate=0.8, beta1=0.9, beta3=0.99, weight_decay=0.0, scale_parameter=False, relative_step=False, warmup_init=False, compression=False, loss_scale=1)[source]
Updates gradients by the Confidence-guided Adaptive Memory Efficient Optimization (Came) algorithm.
The Came algorithm is proposed in CAME: Confidence-guided Adaptive Memory Efficient Optimization .
- Parameters
params (Union[list[Parameter], list[dict]]) – When the params is a list of Parameter which will be updated, the element in params must be class Parameter.
learning_rate (Union[float, Tensor]) – A value or a graph for the learning rate. When the learning_rate is a Tensor in a 1D dimension. If the type of learning_rate is int, it will be converted to float. Default: None.
eps (Union[list, tuple]) – The regularization constans for square gradient, parameter scale and instability_matrix respectively. default: (1e-30, 1e-3, 1e-16)
clip_threshold (float) – The threshold of root mean square of final gradient update. default: 1.0
decay_rate (float) – The coefficient used to compute running averages of square gradient. Should be in range [0.0, 1.0]. default: 0.8.
beta1 (float) – The coefficient to computing running averages of gradient. Should be in range [0.0, 1.0]. Default: 0.9.
beta3 (float) – The coefficient to computing running averages of gradient. Should be in range [0.0, 1.0]. Default: 0.99.
weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Should be in range [0.0, 1.0]. default: 0.0.
scale_parameter (bool) – If True, learning rate is scaled by root mean square of parameter. default: True
relative_step (bool) – If True, time-dependent learning rate is computed instead of external learning rate. default: True
warmup_init (bool) – The time-dependent learning rate computation depends on whether warm-up initialization is being used. default: False
compression (bool) – If True, the data type of the running averages exponent will be compression to float16. default: False
loss_scale (int) – An integer point value for the loss scale. Should be greater than 0. In general, use the default value. Only when FixedLossScaleManager is used for training and the drop_overflow_update in FixedLossScaleManager is set to False, then this value needs to be the same as the loss_scale in FixedLossScaleManager. Refer to class
mindspore.amp.FixedLossScaleManager
for more details. Default: 1.
- Inputs:
gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.
- Outputs:
Tensor[bool], the value is True.
- Raises
TypeError – If learning_rate is not one of int, float, Tensor, Iterable, LearningRateSchedule.
TypeError – If element of parameters is neither Parameter nor dict.
TypeError – If decay_rate, weight_decay, beta1, beta3, eps or loss_scale is not a float.
TypeError – If use_locking or use_nesterov is not a bool.
ValueError – If loss_scale or eps is less than or equal to 0.
ValueError – If decay_rate, weight_decay, beta1 or beta3 is not in range [0.0, 1.0].
Examples
>>> import mindspore as ms >>> import mindspore.nn as nn >>> from mindformers import AutoModel >>> from mindformers.core.optim import Came >>> >>> ms.set_context(mode=ms.context.GRAPH_MODE) >>> net = AutoModel.from_pretrained("llama2_7b", num_layers=2) >>> #1) All parameters use the same learning rate and weight decay >>> optim = Came(params=net.trainable_params(), learning_rate=0.1) >>> >>> #2) Use parameter groups and set different values >>> layernorm_params = list(filter(lambda x: 'norm' in x.name, net.trainable_params())) >>> no_layernorm_params = list(filter(lambda x: 'norm' not in x.name, net.trainable_params())) >>> group_params = [{'params': layernorm_params, 'weight_decay': 0.01}, ... {'params': no_layernorm_params, 'lr': 0.01}, ... {'order_params': net.trainable_params()}] >>> optim = Came(group_params, learning_rate=0.1, weight_decay=0.0) >>> # The layernorm_params's parameters will use default learning rate of 0.1 and weight decay of 0.01. >>> # The no_layernorm_params's parameters will use learning rate of 0.01 and default weight decay of 0.0. >>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'. >>> >>> loss = nn.SoftmaxCrossEntropyWithLogits() >>> model = ms.Model(net, loss_fn=loss, optimizer=optim)