Differences with torch.optim.Adam

torch.optim.Adam

class torch.optim.Adam(
    params,
    lr=0.001,
    betas=(0.9, 0.999),
    eps=1e-08,
    weight_decay=0,
    amsgrad=False
)

For more information, see torch.optim.Adam.

mindspore.nn.Adam

class mindspore.nn.Adam(
    params,
    learning_rate=1e-3,
    beta1=0.9,
    beta2=0.999,
    eps=1e-8,
    use_locking=False,
    use_nesterov=False,  
    weight_decay=0.0,
    loss_scale=1.0,
    use_amsgrad=False,
    **kwargs
)

For more information, see mindspore.nn.Adam.

Differences

mindspore.nn.Adam can override the function of torch.optim.Adam, and the function is the same with default parameters. The extra inputs in mindspore.nn.Adam compared to PyTorch are used to control other functions. See the notes on the website for details.

Categories	Subcategories	PyTorch	MindSpore	Difference
Parameters	Parameter 1	params	params	Consistent
	Parameter 2	lr	learning_rate	Same function, different parameter names
	Parameter 3	eps	eps	Consistent
	Parameter 4	weight_decay	weight_decay	Consistent
	Parameter 5	amsgrad	use_amsgrad	Same function, different parameter names
	Parameter 6	betas	beta1, beta2	Same function, different parameter names
	Parameter 7	-	use_locking	MindSpore `use_locking` indicates whether to update the accumulator, and PyTorch does not have this parameter
	Parameter 8	-	use_nesterov	MindSpore `use_nesterov` indicates whether to update the gradient using the NAG algorithm, and PyTorch does not have this parameter
	Parameter 9	-	loss_scale	MindSpore `loss_scale` is the gradient scaling factor, and PyTorch does not have this parameter
	Parameter 10	-	kwargs	The parameters "use_lazy" and "use_offload" passed into `kwargs` in MindSpore can be resolved to indicate whether to use the Lazy Adam algorithm or the Offload Adam algorithm, and PyTorch does not have this parameter

Code Example# MindSpore
import mindspore
from mindspore import nn

net = nn.Dense(2, 3)
optimizer = nn.Adam(net.trainable_params())
criterion = nn.MAELoss(reduction="mean")

def forward_fn(data, label):
    logits = net(data)
    loss = criterion(logits, label)
    return loss, logits

grad_fn = mindspore.value_and_grad(forward_fn, None, optimizer.parameters, has_aux=True)

def train_step(data, label):
    (loss, _), grads = grad_fn(data, label)
    optimizer(grads)
    return loss

# PyTorch
import torch

model = torch.nn.Linear(2, 3)
criterion = torch.nn.L1Loss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters())
def train_step(data, label):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, label)
    loss.backward()
    optimizer.step()