Differences between torch.nn.MultiheadAttention and mindspore.nn.MultiheadAttention

torch.nn.MultiheadAttention

class torch.nn.MultiheadAttention(
    embed_dim,
    num_heads,
    dropout=0.0,
    bias=True,
    add_bias_kv=False,
    add_zero_attn=False,
    kdim=None,
    vdim=None
)(query, key, value, key_padding_mask=None, need_weights=True, attn_mask=None)

For more information, see torch.nn.MultiheadAttention。

mindspore.nn.MultiheadAttention

class mindspore.nn.MultiheadAttention(
    embed_dim,
    num_heads,
    dropout=0.0,
    has_bias=True,
    add_bias_kv=False,
    add_zero_attn=False,
    kdim=None,
    vdim=None,
    batch_first=False,
    dtype=mstype.float32
)(query, key, value, key_padding_mask=None, need_weights=True, attn_mask=None, average_attn_weights=True)

For more information, see mindspore.nn.MultiheadAttention。

Differences

The code implementation and parameter update logic of mindspore.nn.MultiheadAttention optimizer is mostly the same with torch.nn.MultiheadAttention.

Categories	Subcategories	PyTorch	MindSpore	Difference
Parameters	Parameter1	embed_dim	embed_dim	Consistent function
	Parameter2	num_heads	num_heads	Consistent function
	Parameter3	dropout	dropout	Consistent function
	Parameter4	bias	has_bias	Consistent function
	Parameter5	add_bias_kv	add_bias_kv	Consistent function
	Parameter6	add_zero_attn	add_zero_attn	Consistent function
	Parameter7	kdim	kdim	Consistent function
	Parameter8	vdim	vdim	Consistent function
	Parameter9		batch_first	In MindSpore, first batch can be set as batch dimension, PyTorch does not have this function.
	Parameter10		dtype	In MindSpore, dtype can be set in Parameters using ‘dtype’. PyTorch does not have this function.
Input	Input1	query	query	Consistent function
	Input2	key	key	Consistent function
	Input3	value	value	Consistent function
	Input4	key_padding_mask	key_padding_mask	In MindSpore, dtype can be set as float or bool Tensor; in PyTorch dtype can be set as byte or bool Tensor.
	Input5	need_weights	need_weights	Consistent function
	Input6	attn_mask	attn_mask	In MindSpore, dtype can be set as float or bool Tensor; in PyTorch dtype can be set as float, byte or bool Tensor.
	Input7		average_attn_weights	If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. PyTorch does not have this function.

Code Example# PyTorch
import torch
from torch import nn

embed_dim, num_heads = 128, 8
seq_length, batch_size = 10, 8
query = torch.rand(seq_length, batch_size, embed_dim)
key = torch.rand(seq_length, batch_size, embed_dim)
value = torch.rand(seq_length, batch_size, embed_dim)
multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
attn_output, attn_output_weights = multihead_attn(query, key, value)
print(attn_output.shape)
#torch.Size([10, 8, 128])
print(attn_output_weights.shape)
#torch.Size([8, 10, 10])

# MindSpore
import mindspore as ms
import numpy as np

embed_dim, num_heads = 128, 8
seq_length, batch_size = 10, 8
query = ms.Tensor(np.random.randn(seq_length, batch_size, embed_dim), ms.float32)
key = ms.Tensor(np.random.randn(seq_length, batch_size, embed_dim), ms.float32)
value = ms.Tensor(np.random.randn(seq_length, batch_size, embed_dim), ms.float32)
multihead_attn = ms.nn.MultiheadAttention(embed_dim, num_heads)
attn_output, attn_output_weights = multihead_attn(query, key, value)
print(attn_output.shape)
#(10, 8, 128)
print(attn_output_weights.shape)
#(8, 10, 10)