mindspore.parallel.nn

Note

Transformer Networks This is an experimental interface that is subject to change and/or deletion.

class mindspore.parallel.nn.AttentionMask(seq_length, parallel_config=default_dpmp_config)[source]

Get the Lower triangular matrix from the input mask. The input mask is a 2D tensor (batch_size, seq_length) with 1 and 0. 1 indicates the current position is a valid token, otherwise not.

Parameters
  • seq_length (int) – The sequence length of the input tensor.

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • input_mask (Tensor) - The mask indicating whether each position is a valid input with (batch_size, seq_length).

Outputs:

Tensor. The attention mask matrix with shape (batch_size, seq_length, seq_length).

Raises
  • TypeErrorseq_length is not an integer.

  • ValueErrorseq_length is not a positive value.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.parallel.nn import AttentionMask
>>> from mindspore import Tensor
>>> mask = AttentionMask(seq_length=4)
>>> mask_array = np.array([[1, 1, 1, 0]], np.float32)
>>> inputs = Tensor(mask_array)
>>> res = mask(inputs)
>>> print(res)
[[[1. 0. 0. 0],
  [1. 1. 0. 0],
  [1. 1. 1. 0],
  [0. 0. 0. 0]]]
class mindspore.parallel.nn.VocabEmbedding(vocab_size, embedding_size, parallel_config=default_embedding_parallel_config, param_init="normal")[source]

The embedding lookup table from the 0-th dim of the parameter table. When the parallel_config.vocab_emb_dp is True and in the AUTO_PARALLEL_MODE, the embedding lookup will be a parallel_config.data_parallel data parallel way, or will shard the parameter at the 0-th dimension in parallel_config.model_parallel, so-called row slice of the embedding table.

Parameters
  • vocab_size (int) – Size of the dictionary of embeddings.

  • embedding_size (int) – The size of each embedding vector.

  • param_init (Union[Tensor, str, Initializer, numbers.Number]) – Initializer for the embedding_table. Refer to class initializer for the values of string when a string is specified. Default: ‘normal’.

  • parallel_config (EmbeddingOpParallelConfig) – The parallel config of network. Default default_embedding_parallel_config, an instance of EmbeddingOpParallelConfig with default args.

Inputs:

input_ids (Tensor) - The tokenized inputs with datatype int32 with shape (batch_size, seq_length)

Outputs:

Tuple, a tuple contains (output, embedding_table)

  • output (Tensor) - The embedding vector for the input with shape (batch_size, seq_length, embedding_size).

  • weight (Tensor) - The embedding table with shape (vocab_size, embedding_size).

Raises
  • ValueError – If the parallel_config.vocab_emb_dp is True, the vocab size is not a multiple of parallel_config.model_parallel

  • ValueErrorvocab_size is not a positive value.

  • ValueErrorembedding_size is not a positive value.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.parallel.nn import VocabEmbedding
>>> from mindspore import Tensor
>>> from mindspore import dtype as mstype
>>> model = VocabEmbedding(vocab_size=30, embedding_size=30)
>>> tensor = Tensor(np.ones((20, 15)), mstype.int32)
>>> output, table = model(tensor)
>>> print(output.shape)
(20, 15, 30)
>>> print(table.shape)
(30, 30)
class mindspore.parallel.nn.MultiHeadAttention(batch_size, src_seq_length, tgt_seq_length, hidden_size, num_heads, hidden_dropout_rate=0.1, attention_dropout_rate=0.1, compute_dtype=mstype.float16, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, use_past=False, parallel_config=default_dpmp_config)[source]

This is an implementation of multihead attention in the paper Attention is all you need. Given the query vector with source length, and the key and value vector with target length, the attention will be performed as the following

\[MultiHeadAttention(query, key, vector) = Concat(head_1, \dots, head_h)W^O\]

where \(head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\). The default is with a bias.

if query, key and value tensor is same, then it will be self attention.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • src_seq_length (int) – The sequence length of the query vector.

  • tgt_seq_length (int) – The sequence length of the key and value vector.

  • hidden_size (int) – The hidden size of the input.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1

  • compute_dtype (dtype.Number) – The computation type of dense. Default dtype.float16. Should be dtype.float32 or dtype.float16.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Default dtype.float32. Should be dtype.float32 or dtype.float16.

  • softmax_compute_type (dtype.Number) – The type of softmax computation module. Default dtype.float32. Should be dtype.float32 or dtype.float16.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’s state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. The first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • query_tensor (Tensor) - the query vector with shape (batch_size, src_seq_length, hidden_size) or (batch_size * src_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • key_tensor (Tensor) - the key vector with shape (batch_size, tgt_seq_length, hidden_size) or (batch_size * tgt_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • value_tensor (Tensor) - the value vector with shape (batch_size, tgt_seq_length, hidden_size) or (batch_size * tgt_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • attention_mask (Tensor) - the attention mask matrix with shape (batch_size, src_seq_length, tgt_seq_length), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, tgt_seq_length)

  • key_past (Tensor) - Float16 tensor with shape (batch_size, num_heads, size_per_head, tgt_seq_length). The past calculated key vector. Used for incremental prediction when the use_past is True. Default None.

  • value_past (Tensor) - Float16 tensor with shape (batch_size, num_heads, tgt_seq_length, size_per_head). The past calculated value vector. Used for incremental prediction when the use_past is True. Default None.

  • batch_valid_length (Tensor) - Int32 tensor with shape (batch_size,) the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - Tensor, the float tensor of the output of the layer with shape (batch_size, src_seq_length, hidden_size) or (batch_size * src_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size).

  • layer_present (Tuple) - A tuple of the Tensor of the projected key and value vector with ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.parallel.nn import MultiHeadAttention
>>> from mindspore import dtype as mstype
>>> from mindspore import Tensor
>>> model = MultiHeadAttention(batch_size=2, hidden_size=15, src_seq_length=20, tgt_seq_length=20,
...                            num_heads=3)
>>> from_tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> to_tensor = Tensor(np.ones((2, 20, 15)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 20, 20)), mstype.float16)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask)
>>> print(attn_out.shape)
(2, 20, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
# When use use_past=True, it includes two steps to implement the incremental prediction.
# Step 1: set is_first_iteration=True, and input the full sequence length's state.
# We need to prepare the memory parameters for saving key and value states firstly.
>>> model = MultiHeadAttention(batch_size=2, hidden_size=15, src_seq_length=20, tgt_seq_length=20,
...                            num_heads=3, use_past=True)
>>> key_past = Tensor(np.zeros(shape=(2, 3, 5, 20)), mstype.float16)
>>> value_past = Tensor(np.zeros(shape=(2, 3, 20, 5)), mstype.float16)
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
# Set is_first_iteration=True to generate the full memory states
>>> model.add_flags_recursive(is_first_iteration=True)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask, key_past, value_past,
...                        batch_valid_length)
>>> print(attn_out.shape)
(2, 20, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
>>> from_tensor = Tensor(np.ones((2, 1, 15)), mstype.float32)
>>> to_tensor = Tensor(np.ones((2, 1, 15)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 1, 20)), mstype.float16)
# Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than the full
# sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask, key_past, value_past,
...                        batch_valid_length)
>>> print(attn_out.shape)
(2, 1, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
class mindspore.parallel.nn.FeedForward(hidden_size, ffn_hidden_size, dropout_rate, hidden_act="gelu", expert_num=1, param_init_type=mstype.float32, parallel_config=default_dpmp_config)[source]

The multilayer perceptron with two linear layers with dropout applied at final output. The first linear will project the input dimension from hidden_size to ffn_hidden_size, the second linear will project the dimension from ffn_hidden_size to hidden_size. The first linear is sharded on the relative dimension, the second linear is sharded on the output dimension. The overview process can be

\[Dropout((xW_1+b_1)W_2 + b_2))\]

where the \(W_1, W_2, b_1\) and \(b_2\) are trainable parameters.

Parameters
  • hidden_size (int) – The dimension of the inputs.

  • ffn_hidden_size (int) – The intermediate hidden size.

  • dropout_rate (float) – The dropout rate for the second linear’s output.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • expert_num (int) – The number of experts used in Linear. For the case expert_num > 1, BatchMatMul is used and the first dimension in BatchMatMul indicate expert_num. Default: 1.

  • param_init_type (dtype.Number) – The parameter initialization type. Should be dtype.float32 or dtype.float16. Default: dtype.float32.

  • parallel_config (OpParallelConfig) – The config of parallel setting, see OpParallelConfig. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • x (Tensor) - should be [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size]. Float tensor.

Outputs:

Tensor, the output of this layer after mapping. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

Raises
  • ValueErrorhidden_act is not a string.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

  • ValueErrorffn_hidden_size is not a multiple of the model parallel way.

  • ValueErrorhidden_size is not a multiple of the model parallel way.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.parallel.nn import FeedForward
>>> from mindspore import dtype as mstype
>>> from mindspore import Tensor
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
class mindspore.parallel.nn.TransformerEncoder(batch_size, num_layers, hidden_size, ffn_hidden_size, seq_length, num_heads, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, hidden_act="gelu", post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, lambda_func=None, offset=0, use_past=False, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer Encoder module with multi-layer stacked of TransformerEncoderLayer, including multihead self attention and feedforward layer.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • num_layers (int) – The layers of the TransformerEncoderLayer

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • seq_length (int) – The seq_length of the input tensor.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’s state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. The first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.

  • lambda_func – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage).

  • offset (int) – The initial layer index for the decoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert).

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • hidden_states (Tensor) - Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size].

  • attention_mask (Tensor) - Tensor, attention mask with shape [batch_size, seq_length, seq_length]

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size).

  • layer_present (Tuple) - A tuple with size of num_layers, where each tuple contains the Tensor the projected key and value vector with shape ((batch_size, num_heads, size_per_head, seq_length), and (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import TransformerEncoder
>>> from mindspore import Tensor
>>> model = TransformerEncoder(batch_size=2, num_layers=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                            num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(len(past))
2
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
# When use use_past=True, it includes two steps to implement the incremental prediction.
# Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
# Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoder(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                            num_heads=2, num_layers=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
# Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than the full
# sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
class mindspore.parallel.nn.TransformerDecoder(num_layers, batch_size, hidden_size, ffn_hidden_size, src_seq_length, tgt_seq_length, num_heads, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act="gelu", lambda_func=None, use_past=False, offset=0, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer Decoder module with multi-layer stacked of TransformerDecoderLayer, including multihead self attention, cross attention and feedforward layer.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • num_layers (int) – The layers of the TransformerDecoderLayer.

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • src_seq_length (int) – The input source sequence length.

  • tgt_seq_length (int) – The input target sequence length.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • offset (int) – The initial layer index for the decoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer.

  • lambda_func – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert).

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - the input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]

  • attention_mask (Tensor) - the attention mask for decoder with shape [batch_size, seq_length, seq_length]

  • encoder_output (Tensor) - the output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - the memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The output logit of this layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size]

  • layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import TransformerDecoder
>>> from mindspore import Tensor
>>> model = TransformerDecoder(batch_size=2, num_layers=1, hidden_size=64, ffn_hidden_size=64,
...                            num_heads=2, src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(past))
1
>>> print(past[0][0].shape)
(2, 2, 32, 10)
>>> print(past[0][1].shape)
(2, 2, 10, 32)
>>> print(past[0][2].shape)
(2, 2, 32, 20)
>>> print(past[0][3].shape)
(2, 2, 20, 32)
class mindspore.parallel.nn.TransformerEncoderLayer(batch_size, hidden_size, ffn_hidden_size, num_heads, seq_length, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act="gelu", use_past=False, moe_config=default_moe_config, parallel_config=default_dpmp_config)[source]

Transformer Encoder Layer. This is an implementation of the single layer of the transformer encoder layer, including multihead attention and feedward layer.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • hidden_size (int) – The hidden size of the input.

  • seq_length (int) – The input sequence length.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’s state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. The first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert).

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • x (Tensor) - Float Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size]

  • input_mask (Tensor) - Float Tensor, attention mask with shape [batch_size, seq_length, seq_length], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size]

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present).

  • output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size)

  • layer_present (Tuple) - A tuple of the Tensor of the projected key and value vector with ((batch_size, num_heads, size_per_head, seq_length), (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import TransformerEncoderLayer
>>> from mindspore import Tensor
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
# When use use_past=True, it includes two steps to implement the incremental prediction.
# Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
# Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
# Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than the full
# sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
class mindspore.parallel.nn.TransformerDecoderLayer(hidden_size, ffn_hidden_size, num_heads, batch_size, src_seq_length, tgt_seq_length, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, use_past=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act="gelu", moe_config=default_moe_config, parallel_config=default_dpmp_config)[source]

Transformer Decoder Layer. This is an implementation of the single layer of the transformer decoder layer, including self-attention, cross attention and feedward layer. When the encoder_output is None, the cross attention will not be effective.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • hidden_size (int) – The hidden size of the input.

  • src_seq_length (int) – The input source sequence length.

  • tgt_seq_length (int) – The input target sequence length.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert).

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - the input tensor with shape [batch_size, tgt_seq_length, hidden_size] or [batch_size * tgt_seq_length, hidden_size].

  • decoder_mask (Tensor) - the attention mask for decoder with shape [batch_size, src_seq_length, seq_length].

  • encoder_output (Tensor) - the output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - the memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - the output logit of this layer. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

  • layer_present (Tensor) - A tuple, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import TransformerDecoderLayer
>>> from mindspore import Tensor
>>> model = TransformerDecoderLayer(batch_size=2, hidden_size=64, ffn_hidden_size=64, num_heads=2,
...                                 src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(past[0].shape)
(2, 2, 32, 10)
>>> print(past[1].shape)
(2, 2, 10, 32)
>>> print(past[2].shape)
(2, 2, 32, 20)
>>> print(past[3].shape)
(2, 2, 20, 32)
class mindspore.parallel.nn.Transformer(hidden_size, batch_size, ffn_hidden_size, src_seq_length, tgt_seq_length, encoder_layers=3, decoder_layers=3, num_heads=2, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, hidden_act="gelu", post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, lambda_func=None, use_past=False, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer module including encoder and decoder. The difference with the original implements is the module use the residual addition before the layer normalization. And the default hidden act is gelu. The details can be found in Attention is all you need.

Note

This is an experimental interface that is subject to change and/or deletion.

Parameters
  • batch_size (int) – The batch size of the input tensor.

  • encoder_layers (int) – The layers of the TransformerEncoderLayer.

  • decoder_layers (int) – The layers of the TransformerDecoderLayer.

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • src_seq_length (int) – The seq_length of the encoder’s input tensor.

  • tgt_seq_length (int) – The seq_length of the decoder’s input tensor.

  • num_heads (int) – The number of the heads. Default: 2.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • hidden_act (str) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. Default: gelu.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert).

  • lambda_func – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // ((encoder_layers + decoder_length) / pipeline_stage).

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • encoder_inputs (Tensor) - the input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size].

  • encoder_masks (Tensor) - the attention mask for decoder with shape [batch_size, seq_length, seq_length].

  • decoder_inputs (Tensor) - the output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], this should be none if the decoder layer is 0.

  • decoder_masks (Tensor) - the attention mask for decoder with shape [batch_size, seq_length, seq_length]

  • memory_mask (Tensor) - the memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. the output of the encoder with shape [batch_size, seq_length, hidden_size], this should be none if the decoder layer is 0.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, encoder_layer_present, encoder_layer_present)

  • output (Tensor) - If there is only encoder, the output logit of the encoder layer. The shape is [batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size], if there are encoder and decoders, the output is from the decoder layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size].

  • encoder_layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

  • decoder_layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head)), and the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)). If the decoder is not set, the returned value will be None.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import Transformer
>>> from mindspore import Tensor
>>> model = Transformer(batch_size=2, encoder_layers=1, decoder_layers=2, hidden_size=64, ffn_hidden_size=64,
...         src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 20, 20)), mstype.float16)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, en_past, de_past = model(encoder_input_value, encoder_input_mask, decoder_input_value,
...                                  decoder_input_mask, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(en_past))
1
>>> print(len(de_past))
2
>>> print(en_past[0][0].shape)
(2, 2, 32, 20)
>>> print(en_past[0][1].shape)
(2, 2, 20, 32)
>>> print(de_past[0][0].shape)
(2, 2, 32, 10)
>>> print(de_past[0][1].shape)
(2, 2, 10, 32)
>>> print(de_past[0][2].shape)
(2, 2, 32, 20)
>>> print(de_past[0][3].shape)
(2, 2, 20, 32)
class mindspore.parallel.nn.TransformerOpParallelConfig(data_parallel=1, model_parallel=1, pipeline_stage=1, micro_batch_num=1, recompute=False, optimizer_shard=False, gradient_aggregation_group=4, vocab_emb_dp=True)[source]

TransformerOpParallelConfig for the setting global data parallel, model parallel and fusion group. The parallel configure setting.

Note

Except the recompute argument, other arguments will not be effective when the user doesn’t set auto_parallel_context to SEMI_AUTO_PARALLEL or AUTO_PARALLEL. The micro_batch_num must be greater than or equal to pipeline_stage. The data_parallel*model_parallel *pipeline_stage must be equal or less equal to the device. When setting the pipeline stage and optimizer_shard, the config will overwrite the auto_parallel_context.

Parameters
  • data_parallel (int) – The data parallel way. Default: 1.

  • model_parallel (int) – The model parallel way. Default: 1.

  • pipeline_stage (int) – The number of the pipeline stage. Should be a positive value. Default: 1.

  • micro_batch_num (int) – The microe size of the batches for the pipeline training. Default: 1.

  • optimizer_shard (bool) – Whether to enable optimizer shard. Default False.

  • gradient_aggregation_group (int) – The fusion group size of the optimizer state sharding. Default: 4.

  • recompute (bool) – Enable recomputation of the transformer block or not. Default: False.

  • vocab_emb_dp (bool) – Shard embedding in model parallel or data parallel. Default: True.

Supported Platforms:

Ascend GPU

Examples

>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1)
property dp_mp_config

To obtain the EmbeddingParallelConfig for the setting data parallel, model parallel and embedding parallel.

Supported Platforms:

Ascend GPU

Examples

>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)
>>> parallel_config = config.dp_mp_config
property embedding_dp_mp_config

To obtain the EmbeddingParallelConfig for the setting data parallel, model parallel and embedding parallel.

Supported Platforms:

Ascend GPU

Examples

>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)
>>> parallel_config = config.embedding_dp_mp_config
class mindspore.parallel.nn.EmbeddingOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)[source]

EmbeddingOpParallelConfig for the setting data parallel or row slice for the embedding table.

Parameters
  • data_parallel (int) – The data parallel way. Default: 1

  • model_parallel (int) – The model parallel way. Default: 1

  • vocab_emb_dp (bool) – Shard embedding in model parallel or data parallel. Default: True

Supported Platforms:

Ascend GPU

Examples

>>> config=EmbeddingOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)
property dp_mp_config

To obtain the DPMPlConfig for the setting data parallel, model parallel

Supported Platforms:

Ascend GPU

Examples

>>> config=EmbeddingOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)
>>> parallel_config = config.dp_mp_config
class mindspore.parallel.nn.CrossEntropyLoss(parallel_config=default_dpmp_config)[source]

Calculate the cross entropy loss.

Parameters

parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • logits (Tensor) - Tensor of shape (N, C). Data type must be float16 or float32. the output logits of the backbone.

  • labels (Tensor) - Tensor of shape (N, ). The ground truth label of the sample.

  • input_mask (Tensor) - Tensor of shape (N, ). input_mask indicates whether there is padded inputs and for padded inputs it will not be counted into loss.

Outputs:

Tensor. the corresponding cross entropy loss

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import CrossEntropyLoss
>>> from mindspore import Tensor
>>> loss = CrossEntropyLoss()
>>> logits = Tensor(np.array([[3, 5, 6, 9, 12, 33, 42, 12, 32, 72]]), mstype.float32)
>>> labels_np = np.array([1]).astype(np.int32)
>>> input_mask = Tensor(np.ones(1).astype(np.float32))
>>> labels = Tensor(labels_np)
>>> output = loss(logits, labels, input_mask)
>>> print(output.shape)
(1,)
class mindspore.parallel.nn.OpParallelConfig(data_parallel=1, model_parallel=1)[source]

OpParallelConfig for the setting data parallel and model parallel.

Parameters
  • data_parallel (int) – The data parallel way. Default: 1

  • model_parallel (int) – The model parallel way. Default: 1

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.parallel.nn import OpParallelConfig
>>> config=OpParallelConfig(data_parallel=1, model_parallel=1)
class mindspore.parallel.nn.FixedSparseAttention(batch_size, num_heads, size_per_head, block_size, seq_length=1024, num_different_global_patterns=4, parallel_config=default_dpmp_config)[source]

Fixed Sparse Attention Layer

This function contains the sparse attention primitives used in Sparse Transformers (see paper). https://arxiv.org/abs/1904.10509 Specifically, it includes the following: 1. A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused). 2. An implementation of “strided” and “fixed” attention, as in the Sparse Transformers paper.

Parameters
  • batch_size (int) – Number of input batch size.

  • num_heads (int) – Number of attention heads.

  • block_size (int) – An integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such blocks, Block X Block. only supports 64 for now

  • seq_length (int) – length of input sequence, only supports 1024 for now

  • num_different_global_patterns (int) – An integer determining number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative, only supports 4 for now

  • size_per_head (int) – An integer determining embedding size of each attention head, only supports 64, 128 for now

Inputs:
  • q (Tensor) - Tensor query (mstype.fp16 [batch_size, seq_length, hidden_size]): Sequence of queries to query the context.

  • k (Tensor) - Tensor key (mstype.fp16 [batch_size, seq_length, hidden_size]): Sequence of queries to query the context.

  • v (Tensor) - Tensor value (mstype.fp16 [batch size, sequence length, Embedding Size]): Sequence of queries to query the context.

  • attention_mask (Tensor) - Float Tensor the mask of (mstype.fp32, mstype.fp16 [batch_size, seq_length, seq_length]): Lower triangular matrix to pass masked information.

Outputs:

A Tensor. The output of the attention with shape [batch_size, seq_length, hidden_size]

Supported Platforms:

Ascend

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.parallel.nn import FixedSparseAttention
>>> from mindspore import Tensor
>>> model = FixedSparseAttention(batch_size=2,
...                              num_heads=8,
...                              size_per_head=64,
...                              block_size=64)
>>> q = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> k = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> v = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 1024, 1024)), mstype.float32)
>>> output = model(q, k, v, attention_mask)
>>> print(output.shape)
(2, 1024, 512)
class mindspore.parallel.nn.MoEConfig(expert_num=1, capacity_factor=1.1, aux_loss_factor=0.05, num_experts_chosen=1, noisy_policy=None, noisy_epsilon=1e-2)[source]

The configuration of MoE (Mixture of Expert).

Parameters
  • expert_num (int) – The number of experts employed. Default: 1

  • capacity_factor (float) – The factor is used to indicate how much to expand expert capacity, which is >=1.0. Default: 1.1.

  • aux_loss_factor (float) – The factor is used to indicate how much the load balance loss (produced by the router) to be added to the entire model loss, which is < 1.0. Default: 0.05.

  • num_experts_chosen (int) – The number of experts is chosen by each token. Default: 1.

  • noisy_policy (string) – The noisy policy is used in routing tokens to experts. Default: None.

  • noisy_epsilon (float) – The parameter is used in adding noises in routing tokens to experts. Default: 1e-2.