mindspore.nn.transformer

Note

Transformer Networks. This is an experimental interface that is subject to change or deletion.

class mindspore.nn.transformer.AttentionMask(seq_length, parallel_config=default_dpmp_config)[source]

Get the Lower triangular matrix from the input mask. The input mask is a 2D tensor (batch_size, seq_length) with 1 and 0, where 1 indicates the current position is a valid token, otherwise not.

Parameters
  • seq_length (int) – The sequence length of the input tensor.

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • input_mask (Tensor) - The mask indicating whether each position is a valid input with (batch_size, seq_length).

Outputs:

Tensor. The attention mask matrix with shape (batch_size, seq_length, seq_length).

Raises
  • TypeErrorseq_length is not an integer.

  • ValueErrorseq_length is not a positive value.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.nn.transformer import AttentionMask
>>> from mindspore import Tensor
>>> mask = AttentionMask(seq_length=4)
>>> mask_array = np.array([[1, 1, 1, 0]], np.float32)
>>> inputs = Tensor(mask_array)
>>> res = mask(inputs)
>>> print(res)
[[[1. 0. 0. 0]
  [1. 1. 0. 0]
  [1. 1. 1. 0]
  [0. 0. 0. 0]]]
class mindspore.nn.transformer.CrossEntropyLoss(parallel_config=default_dpmp_config)[source]

Calculate the cross entropy loss.

Parameters

parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • logits (Tensor) - Tensor of shape (N, C). Data type must be float16 or float32. The output logits of the backbone.

  • labels (Tensor) - Tensor of shape (N, ). The ground truth label of the sample.

  • input_mask (Tensor) - Tensor of shape (N, ). input_mask indicates whether there are padded inputs and for padded inputs it will not be counted into loss.

Outputs:

Tensor. The corresponding cross entropy loss.

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import CrossEntropyLoss
>>> from mindspore import Tensor
>>> loss = CrossEntropyLoss()
>>> logits = Tensor(np.array([[3, 5, 6, 9, 12, 33, 42, 12, 32, 72]]), mstype.float32)
>>> labels_np = np.array([1]).astype(np.int32)
>>> input_mask = Tensor(np.ones(1).astype(np.float32))
>>> labels = Tensor(labels_np)
>>> output = loss(logits, labels, input_mask)
>>> print(output.shape)
(1,)
class mindspore.nn.transformer.EmbeddingOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)[source]

The parallel config of VocabEmbedding for the setting data parallel or model parallel for the embedding table.

Parameters
  • data_parallel (int) – The data parallel way. The input data will be sliced into n parts for embedding layer according to this value. Default: 1.

  • model_parallel (int) – The model parallel way. The embedding table parameters will be sliced at 0-th axis according to the model parallel way. Default: 1.

  • vocab_emb_dp (bool) – Shard embedding in model parallel or data parallel. If True, the embedding lookup will be a data parallel style training and model_parallel value will be ignored. If false, the embedding table will be sharded into n parts at the 0-th dimension row slice of the embedding table, where the n is the model parallel way determined by this parameter. Default: True

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.nn.transformer import EmbeddingOpParallelConfig
>>> config=EmbeddingOpParallelConfig(data_parallel=1, model_parallel=1, vocab_emb_dp=True)
class mindspore.nn.transformer.FeedForward(hidden_size, ffn_hidden_size, dropout_rate, hidden_act='gelu', expert_num=1, expert_group_size=None, param_init_type=mstype.float32, parallel_config=default_dpmp_config)[source]

The multilayer perceptron with two linear layers with dropout applied at final output. The first linear will project the input dimension from hidden_size to ffn_hidden_size. The second linear will project the dimension from ffn_hidden_size to hidden_size. The first linear is sharded on the relative dimension, and the second linear is sharded on the output dimension. The overview process can be:

\[Dropout((xW_1+b_1)W_2 + b_2)\]

where the \(W_1, W_2, b_1\) and \(b_2\) are trainable parameters.

Parameters
  • hidden_size (int) – The dimension of the inputs.

  • ffn_hidden_size (int) – The intermediate hidden size.

  • dropout_rate (float) – The dropout rate for the second linear’s output.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see examples. Default: gelu.

  • expert_num (int) – The number of experts used in Linear. For the case expert_num > 1, BatchMatMul is used and the first dimension in BatchMatMul indicate expert_num. Default: 1.

  • expert_group_size (int) – The number of tokens in each data parallel group. Default: None. This parameter is effective only when in AUTO_PARALLEL mode, and NOT SHARDING_PROPAGATION.

  • param_init_type (dtype.Number) – The parameter initialization type. Should be mstype.float32 or mstype.float16. Default: mstype.float32.

  • parallel_config (OpParallelConfig, MoEParallelConfig) – The config of parallel setting, see OpParallelConfig or MoEParallelConfig. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • x (Tensor) - should be [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size]. Float tensor.

Outputs:

Tensor, the output of this layer after mapping. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

Raises
  • TypeErrorhidden_act is not a string or nn.Cell.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

  • ValueErrorffn_hidden_size is not a multiple of the model parallel way.

  • ValueErrorhidden_size is not a multiple of the model parallel way.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.nn.transformer import FeedForward
>>> from mindspore import dtype as mstype
>>> from mindspore import Tensor, nn
>>> import mindspore.ops as ops
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
>>> # Example 2 using custom hidden activation
>>> class MyActivationNoShard(nn.Cell):
>>>     def __init__(self):
>>>         super(MyActivationNoShard, self).__init__()
>>>         self.add = ops.Add()
>>>     def construct(self, x):
>>>         return self.add(x, 0.1)
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1,
>>>                     hidden_act=MyActivationNoShard)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
>>> # Example 3 using custom hidden activation with activation_shard
>>> # If user wantss to run on the SEMI/AUTO parallel mode, the custom activation must provide
>>> # a class function named activation_shard. It accepts the argument parallel_config (OpParallelConfig,
>>> # MoEParallelConfig) and set the shard for the primitives used in the construct.
>>> class MyActivationWithShard(nn.Cell):
>>>     def __init__(self):
>>>         super(MyActivationWithShard, self).__init__()
>>>         self.add = ops.Add()
>>>     def construct(self, x):
>>>         return self.add(x, 0.1)
>>>     def activation_shard(self, parallel_config):
>>>         self.add.shard(((parallel_config.data_parallel, parallel_config.model_parallel), ()))
>>>
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1,
>>>                     hidden_act=MyActivationWithShard)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
class mindspore.nn.transformer.FixedSparseAttention(batch_size, num_heads, size_per_head, block_size, seq_length=1024, num_different_global_patterns=4, parallel_config=default_dpmp_config)[source]

Fixed Sparse Attention Layer.

This function contains the sparse attention primitives used in Sparse Transformers (see paper) Generating Long Sequences with Sparse Transformers.

Specifically, it includes the following:

  1. A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused).

  2. An implementation of “strided” and “fixed” attention, as in the Sparse Transformers paper.

Parameters
  • batch_size (int) – Number of input batch size.

  • num_heads (int) – Number of attention heads.

  • size_per_head (int) – An integer determining embedding size of each attention head, only supports 64, 128 for now.

  • block_size (int) – An integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines the size of such blocks, Block X Block. Only supports 64 for now.

  • seq_length (int) – length of input sequence, only supports 1024 for now. Default 1024.

  • num_different_global_patterns (int) – An integer determining the number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative, only supports 4 for now. Default 4.

  • parallel_config (OpParallelConfig) – The config of parallel setting, see OpParallelConfig. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • q (Tensor) - Tensor query (mstype.fp16 [batch_size, seq_length, hidden_size]): Sequence of queries to query the context.

  • k (Tensor) - Tensor key (mstype.fp16 [batch_size, seq_length, hidden_size]): Sequence of queries to query the context.

  • v (Tensor) - Tensor value (mstype.fp16 [batch size, sequence length, Embedding Size]): Sequence of queries to query the context.

  • attention_mask (Tensor) - Float Tensor the mask of (mstype.fp32, mstype.fp16 [batch_size, seq_length, seq_length]): Lower triangular matrix to pass masked information.

Outputs:

A Tensor. The output of the attention with shape [batch_size, seq_length, hidden_size]

Supported Platforms:

Ascend

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import FixedSparseAttention
>>> from mindspore import Tensor
>>> model = FixedSparseAttention(batch_size=2,
...                              num_heads=8,
...                              size_per_head=64,
...                              block_size=64)
>>> q = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> k = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> v = Tensor(np.ones((2, 1024, 8*64)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 1024, 1024)), mstype.float32)
>>> output = model(q, k, v, attention_mask)
>>> print(output.shape)
(2, 1024, 512)
class mindspore.nn.transformer.MoEConfig(expert_num=1, capacity_factor=1.1, aux_loss_factor=0.05, num_experts_chosen=1, expert_group_size=None, group_wise_a2a=False, comp_comm_parallel=False, comp_comm_parallel_degree=2)[source]

The configuration of MoE (Mixture of Expert).

Parameters
  • expert_num (int) – The number of experts employed. Default: 1

  • capacity_factor (float) – The factor is used to indicate how much to expand expert capacity, which is >=1.0. Default: 1.1.

  • aux_loss_factor (float) – The factor is used to indicate how much the load balance loss (produced by the router) to be added to the entire model loss, which is < 1.0. Default: 0.05.

  • num_experts_chosen (int) – The number of experts is chosen by each token and it should not be larger than expert_num. Default: 1.

  • expert_group_size (int) – The number of tokens in each data parallel group. Default: None. This parameter is effective only when in AUTO_PARALLEL mode, and NOT SHARDING_PROPAGATION.

  • group_wise_a2a (bool) – Whether to enable group-wise alltoall communication, which can reduce communication time by converting part of inter communication into intra communication. Default: False. This parameter is effective only when model parallel > 1 and data_parallel equal to expert parallel.

  • comp_comm_parallel (bool) – Whether to enable ffn compute and communication parallel, which can reduce pure communicattion time by splitting and overlapping compute and communication. Default: False.

  • comp_comm_parallel_degree (int) – The split number of compute and communication. The larger the numbers, the more overlap there will be but will consume more memory. Default: 2. This parameter is effective only when comp_comm_parallel enable.

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.nn.transformer import MoEConfig
>>> moe_config = MoEConfig(expert_num=4, capacity_factor=5.0, aux_loss_factor=0.05, num_experts_chosen=1,
...                        expert_group_size=64, group_wise_a2a=True, comp_comm_parallel=False,
...                        comp_comm_parallel_degree=2)
class mindspore.nn.transformer.MultiHeadAttention(batch_size, src_seq_length, tgt_seq_length, hidden_size, num_heads, hidden_dropout_rate=0.1, attention_dropout_rate=0.1, compute_dtype=mstype.float16, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, use_past=False, parallel_config=default_dpmp_config)[source]

This is an implementation of multihead attention in the paper Attention is all you need. Given the query vector with source length, and the key and value vector with target length, the attention will be performed as the following

\[MultiHeadAttention(query, key, vector) = Concat(head_1, \dots, head_h)W^O\]

where \(head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\). The default is with a bias.

if query, key and value tensor is same, then it will be self attention.

Parameters
  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • src_seq_length (int) – The sequence length of the query vector.

  • tgt_seq_length (int) – The sequence length of the key and value vector.

  • hidden_size (int) – The hidden size of the input.

  • num_heads (int) – The number of the heads.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • compute_dtype (dtype.Number) – The computation type of dense. Default mstype.float16. Should be mstype.float32 or mstype.float16.

  • softmax_compute_type (dtype.Number) – The type of softmax computation module. Default mstype.float32. Should be mstype.float32 or mstype.float16.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Default mstype.float32. Should be mstype.float32 or mstype.float16.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’ state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. In the first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.

  • parallel_config (OpParallelConfig) – The parallel configure. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • query_tensor (Tensor) - The query vector with shape (batch_size, src_seq_length, hidden_size) or (batch_size * src_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • key_tensor (Tensor) - The key vector with shape (batch_size, tgt_seq_length, hidden_size) or (batch_size * tgt_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • value_tensor (Tensor) - The value vector with shape (batch_size, tgt_seq_length, hidden_size) or (batch_size * tgt_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, must be (batch_size, 1, hidden_size)

  • attention_mask (Tensor) - If the use_past is False or is_first_iteration=True, the attention mask matrix should ba (batch_size, src_seq_length, tgt_seq_length), or None. None means there will be no mask in softmax computation. Otherwise, the mask must be (batch_size, 1, tgt_seq_length)

  • key_past (Tensor) - Float16 tensor with shape (batch_size, num_heads, size_per_head, tgt_seq_length). The past calculated key vector. Used for incremental prediction when the use_past is True. Default None.

  • value_past (Tensor) - Float16 tensor with shape (batch_size, num_heads, tgt_seq_length, size_per_head). The past calculated value vector. Used for incremental prediction when the use_past is True. Default None.

  • batch_valid_length (Tensor) - Int32 tensor with shape (batch_size,) the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - Tensor, the float tensor of the output of the layer with shape (batch_size, src_seq_length, hidden_size) or (batch_size * src_seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size).

  • layer_present (Tuple) - A tuple of the Tensor of the projected key and value vector with ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.nn.transformer import MultiHeadAttention
>>> from mindspore import dtype as mstype
>>> from mindspore import Tensor
>>> model = MultiHeadAttention(batch_size=None, hidden_size=15, src_seq_length=20, tgt_seq_length=20,
...                            num_heads=3)
>>> from_tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> to_tensor = Tensor(np.ones((2, 20, 15)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 20, 20)), mstype.float16)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask)
>>> print(attn_out.shape)
(2, 20, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
>>> # When use use_past=True, it includes two steps to implement the incremental prediction.
>>> # Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> # We need to prepare the memory parameters for saving key and value states firstly.
>>> model = MultiHeadAttention(batch_size=2, hidden_size=15, src_seq_length=20, tgt_seq_length=20,
...                            num_heads=3, use_past=True)
>>> key_past = Tensor(np.zeros(shape=(2, 3, 5, 20)), mstype.float16)
>>> value_past = Tensor(np.zeros(shape=(2, 3, 20, 5)), mstype.float16)
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> # Set is_first_iteration=True to generate the full memory states
>>> model.add_flags_recursive(is_first_iteration=True)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask, key_past, value_past,
...                        batch_valid_length)
>>> print(attn_out.shape)
(2, 20, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
>>> from_tensor = Tensor(np.ones((2, 1, 15)), mstype.float32)
>>> to_tensor = Tensor(np.ones((2, 1, 15)), mstype.float16)
>>> attention_mask = Tensor(np.ones((2, 1, 20)), mstype.float16)
>>> # Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than the
>>> # full sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> attn_out, past = model(from_tensor, to_tensor, to_tensor, attention_mask, key_past, value_past,
...                        batch_valid_length)
>>> print(attn_out.shape)
(2, 1, 15)
>>> print(past[0].shape)
(2, 3, 5, 20)
>>> print(past[1].shape)
(2, 3, 20, 5)
class mindspore.nn.transformer.OpParallelConfig(data_parallel=1, model_parallel=1)[source]

OpParallelConfig for the setting data parallel and model parallel.

Parameters
  • data_parallel (int) – The data parallel way. Default: 1

  • model_parallel (int) – The model parallel way. Default: 1

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.nn.transformer import OpParallelConfig
>>> config=OpParallelConfig(data_parallel=1, model_parallel=1)
class mindspore.nn.transformer.Transformer(hidden_size, batch_size, ffn_hidden_size, src_seq_length, tgt_seq_length, encoder_layers=3, decoder_layers=3, num_heads=2, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, hidden_act='gelu', post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, lambda_func=None, use_past=False, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer module including encoder and decoder. The difference with the original implements is the module use the residual addition before the layer normalization. And the default hidden act is gelu. The details can be found in Attention is all you need.

Note

This is an experimental interface that is subject to change or deletion.

Parameters
  • hidden_size (int) – The hidden size of the input.

  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • src_seq_length (int) – The seq_length of the encoder’s input tensor.

  • tgt_seq_length (int) – The seq_length of the decoder’s input tensor.

  • encoder_layers (int) – The layers of the TransformerEncoderLayer. Default 3.

  • decoder_layers (int) – The layers of the TransformerDecoderLayer. Default 3.

  • num_heads (int) – The number of the heads. Default: 2.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindspore.nn.transformer.FeedForward. Default: gelu.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • lambda_func – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // ((encoder_layers + decoder_layers) / pipeline_stage). Default None.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • encoder_inputs (Tensor) - The input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size].

  • encoder_masks (Tensor) - The attention mask for decoder with shape [batch_size, seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention of the encoder module.

  • decoder_inputs (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], this should be none if the decoder layer is 0.

  • decoder_masks (Tensor) - The attention mask for decoder with shape [batch_size, seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention of the decoder module.

  • memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The output of the encoder with shape [batch_size, seq_length, hidden_size], this should be none if the decoder layer is 0 or the user wants no mask.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, encoder_layer_present, decoder_layer_present, accum_loss)

  • output (Tensor) - If there is only encoder, the output logit of the encoder layer. The shape is [batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size], if there are encoder and decoders, the output is from the decoder layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size].

  • encoder_layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

  • decoder_layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head)), and the projected key and value vector in cross attention with shape ((batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)). If the decoder is not set, the returned value will be None.

  • accum_loss (Tensor) - A Tensor indicates an auxiliary loss to minimize the mean square of the data part routed to each expert, and only returned if the number of experts is greater than 1.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import Transformer
>>> from mindspore import Tensor
>>> model = Transformer(batch_size=2, encoder_layers=1, decoder_layers=2, hidden_size=64,
...                     ffn_hidden_size=64, src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 20, 20)), mstype.float16)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, en_past, de_past = model(encoder_input_value, encoder_input_mask, decoder_input_value,
...                                  decoder_input_mask, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(en_past))
1
>>> print(len(de_past))
2
>>> print(en_past[0][0].shape)
(2, 2, 32, 20)
>>> print(en_past[0][1].shape)
(2, 2, 20, 32)
>>> print(de_past[0][0].shape)
(2, 2, 32, 10)
>>> print(de_past[0][1].shape)
(2, 2, 10, 32)
>>> print(de_past[0][2].shape)
(2, 2, 32, 20)
>>> print(de_past[0][3].shape)
(2, 2, 20, 32)
class mindspore.nn.transformer.TransformerDecoder(num_layers, batch_size, hidden_size, ffn_hidden_size, src_seq_length, tgt_seq_length, num_heads, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act='gelu', lambda_func=None, use_past=False, offset=0, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer Decoder module with multi-layer stacked of TransformerDecoderLayer, including multihead self attention, cross attention and feedforward layer.

Parameters
  • num_layers (int) – The layers of the TransformerDecoderLayer.

  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • src_seq_length (int) – The input source sequence length.

  • tgt_seq_length (int) – The input target sequence length.

  • num_heads (int) – The number of the heads.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindspore.nn.transformer.FeedForward. Default: gelu.

  • lambda_func (function) – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.

  • offset (int) – The initial layer index for the decoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer. Default 0.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - The input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]

  • attention_mask (Tensor) - The attention mask for decoder with shape [batch_size, seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.

  • encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The output logit of this layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size]

  • layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import TransformerDecoder
>>> from mindspore import Tensor
>>> model = TransformerDecoder(batch_size=2, num_layers=1, hidden_size=64, ffn_hidden_size=64,
...                            num_heads=2, src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(past))
1
>>> print(past[0][0].shape)
(2, 2, 32, 10)
>>> print(past[0][1].shape)
(2, 2, 10, 32)
>>> print(past[0][2].shape)
(2, 2, 32, 20)
>>> print(past[0][3].shape)
(2, 2, 20, 32)
class mindspore.nn.transformer.TransformerDecoderLayer(hidden_size, ffn_hidden_size, num_heads, batch_size, src_seq_length, tgt_seq_length, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, use_past=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act='gelu', moe_config=default_moe_config, parallel_config=default_dpmp_config)[source]

Transformer Decoder Layer. This is an implementation of the single layer of the transformer decoder layer, including self-attention, cross attention and feedward layer. When the encoder_output is None, the cross attention will not be effective.

Parameters
  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • num_heads (int) – The number of the heads.

  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • src_seq_length (int) – The input source sequence length.

  • tgt_seq_length (int) – The input target sequence length.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindspore.nn.transformer.FeedForward. Default: gelu.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.

  • parallel_config (OpParallelConfig, MoEParallelConfig) – The parallel configure. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - The input tensor with shape [batch_size, tgt_seq_length, hidden_size] or [batch_size * tgt_seq_length, hidden_size].

  • decoder_mask (Tensor) - The attention mask for decoder with shape [batch_size, src_seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.

  • encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The output logit of this layer. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

  • layer_present (Tuple) - A tuple, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import TransformerDecoderLayer
>>> from mindspore import Tensor
>>> model = TransformerDecoderLayer(batch_size=2, hidden_size=64, ffn_hidden_size=64, num_heads=2,
...                                 src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(past[0].shape)
(2, 2, 32, 10)
>>> print(past[1].shape)
(2, 2, 10, 32)
>>> print(past[2].shape)
(2, 2, 32, 20)
>>> print(past[3].shape)
(2, 2, 20, 32)
class mindspore.nn.transformer.TransformerEncoder(batch_size, num_layers, hidden_size, ffn_hidden_size, seq_length, num_heads, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, hidden_act='gelu', post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, lambda_func=None, offset=0, use_past=False, moe_config=default_moe_config, parallel_config=default_transformer_config)[source]

Transformer Encoder module with multi-layer stacked of TransformerEncoderLayer, including multihead self attention and feedforward layer.

Parameters
  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • num_layers (int) – The layers of the TransformerEncoderLayer

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • seq_length (int) – The seq_length of the input tensor.

  • num_heads (int) – The number of the heads.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default: 0.1.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindspore.nn.transformer.FeedForward. Default: gelu.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default: mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default: mstype.float32.

  • lambda_func (function) – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None.

  • offset (int) – The initial layer index for the encoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer. Default 0.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’ state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. In the first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default: False.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.

  • parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:
  • hidden_states (Tensor) - Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size].

  • attention_mask (Tensor) - Float Tensor, If the use_past is False or is_first_iteration=True, the attention mask matrix should ba [batch_size, seq_length, seq_length], or None. None means there will be no mask in softmax computation. Otherwise, should be [batch_size, 1, hidden_size]

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size).

  • layer_present (Tuple) - A tuple with size of num_layers, where each tuple contains the Tensor the projected key and value vector with shape ((batch_size, num_heads, size_per_head, seq_length), and (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import TransformerEncoder
>>> from mindspore import Tensor
>>> model = TransformerEncoder(batch_size=2, num_layers=2, hidden_size=8, ffn_hidden_size=64,
...                            seq_length=16, num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(len(past))
2
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
>>> # When use use_past=True, it includes two steps to implement the incremental prediction.
>>> # Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
>>> # Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoder(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                            num_heads=2, num_layers=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
>>> # Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than
>>> # the full sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
class mindspore.nn.transformer.TransformerEncoderLayer(batch_size, hidden_size, ffn_hidden_size, num_heads, seq_length, attention_dropout_rate=0.1, hidden_dropout_rate=0.1, post_layernorm_residual=False, layernorm_compute_type=mstype.float32, softmax_compute_type=mstype.float32, param_init_type=mstype.float32, hidden_act='gelu', use_past=False, moe_config=default_moe_config, parallel_config=default_dpmp_config)[source]

Transformer Encoder Layer. This is an implementation of the single layer of the transformer encoder layer, including multihead attention and feedward layer.

Parameters
  • batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

  • hidden_size (int) – The hidden size of the input.

  • ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.

  • num_heads (int) – The number of the heads.

  • seq_length (int) – The input sequence length.

  • attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.

  • hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.

  • post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.

  • layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default mstype.float32.

  • hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindspore.nn.transformer.FeedForward. Default: gelu.

  • use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’ state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. In the first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.

  • moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.

  • parallel_config (OpParallelConfig, MoEParallelConfig) – The parallel configure. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • x (Tensor) - Float Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size]

  • input_mask (Tensor) - Float Tensor, If the use_past is False or is_first_iteration=True, the attention mask matrix should ba [batch_size, seq_length, seq_length], or None. None means there will be no mask in softmax computation. Otherwise, should be [batch_size, 1, hidden_size]

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present).

  • output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size)

  • layer_present (Tuple) - A tuple of the Tensor of the projected key and value vector with ((batch_size, num_heads, size_per_head, seq_length), (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindspore.nn.transformer import TransformerEncoderLayer
>>> from mindspore import Tensor
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> # When use use_past=True, it includes two steps to implement the incremental prediction.
>>> # Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
>>> # Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
>>> # Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than
>>> # the full sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
class mindspore.nn.transformer.TransformerOpParallelConfig(data_parallel=1, model_parallel=1, expert_parallel=1, pipeline_stage=1, micro_batch_num=1, recompute=default_transformer_recompute_config, optimizer_shard=False, gradient_aggregation_group=4, vocab_emb_dp=True)[source]

TransformerOpParallelConfig for setting parallel configuration, such as the data parallel and model parallel.

Note

Except the recompute argument, other arguments will not be effective when the user doesn’t set auto_parallel_context to SEMI_AUTO_PARALLEL or AUTO_PARALLEL. The micro_batch_num must be greater than or equal to pipeline_stage when training. The data_parallel*model_parallel *pipeline_stage must be equal or less equal to the device. When setting the pipeline stage and optimizer_shard, the config will overwrite the auto_parallel_context. When given the 8 devices and the data_parallel is 1 and model_parallel is 1, the calculation will be repeated on each device.

Parameters
  • data_parallel (int) – The data parallel way. The input data will be sliced into n parts for each layer according to the data parallel way. Default: 1.

  • model_parallel (int) – The model parallel way. The parameters of dense layers in MultiheadAttention and FeedForward layer will be sliced according to the model parallel way. Default: 1.

  • expert_parallel (int) – The expert parallel way. This is effective only when MoE (Mixture of Experts) is applied. This value specifies the number of partitions to split the experts into.

  • pipeline_stage (int) – The number of the pipeline stage. Should be a positive value. Default: 1.

  • micro_batch_num (int) – The micro size of the batches for the pipeline training. Default: 1.

  • optimizer_shard (bool) – Whether to enable optimizer shard. Default False.

  • gradient_aggregation_group (int) – The fusion group size of the optimizer state sharding. Default: 4.

  • recompute (Union[TransformerRecomputeConfig, bool]) – The configuration of recomputation for the transformer block. Default: An instance of TransformerRecomputeConfig with default values.

  • vocab_emb_dp (bool) – Shard embedding in model parallel or data parallel. Default: True.

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.nn.transformer import TransformerRecomputeConfig
>>> recompute_config=TransformerRecomputeConfig(recompute=True, parallel_optimizer_comm_recompute=True, \
...                                             mp_comm_recompute=True, recompute_slice_activation=True)
>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, recompute=recompute_config)
class mindspore.nn.transformer.TransformerRecomputeConfig(recompute=False, parallel_optimizer_comm_recompute=False, mp_comm_recompute=True, recompute_slice_activation=False)[source]

TransformerRecomputeConfig for the setting recompute attributes for encoder/decoder layers.

Parameters
  • recompute (bool) – Enable recomputation of the transformer block or not. Default: False.

  • parallel_optimizer_comm_recompute (bool) – Specifies whether the communication operator allgathers introduced by optimizer shard are recomputed in auto parallel or semi auto parallel mode. Default: False.

  • mp_comm_recompute (bool) – Specifies whether the model parallel communication operators in the cell are recomputed in auto parallel or semi auto parallel mode. Default: True.

  • recompute_slice_activation (bool) – Slice the cell output which would remains in memory. Default: False.

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore.nn.transformer import TransformerRecomputeConfig
>>> config=TransformerRecomputeConfig(recompute=True, parallel_optimizer_comm_recompute=True, \
...                                   mp_comm_recompute=True, recompute_slice_activation=True)
class mindspore.nn.transformer.VocabEmbedding(vocab_size, embedding_size, parallel_config=default_embedding_parallel_config, param_init='normal')[source]

The embedding lookup table from the 0-th dim of the parameter table. When the parallel_config.vocab_emb_dp is True and in the AUTO_PARALLEL mode, the embedding lookup will be trained by the data parallel way, as the parameters will be repeated on each device. If false, the embedding table will be sharded into n parts at the 0-th dimension of the embedding table, where the n is the model parallel way determined by parallel_config.model_parallel (EmbeddingOpParallelConfig).

Note

When AUTO_PARALLEL or SEMI_AUTO_PARALLEL mode is enabled, this layer support only 2-d dimension inputs, as the shard is designed for 2d inputs.

Parameters
  • vocab_size (int) – Size of the dictionary of embeddings.

  • embedding_size (int) – The size of each embedding vector.

  • parallel_config (EmbeddingOpParallelConfig) – The parallel config of network. Default default_embedding_parallel_config, an instance of EmbeddingOpParallelConfig with default args.

  • param_init (Union[Tensor, str, Initializer, numbers.Number]) – Initializer for the embedding_table. Refer to class initializer for the values of string when a string is specified. Default: ‘normal’.

Inputs:
  • input_ids (Tensor) - The tokenized inputs with datatype int32 with shape (batch_size, seq_length)

Outputs:

Tuple, a tuple contains (output, embedding_table)

  • output (Tensor) - The embedding vector for the input with shape (batch_size, seq_length, embedding_size).

  • embedding_table (Tensor) - The embedding table with shape (vocab_size, embedding_size).

Raises
  • ValueError – If the parallel_config.vocab_emb_dp is True, the vocab size is not a multiple of parallel_config.model_parallel

  • ValueErrorvocab_size is not a positive value.

  • ValueErrorembedding_size is not a positive value.

  • TypeErrorparallel_config is not a subclass of OpParallelConfig.

Supported Platforms:

Ascend GPU

Examples

>>> import numpy as np
>>> from mindspore.nn.transformer import VocabEmbedding
>>> from mindspore import Tensor
>>> from mindspore import dtype as mstype
>>> model = VocabEmbedding(vocab_size=30, embedding_size=30)
>>> tensor = Tensor(np.ones((20, 15)), mstype.int32)
>>> output, table = model(tensor)
>>> print(output.shape)
(20, 15, 30)
>>> print(table.shape)
(30, 30)