mindformers.models.glm2.ChatGLM2Config

class mindformers.models.glm2.ChatGLM2Config(batch_size=1, num_layers=28, padded_vocab_size=65024, hidden_size=4096, ffn_hidden_size=13696, kv_channels=128, num_attention_heads=32, seq_length=2048, hidden_dropout=0.0, attention_dropout=0.0, layernorm_epsilon=1e-5, rope_ratio=1, rmsnorm=True, apply_residual_connection_post_layernorm=False, post_layer_norm=True, add_bias_linear=False, add_qkv_bias=True, bias_dropout_fusion=True, multi_query_attention=True, multi_query_group_num=2, apply_query_key_layer_scaling=True, attention_softmax_in_fp32=True, fp32_residual_connection=False, quantization_bit=0, pre_seq_len=None, prefix_projection=False, param_init_type: str = 'float16', compute_dtype: str = 'float16', layernorm_compute_type: str = 'float32', rotary_dtype: str = None, use_past=False, use_flash_attention=False, block_size=16, num_blocks=128, is_dynamic=False, eos_token_id=2, pad_token_id=0, gmask_token_id=None, bos_token_id=None, repetition_penalty=1.0, checkpoint_name_or_path=None, parallel_config: Union[dict, TransformerOpParallelConfig] = default_transformer_config, offset: int = 0, pp_interleave_num: int = 1, mlp_concat: bool = True, qkv_concat: bool = True, use_rearrange_rope: bool = False, mask_generate: str = None, fine_grain_interleave: int = 1, **kwargs)[source]

ChatGLM2 model config class which defines the model size.

Parameters

batch_size (int, optional) – batch size for input data, use in predict. Default: 1 .
num_layers (int, optional) – Number of hidden layers in the Transformer encoder. Default: 28 .
padded_vocab_size (int, optional) – Vocabulary size of the ChatGLM2 model. Default: 65024 .
hidden_size (int, optional) – Dimensionality of the hidden layers. Default: 4096 .
ffn_hidden_size (int, optional) – Dimensionality of the ffn layer. Default: 13696 .
kv_channels (int, optional) – The number of channels for key and value vectors in the transformer. Default: 128 .
num_attention_heads (int, optional) – The number of attention heads for each attention layer. Default: 32 .
seq_length (int, optional) – The sequence length of input_ids, default is 2048. Default: 2048 .
hidden_dropout (float, optional) – The dropout ratio of the dropout function. Default: 0.0 .
attention_dropout (float, optional) – The dropout ratio for the attention matrix. Default: 0.0 .
layernorm_epsilon (float, optional) – The ϵ value added to prevent the denominator from being zero when computing layer normalization. Default: 1e-5 .
rope_ratio (float, optional) – RoPE rotation coefficient. Default: 1 .
rmsnorm (bool, optional) – Whether to use rmsnorm. Default: True .
apply_residual_connection_post_layernorm (bool, optional) – Whether apply the residual connection to post layernorm. Default: False .
post_layer_norm (bool, optional) – Whether to use layer normalization after the ffn layer. Default: True .
add_bias_linear (bool, optional) – Whether to add bias to the linear layer. Default: False .
add_qkv_bias (bool, optional) – Whether to add bias for qkv. Default: True .
bias_dropout_fusion (bool, optional) – Whether to add bias, dropout, and fusion operations. Default: True .
multi_query_attention (bool, optional) – Whether to use multi query attention. Default: True .
multi_query_group_num (int, optional) – Define multi group head attention heads number. Default: 2 .
apply_query_key_layer_scaling (bool, optional) – Whether scaling the query_key layer. Default: True .
attention_softmax_in_fp32 (bool, optional) – Whether apply fp32 to the attention softmax. Default: True .
fp32_residual_connection (bool, optional) – Whether apply fp32 to residual connection layer. Default: False .
quantization_bit (int, optional) – Weight and number of activation bits. Default: 0 .
pre_seq_len (int, optional) – Length of the input sequence that can be learned. Default: None .
prefix_projection (bool, optional) – Add a projection layer before a sequence. Default: False .
param_init_type (str, optional) – parameter initial dtype. Default: float16 .
compute_dtype (str, optional) – Linear layer compute dtype. Default: float16 .
layernorm_compute_type (str, optional) – layernorm compute dtype. Default: float32 .
use_past (bool, optional) – Whether the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. Default: False .
use_flash_attention (bool, optional) – Whether enable flash attention ops, default False. Default: False .
block_size (int, optional) – The maximum number of tokens in one block can have when using PagedAttention. Default: 16 .
num_blocks (int, optional) – The maximum number of blocks when using PagedAttention. Default: 128 .
is_dynamic (bool, optional) – Whether to use dynamic diagram mode. Default: False .
eos_token_id (int, optional) – The token id of the end-of-sequence token. Default: 2 .
pad_token_id (int, optional) – In multi-batch inference, the token id value used to pad shorter sequences to match the length of the longest sequence. Default: 0 .
gmask_token_id (int, optional) – A special token representing a gmask token. Default: None .
bos_token_id (int, optional) – The id of the beginning-of-sequence token. Default: None .
repetition_penalty (float, optional) – The parameter for repetition penalty. 1.0 means no penalty. Default: 1.0 .
checkpoint_name_or_path (str, optional) – checkpoint path or name used to load to the network. Default: None .
parallel_config (TransformerOpParallelConfig, optional) – The parallel configure. an instance of TransformerOpParallelConfig with default args. Default: TransformerOpParallelConfig .
offset (int, optional) – The layer offset for each (mini) stage. Default: 0 .
pp_interleave_num (int, optional) – Number of microbatch interleavings in pipeline parallelism. Default: 1 .
mlp_concat (bool, optional) – Whether to concatenate two mlp to one Linear. Default: True .
qkv_concat (bool, optional) – Whether to concatenate query key and value Linear calculation to one entire Linear. Default: True .
use_rearrange_rope (bool, optional) – Whether to use rearranged rotary embedding. Default: False .
mask_generate (str, optional) – Which mask generation to use, which can be "inmap", "compress_reset", None. When set as None, lower triangular mask is used. Default: None .
fine_grain_interleave (int, optional) – Number of slices for fine grain interleave feature, which covers communication time with computation time in tensor parallel case. Default: 1 .
kwargs (dict, optional) – A variable number of keyword parameters reserved for the keyword parameters to be expanded.

Examples

>>> from mindformers.models import ChatGLM2Config
>>> config = ChatGLM2Config(num_layers=2, seq_length=1024)
>>> print(config)
ChatGLM2Config {
    "add_bias_linear": false,
    "add_qkv_bias": true,
    "apply_query_key_layer_scaling": true,
    "apply_residual_connection_post_layernorm": false,
    "attention_dropout": 0.0,
    "attention_softmax_in_fp32": true,
    "batch_size": 1,
    "bias_dropout_fusion": true,
    "block_size": 16,
    "bos_token_id": null,
    "compute_dtype": "float16",
    "eos_token_id": 2,
    "ffn_hidden_size": 13696,
    "fp32_residual_connection": false,
    "gmask_token_id": null,
    "hidden_dropout": 0.0,
    "hidden_size": 4096,
    "is_dynamic": false,
    "kv_channels": 128,
    "layernorm_compute_type": "float32",
    "layernorm_epsilon": 1e-05,
    "mindformers_version": "1.1",
    "model_type": "glm2",
    "multi_query_attention": true,
    "multi_query_group_num": 2,
    "n_kv_heads": 2,
    "num_attention_heads": 32,
    "num_blocks": 128,
    "num_heads": 32,
    "num_layers": 2,
    "offset": 0,
    "pad_token_id": 0,
    "padded_vocab_size": 65024,
    "param_init_type": "float16",
    "post_layer_norm": true,
    "pre_seq_len": null,
    "prefix_projection": false,
    "quantization_bit": 0,
    "repetition_penalty": 1.0,
    "rmsnorm": true,
    "rope_ratio": 1,
    "seq_length": 1024,
    "use_flash_attention": false,
    "use_past": false,
    "vocab_size": 65024,
    "mlp_concat": True,
    "qkv_concat": True,
    "use_rearrange_rope": False,
    "mask_generate": None,
    "fine_grain_interleave": 1
    }