mindformers.models.LlamaConfig
- class mindformers.models.LlamaConfig(batch_size: int = 1, seq_length: int = 2048, hidden_size: int = 4096, num_layers: int = 32, num_heads: int = 32, n_kv_heads: Optional[int] = None, max_position_embedding: Optional[int] = None, intermediate_size: Optional[int] = None, vocab_size: int = 32000, multiple_of: int = 256, ffn_dim_multiplier: Optional[int] = None, rms_norm_eps: float = 1e-5, bos_token_id: int = 1, eos_token_id: int = 2, pad_token_id: int = 0, ignore_token_id: int = - 100, theta: float = 10000.0, compute_dtype: str = 'float16', layernorm_compute_type: str = 'float32', softmax_compute_type: str = 'float32', rotary_dtype: str = 'float32', param_init_type: str = 'float16', residual_dtype: str = None, embedding_init_type=None, qkv_has_bias: bool = False, qkv_concat: bool = False, attn_proj_has_bias: bool = False, parallel_config: Union[dict, TransformerOpParallelConfig] = default_transformer_config, moe_config: Union[dict, MoEConfig] = default_moe_config, use_past: bool = False, extend_method: str = 'None', scaling_factor: float = 1.0, is_dynamic: bool = False, use_rope_slice: bool = False, use_flash_attention: bool = False, use_ring_attention: bool = False, use_attn_mask_compression: bool = False, parallel_optimizer: bool = False, fine_grain_interleave: int = 1, pp_interleave_num: int = 1, offset: int = 0, init_method_std: float = 0.01, checkpoint_name_or_path: str = '', repetition_penalty: float = 1.0, max_decode_length: int = 1024, block_size: int = 16, num_blocks: int = 512, top_k: int = 5, top_p: float = 1.0, do_sample: bool = True, quant_config: dict = None, tie_word_embeddings: bool = False, llm_backend: str = '', fused_rms_norm: bool = True, input_sliced_sig: bool = False, **kwargs)[source]
Llama config class which defines the model size.
- Parameters
batch_size (int, optional) – batch size for input data, use in predict. Default:
1
.seq_length (int, optional) – The sequence length of input_ids. Default:
2048
.vocab_size (int, optional) – Default:
32000
. Vocabulary size of the BERT model.hidden_size (int, optional) – Dimensionality of the encoder layers and the pooler layer. Default:
4096
.num_layers (int, optional) – Number of hidden layers in the Transformer encoder. Default:
32
.num_heads (int, optional) – Number of attention heads for each attention layer in the Transformer encoder. Default:
32
.multiple_of (int, optional) – Define SwiGLU hidden layer size multiples. Default:
256
.n_kv_heads (int, optional) – Define multi group head attention heads number. Default:
None
.ffn_dim_multiplier (int, optional) – Define ffn layer dim multiples. Default:
None
.rms_norm_eps (float, optional) – The epsilon value of the denominator. Default:
1e-5
.bos_token_id (int, optional) – The id of the beginning-of-sequence token. Default:
1
.eos_token_id (int, optional) – The id of the end-of-sequence token. Default:
2
.pad_token_id (int, optional) – The id of the padding token. Default:
0
.ignore_token_id (int, optional) – The id of the ignoring token. Default:
-100
.compute_dtype (str, optional) – Linear layer compute dtype. Default:
float16
.layernorm_compute_type (str, optional) – layernorm compute dtype. Default:
float32
.softmax_compute_type (str, optional) – softmax compute dtype. Default:
float32
.rotary_dtype (str, optional) – rope compute dtype. Default:
float32
.param_init_type (str, optional) – parameter initial dtype. Default:
float16
.init_method_std (float, optional) – The sigma value when using normal type to initialize Linear. Default:
0.01
.qkv_has_bias (bool, optional) – Whether the Query, Key, and Value projection has bias. Default:
False
.attn_proj_has_bias (bool, optional) – Whether the attn projection has bias. Default:
False
.use_past (bool, optional) – Whether the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. Default:
False
.parallel_config (TransformerOpParallelConfig) – The parallel configure. Default:
default_transformer_config
, an instance of TransformerOpParallelConfig with default args.residual_dtype (str, optional) – residual compute dtype. Default:
None
.extend_method (str, optional) – The extent method of seq length of inference. Default:
None
.use_flash_attention (bool, optional) – Whether enable flash attention ops. Default:
False
.use_ring_attention (bool, optional) – Whether enable ring attention ops. Default:
False
.offset (int, optional) – Offset of transformer layer when set pipeline stage number. Default:
0
.checkpoint_name_or_path (str, optional) – checkpoint path or name used to load to the network. Default:
None
.repetition_penalty (float, optional) – The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Default:
1.0
.max_decode_length (int, optional) – The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set. Default:
1024
.top_k (int, optional) – The number of highest probability vocabulary tokens to keep for top-k-filtering. Default:
5
.top_p (float, optional) – If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default:
1.0
.do_sample (bool, optional) – Whether to use sampling; use greedy decoding otherwise. Default:
True
.block_size (int, optional) – The maximum number of tokens in one block can have when using paged attention. Default:
16
.num_blocks (int, optional) – The maximum number of blocks when using paged attention. Default:
512
.tie_word_embeddings (bool, optional) – Whether to tie input and output embeddings. Default:
False
.llm_backend (str, optional) – Llm boost backend. Default:
None
.fused_rms_norm (bool, optional) – Whether or not to use the RMS_NORM of the fusion operator. Default:
True
.input_sliced_sig (bool, optional) – If input_ids and labels have been processed to equal to seq_length, input_sliced_sig should be True, if not, input_sliced_sig should be False. Default:
False
.
- Returns
LlamaConfig, a LlamaConfig instance.
Examples
>>> from mindformers.models import LlamaConfig >>> config = LlamaConfig(num_layers=2, seq_length=1024) >>> print(config) LlamaConfig { "batch_size": 1, "block_size": 16, "bos_token_id": 1, "checkpoint_name_or_path": "", "compute_dtype": "float16", "do_sample": true, "embedding_init_type": "float16", "eos_token_id": 2, "extend_method": "None", "ffn_dim_multiplier": null, "fine_grain_interleave": 1, "hidden_size": 4096, "ignore_token_id": -100, "intermediate_size": null, "is_dynamic": false, "layernorm_compute_type": "float32", "llm_backend": "", "max_decode_length": 1024, "max_position_embedding": 1024, "mindformers_version": "dev", "model_type": "llama", "multiple_of": 256, "n_kv_heads": null, "num_blocks": 512, "num_heads": 32, "num_layers": 2, "offset": 0, "pad_token_id": 0, "parallel_decoding_params": null, "parallel_optimizer": false, "param_init_type": "float16", "pp_interleave_num": 1, "qkv_concat": false, "qkv_has_bias": false, "quant_config": null, "repetition_penalty": 1.0, "rms_norm_eps": 1e-05, "rotary_dtype": "float32", "scaling_factor": 1.0, "seq_length": 1024, "softmax_compute_type": "float32", "theta": 10000.0, "tie_word_embeddings": false, "top_k": 5, "top_p": 1.0, "use_attn_mask_compression": false, "use_flash_attention": false, "use_past": false, "use_ring_attention": false, "use_rope_slice": false, "vocab_size": 32000 }