mindformers.generation.GenerationMixin
- class mindformers.generation.GenerationMixin[source]
A class providing all functions for autoregressive text generation, used as a mixin with PreTrainedModel.
- chat(tokenizer: PreTrainedTokenizer, query: str, history: Optional[List[Dict[str, str]]] = None, system_role_name: Optional[str] = 'system', user_role_name: Optional[str] = 'user', assistant_role_name: Optional[str] = 'assistant', instruction: Optional[str] = '', max_length: Optional[int] = 512, max_new_tokens: Optional[int] = None, min_length: Optional[int] = 0, min_new_tokens: Optional[int] = None, do_sample: Optional[bool] = True, temperature: Optional[float] = 1.0, top_k: Optional[int] = 50, top_p: Optional[float] = 1.0, repetition_penalty: Optional[float] = 1.0)[source]
Dia-logical text generation inference with large language models. The query from the user will be inference using generate() after adding the chat template via the provided tokenizer.
- Parameters
tokenizer (PreTrainedTokenizer) – The tokenized used to decode the tokens.
query (str) – User input for inference.
history (List[Dict[str, str]], optional) – A Conversation object or list of dicts with "role" and "content" keys, representing the chat history so far. Default:
None
.system_role_name (str) – The name of system role. Default:
"system"
.user_role_name (str) – The name of user role. Default:
"user"
.assistant_role_name (str) – The name of assistant role. Default: "assistant".
instruction (str, optional) – Instruction message to the model. Default:
""
.max_length (int, optional) – The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set. Default:
512
.max_new_tokens (int, optional) – The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default:
None
.min_length (int, optional) – The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set. Default: 0.
min_new_tokens (int, optional) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default:
None
.do_sample (bool, optional) – Whether to do sampling on the candidate ids. If set True it will be enabled, and set it to be False to disable the sampling, equivalent to top-k 1. If set None, it follows the setting in the configuration in the model. Default:
True
.temperature (float, optional) – The value used to modulate the next token probabilities. Default:
1.0
.top_k (int, optional) – Determine the top-k numbers token id as candidate. This should be a positive number. If set None, it follows the setting in the configuration in the model. Default:
50
.top_p (float, optional) – The accumulation probability of the candidate token ids below the top-p will be select as the candidate ids. The valid value of top-p is between (0, 1]. If the value is larger than 1, top-k algorithm will be enabled. If set None, it follows the setting in the configuration in the model. Default:
1.0
.repetition_penalty (float, optional) – The penalty factor of the frequency that generated words. If set 1, the repetition_penalty will not be enabled. If set None, it follows the setting in the configuration in the model. Default:
1.0
.
- Returns
response, the reply from the LLM in this session. history, the conversation history.
Examples
>>> import mindspore as ms >>> from mindformers.generation import text_generator >>> from mindformers import AutoModel, AutoTokenizer >>> ms.set_context(mode=0) >>> model = AutoModel.from_pretrained("llama2_7b") >>> tokenizer = AutoTokenizer.from_pretrained("llama2_7b") >>> query = "Hello!" >>> response, history = model.chat(tokenizer=tokenizer, query=query, max_length=32) >>> print(response) Thanks, sir.
- chunk_prefill_infer(input_ids: [Union[List[int], List[List[int]]]], batch_valid_length: np.ndarray, block_tables: np.ndarray, slot_mapping: np.ndarray, attention_mask: Optional[np.ndarray] = None, **model_kwargs)[source]
Preprocessing of chunk prefill inference
- Parameters
input_ids (List(List(int))) – Input ids.
batch_valid_length (np.ndarray) – Valid input length.
block_tables (np.ndarray) – Params for page attention.
slot_mapping (np.ndarray) – Params for page attention.
attention_mask (np.ndarray) – Params for page attention.
q_seq_lens (np.ndarray) – Params for page attention.
gather_index (np.ndarray) – Used to obtain the last latent vector of each sequence.
seq_range (np.ndarray) – Used to obtain Mask and positional encoding of valid tokens for each sequence.
- forward(input_ids: [Union[List[int], List[List[int]]]], valid_length_each_example: np.ndarray, block_tables: Optional[Tensor] = None, slot_mapping: Optional[Tensor] = None, prefill: bool = None, use_past: bool = False, encoder_mask: Optional[Tensor] = None, encoder_output: Optional[Tensor] = None, target_mask: Optional[Tensor] = None, **model_kwargs)[source]
Model forward process.
- Parameters
input_ids (List(List(int))) – Input ids after padding.
valid_length_each_example (np.ndarray) – Valid input length except padding.
block_tables (Tensor) – Params for page attention.
slot_mapping (Tensor) – Params for page attention.
prefill (bool) – Whether to do prefill predict or decode predict.
use_past (bool) – Whether to use past.
encoder_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
encoder_output (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
target_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
- Returns
res, the result after the forward process. current_index, records the current index of the sequence.
- generate(input_ids: Optional[Union[List[int], List[List[int]]]], generation_config: Optional[GenerationConfig] = None, logits_processor: Optional[LogitsProcessorList] = None, streamer: Optional[BaseStreamer] = None, seed: Optional[int] = None, **kwargs)[source]
Generate the words according to the given the input ids.
Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model's default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, top_k=3, do_sample=True).
- Parameters
input_ids (List(str), List(List(str))) – The token id list or a batch of token id list. When input a batch of token id list, the length of each token id list should be same.
generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default config from the model configuration will be used. Please note that unspecified parameters will inherit [GenerationConfig]'s default values, whose documentation should be checked to parameterize generation. Default:
None
.logits_processor (LogitsProcessorList, optional) – Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. Default:
None
.streamer (TextStreamer) – The streamer that generator uses.
seed (int) – Random seed used in sample.
kwargs –
Specific parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. Supported generate_config keywords can be checked in [GenerationConfig]'s documentation. Mainly used Keywords are shown below:
max_length (int): The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set.
max_new_tokens (int): The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int): The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.
min_new_tokens (int): The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.
do_sample (bool): Whether to do sampling on the candidate ids. If set True it will be enabled, and set it to be False to disable the sampling, equivalent to top-k 1. If set None, it follows the setting in the configureation in the model.
top_k (int): Determine the top-k numbers token id as candidate. This should be a positive number. If set None, it follows the setting in the configureation in the model.
top_p (float): The accumulation probability of the candidate token ids below the top-p will be select as the condaite ids. The valid value of top-p is between (0, 1]. If the value is larger than 1, top-k algorithm will be enabled. If set None, it follows the setting in the configureation in the model.
eos_token_id (int): The end of sentence token id. If set None, it follows the setting in the configureation in the model.
pad_token_id (int): The pad token id. If set None, it follows the setting in the configureation in the model.
repetition_penalty (float): The penalty factor of the frequency that generated words. The If set 1, the repetition_penalty will not be enabled. If set None, it follows the setting in the configureation in the model. Default:
None
.num_beams (int): Number of beams for beam search. 1 means no beam search. If larger than 1, do_sample will be set to false.
- Returns
A list of the generated token ids.
Examples
>>> from mindformers import LlamaForCausalLM, LlamaTokenizer >>> import mindspore as ms >>> ms.set_context(mode=0) >>> llama = LlamaForCausalLM.from_pretrained("llama2_7b") >>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b") >>> words = "translate the English to the Romanian: UN Chief Says There Is No Military Solution in Syria" >>> words = tokenizer(words, max_length=21, padding='max_length')['input_ids'] >>> output = llama.generate(words, do_sample=True) >>> output = tokenizer.decode(output[0], skip_special_tokens=True) >>> print(output) UN Chief Says There Is No Military Solution in Syria The United Nations Secretary-General, Ban Ki-moon, said that there is no military solution in Syria, calling on the international community >>> # Enable the top-p sampling >>> output = llama.generate(words, do_sample=True, top_p=0.4) >>> output = tokenizer.decode(output[0], skip_special_tokens=True) >>> print(output) UN Chief Says There Is No Military Solution in Syria UN Chief Says There Is No Military Solution in Syria. >>> # Enable the top-k sampling. >>> output = llama.generate(words, do_sample=True, top_k=10, top_p=1) >>> output = tokenizer.decode(output[0], skip_special_tokens=True) >>> print(output) Translation by: Adela Popa English Text: UN chief warns Syria conflict threatens entire region >>> from mindformers import LlamaForCausalLM, LlamaTokenizer >>> llama = LlamaForCausalLM.from_pretrained("llama2_7b") >>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b") >>> words = "translate the English to the Romanian: UN Chief Says There Is No Military Solution in Syria" >>> words = tokenizer(words, max_length=21, padding='max_length')['input_ids'] >>> output = llama.generate(words, num_beams=3) >>> output = tokenizer.decode(output[0], skip_special_tokens=True) >>> print(output) UN Chief Says There Is No Military Solution in Syria UN Chief Says There Is No Military Solution in Syria.
- infer(input_ids: Union[List[int], List[List[int]]], valid_length_each_example: np.ndarray, generation_config: GenerationConfig = None, logits_processor: Optional[LogitsProcessorList] = None, logits_warper: Optional[LogitsProcessorList] = None, block_tables: Optional[Tensor] = None, slot_mapping: Optional[Tensor] = None, prefill: bool = True, is_finished: List[bool] = None, encoder_mask: Optional[Tensor] = None, encoder_output: Optional[Tensor] = None, target_mask: Optional[Tensor] = None, **model_kwargs)[source]
Do infer and return logits on next position, can choose do prefill or decode predict.
- Parameters
input_ids (List(List(int))) – Input ids after padding.
valid_length_each_example (np.ndarray) – Valid input length except padding.
generation_config (GenerationConfig) – The generation configuration to be used as base parametrization for the generation call.
logits_processor (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsProcessor] used to modify the prediction scores of the language modeling head applied at each generation step. Default:
None
.logits_warper (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsWarper] used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. Default:
None
.block_tables (Tensor) – Params for page attention.
slot_mapping (Tensor) – Params for page attention.
prefill (bool) – Whether to do prefill predict or decode predict.
is_finished (List(bool)) – Whether each sequence is finished its generation.
encoder_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
encoder_output (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
target_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.
- Returns
next_token, the next token to be generated. is_finished, whether the sequence has completed its generation task.
- postprocess(input_ids, is_finished, res, generation_config: GenerationConfig, valid_length_each_example, current_index: Optional[Union[List[int], List[List[int]]]], logits_processor: Optional[LogitsProcessorList] = None, logits_warper: Optional[LogitsProcessorList] = None, need_gather_logits: bool = True)[source]
Postprocess of the output from model generation.
- Parameters
input_ids (List(List(int))) – Input ids after padding.
res (List(List(int))) – Logits after infer.
is_finished (List(bool)) – Whether each sequence is finished its generation.
generation_config (GenerationConfig) – The generation configuration to be used as base parametrization for the generation call.
valid_length_each_example (np.ndarray) – Valid input length except padding.
current_index (List(int)) – Current index of sequence.
logits_processor (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsProcessor] used to modify the prediction scores of the language modeling head applied at each generation step. Default:
None
.logits_warper (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsWarper] used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. Default:
None
.need_gather_logits (bool) – whether gather result, when decode predict and is first iteration, set True.
- Returns
target_list, contains the target values generated in each batch. next_probs_cache, cache for probs, if needed in output. next_logits_cache, cache for logits, if needed in output. is_finished, whether the sequence has completed its generation task.