mindformers.generation.GenerationMixin

View Source On Gitee
class mindformers.generation.GenerationMixin[source]

A class providing all functions for autoregressive text generation, used as a mixin with PreTrainedModel.

chat(tokenizer: PreTrainedTokenizer, query: str, history: Optional[List[Dict[str, str]]] = None, system_role_name: Optional[str] = 'system', user_role_name: Optional[str] = 'user', assistant_role_name: Optional[str] = 'assistant', instruction: Optional[str] = '', max_length: Optional[int] = 512, max_new_tokens: Optional[int] = None, min_length: Optional[int] = 0, min_new_tokens: Optional[int] = None, do_sample: Optional[bool] = True, temperature: Optional[float] = 1.0, top_k: Optional[int] = 50, top_p: Optional[float] = 1.0, repetition_penalty: Optional[float] = 1.0)[source]

Dia-logical text generation inference with large language models. The query from the user will be inference using generate() after adding the chat template via the provided tokenizer.

Parameters
  • tokenizer (PreTrainedTokenizer) – The tokenized used to decode the tokens.

  • query (str) – User input for inference.

  • history (List[Dict[str, str]], optional) – A Conversation object or list of dicts with "role" and "content" keys, representing the chat history so far. Default: None.

  • system_role_name (str) – The name of system role. Default: "system".

  • user_role_name (str) – The name of user role. Default: "user".

  • assistant_role_name (str) – The name of assistant role. Default: "assistant".

  • instruction (str, optional) – Instruction message to the model. Default: "".

  • max_length (int, optional) – The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set. Default: 512.

  • max_new_tokens (int, optional) – The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default: None.

  • min_length (int, optional) – The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set. Default: 0.

  • min_new_tokens (int, optional) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default: None.

  • do_sample (bool, optional) – Whether to do sampling on the candidate ids. If set True it will be enabled, and set it to be False to disable the sampling, equivalent to top-k 1. If set None, it follows the setting in the configuration in the model. Default: True.

  • temperature (float, optional) – The value used to modulate the next token probabilities. Default: 1.0.

  • top_k (int, optional) – Determine the top-k numbers token id as candidate. This should be a positive number. If set None, it follows the setting in the configuration in the model. Default: 50.

  • top_p (float, optional) – The accumulation probability of the candidate token ids below the top-p will be select as the candidate ids. The valid value of top-p is between (0, 1]. If the value is larger than 1, top-k algorithm will be enabled. If set None, it follows the setting in the configuration in the model. Default: 1.0.

  • repetition_penalty (float, optional) – The penalty factor of the frequency that generated words. If set 1, the repetition_penalty will not be enabled. If set None, it follows the setting in the configuration in the model. Default: 1.0.

Returns

response, the reply from the LLM in this session. history, the conversation history.

Examples

>>> import mindspore as ms
>>> from mindformers.generation import text_generator
>>> from mindformers import AutoModel, AutoTokenizer
>>> ms.set_context(mode=0)
>>> model = AutoModel.from_pretrained("llama2_7b")
>>> tokenizer = AutoTokenizer.from_pretrained("llama2_7b")
>>> query = "Hello!"
>>> response, history = model.chat(tokenizer=tokenizer, query=query, max_length=32)
>>> print(response)
Thanks, sir.
chunk_prefill_infer(input_ids: [Union[List[int], List[List[int]]]], batch_valid_length: np.ndarray, block_tables: np.ndarray, slot_mapping: np.ndarray, attention_mask: Optional[np.ndarray] = None, **model_kwargs)[source]

Preprocessing of chunk prefill inference

Parameters
  • input_ids (List(List(int))) – Input ids.

  • batch_valid_length (np.ndarray) – Valid input length.

  • block_tables (np.ndarray) – Params for page attention.

  • slot_mapping (np.ndarray) – Params for page attention.

  • attention_mask (np.ndarray) – Params for page attention.

  • q_seq_lens (np.ndarray) – Params for page attention.

  • gather_index (np.ndarray) – Used to obtain the last latent vector of each sequence.

  • seq_range (np.ndarray) – Used to obtain Mask and positional encoding of valid tokens for each sequence.

forward(input_ids: [Union[List[int], List[List[int]]]], valid_length_each_example: np.ndarray, block_tables: Optional[Tensor] = None, slot_mapping: Optional[Tensor] = None, prefill: bool = None, use_past: bool = False, encoder_mask: Optional[Tensor] = None, encoder_output: Optional[Tensor] = None, target_mask: Optional[Tensor] = None, **model_kwargs)[source]

Model forward process.

Parameters
  • input_ids (List(List(int))) – Input ids after padding.

  • valid_length_each_example (np.ndarray) – Valid input length except padding.

  • block_tables (Tensor) – Params for page attention.

  • slot_mapping (Tensor) – Params for page attention.

  • prefill (bool) – Whether to do prefill predict or decode predict.

  • use_past (bool) – Whether to use past.

  • encoder_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

  • encoder_output (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

  • target_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

Returns

res, the result after the forward process. current_index, records the current index of the sequence.

generate(input_ids: Optional[Union[List[int], List[List[int]]]], generation_config: Optional[GenerationConfig] = None, logits_processor: Optional[LogitsProcessorList] = None, streamer: Optional[BaseStreamer] = None, seed: Optional[int] = None, **kwargs)[source]

Generate the words according to the given the input ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model's default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, top_k=3, do_sample=True).

Parameters
  • input_ids (List(str), List(List(str))) – The token id list or a batch of token id list. When input a batch of token id list, the length of each token id list should be same.

  • generation_config (GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default config from the model configuration will be used. Please note that unspecified parameters will inherit [GenerationConfig]'s default values, whose documentation should be checked to parameterize generation. Default: None.

  • logits_processor (LogitsProcessorList, optional) – Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. Default: None.

  • streamer (TextStreamer) – The streamer that generator uses.

  • seed (int) – Random seed used in sample.

  • kwargs

    Specific parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. Supported generate_config keywords can be checked in [GenerationConfig]'s documentation. Mainly used Keywords are shown below:

    • max_length (int): The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set.

    • max_new_tokens (int): The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

    • min_length (int): The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.

    • min_new_tokens (int): The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.

    • do_sample (bool): Whether to do sampling on the candidate ids. If set True it will be enabled, and set it to be False to disable the sampling, equivalent to top-k 1. If set None, it follows the setting in the configureation in the model.

    • top_k (int): Determine the top-k numbers token id as candidate. This should be a positive number. If set None, it follows the setting in the configureation in the model.

    • top_p (float): The accumulation probability of the candidate token ids below the top-p will be select as the condaite ids. The valid value of top-p is between (0, 1]. If the value is larger than 1, top-k algorithm will be enabled. If set None, it follows the setting in the configureation in the model.

    • eos_token_id (int): The end of sentence token id. If set None, it follows the setting in the configureation in the model.

    • pad_token_id (int): The pad token id. If set None, it follows the setting in the configureation in the model.

    • repetition_penalty (float): The penalty factor of the frequency that generated words. The If set 1, the repetition_penalty will not be enabled. If set None, it follows the setting in the configureation in the model. Default: None.

    • num_beams (int): Number of beams for beam search. 1 means no beam search. If larger than 1, do_sample will be set to false.

Returns

A list of the generated token ids.

Examples

>>> from mindformers import LlamaForCausalLM, LlamaTokenizer
>>> import mindspore as ms
>>> ms.set_context(mode=0)
>>> llama = LlamaForCausalLM.from_pretrained("llama2_7b")
>>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b")
>>> words = "translate the English to the Romanian: UN Chief Says There Is No Military Solution in Syria"
>>> words = tokenizer(words, max_length=21, padding='max_length')['input_ids']
>>> output = llama.generate(words, do_sample=True)
>>> output = tokenizer.decode(output[0], skip_special_tokens=True)
>>> print(output)
UN Chief Says There Is No Military Solution in Syria
The United Nations Secretary-General, Ban Ki-moon, said that there is no military solution in Syria,
calling on the international community
>>> # Enable the top-p sampling
>>> output = llama.generate(words, do_sample=True, top_p=0.4)
>>> output = tokenizer.decode(output[0], skip_special_tokens=True)
>>> print(output)
UN Chief Says There Is No Military Solution in Syria
UN Chief Says There Is No Military Solution in Syria.
>>> # Enable the top-k sampling.
>>> output = llama.generate(words, do_sample=True, top_k=10, top_p=1)
>>> output = tokenizer.decode(output[0], skip_special_tokens=True)
>>> print(output)
Translation by: Adela Popa
English Text: UN chief warns Syria conflict threatens entire region
>>> from mindformers import LlamaForCausalLM, LlamaTokenizer
>>> llama = LlamaForCausalLM.from_pretrained("llama2_7b")
>>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b")
>>> words = "translate the English to the Romanian: UN Chief Says There Is No Military Solution in Syria"
>>> words = tokenizer(words, max_length=21, padding='max_length')['input_ids']
>>> output = llama.generate(words, num_beams=3)
>>> output = tokenizer.decode(output[0], skip_special_tokens=True)
>>> print(output)
UN Chief Says There Is No Military Solution in Syria
UN Chief Says There Is No Military Solution in Syria.
infer(input_ids: Union[List[int], List[List[int]]], valid_length_each_example: np.ndarray, generation_config: GenerationConfig = None, logits_processor: Optional[LogitsProcessorList] = None, logits_warper: Optional[LogitsProcessorList] = None, block_tables: Optional[Tensor] = None, slot_mapping: Optional[Tensor] = None, prefill: bool = True, is_finished: List[bool] = None, encoder_mask: Optional[Tensor] = None, encoder_output: Optional[Tensor] = None, target_mask: Optional[Tensor] = None, **model_kwargs)[source]

Do infer and return logits on next position, can choose do prefill or decode predict.

Parameters
  • input_ids (List(List(int))) – Input ids after padding.

  • valid_length_each_example (np.ndarray) – Valid input length except padding.

  • generation_config (GenerationConfig) – The generation configuration to be used as base parametrization for the generation call.

  • logits_processor (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsProcessor] used to modify the prediction scores of the language modeling head applied at each generation step. Default: None.

  • logits_warper (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsWarper] used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. Default: None.

  • block_tables (Tensor) – Params for page attention.

  • slot_mapping (Tensor) – Params for page attention.

  • prefill (bool) – Whether to do prefill predict or decode predict.

  • is_finished (List(bool)) – Whether each sequence is finished its generation.

  • encoder_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

  • encoder_output (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

  • target_mask (Tensor) – Use for encoder-decoder construct, do not need for decoder only construct.

Returns

next_token, the next token to be generated. is_finished, whether the sequence has completed its generation task.

postprocess(input_ids, is_finished, res, generation_config: GenerationConfig, valid_length_each_example, current_index: Optional[Union[List[int], List[List[int]]]], logits_processor: Optional[LogitsProcessorList] = None, logits_warper: Optional[LogitsProcessorList] = None, need_gather_logits: bool = True)[source]

Postprocess of the output from model generation.

Parameters
  • input_ids (List(List(int))) – Input ids after padding.

  • res (List(List(int))) – Logits after infer.

  • is_finished (List(bool)) – Whether each sequence is finished its generation.

  • generation_config (GenerationConfig) – The generation configuration to be used as base parametrization for the generation call.

  • valid_length_each_example (np.ndarray) – Valid input length except padding.

  • current_index (List(int)) – Current index of sequence.

  • logits_processor (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsProcessor] used to modify the prediction scores of the language modeling head applied at each generation step. Default: None.

  • logits_warper (LogitsProcessorList, optional) – An instance of [LogitsProcessorList]. List of instances of class derived from [LogitsWarper] used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. Default: None.

  • need_gather_logits (bool) – whether gather result, when decode predict and is first iteration, set True.

Returns

target_list, contains the target values generated in each batch. next_probs_cache, cache for probs, if needed in output. next_logits_cache, cache for logits, if needed in output. is_finished, whether the sequence has completed its generation task.