Inference

View Source On Gitee

Overview

MindFormers provides the foundation model inference capability. Users can run the unified script run_mindformer or write a script to call the high-level pipeline API to start inference. If the unified script run_mindformer is used, you can directly start the system through the configuration file without writing code.

Basic Process

The inference process can be categorized into the following steps:

1. Models of Selective Inference

Depending on the required inference task, different models are chosen, e.g. for text generation one can choose Llama2, etc.

2. Preparing Model Weights

Model weights can be categorized into two types: complete weights and distributed weights, and the following instructions should be referred to when using them.

Complete Weights

Complete weights can be obtained in two ways:

  1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to Weight Format Conversion to convert them to the ckpt format.

  2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by merging.

Distributed Weights

Distributed weights are typically obtained by pre-training or after fine-tuning and are stored by default in the . /output/checkpoint_network directory, which needs to be converted to single-card or multi-card weights before performing single-card or multi-card inference.

If the inference uses a weight slicing that is different from the model slicing provided in the inference task, such as in these cases below, the weights need to be additionally converted to a slice that matches the slicing of the model in the actual inference task.

  • The weights obtained from multi-card training are reasoned on a single card;

  • The weights of the eight-card training are reasoned over two cards;

  • Already sliced distributed weights are reasoned on a single card, and so on.

The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters --auto_trans_ckpt to -True and -src_strategy_path_or_dir to the weighted slicing strategy file or directory path (which is saved by default after training under ./output/strategy) are automatically sliced in the inference task. Details can be found in Distributed Weight Slicing and Merging.

Since both the training and inference tasks use . /output as the default output path, when using the strategy file output by the training task as the source weight strategy file for the inference task, you need to move the strategy file directory under the default output path to another location to avoid it being emptied by the process of the inference task, for example:

mv ./output/strategy/ ./strategy

3. Executing inference tasks

Call the pipeline API or use the unified script run_mindformer to execute inference tasks.

Inference Based on the run_mindformer Script

For single-device inference, you can directly run run_mindformer.py. For multi-device inference, you need to run scripts/msrun_launcher.sh.

The arguments to run_mindformer.py are described below:

Parameters

Parameter Descriptions

config

Path to the yaml configuration file

run_mode

The running mode, with inference set to predict

use_parallel

Whether to use multicard inference

load_checkpoint

the loaded weight path

predict_data

Input data for inference. Multi-batch inference needs to pass the path to the txt file of the input data, which contains multiple lines of inputs.

auto_trans_ckpt

Automatic weight slicing. Default value is False

src_strategy_path_or_dir

Path to the strategy file for weights

predict_batch_size

batch_size for multi-batch inference

msrun_launcher.sh includes the run_mindformer.py command and the number of inference cards as two parameters.

The following will describe the usage of single and multi-card inference using Llama2 as an example, with the recommended configuration of the predict_llama2_7b.yaml file.

During inference, the vocabulary file tokenizer.model required for the Llama2 model will be automatically downloaded (ensuring smooth network connectivity). If the file exists locally, you can place it in the ./checkpoint_download/Llama2/ directory in advance.

Single-Device Inference

When using complete weight inference, the following command is executed to start the inference task:

python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--use_parallel False \
--load_checkpoint path/to/checkpoint.ckpt \
--predict_data 'I love Beijing, because'

If you use distributed weight files for inference, you need to add the --auto_trans_ckpt and -src_strategy_path_or_dir entries, with the following startup commands:

python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--use_parallel False \
--auto_trans_ckpt True \
--src_strategy_path_or_dir ./strategy \
--load_checkpoint path/to/checkpoint.ckpt \
--predict_data 'I love Beijing, because'

The following result appears, proving that the inference was successful. The inference result is also saved to the text_generation_result.txt file in the current directory. The detailed log can be viewed in the . /output/msrun_log directory.

'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]

Multi-Card Inference

The configuration requirements for multi-card inference differ from those of single card, and you need to refer to the following instructions to modify the predict_llama2_7b.yaml configuration.

  1. The configuration of model_parallel and the number of cards used need to be consistent. The following use case is 2-card inference, and model_parallel needs to be set to 2;

  2. The current version of multi-card inference does not support data parallelism, you need to set data_parallel to 1.

Configuration before modification:

parallel_config:
  data_parallel: 8
  model_parallel: 1
  pipeline_stage: 1

Configuration after modifications:

parallel_config:
  data_parallel: 1
  model_parallel: 2
  pipeline_stage: 1

When full weight inference is used, you need to enable the online slicing mode to load weights. For details, see the following command:

bash scripts/msrun_launcher.sh "python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--use_parallel True \
--auto_trans_ckpt True \
--load_checkpoint path/to/checkpoint.ckpt \
--predict_data 'I love Beijing, because'" \
2

Refer to the following commands when distributed weight inference is used and the slicing strategy for the weights is the same as the slicing strategy for the model:

bash scripts/msrun_launcher.sh "python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--use_parallel True \
--load_checkpoint path/to/checkpoint_dir \
--predict_data 'I love Beijing, because'" \
2

When distributed weight inference is used and the slicing strategy of the weights is not consistent with the slicing strategy of the model, you need to enable the online slicing function to load the weights. Refer to the following command:

bash scripts/msrun_launcher.sh "python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--use_parallel True \
--auto_trans_ckpt True \
--src_strategy_path_or_dir ./strategy \
--load_checkpoint path/to/checkpoint_dir \
--predict_data 'I love Beijing, because'" \
2

Inference results are viewed in the same way as single-card inference.

Multi-Device Multi-Batch Inference

Multi-card multi-batch inference is initiated in the same way as multi-card inference, but requires the addition of the predict_batch_size inputs and the modification of the predict_data inputs.

The content and format of the input_predict_data.txt file is an input each line, and the number of questions is the same as the predict_batch_size, which can be found in the following format:

I love Beijing, because
I love Beijing, because
I love Beijing, because
I love Beijing, because

Refer to the following commands to perform inference tasks, taking the full weight inference as an example:

bash scripts/msrun_launcher.sh "python run_mindformer.py \
--config configs/llama2/predict_llama2_7b.yaml \
--run_mode predict \
--predict_batch_size 4 \
--use_parallel True \
--auto_trans_ckpt True \
--load_checkpoint path/to/checkpoint.ckpt \
--predict_data path/to/input_predict_data.txt" \
2

Inference results are viewed in the same way as single-card inference.

Inference Based on Pipeline Interface

Customized text generation inference task flow based on pipeline interface, supporting single card inference and multi-card inference. About how to use pipeline interface to start the task and output the result, you can refer to the following implementation. The specific parameter description can be viewed pipeline interface API documentation.

Incremental Inference

from mindformers import build_context
from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer

# Construct the input content.
inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]

# Initialize the environment.
build_context({'context': {'mode': 0}, 'parallel': {}, 'parallel_config': {}})

# Instantiate a tokenizer.
tokenizer = AutoTokenizer.from_pretrained('llama2_7b')

# Instantiate a model.
# Modify the path to the local weight path.
model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)

# Start a non-stream inference task in the pipeline.
text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer)
outputs = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)
for output in outputs:
    print(output)

Save the example to pipeline_inference.py, modify the path for loading the weight, and run the pipeline_inference.py script.

python pipeline_inference.py

The inference result is as follows:

'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
'text_generation_text': [Huawei is a company that has been around for a long time. ......]

Stream Inference

from mindformers import build_context
from mindformers import AutoModel, AutoTokenizer, pipeline, TextStreamer

# Construct the input content.
inputs = ["I love Beijing, because", "LLaMA is a", "Huawei is a company that"]

# Initialize the environment.
build_context({'context': {'mode': 0}, 'parallel': {}, 'parallel_config': {}})

# Instantiate a tokenizer.
tokenizer = AutoTokenizer.from_pretrained('llama2_7b')

# Instantiate a model.
# Modify the path to the local weight path.
model = AutoModel.from_pretrained('llama2_7b', checkpoint_name_or_path="path/to/llama2_7b.ckpt", use_past=True)

# Start a stream inference task in the pipeline.
streamer = TextStreamer(tokenizer)
text_generation_pipeline = pipeline(task="text_generation", model=model, tokenizer=tokenizer, streamer=streamer)
_ = text_generation_pipeline(inputs, max_length=512, do_sample=False, top_k=3, top_p=1)

Save the example to pipeline_inference.py, modify the path for loading the weight, and run the pipeline_inference.py script.

python pipeline_inference.py

The inference result is as follows:

'text_generation_text': [I love Beijing, because it is a city that is constantly constantly changing. I have been living here for ......]
'text_generation_text': [LLaMA is a large-scale, open-source, multimodal, multilingual, multitask, and multimodal pretrained language model. It is ......]
'text_generation_text': [Huawei is a company that has been around for a long time. ......]

More Information

For more inference examples of different models, see the models supported by MindFormers.