Development Migration

This document describes how to develop and build foundation models based on MindFormers and complete basic adaptation to start the training and inference processes.

Building a Foundation Model Based on MindFormers

The basic components of a foundation model in MindFormers include the configurations, models, and tokenizers for large language models (LLMs). In addition, to use the run_mindformer.py unified script to start the training or inference process, you need to prepare the YAML configuration file for training or inference.

Writing Configurations

A model configuration is an instance that contains all information about a model. The __init__ methods of all models in MindFormers receive a model configuration instance as the input parameter. All submodules of the model are initialized based on the information contained in the configuration instance.

MindFormers provides the PretrainedConfig class, which provides some common configuration methods. The configuration classes of all models should be inherited from the PretrainedConfig class. Developers only need to define all configuration parameters that help build foundation models. Foundation models of the Transformer type have configuration parameters such as seq_length, hidden_size, num_layers, and num_heads, and foundation models of the text type have vocab_size in addition.

For details, see the configuration class LlamaConfig of the Llama model in MindFormers.

If your model is similar to a model in the library, you can reuse the same configurations as the model.

Writing a Model

The MindFormers foundation model is developed based on the MindSpore framework. If your model has been implemented based on PyTorch, see MindSpore Network Construction. Developers only need to pay attention to the implementation of the model network.

MindFormers provides the PretrainedModel class, which is responsible for storage model configurations and processing the methods of loading and saving models. All model classes must be inherited from the PretrainedModel class, and the model input must be the same. That is, the input parameters of the construct method of the model must be the same. For details about the input parameters and meanings, see the Llama model class LlamaForCausalLM in MindFormers. In addition, the model class must implement some abstract methods of the base class, including:

prepare_inputs_for_generation: method for building input for model inference.
prepare_inputs_for_predict_layout: method for building virtual input for the distributed loading model weight.

For specific meanings, refer to the descriptions in LlamaForCausalLM.

If your model structure is similar to that of a model in the library, you can reuse the model.

Writing a Tokenizer (for LLMs)

A tokenizer is used to process input and output of LLMs. It is required in the workflow of LLMs.

MindFormers provides the PretrainedTokenizer and PretrainedTokenizerFast classes, which use Python only and use the Rust library, respectively. The features of the latter one are as follows:

Faster batch processing.
Additional methods for mapping between text strings and lexical spaces. For example, the indexes of the lexical element containing a given character or the character spans corresponding to the given lexical element are obtained.

All tokenizer classes must be inherited from the PretrainedTokenizer or PretrainedTokenizerFast class. For details, see LlamaTokenizer and LlamaTokenizerFast.

If your tokenizer is similar to that in the library, you can reuse that in the library.

Preparing a Weight and a Dataset

If a PyTorch-based model weight already exists, you can convert the weight to that in the MindSpore format by referring to Weight Conversion.

For details about how to prepare a dataset, see Dataset or the model document, for example, Llama2 Description Document > Dataset Preparation.

Preparing a `YAML` File

MindFormers uses a YAML file to configure all parameters required by a task, including model parameters, training parameters (such as optimizer, learning rate, and dataset), inference parameters (such as tokenizer), distributed parallel parameters, and context environment parameters.

The code of the customized model is not in the MindFormers library, and the customized module in the code is not registered with MindFormers. Therefore, the customized model cannot be automatically instantiated. The code is also called external code (for example, the code in the research directory). Therefore, you need to add the auto_register configuration item for automatically registering any module to the corresponding module configuration in the YAML file and set the configuration items to the relative import paths of the API to be registered. When the run_mindformer.py script is executed to start the task, you need to add the input parameter --register_path of the registration path and set it to the relative path of the directory where the external code is located.

For example, in the YAML file research/llama3_1/predict_llama3_1_8b.yaml of the Llama3.1-8B model inference in the research directory, the configuration item auto_register is added for automatic registration to register the customized Llama3Tokenizer in research/llama3_1/llama3_1_tokenizer.py.

...
processor:
  return_tensors: ms
  tokenizer:
    model_max_length: 8192
    vocab_file: "/path/tokenizer.json"
    pad_token: "<|reserved_special_token_0|>"
    type: Llama3Tokenizer
    auto_register: llama3_1_tokenizer.Llama3Tokenizer
  type: LlamaProcessor
...

The relative import path auto_register: llama3_1_tokenizer.Llama3Tokenizer of Llama3Tokenizer is configured under tokenizer.

Run the following command to start the inference job:

python run_mindformer.py --config research/llama3_1/predict_llama3_1_8b.yaml --load_checkpoint path/to/llama3_1_8b.ckpt --register_path research/llama3_1 --predict_data "hello"

Parameters

Parameter	Description
config	Path of the `YAML` file.
load_checkpoint	Loaded weight path.
register_path	Path of the directory where the external code is located.
predict_data	Input data for inference.

register_path is set to research/llama3_1 (path of the directory where the external code is located). For details about how to prepare the model weight, see Llama3.1 Description Document > Model Weight Download.

For details about the configuration file and configurable items, see Configuration File Descriptions. When compiling a configuration file, you can refer to an existing configuration file in the library, for example, Llama2-7B fine-tuning configuration file.

After all the preceding basic elements are prepared, you can refer to other documents in the MindFormers tutorial to perform model training, fine-tuning, and inference. For details about subsequent model debugging and optimization, see Large Model Accuracy Optimization Guide and Large Model Performance Optimization Guide.

Contributing Models to the MindFormers Open Source Repository

You can contribute models to the MindFormers open source repository for developers to research and use. For details, see MindFormers Contribution Guidelines.

MindFormers Model Migration Practice

Migration from Llama2-7B to Llama3-8B

Llama3-8B and Llama2-7B have the same model structure but different model parameters, tokenizers, and weights.

Model Configurations

The following compares the model configurations between Llama2-7B and Llama3-8B.

model_config_comparison

The differences are as follows:

The sequence length of Llama3-8B is 8192. Therefore, seq_length is set to 8192.
Llama3-8B uses GQA and the number of heads in each key-value group is 8. Therefore, n_kv_head is set to 8.
The size of the Llama3-8B vocabulary is 128,256. Therefore, vocab_size is set to 128256.
Llama3-8B expands the hidden layer size of the feed-forward network to 14,336. Therefore, intermediate_size is set to 14336.
In Llama3-8B, the special word metaindex is modified. Therefore, bos_token_id is set to 128000, eos_token_id is set to 128001, and pad_token_id is set to 128002.
In Llama3-8B, the value of theta in the rotation position code is changed to 500000. Therefore, theta is set to 500000.

After modifying the corresponding content in the YAML file of Llama2-7B, you can obtain the Llama3-8B configuration file.

Tokenizer

Llama3-8B re-implements the tokenizer. According to the official implementation, PretrainedTokenizer is inherited from MindFormers to implement Llama3Tokenizer, which is written in llama3_tokenizer.py.

Weight Conversion

The parameters of Llama3-8B are the same as those of Llama2-7B. Therefore, the weight conversion process of Llama2-7B can be reused. For details, see Llama3 Document > Weight Conversion.

Dataset Processing

The tokenizer of Llama3-8B is different from that of Llama2-7B. Therefore, you need to replace the tokenizer of Llama3-8B to preprocess data based on the dataset processing script of Llama2-7B. For details, see conversation.py and llama_preprocess.py.

For details about the implementation of Llama3 in MindFormers, see Llama3 folder in the MindFormers repository. For details about how to use Llama3 in MindFormers, see LLama3 documents.