Pretraining
Overview
Pretraining refers to training a model on large-scale unlabeled data, so that the model can comprehensively capture a wide range of features of a language. A pretrained model can learn knowledge at the vocabulary, syntax, and semantic levels. After fine-tuning, the knowledge is applied in downstream tasks to optimize the performance of specific tasks. The objective of the MindFormers framework pretraining is to help developers quickly and conveniently build and train pretrained models based on the Transformer architecture.
Procedure
Based on actual operations, the basic pretraining process can be divided into the following steps:
Preparing a dataset: Prepare a large-scale unlabeled text dataset for pretraining. Such datasets contain a large amount of text from multiple sources, such as networks, books, and articles. The diversity and scale of datasets have a great impact on the generalization capability of models.
Selecting a model architecture: Select a proper model architecture to build a pretrained model based on task requirements and computing resources.
Pretraining: Perform pretraining with the prepared large-scale dataset and use the configured model architecture and training configuration to perform long-time training to generate the final pretrained model weight.
Saving a model: After the training is complete, save the model weight to the specified location.
MindFormers-based Pretraining Practice
Currently, MindFormers supports mainstream foundation models in the industry. In this practice, Llama2-7B and Llama3-70B are used to demonstrate Single-Node Training and Multi-Node Training, respectively.
Preparing a Dataset
Dataset |
Applicable Model |
Applicable Phase |
Download Link |
---|---|---|---|
Wikitext2 |
Llama2-7B |
Pretrain |
|
Wiki103 |
Llama3-70B |
Pretrain |
Data Preprocessing
For details about how to process the Llama2-7B and Llama3-70B datasets, see the Wikitext2 data preprocessing and the Wiki103 data preprocessing, respectively.
Executing a Pretrained Task
Single-Node Training
Take Llama2-7B as an example. Specify the configuration file pretrain_llama2_7b.yaml and start the run_mindformer.py script in msrun mode to perform 8-device distributed training. The startup command is as follows:
bash scripts/msrun_launcher.sh "run_mindformer.py \
--config configs/llama2/pretrain_llama2_7b.yaml \
--train_dataset_dir /{path}/wiki4096.mindrecord \
--use_parallel True \
--run_mode train" 8
# Parameters:
config: model configuration file, which is stored in the config directory of the MindFormers code repository.
train_dataset_dir: path of the training dataset.
use_parallel: specifies whether to enable parallelism.
run_mode: running mode. The value can be train, finetune, or predict (inference).
After the task is executed, the checkpoint folder is generated in the mindformers/output directory, and the model file is saved in this folder.
Multi-Node Training
Take Llama3-70B as an example. Use the pretrain_llama3_70b.yaml configuration file to run run_llama3.py in msrun mode to perform 8-node 64-device pretraining. To perform distributed training on a multi-node multi-device script, you need to run the script on different nodes and set the MASTER_ADDR parameter to the IP address of the primary node. The IP addresses of all nodes are the same, and only the values of NODE_RANK are different for different nodes. For details about the parameter positions, see msrun Launching Guide.
# Node 0: Set the IP address of node 0 to the value of MASTER_ADDR, which is used as the IP address of the primary node. There are 64 devices in total with 8 devices for each node.
# Change the value of node_num for nodes 0 to 7 in sequence. For example, if there are eight nodes, the value of node_num ranges from 0 to 7.
bash scripts/msrun_launcher.sh "run_llama3.py \
--config pretrain_llama3_70b.yaml \
--train_dataset dataset_dir
--use_parallel True \
--run_mode train" \
64 8 {MASTER_ADDR} 8118 {node_num} output/msrun_log False 300
# Parameters:
config: model configuration file, which is stored in the config directory of the MindFormers code repository.
train_dataset_dir: path of the training dataset.
use_parallel: specifies whether to enable parallelism.
run_mode: running mode. The value can be train, finetune, or predict (inference).
Note: During multi-node distributed training, some performance problems may occur. To ensure the efficiency and stability of the training process, you are advised to optimize and adjust the performance by referring to Large Model Performance Optimization Guide.
More Information
For more training examples of different models, see the models supported by MindFormers.