Configuration File Descriptions
Overview
Different parameters usually need to be configured during the training and inference process of a model. MindFormers supports the use of YAML
files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.
Description of the YAML File Contents
The YAML
file provided by MindFormers contains configuration items for different functions, which are described below according to their contents.
Basic Configuration
The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.
Parameters |
Descriptions |
Types |
---|---|---|
seed |
Set the global seed. For details, refer to mindspore.set_seed |
int |
run_mode |
Set the running mode of the model, either |
str |
output_dir |
Set the path where log, checkpoint, strategy, etc. files are saved. |
str |
load_checkpoint |
File or folder paths for loading weights。 Currently there are 3 application scenarios |
str |
auto_trans_ckpt |
Enable online weight automatic conversion. Refer to Weight Conversion Function |
bool |
resume_training |
Turn on resumable training after breakpoint. For details, refer to Resumable Training After Breakpoint |
bool |
Context Configuration
Context configuration is mainly used to specify the mindspore.set_context in the related parameters.
Parameters |
Descriptions |
Types |
---|---|---|
context.mode |
Set the backend execution mode, |
int |
context.device_target |
Set the backend execution device. MindFormers is only supported on |
str |
context.device_id |
Set the execution device ID. The value must be within the range of available devices, and the default value is |
int |
context.enable_graph_kernel |
Enable graph fusion to optimize network execution performance, defaults to |
bool |
context.max_call_depth |
Set the maximum depth of a function call. The value must be a positive integer, and the default value is |
int |
context.max_device_memory |
Set the maximum memory available to the device in the format “xxGB”, and the default value is |
str |
context.mempool_block_size |
Set the size of the memory pool block for devices. The format is "xxGB". Default value is |
str |
context.save_graphs |
Save the compilation graph during execution. |
bool/int |
Model Configuration
Since the configuration will vary from model to model, only the generic configuration of models in MindFormers is described here.
Parameters |
Descriptions |
Types |
---|---|---|
model.arch.type |
Set the model class to instantiate the model according to the model class when constructing the model |
str |
model.model_config.type |
Set the model configuration class, the model configuration class needs to match the model class to be used, i.e. the model configuration class should contain all the parameters used by the model class |
str |
model.model_config.num_layers |
Set the number of model layers, usually the number of layers in the model Decoder Layer. |
int |
model.model_config.seq_length |
Set the model sequence length, this parameter indicates the maximum sequence length supported by the model |
int |
model.model_config.hidden_size |
Set the dimension of the model hidden state |
int |
model.model_config.vocab_size |
Set the model word list size |
int |
model.model_config.top_k |
Sample from the |
int |
model.model_config.top_p |
Sample from tokens that have the highest probability and whose probability accumulation does not exceed |
int |
model.model_config.use_past |
Turn on model incremental inference, when turned on you can use Paged Attention to improve inference performance, must be set to |
bool |
model.model_config.max_decode_length |
Set the maximum length of the generated text, including the input length |
int |
model.model_config.max_length |
The descriptions are same as |
int |
model.model_config.max_new_tokens |
Set the maximum length of the generated new text, excluding the input length, when set together with |
int |
model.model_config.min_length |
Set the minimum length of the generated text, including the input length |
int |
model.model_config.min_new_tokens |
Set the minimum length of the new text to be generated, excluding the input length; when set together with |
int |
model.model_config.repetition_penalty |
Set the penalty factor for generating duplicate text, |
int |
model.model_config.block_size |
Set the size of the block in Paged Attention, only works if |
int |
model.model_config.num_blocks |
Set the total number of blocks in Paged Attention, effective only if |
int |
model.model_config.return_dict_in_generate |
Set to return the inference results of the |
bool |
model.model_config.output_scores |
Set to include score before the input softmax for each forward generation when returning the result as a dictionary, defaults to |
bool |
model.model_config.output_logits |
Set to include the logits output by the model at each forward generation when returning results as a dictionary, defaults to |
bool |
MoE Configuration
In addition to the basic configuration of the model above, the MoE model needs to be configured separately with some superparameters of the moe module, and since the parameters used will vary from model to model, only the generic configuration will be explained:
Parameters |
Descriptions |
Types |
---|---|---|
moe_config.expert_num |
Set the number of routing experts |
int |
moe_config.shared_expert_num |
Set the number of sharing experts |
int |
moe_config.moe_intermediate_size |
Set the size of the intermediate dimension of the expert layer |
int |
moe_config.capacity_factor |
Set the expert capacity factor |
int |
moe_config.num_experts_chosen |
Set the number of experts to select per token |
int |
moe_config.enable_sdrop |
Set the enable token drop policy |
bool |
moe_config.aux_loss_factor |
Set the weights of the equilibrium loss |
list[float] |
moe_config.first_k_dense_replace |
Set the enable block of the moe layer, generally set to 1 to indicate that moe is not enabled in the first block |
int |
Model Training Configuration
When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindFormers provides the following configuration items.
Parameters |
Descriptions |
Types |
---|---|---|
trainer.type |
Set the trainer class, usually different models for different application scenarios will set different trainer classes |
str |
trainer.model_name |
Set the model name in the format '{name}_xxb', indicating a certain specification of the model |
str |
runner_config.epochs |
Set the number of rounds for model training |
int |
runner_config.batch_size |
Set the sample size of the batch data, which overrides the |
int |
runner_config.sink_mode |
Enable data sink mode, see Sink Mode for details |
bool |
runner_config.sink_size |
Set the number of iterations to be sent down from Host to Device per iteration, effective only when |
int |
runner_config.gradient_accumulation_steps |
Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled. |
int |
runner_wrapper.type |
Set the wrapper class, generally set 'MFTrainOneStepCell' |
str |
runner_wrapper.scale_sense.type |
Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell' |
str |
runner_wrapper.scale_sense.use_clip_grad |
Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge |
bool |
runner_wrapper.scale_sense.loss_scale_value |
Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter |
int |
lr_schedule.type |
Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training |
str |
lr_schedule.learning_rate |
Set the initialized learning rate size |
float |
lr_scale |
Whether to enable learning rate scaling |
bool |
lr_scale_factor |
Set the learning rate scaling factor |
int |
layer_scale |
Whether to turn on layer attenuation |
bool |
layer_decay |
Set the layer attenuation factor |
float |
optimizer.type |
Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training |
str |
optimizer.weight_decay |
Set the optimizer weight decay factor |
float |
train_dataset.batch_size |
The description is same as that of |
int |
train_dataset.input_columns |
Set the input data columns for the training dataset |
list |
train_dataset.output_columns |
Set the output data columns for the training dataset |
list |
train_dataset.column_order |
Set the order of the output data columns of the training dataset |
list |
train_dataset.num_parallel_workers |
Set the number of processes that read the training dataset |
int |
train_dataset.python_multiprocessing |
Enabling Python multi-process mode to improve data processing performance |
bool |
train_dataset.drop_remainder |
Whether to discard the last batch of data if it contains fewer samples than batch_size. |
bool |
train_dataset.repeat |
Set the number of dataset duplicates |
int |
train_dataset.numa_enable |
Set the default state of NUMA to data read startup state |
bool |
train_dataset.prefetch_size |
Set the amount of pre-read data |
int |
train_dataset.data_loader.type |
Set the data loading class |
str |
train_dataset.data_loader.dataset_dir |
Set the path for loading data |
str |
train_dataset.data_loader.shuffle |
Whether to randomly sort the data when reading the dataset |
bool |
train_dataset.transforms |
Set options related to data enhancement |
- |
train_dataset_task.type |
Set up the dataset class, which is used to encapsulate the data loading class and other related configurations |
str |
train_dataset_task.dataset_config |
Typically set as a reference to |
- |
auto_tune |
Enable auto-tuning of data processing parameters, see set_enable_autotune for details |
bool |
filepath_prefix |
Set the save path for parameter configurations after data optimization |
str |
autotune_per_step |
Set the configuration tuning step interval for automatic data acceleration, for details see set_autotune_interval |
int |
Parallel Configuration
In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to Distributed Parallelism, the parallel configuration in MindFormers is as follows.
Parameters |
Descriptions |
Types |
---|---|---|
use_parallel |
Enable parallel mode |
bool |
parallel_config.data_parallel |
Set the number of data parallel |
int |
parallel_config.model_parallel |
Set the number of model parallel |
int |
parallel_config.context_parallel |
Set the number of sequence parallel |
int |
parallel_config.pipeline_stage |
Set the number of pipeline parallel |
int |
parallel_config.micro_batch_num |
Set the pipeline parallel microbatch size, which should satisfy |
int |
parallel_config.gradient_aggregation_group |
Set the size of the gradient communication operator fusion group |
int |
micro_batch_interleave_num |
Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to MicroBatchInterleaved |
int |
parallel.parallel_mode |
Set parallel mode, |
int |
parallel.gradients_mean |
Whether to execute the averaging operator after the gradient AllReduce. Typically set to |
bool |
parallel.enable_alltoall |
Enables generation of the AllToAll communication operator during communication. Typically set to |
bool |
parallel.full_batch |
Set the dataset to load the full batch in parallel mode, set to |
bool |
parallel.search_mode |
Set fully-automatic parallel strategy search mode, options are |
str |
parallel.strategy_ckpt_save_file |
Set the save path for the parallel slicing strategy file |
str |
parallel.strategy_ckpt_config.only_trainable_params |
Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to |
bool |
parallel.enable_parallel_optimizer |
Turn on optimizer parallel. |
bool |
parallel.parallel_optimizer_config.gradient_accumulation_shard |
Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if |
bool |
parallel.parallel_optimizer_config.parallel_optimizer_threshold |
Set the threshold for the optimizer weight parameter cut, effective only if |
int |
parallel.parallel_optimizer_config.optimizer_weight_shard_size |
Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by |
int |
Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage
Model Optimization Configuration
MindFormers provides recomputation-related configurations to reduce the memory footprint of the model during training, see Recomputation for details.
Parameters |
Descriptions |
Types |
---|---|---|
recompute_config.recompute |
Enable recompute |
bool |
recompute_config.select_recompute |
Turn on recomputation to recompute only for the operators in the attention layer |
bool/list |
recompute_config.parallel_optimizer_comm_recompute |
Whether to recompute AllGather communication introduced in parallel by the optimizer |
bool/list |
recompute_config.mp_comm_recompute |
Whether to recompute communications introduced by model parallel |
bool |
recompute_config.recompute_slice_activation |
Whether to output slices for Cells kept in memory |
bool |
Callbacks Configuration
MindFormers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently the following Callbacks function class is supported.
MFLossMonitor
This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:
Parameters
Descriptions
Types
learning_rate
Set the initial learning rate in
MFLossMonitor
. The default value isNone
float
per_print_times
Set the interval for printing log information in
MFLossMonitor
. The default value is1
, that is, the log information is printed every stepint
micro_batch_num
Set the size of the micro batch data in each step in the training, which is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of
parallel_config.micro_batch_num
in Parallel Configurationint
micro_batch_interleave_num
Set the size of the interleave micro batch data in each step of the training. This parameter is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of
micro_batch_interleave_num
in Parallel Configurationint
origin_epochs
Set the initial number of training epochs in
MFLossMonitor
. If this parameter is not set, the value of this parameter is the same as that ofrunner_config.epochs
in Model Training Configurationint
dataset_size
Set initial size of the dataset in
MFLossMonitor
. If this parameter is not set, the size of the initialized dataset is the same as the size of the actual dataset used for trainingint
initial_epoch
Set start epoch number of training in
MFLossMonitor
. The default value is0
int
initial_step
Set start step number of training in
MFLossMonitor
. The default value is0
int
global_batch_size
Set the number of global batch data samples in
MFLossMonitor
. If this parameter is not set, the system automatically calculates the number of global batch data samples based on the dataset size and parallel strategyint
gradient_accumulation_steps
Set the number of gradient accumulation steps in
MFLossMonitor
. If this parameter is not set, the value of this parameter is the same as that ofgradient_accumulation_steps
in Model Training Configurationint
enable_tensorboard
Whether to enable TensorBoard to record logs in
MFLossMonitor
. The default value isFalse
bool
tensorboard_path
Set the path for saving TensorBoard logs in
MFLossMonitor
. This parameter is valid only whenenable_tensorboard=True
str
check_for_nan_in_loss_and_grad
Whether to enable overflow detection in
MFLossMonitor
. After overflow detection is enabled, the training exits if overflow occurs during model training. The default value isFalse
bool
If you do not need to enable TensorBoard log recording or overflow detection, you are advised to use the default settings.
SummaryMonitor
This callback function class is mainly used to collect Summary data, see mindspore.SummaryCollector for details.
CheckpointMonitor
This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:
Parameters
Descriptions
Types
prefix
Set the prefix for saving file names
str
directory
Set the directory for saving file names
str
save_checkpoint_seconds
Set the number of seconds between saving model weights
int
save_checkpoint_steps
Set the number of interval steps for saving model weights
int
keep_checkpoint_max
Set the maximum number of model weight files to be saved, if there are more model weight files in the save path, they will be deleted starting from the earliest file created to ensure that the total number of files does not exceed
keep_checkpoint_max
.int
keep_checkpoint_per_n_minutes
Set the number of minutes between saving model weights
int
integrated_save
Turn on aggregation to save the weights file.
1. When set to True, it means that the weights of all devices are aggregated when the weight file is saved, i.e., the weights of all devices are the same.
2. False means that all devices save their own weights
When using semi-automatic parallel mode, it is usually necessary to set it to False to avoid memory problems when saving the weights file.bool
save_network_params
Set to save only model weights, default value is
False
.bool
save_trainable_params
Set the additional saving of trainable parameter weights, i.e. the parameter weights of the model when partially fine-tuned, default to
False
.bool
async_save
Set an asynchronous execution to save the model weights file
bool
Multiple Callbacks function classes can be configured at the same time under the callbacks
field. The following is an example of callbacks
configuration.
callbacks:
- type: MFLossMonitor
- type: CheckpointMonitor
prefix: "name_xxb"
save_checkpoint_steps: 1000
integrated_save: False
async_save: False
Processor Configuration
Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindFormers are explained here.
Parameters |
Descriptions |
Types |
---|---|---|
processor.type |
Set the data processing class |
str |
processor.return_tensors |
Set the type of tensor returned by the data processing class, typically use 'ms' |
str |
processor.image_processor.type |
Set the image data processing class |
str |
processor.tokenizer.type |
Set the text tokenizer class |
str |
processor.tokenizer.vocab_file |
Set the path of the file to be read by the text tokenizer, which needs to correspond to the tokenizer class |
str |
Model Evaluation Configuration
MindFormers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.
Parameters |
Descriptions |
Types |
---|---|---|
eval_dataset |
Used in the same way as |
- |
eval_dataset_task |
Used in the same way as |
- |
metric.type |
Used in the same way as |
- |
do_eval |
Enable evaluation while training |
bool |
eval_step_interval |
Set evaluation step interval, default value is 100. The value less than 0 means disable evaluation according to step interval. |
int |
eval_epoch_interval |
Set the epoch interval for evaluation, the default value is -1. The value less than 0 means disable the function of evaluating according to epoch interval, it is not recommended to use this configuration in data sinking mode. |
int |
metric.type |
Set the type of evaluation |
str |
Profile Configuration
MindFormers provides Profile as the main tool for model performance tuning, please refer to Performance Tuning Guide for more details. The following is the Profile related configuration.
Parameters |
Descriptions |
Types |
---|---|---|
profile |
Enable the performance capture tool, see mindspore.Profiler for details |
bool |
profile_start_step |
Set the number of steps to start collecting performance data |
int |
profile_stop_step |
Set the number of steps to stop collecting performance data |
int |
profile_communication |
Set whether communication performance data is collected in multi-device training, this parameter is invalid when using single card training. Default: |
bool |
profile_memory |
Set whether to collect Tensor memory data |
bool |
init_start_profile |
Set whether to turn on collecting performance data when the Profiler is initialized; this parameter does not take effect when |
bool |