Configuration File Descriptions

View Source On Gitee

Overview

Different parameters usually need to be configured during the training and inference process of a model. MindFormers supports the use of YAML files to centrally manage and adjust the configurable items, which makes the configuration of the model more structured and improves its maintainability at the same time.

Description of the YAML File Contents

The YAML file provided by MindFormers contains configuration items for different functions, which are described below according to their contents.

Basic Configuration

The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.

Parameters

Descriptions

Types

seed

Set the global seed. For details, refer to mindspore.set_seed

int

run_mode

Set the running mode of the model, either train, finetune, eval or predict.

str

output_dir

Set the path where log, checkpoint, strategy, etc. files are saved.

str

load_checkpoint

File or folder paths for loading weights。 Currently there are 3 application scenarios
1. Support for passing in full weight file paths
2. Support for passing in offline sliced weight folder paths
3. Support for passing in folder paths containing lora weights and base weights
Refer to Weight Conversion Function for the ways of obtaining various weights.

str

auto_trans_ckpt

Enable online weight automatic conversion. Refer to Weight Conversion Function

bool

resume_training

Turn on resumable training after breakpoint. For details, refer to Resumable Training After Breakpoint

bool

Context Configuration

Context configuration is mainly used to specify the mindspore.set_context in the related parameters.

Parameters

Descriptions

Types

context.mode

Set the backend execution mode, 0 means GRAPH_MODE. MindFormers currently only supports running in GRAPH_MODE mode.

int

context.device_target

Set the backend execution device. MindFormers is only supported on Ascend devices.

str

context.device_id

Set the execution device ID. The value must be within the range of available devices, and the default value is 0.

int

context.enable_graph_kernel

Enable graph fusion to optimize network execution performance, defaults to False. See graph fusion for details.

bool

context.max_call_depth

Set the maximum depth of a function call. The value must be a positive integer, and the default value is 1000.

int

context.max_device_memory

Set the maximum memory available to the device in the format “xxGB”, and the default value is 1024GB.

str

context.mempool_block_size

Set the size of the memory pool block for devices. The format is "xxGB". Default value is "1GB"

str

context.save_graphs

Save the compilation graph during execution.
1. False or 0 indicates that the intermediate compilation map is not saved.
2. 1 means outputting some of the intermediate files generated during the compilation of the diagram.
3. True or 2 indicates the generation of more backend-process-related IR files.
4. 3 indicates the generation of visualized computational diagrams and more detailed front-end IR diagrams.

bool/int

context.save_graphs_path

Path for saving the compilation diagram.

str

Model Configuration

Since the configuration will vary from model to model, only the generic configuration of models in MindFormers is described here.

Parameters

Descriptions

Types

model.arch.type

Set the model class to instantiate the model according to the model class when constructing the model

str

model.model_config.type

Set the model configuration class, the model configuration class needs to match the model class to be used, i.e. the model configuration class should contain all the parameters used by the model class

str

model.model_config.num_layers

Set the number of model layers, usually the number of layers in the model Decoder Layer.

int

model.model_config.seq_length

Set the model sequence length, this parameter indicates the maximum sequence length supported by the model

int

model.model_config.hidden_size

Set the dimension of the model hidden state

int

model.model_config.vocab_size

Set the model word list size

int

model.model_config.top_k

Sample from the top_k tokens with the highest probability during inference

int

model.model_config.top_p

Sample from tokens that have the highest probability and whose probability accumulation does not exceed top_p during inference

int

model.model_config.use_past

Turn on model incremental inference, when turned on you can use Paged Attention to improve inference performance, must be set to False during model training

bool

model.model_config.max_decode_length

Set the maximum length of the generated text, including the input length

int

model.model_config.max_length

The descriptions are same as max_decode_length. When set together with max_decode_length, max_length takes effect.

int

model.model_config.max_new_tokens

Set the maximum length of the generated new text, excluding the input length, when set together with max_length, max_new_tokens takes effect.

int

model.model_config.min_length

Set the minimum length of the generated text, including the input length

int

model.model_config.min_new_tokens

Set the minimum length of the new text to be generated, excluding the input length; when set together with min_length, min_new_tokens takes effect.

int

model.model_config.repetition_penalty

Set the penalty factor for generating duplicate text, repetition_penalty is not less than 1. When it equals to 1, duplicate outputs will not be penalized.

int

model.model_config.block_size

Set the size of the block in Paged Attention, only works if use_past=True.

int

model.model_config.num_blocks

Set the total number of blocks in Paged Attention, effective only if use_past=True. batch_size×seq_length<=block_size×num_blocks should be satisfied.

int

model.model_config.return_dict_in_generate

Set to return the inference results of the generate interface as a dictionary, defaults to False.

bool

model.model_config.output_scores

Set to include score before the input softmax for each forward generation when returning the result as a dictionary, defaults to False

bool

model.model_config.output_logits

Set to include the logits output by the model at each forward generation when returning results as a dictionary, defaults to False.

bool

MoE Configuration

In addition to the basic configuration of the model above, the MoE model needs to be configured separately with some superparameters of the moe module, and since the parameters used will vary from model to model, only the generic configuration will be explained:

Parameters

Descriptions

Types

moe_config.expert_num

Set the number of routing experts

int

moe_config.shared_expert_num

Set the number of sharing experts

int

moe_config.moe_intermediate_size

Set the size of the intermediate dimension of the expert layer

int

moe_config.capacity_factor

Set the expert capacity factor

int

moe_config.num_experts_chosen

Set the number of experts to select per token

int

moe_config.enable_sdrop

Set the enable token drop policy sdrop, since MindFormer's MoE is a static shape implementation so it can't retain all tokens

bool

moe_config.aux_loss_factor

Set the weights of the equilibrium loss

list[float]

moe_config.first_k_dense_replace

Set the enable block of the moe layer, generally set to 1 to indicate that moe is not enabled in the first block

int

Model Training Configuration

When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindFormers provides the following configuration items.

Parameters

Descriptions

Types

trainer.type

Set the trainer class, usually different models for different application scenarios will set different trainer classes

str

trainer.model_name

Set the model name in the format '{name}_xxb', indicating a certain specification of the model

str

runner_config.epochs

Set the number of rounds for model training

int

runner_config.batch_size

Set the sample size of the batch data, which overrides the batch_size in the dataset configuration.

int

runner_config.sink_mode

Enable data sink mode, see Sink Mode for details

bool

runner_config.sink_size

Set the number of iterations to be sent down from Host to Device per iteration, effective only when sink_mode=True

int

runner_config.gradient_accumulation_steps

Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled.

int

runner_wrapper.type

Set the wrapper class, generally set 'MFTrainOneStepCell'

str

runner_wrapper.scale_sense.type

Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'

str

runner_wrapper.scale_sense.use_clip_grad

Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge

bool

runner_wrapper.scale_sense.loss_scale_value

Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter

int

lr_schedule.type

Set the lr_schedule class, lr_schedule is mainly used to adjust the learning rate in model training

str

lr_schedule.learning_rate

Set the initialized learning rate size

float

lr_scale

Whether to enable learning rate scaling

bool

lr_scale_factor

Set the learning rate scaling factor

int

layer_scale

Whether to turn on layer attenuation

bool

layer_decay

Set the layer attenuation factor

float

optimizer.type

Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training

str

optimizer.weight_decay

Set the optimizer weight decay factor

float

train_dataset.batch_size

The description is same as that of runner_config.batch_size

int

train_dataset.input_columns

Set the input data columns for the training dataset

list

train_dataset.output_columns

Set the output data columns for the training dataset

list

train_dataset.column_order

Set the order of the output data columns of the training dataset

list

train_dataset.num_parallel_workers

Set the number of processes that read the training dataset

int

train_dataset.python_multiprocessing

Enabling Python multi-process mode to improve data processing performance

bool

train_dataset.drop_remainder

Whether to discard the last batch of data if it contains fewer samples than batch_size.

bool

train_dataset.repeat

Set the number of dataset duplicates

int

train_dataset.numa_enable

Set the default state of NUMA to data read startup state

bool

train_dataset.prefetch_size

Set the amount of pre-read data

int

train_dataset.data_loader.type

Set the data loading class

str

train_dataset.data_loader.dataset_dir

Set the path for loading data

str

train_dataset.data_loader.shuffle

Whether to randomly sort the data when reading the dataset

bool

train_dataset.transforms

Set options related to data enhancement

-

train_dataset_task.type

Set up the dataset class, which is used to encapsulate the data loading class and other related configurations

str

train_dataset_task.dataset_config

Typically set as a reference to train_dataset, containing all configuration entries for train_dataset.

-

auto_tune

Enable auto-tuning of data processing parameters, see set_enable_autotune for details

bool

filepath_prefix

Set the save path for parameter configurations after data optimization

str

autotune_per_step

Set the configuration tuning step interval for automatic data acceleration, for details see set_autotune_interval

int

Parallel Configuration

In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to Distributed Parallelism, the parallel configuration in MindFormers is as follows.

Parameters

Descriptions

Types

use_parallel

Enable parallel mode

bool

parallel_config.data_parallel

Set the number of data parallel

int

parallel_config.model_parallel

Set the number of model parallel

int

parallel_config.context_parallel

Set the number of sequence parallel

int

parallel_config.pipeline_stage

Set the number of pipeline parallel

int

parallel_config.micro_batch_num

Set the pipeline parallel microbatch size, which should satisfy parallel_config.micro_batch_num >= parallel_config.pipeline_stage when parallel_config.pipeline_stage is greater than 1

int

parallel_config.gradient_aggregation_group

Set the size of the gradient communication operator fusion group

int

micro_batch_interleave_num

Set the number of multicopy parallel, enable multicopy parallelism if it is greater than 1. Usually enabled when using model parallel, mainly used to optimize the communication loss generated by model parallel, and not recommended to be enabled when only using streaming parallel. For details, please refer to MicroBatchInterleaved

int

parallel.parallel_mode

Set parallel mode, 0 means data parallel mode, 1 means semi-automatic parallel mode, 2 means automatic parallel mode, 3 means mixed parallel mode, usually set to semi-automatic parallel mode.

int

parallel.gradients_mean

Whether to execute the averaging operator after the gradient AllReduce. Typically set to False in semi-automatic parallel mode and True in data parallel mode

bool

parallel.enable_alltoall

Enables generation of the AllToAll communication operator during communication. Typically set to True only in MOE scenarios, default value is False

bool

parallel.full_batch

Set the dataset to load the full batch in parallel mode, set to True in auto-parallel mode and semi-auto-parallel mode, and False in data-parallel mode

bool

parallel.search_mode

Set fully-automatic parallel strategy search mode, options are recursive_programming, dynamic_programming and sharding_propagation, only works in fully-automatic parallel mode, experimental interface

str

parallel.strategy_ckpt_save_file

Set the save path for the parallel slicing strategy file

str

parallel.strategy_ckpt_config.only_trainable_params

Whether to save (or load) information about the slicing strategy for trainable parameters only, default is True, set this parameter to False when there are frozen parameters in the network but need to be sliced

bool

parallel.enable_parallel_optimizer

Turn on optimizer parallel.
1. slice model weight parameters by number of devices in data parallel mode
2. slice model weight parameters by parallel_config.data_parallel in semi-automatic parallel mode

bool

parallel.parallel_optimizer_config.gradient_accumulation_shard

Set whether the cumulative gradient variable is sliced on the data-parallel dimension, only effective if enable_parallel_optimizer=True

bool

parallel.parallel_optimizer_config.parallel_optimizer_threshold

Set the threshold for the optimizer weight parameter cut, effective only if enable_parallel_optimizer=True.

int

parallel.parallel_optimizer_config.optimizer_weight_shard_size

Set the size of the optimizer weight parameter to slice the communication domain, requiring the value to be integrable by parallel_config.data_parallel, effective only if enable_parallel_optimizer=True.

int

Configure the parallel strategy to satisfy device_num = data_parallel × model_parallel × context_parallel × pipeline_stage

Model Optimization Configuration

MindFormers provides recomputation-related configurations to reduce the memory footprint of the model during training, see Recomputation for details.

Parameters

Descriptions

Types

recompute_config.recompute

Enable recompute

bool

recompute_config.select_recompute

Turn on recomputation to recompute only for the operators in the attention layer

bool/list

recompute_config.parallel_optimizer_comm_recompute

Whether to recompute AllGather communication introduced in parallel by the optimizer

bool/list

recompute_config.mp_comm_recompute

Whether to recompute communications introduced by model parallel

bool

recompute_config.recompute_slice_activation

Whether to output slices for Cells kept in memory

bool

Callbacks Configuration

MindFormers provides encapsulated Callbacks function class, mainly to achieve to return to the model training state and output in the model training process, save the model weight file and other operations. Currently the following Callbacks function class is supported.

  1. MFLossMonitor

    This callback function class is mainly used to print information such as training progress, model Loss, and learning rate during the training process and has several configurable items as follows:

    Parameters

    Descriptions

    Types

    learning_rate

    Set the initial learning rate in MFLossMonitor. The default value is None

    float

    per_print_times

    Set the interval for printing log information in MFLossMonitor. The default value is 1, that is, the log information is printed every step

    int

    micro_batch_num

    Set the size of the micro batch data in each step in the training, which is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of parallel_config.micro_batch_num in Parallel Configuration

    int

    micro_batch_interleave_num

    Set the size of the interleave micro batch data in each step of the training. This parameter is used to calculate the actual loss value. If this parameter is not set, the value of this parameter is the same as that of micro_batch_interleave_num in Parallel Configuration

    int

    origin_epochs

    Set the initial number of training epochs in MFLossMonitor. If this parameter is not set, the value of this parameter is the same as that of runner_config.epochs in Model Training Configuration

    int

    dataset_size

    Set initial size of the dataset in MFLossMonitor. If this parameter is not set, the size of the initialized dataset is the same as the size of the actual dataset used for training

    int

    initial_epoch

    Set start epoch number of training in MFLossMonitor. The default value is 0

    int

    initial_step

    Set start step number of training in MFLossMonitor. The default value is 0

    int

    global_batch_size

    Set the number of global batch data samples in MFLossMonitor. If this parameter is not set, the system automatically calculates the number of global batch data samples based on the dataset size and parallel strategy

    int

    gradient_accumulation_steps

    Set the number of gradient accumulation steps in MFLossMonitor. If this parameter is not set, the value of this parameter is the same as that of gradient_accumulation_steps in Model Training Configuration

    int

    enable_tensorboard

    Whether to enable TensorBoard to record logs in MFLossMonitor. The default value is False

    bool

    tensorboard_path

    Set the path for saving TensorBoard logs in MFLossMonitor. This parameter is valid only when enable_tensorboard=True

    str

    check_for_nan_in_loss_and_grad

    Whether to enable overflow detection in MFLossMonitor. After overflow detection is enabled, the training exits if overflow occurs during model training. The default value is False

    bool

    If you do not need to enable TensorBoard log recording or overflow detection, you are advised to use the default settings.

  2. SummaryMonitor

    This callback function class is mainly used to collect Summary data, see mindspore.SummaryCollector for details.

  3. CheckpointMonitor

    This callback function class is mainly used to save the model weights file during the model training process and has several configurable items as follows:

    Parameters

    Descriptions

    Types

    prefix

    Set the prefix for saving file names

    str

    directory

    Set the directory for saving file names

    str

    save_checkpoint_seconds

    Set the number of seconds between saving model weights

    int

    save_checkpoint_steps

    Set the number of interval steps for saving model weights

    int

    keep_checkpoint_max

    Set the maximum number of model weight files to be saved, if there are more model weight files in the save path, they will be deleted starting from the earliest file created to ensure that the total number of files does not exceed keep_checkpoint_max.

    int

    keep_checkpoint_per_n_minutes

    Set the number of minutes between saving model weights

    int

    integrated_save

    Turn on aggregation to save the weights file.
    1. When set to True, it means that the weights of all devices are aggregated when the weight file is saved, i.e., the weights of all devices are the same.
    2. False means that all devices save their own weights
    When using semi-automatic parallel mode, it is usually necessary to set it to False to avoid memory problems when saving the weights file.

    bool

    save_network_params

    Set to save only model weights, default value is False.

    bool

    save_trainable_params

    Set the additional saving of trainable parameter weights, i.e. the parameter weights of the model when partially fine-tuned, default to False.

    bool

    async_save

    Set an asynchronous execution to save the model weights file

    bool

Multiple Callbacks function classes can be configured at the same time under the callbacks field. The following is an example of callbacks configuration.

callbacks:
  - type: MFLossMonitor
  - type: CheckpointMonitor
    prefix: "name_xxb"
    save_checkpoint_steps: 1000
    integrated_save: False
    async_save: False

Processor Configuration

Processor is mainly used to preprocess the inference data of the input model. Since the Processor configuration items are not fixed, only the generic configuration items of Processor in MindFormers are explained here.

Parameters

Descriptions

Types

processor.type

Set the data processing class

str

processor.return_tensors

Set the type of tensor returned by the data processing class, typically use 'ms'

str

processor.image_processor.type

Set the image data processing class

str

processor.tokenizer.type

Set the text tokenizer class

str

processor.tokenizer.vocab_file

Set the path of the file to be read by the text tokenizer, which needs to correspond to the tokenizer class

str

Model Evaluation Configuration

MindFormers provides model evaluation function, and also supports model evaluation while training. The following is the configuration related to model evaluation.

Parameters

Descriptions

Types

eval_dataset

Used in the same way as train_dataset

-

eval_dataset_task

Used in the same way as eval_dataset_task

-

metric.type

Used in the same way as callbacks

-

do_eval

Enable evaluation while training

bool

eval_step_interval

Set evaluation step interval, default value is 100. The value less than 0 means disable evaluation according to step interval.

int

eval_epoch_interval

Set the epoch interval for evaluation, the default value is -1. The value less than 0 means disable the function of evaluating according to epoch interval, it is not recommended to use this configuration in data sinking mode.

int

metric.type

Set the type of evaluation

str

Profile Configuration

MindFormers provides Profile as the main tool for model performance tuning, please refer to Performance Tuning Guide for more details. The following is the Profile related configuration.

Parameters

Descriptions

Types

profile

Enable the performance capture tool, see mindspore.Profiler for details

bool

profile_start_step

Set the number of steps to start collecting performance data

int

profile_stop_step

Set the number of steps to stop collecting performance data

int

profile_communication

Set whether communication performance data is collected in multi-device training, this parameter is invalid when using single card training. Default: False

bool

profile_memory

Set whether to collect Tensor memory data

bool

init_start_profile

Set whether to turn on collecting performance data when the Profiler is initialized; this parameter does not take effect when profile_start_step is set. This parameter needs to be set to True when profile_memory is turned on.

bool