Training Metrics Monitoring

View Source On Gitee

MindSpore Transformers supports TensorBoard as a visualization tool for monitoring and analyzing various metrics and information during training. TensorBoard is a standalone visualization library that requires the user to manually install it, and it provides an interactive way to view loss, precision, learning rate, gradient distribution, and a variety of other things in training. After the user configures TensorBoard in the training yaml file, the event file is generated and updated in real time during the training of the large model, and the training data can be viewed via commands.

Configuration Descriptions

Configure the "monitor_config", "tensorboard" and "callbacks" keywords in the training yaml file, and the training will save the tensorboard event file under the configured save address. A sample configuration is shown below:

Configuration Sample of yaml File

seed: 0
output_dir: './output'

monitor_config:
    monitor_on: True
    dump_path: './dump'
    target: ['layers.0', 'layers.1'] # Monitor only the first and second level parameters
    invert: False
    step_interval: 1
    local_loss_format: ['log', 'tensorboard']
    local_norm_format: ['log', 'tensorboard']
    device_local_norm_format: ['log', 'tensorboard']
    optimizer_state_format: null
    weight_state_format: null
    throughput_baseline: null
    print_struct: False

tensorboard:
    tensorboard_dir: 'worker/tensorboard'
    tensorboard_queue_size: 10
    log_loss_scale_to_tensorboard: True
    log_timers_to_tensorboard: True

callbacks:
    - type: MFLossMonitor
      per_print_times: 1

monitor_config field parameter name

Descriptions

Types

monitor_config.monitor_on

Sets whether monitoring is enabled. The default is False, when all the following parameters do not take effect

bool

monitor_config.dump_path

Sets the path where the local_norm, device_local_norm, and local_loss metrics files are saved during training. When not set or set to null take the default value '. /dump'

str

monitor_config.target

Sets the name (fragment) of the target parameter monitored by the indicator optimizer_state and local_norm, which can be a regular expression. When not set or set to null take the default value ['. *'], i.e. specify all parameters

list[str]

monitor_config.invert

Sets the parameter specified by counterselecting monitor_config.target. Defaults to False.

bool

monitor_config.step_interval

Sets the frequency of logging the indicator. Default is 1, i.e., record once per step

int

monitor_config.local_loss_format

Sets the logging form of the indicator local_loss

str or list[str]

monitor_config.local_norm_format

Sets the logging form of the indicator local_norm

str or list[str]

monitor_config.device_local_norm_format

Sets the logging form of the indicator device_local_norm

str or list[str]

monitor_config.optimizer_state_format

Sets the logging form of the indicator optimizer_state

str or list[str]

monitor_config.weight_state_format

Sets the logging form of the indicator weight L2-norm

str or list[str]

monitor_config.throughput_baseline

Sets the baseline value for the metric throughput linearity, which needs to be positive. It will be written to both Tensorboard and logs. Defaults to null when not set, indicating that the metric is not monitored

int or float

monitor_config.print_struct

Sets whether to print all trainable parameter names for the model. If True, it will print the names of all trainable parameters at the start of the first step and exit training at the end of the step. Default is False.

bool

The optional values for the parameters of the form xxx_format above are the strings 'tensorboard' and 'log' (for writing to the Tensorboard and writing to the log, respectively), or a list of both, or null. All default to null when not set, indicating that the corresponding metrics are not monitored.

Note: when monitoring optimizer_state and weight L2 norm metrics is enabled, it will greatly increase the time consumption of the training process, so please choose carefully according to your needs. "rank_x" directory under the monitor_config.dump_path path will be cleared, so make sure that there is no file under the set path that needs to be kept.

tensoraboardfield parameter name

Descriptions

Types

tensorboard.tensorboard_dir

Sets the path where TensorBoard event files are saved

str

tensorboard.tensorboard_queue_size

Sets the maximum cache value of the capture queue. If it exceeds this value, it will be written to the event file, the default value is 10.

int

tensorboard.log_loss_scale_to_tensorboard

Sets whether loss scale information is logged to the event file, default is False.

bool

tensorboard.log_timers_to_tensorboard

Sets whether to log timer information to the event file. The timer information contains the duration of the current training step (or iteration) as well as the throughput, defaults to False

bool

tensorboard.tensorboard_dir can be specified via the environment variable 'MA_SUMMARY_LOG_DIR', at which point a default tensorboard configuration will be automatically generated if tensorboard is not configured. It should be noted that without the tensorboard configuration, the "tensorboard" set in xxx_format by monitor_config will be replaced with "log", i.e., instead of writing to the tensorboard event file, the corresponding information will be printed in the log.

Viewing Training Data

After the above configuration, the event file for each card will be saved under the path . /worker/tensorboard/rank_{id}, where {id} is the rank number of each card. The event files are named events.*. The file contains scalars and text data, where scalars are the scalars of key metrics in the training process, such as learning rate, loss, etc.; text is the text data of all configurations for the training task, such as parallel configuration, dataset configuration, etc. In addition, according to the specific configuration, some metrics will be displayed in the log.

Use the following command to start the Tensorboard Web Visualization Service:

tensorboard --logdir=./worker/tensorboard/ --host=0.0.0.0 --port=6006

Parameter names

Descriptions

logdir

Path to the folder where TensorBoard saves event files

host

The default is 127.0.0.1, which means that only local access is allowed; setting it to 0.0.0.0 allows external devices to access it, so please pay attention to information security.

port

Set the port on which the service listens, the default is 6006.

The following is displayed when the command in the sample is entered:

TensorBoard 2.18.0 at http://0.0.0.0:6006/ (Press CTRL+C to quit)

2.18.0 indicates the version number of the current TensorBoard installation (the recommended version is 2.18.0), and 0.0.0.0 and 6006 correspond to the input --host and --port respectively, after which you can visit server public ip:port in the local PC's browser to view the visualization page. For example, if the public IP of the server is 192.168.1.1, then access 192.168.1.1:6006.

Explanation of the Visualization of Indicators

The callback functions MFLossMonitor and TrainingStateMonitor will monitor different scalar metrics respectively. The TrainingStateMonitor does not need to be set by the user in the configuration file, it will be added automatically according to monitor_config.

MFLossMonitor Monitoring Metrics

The names and descriptions of the metrics monitored by MFLossMonitor are listed below:

Scalar name

Descriptions

learning-rate

learning rate

batch-size

batch size

loss

loss

loss-scale

Loss scaling factor, logging requires setting log_loss_scale_to_tensorboard to True

grad-norm

gradient exponent

iteration-time

The time taken for training iterations, logging requires setting log_timers_to_tensorboard to True

throughput

Data throughput, logging requires setting log_timers_to_tensorboard to True

model-flops-throughput-per-npu

Model operator throughput in TFLOPS/npu (trillion floating point operations per second per card)

B-samples-per-day

Cluster data throughput in B samples/day (one billion samples per day), logging requires setting log_timers_to_tensorboard to True

In Tensorboard SCALARS page, the above metrics (assumed to be named scalar_name) have drop-down tabs for scalar_name and scalar_name-vs-samples, except for the last two. A line plot of this scalar versus the number of training iterations is shown under scalar_name, and a line plot of this scalar versus the number of samples is shown under scalar_name-vs-samples. An example of a plot of learning rate learning-rate is shown below:

/tensorboard_scalar

TrainingStateMonitor Monitoring Metrics

The names and descriptions of the metrics monitored by TrainingStateMonitor are listed below:

Scalar name

Descriptions

local_norm

Gradient paradigm for each parameter on a single card, records need to set local_norm_format to non-null

device_local_norm

the total number of gradient paradigms on a single card, records need to set device_local_norm_format to non-null

local_loss

localized losses on a single card, records need to set local_loss_format to non-null

adam_m_norm

The optimizer's first-order moments estimate the number of paradigms for each parameter, records need to set optimizer_state_format to non-null

adam_v_norm

The optimizer's second-order moments estimate the number of paradigms for each parameter, records need to set optimizer_state_format to non-null

weight_norm

weight L2 paradigm, records need to set weight_state_format to non-null

throughput_linearity

data throughput linearity, records need to set throughput_baseline to non-null

Depending on the specific settings, the above metrics will be displayed in the Tensorboard or logs as follows:

Example of logging effect

/TrainingStateMonitor_log

Example of tensorboard visualization

adam_m_norm

/adam_m_norm

local_loss and local_norm

/local_loss&local_norm

Description of Text Data Visualization

On the TEXT page, a tab exists for each training configuration where the values for that configuration are recorded. This is shown in the following figure:

/tensorboard_text

All configuration names and descriptions are listed below:

Configuration names

descriptions

seed

random seed

output_dir

Save paths to checkpoint and strategy

run_mode

running mode

use_parallel

whether to enable parallel

resume_training

whether to enable resume training

ignore_data_skip

Whether to ignore the mechanism for skipping data during breakpoints in resume training and read the dataset from the beginning. Recorded only if the resume_training value is True

data_skip_steps

The number of data set skip steps. Only logged if ignore_data_skip is logged and the value is False.

load_checkpoint

Model name or weight path for loading weights

load_ckpt_format

File format for load weights. Only logged if the load_checkpoint value is not null

auto_trans_ckpt

Whether to enable automatic online weight slicing or conversion. Only logged if the load_checkpoint value is not null

transform_process_num

The number of processes to convert the checkpoint. Only logged if auto_trans_ckpt is logged and the value is True.

src_strategy_path_or_dir

Source weight distributed policy file path. Only logged if auto_trans_ckpt is logged and the value is True.

load_ckpt_async

Whether to log weights asynchronously. Only logged if the load_checkpoint value is not null

only_save_strategy

Whether the task saves only distributed policy files

profile

Whether to enable performance analysis tools

profile_communication

Whether to collect communication performance data in multi-device training. Recorded only when profile value is True

profile_level

Capture performance data levels. Recorded only when profile value is True

profile_memory

Whether to collect Tensor memory data. Recorded only when profile value is True

profile_start_step

Performance analysis starts with step. Recorded only when profile value is True

profile_stop_step

Performance analysis ends with step. Recorded only when profile value is True

profile_rank_ids

Specify the rank ids to turn on profiling. Recorded only when profile value is True

profile_pipeline

Whether to turn on profiling by for the cards of each stage in pipeline parallel. Recorded only when profile value is True

init_start_profile

Whether to enable data acquisition during Profiler initialization.Recorded only when profile value is True

layer_decay

Layer decay coefficient

layer_scale

whether to enable layer scaling

lr_scale

Whether to enable learning rate scaling

lr_scale_factor

Learning rate scaling factor. Recorded only when lr_scale value is True

micro_batch_interleave_num

Number of batch_size splits, multicopy parallelism switch

remote_save_url

Return folder paths for target buckets when using AICC training jobs

callbacks

callback function configuration

context

Configuration of the environment

data_size

Dataset size

device_num

Number of devices (cards)

do_eval

Whether to turn on training-while-evaluating

eval_callbacks

Evaluate the callback function configuration. Recorded only when do_eval value is True

eval_step_interval

Evaluate step intervals. Recorded only when do_eval value is True

eval_epoch_interval

Evaluate the epoch interval. Recorded only when do_eval value is True

eval_dataset

Evaluate the dataset configuration. Recorded only when do_eval value is True

eval_dataset_task

Evaluate task configurations. Recorded only when do_eval value is True

lr_schedule

learning rate

metric

evaluation function

model

Model configuration

moe_config

Mixed expert configurations

optimizer

optimizer

parallel_config

Parallel strategy configuration

parallel

Automatic parallel configuration

recompute_config

recomputation configuration

remove_redundancy

Whether redundancy is removed when checkpoint is saved

runner_config

running configuration

runner_wrapper

wrapper configuration

monitor_config

Training metrics monitoring configuration

tensorboard

TensorBoard configuration

train_dataset_task

Training task configuration

train_dataset

Training dataset configuration

trainer

Training process configuration

The above training configurations are derived from:

  1. Configuration parameters passed in by the user in the training startup command run_mindformer.py;

  2. Configuration parameters set by the user in the training configuration file yaml;

  3. Default configuration parameters during training.

Refer to Configuration File Description for all configurable parameters.