mindformers.core.MFLossMonitor

class mindformers.core.MFLossMonitor(learning_rate: Optional[Union[float, LearningRateSchedule]] = None, per_print_times: int = 1, micro_batch_num: int = 1, micro_batch_interleave_num: int = 1, origin_epochs: int = None, dataset_size: int = None, initial_epoch: int = 0, initial_step: int = 0, global_batch_size: int = 0, gradient_accumulation_steps: int = 1, check_for_nan_in_loss_and_grad: bool = False, calculate_per_token_loss: bool = False)[source]

Monitor loss and other parameters in training process.

Parameters

learning_rate (Union[float, LearningRateSchedule], optional) – The learning rate schedule. Default: None.
per_print_times (int, optional) – Every how many steps to print the log information. Default: 1.
micro_batch_num (int, optional) – MicroBatch size for Pipeline Parallel. Default: 1.
micro_batch_interleave_num (int, optional) – split num of batch size. Default: 1.
origin_epochs (int, optional) – Training epoches. Default: None.
dataset_size (int, optional) – Training dataset size. Default: None.
initial_epoch (int, optional) – The beginning epoch. Default: 0.
initial_step (int, optional) – The beginning step. Default: 0.
global_batch_size (int, optional) – The total batch size. Default: 0.
gradient_accumulation_steps (int, optional) – The gradient accumulation steps. Default: 1.
check_for_nan_in_loss_and_grad (bool, optional) – Whether to check loss and norm of grad is Nan. Default: False.
calculate_per_token_loss (bool, optional) – Whether to calculate the loss of each token. Default: False.

Examples

>>> from mindformers.core import MFLossMonitor
>>> lr = [0.01, 0.008, 0.006, 0.005, 0.002]
>>> monitor = MFLossMonitor(learning_rate=lr, per_print_times=10)

on_train_epoch_begin(run_context)[source]

Record time at the beginning of epoch.

Parameters: run_context (RunContext) – Context of the process running.

on_train_step_begin(run_context)[source]

Record time at the beginning of step.

Parameters: run_context (RunContext) – Context of the process running.

on_train_step_end(run_context)[source]

Print training info at the end of step.

Parameters: run_context (RunContext) – Context of the process running.