mindformers.core.TrainingStateMonitor

class mindformers.core.TrainingStateMonitor(origin_epochs: int, config: dict = None, step_interval: int = 1, dataset_size: int = None, initial_epoch: int = 0, initial_step: int = 0, global_batch_size: int = 0, check_for_nan_in_loss_and_grad: bool = False)[source]

Monitor metrics such as local norm and local loss in training process.

Parameters

origin_epochs (int) – Required. Training epoches.
config (dict, optional) –
The config specified how to display metrics. Keys are shown below. Default: None, mean that keys will be set as the default values as below.
- target: Specify the name or regular expression of params to monitor. Must be list of str, e.g. ["layers.[01]", "attention"]. Default: ['*'] , all params are selected.
- invert: Whether to invert target, i.e. params in target won't be monitored. Must be bool. Default: False
- local_norm_format: Determine where to display the local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Only params specified will be monitored. may cause a large amount of print info if 'log' is selected. Set to None to ignore this metric. Default: None.
- device_local_norm_format: Determine where to display the device local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.
- local_loss_format: Determine where to display the local loss. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.
- device_local_loss_format: Determine where to display the device local loss. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.
- optimizer_state_format: Determine where to display the optimizer state. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Only the optimizer state of params specified will be monitored, may cause a large amount of print info if 'log' is selected. Set to None to ignore this metric. Default: None.
- weight_state_format: Determine where to display the weight L2-norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.
- throughput_baseline: The model throughput baseline to calculate linearity. Must be a positive number. Will be displayed both to tensorboard and log. Set to None to ignore this metric. Default: None.
- print_struct: Whether to print the structure of model. If True, callback will print the names of all trainable params at the first step and then quit training process. Default: False.
step_interval (int, optional) – Every how many steps to display metrics. Default: 1.
dataset_size (int, optional) – Required in sink mode. Training dataset size. Default: None.
initial_epoch (int, optional) – The beginning epoch. Default: 0.
initial_step (int, optional) – The beginning step. Default: 0.
global_batch_size (int, optional) – The total batch size. Default: 0.
check_for_nan_in_loss_and_grad (bool, optional) – Whether to check loss and norm of grad is Nan. Default: False.

abnormal_global_norm_check(cb_params)[source]: Check the abnormal global_norm and raise error