mindformers.core.TrainingStateMonitor
- class mindformers.core.TrainingStateMonitor(origin_epochs: int, config: dict = None, per_print_times: int = 1, dataset_size: int = None, initial_epoch: int = 0, initial_step: int = 0, global_batch_size: int = 0)[source]
Monitor metrics such as local norm and local loss in training process.
- Parameters
origin_epochs (int) – Required. Training epoches.
config (dict, optional) – The config specified how to display metrics. Keys are shown below. Default:
None
,below. (mean that keys will be set as the default values as) –
target: Specify the name or regular expression of params to monitor. Must be list of str, e.g. ["layers.[01]", "attention"]. Default: ['*'] , all params are selected.
invert: Whether to invert target, i.e. params in target won't be monitored. Must be bool. Default: False
local_norm_format: Determine where to display the local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or
None
. Only params specified will be monitored. may cause a large amount of print info if 'log' is selected. Set toNone
to ignore this metric. Default:None
.device_local_norm_format: Determine where to display the device local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or
None
. Set toNone
to ignore this metric. Default:None
.local_loss_format: Determine where to display the local loss. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or
None
. Set toNone
to ignore this metric. Default:None
.optimizer_state_format: Determine where to display the optimizer state. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or
None
. Only the optimizer state of params specified will be monitored, may cause a large amount of print info if 'log' is selected. Set toNone
to ignore this metric. Default:None
.weight_state_format: Determine where to display the weight L2-norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or
None
. Set toNone
to ignore this metric. Default:None
.throughput_baseline: The model throughput baseline to calculate linearity. Must be a positive number. Will be displayed both to tensorboard and log. Set to
None
to ignore this metric. Default:None
.print_struct: Whether to print the structure of model. If
True
, callback will print the names of all trainable params at the first step and then quit training process. Default:False
.
per_print_times (int, optional) – Every how many steps to print the log information. Default:
1
.dataset_size (int, optional) – Required in sink mode. Training dataset size. Default:
None
.initial_epoch (int, optional) – The beginning epoch. Default:
0
.initial_step (int, optional) – The beginning step. Default:
0
.global_batch_size (int, optional) – The total batch size. Default:
0
.
- epoch_begin(run_context)[source]
Record time at the beginning of epoch.
- Parameters
run_context (RunContext) – Context of the process running.
- epoch_end(run_context)[source]
Print training info at the end of epoch.
- Parameters
run_context (RunContext) – Context of the process running.