mindformers.core.TrainingStateMonitor

View Source On Gitee
class mindformers.core.TrainingStateMonitor(origin_epochs: int, config: dict = None, per_print_times: int = 1, dataset_size: int = None, initial_epoch: int = 0, initial_step: int = 0, global_batch_size: int = 0)[source]

Monitor metrics such as local norm and local loss in training process.

Parameters
  • origin_epochs (int) – Required. Training epoches.

  • config (dict, optional) – The config specified how to display metrics. Keys are shown below. Default: None,

  • below. (mean that keys will be set as the default values as) –

    • target: Specify the name or regular expression of params to monitor. Must be list of str, e.g. ["layers.[01]", "attention"]. Default: ['*'] , all params are selected.

    • invert: Whether to invert target, i.e. params in target won't be monitored. Must be bool. Default: False

    • local_norm_format: Determine where to display the local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Only params specified will be monitored. may cause a large amount of print info if 'log' is selected. Set to None to ignore this metric. Default: None.

    • device_local_norm_format: Determine where to display the device local norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.

    • local_loss_format: Determine where to display the local loss. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.

    • optimizer_state_format: Determine where to display the optimizer state. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Only the optimizer state of params specified will be monitored, may cause a large amount of print info if 'log' is selected. Set to None to ignore this metric. Default: None.

    • weight_state_format: Determine where to display the weight L2-norm. Should be a str in ['tensorboard', 'log'] (mean that write data to tensorboard or log file), or a list containing them, or None. Set to None to ignore this metric. Default: None.

    • throughput_baseline: The model throughput baseline to calculate linearity. Must be a positive number. Will be displayed both to tensorboard and log. Set to None to ignore this metric. Default: None.

    • print_struct: Whether to print the structure of model. If True, callback will print the names of all trainable params at the first step and then quit training process. Default: False.

  • per_print_times (int, optional) – Every how many steps to print the log information. Default: 1.

  • dataset_size (int, optional) – Required in sink mode. Training dataset size. Default: None.

  • initial_epoch (int, optional) – The beginning epoch. Default: 0.

  • initial_step (int, optional) – The beginning step. Default: 0.

  • global_batch_size (int, optional) – The total batch size. Default: 0.

epoch_begin(run_context)[source]

Record time at the beginning of epoch.

Parameters

run_context (RunContext) – Context of the process running.

epoch_end(run_context)[source]

Print training info at the end of epoch.

Parameters

run_context (RunContext) – Context of the process running.

step_begin(run_context)[source]

Record time at the beginning of step.

Parameters

run_context (RunContext) – Context of the process running.

step_end(run_context)[source]

Print training info at the end of step.

Parameters

run_context (RunContext) – Context of the process running.