mindspore.train

mindspore.train.summary

Summary related classes and functions. User can use SummaryRecord to dump the summary data, the summary is a series of operations to collect data for analysis and visualization.

class mindspore.train.summary.SummaryRecord(log_dir, file_prefix='events', file_suffix='_MS', network=None, max_file_size=None, raise_exception=False, export_options=None)[source]

SummaryRecord is used to record the summary data and lineage data.

The API will create a summary file and lineage files lazily in a given directory and writes data to them. It writes the data to files by executing the ‘record’ method. In addition to recording the data bubbled up from the network by defining the summary operators, SummaryRecord also supports to record extra data which can be added by calling add_value.

Note

  1. Make sure to close the SummaryRecord at the end, otherwise the process will not exit. Please see the Example section below to learn how to close properly in two ways.

  2. Only one SummaryRecord instance is allowed at a time, otherwise it will cause data writing problems.

  3. SummaryRecord only supports Linux systems.

  4. The Summary is not supported when compile source with -s on option.

Parameters
  • log_dir (str) – The log_dir is a directory location to save the summary.

  • file_prefix (str) – The prefix of file. Default: “events”.

  • file_suffix (str) – The suffix of file. Default: “_MS”.

  • network (Cell) – Obtain a pipeline through network for saving graph summary. Default: None.

  • max_file_size (int, optional) – The maximum size of each file that can be written to disk (in bytes). For example, to write not larger than 4GB, specify max_file_size=4*1024**3. Default: None, which means no limit.

  • raise_exception (bool, optional) – Sets whether to throw an exception when a RuntimeError or OSError exception occurs in recording data. Default: False, this means that error logs are printed and no exception is thrown.

  • export_options (Union[None, dict]) –

    Perform custom operations on the export data. Note that the size of export files is not limited by the max_file_size. You can customize the export data with a dictionary. For example, you can set {‘tensor_format’: ‘npy’} to export tensor as npy file. The data that supports control is shown below. Default: None, it means that the data is not exported.

    • tensor_format (Union[str, None]): Customize the export tensor format. Supports [“npy”, None]. Default: None, it means that the tensor is not exported.

      • npy: export tensor as npy file.

Raises
  • TypeErrormax_file_size is not int or file_prefix and file_suffix is not string.

  • ValueError – The Summary is not supported, please without -s on and recompile source.

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     # use in with statement to auto close
...     with SummaryRecord(log_dir="./summary_dir") as summary_record:
...         pass
...
...     # use in try .. finally .. to ensure closing
...     try:
...         summary_record = SummaryRecord(log_dir="./summary_dir")
...     finally:
...         summary_record.close()
add_value(plugin, name, value)[source]

Add value to be recorded later.

Parameters
  • plugin (str) –

    The plugin of the value.

    • graph: the value is a computational graph.

    • scalar: the value is a scalar.

    • image: the value is an image.

    • tensor: the value is a tensor.

    • histogram: the value is a histogram.

    • train_lineage: the value is a lineage data for the training phase.

    • eval_lineage: the value is a lineage data for the evaluation phase.

    • dataset_graph: the value is a dataset graph.

    • custom_lineage_data: the value is a customized lineage data.

  • name (str) – The value of the name.

  • value (Union[Tensor, GraphProto, TrainLineage, EvaluationLineage, DatasetGraph, UserDefinedInfo]) –

    The value to store.

    • The data type of value should be ‘GraphProto’ (see mindspore/ccsrc/anf_ir.proto) object when the plugin is ‘graph’.

    • The data type of value should be ‘Tensor’ object when the plugin is ‘scalar’, ‘image’, ‘tensor’ or ‘histogram’.

    • The data type of value should be a ‘TrainLineage’ object when the plugin is ‘train_lineage’, see mindspore/ccsrc/lineage.proto.

    • The data type of value should be a ‘EvaluationLineage’ object when the plugin is ‘eval_lineage’, see mindspore/ccsrc/lineage.proto.

    • The data type of value should be a ‘DatasetGraph’ object when the plugin is ‘dataset_graph’, see mindspore/ccsrc/lineage.proto.

    • The data type of value should be a ‘UserDefinedInfo’ object when the plugin is ‘custom_lineage_data’, see mindspore/ccsrc/lineage.proto.

Raises
  • ValueErrorplugin is not in the optional value.

  • TypeErrorname is not non-empty string,or the data type of value is not ‘Tensor’ object when the plugin is ‘scalar’, ‘image’, ‘tensor’ or ‘histogram’.

Examples

>>> from mindspore import Tensor
>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     with SummaryRecord(log_dir="./summary_dir", file_prefix="xx_", file_suffix="_yy") as summary_record:
...         summary_record.add_value('scalar', 'loss', Tensor(0.1))
close()[source]

Flush the buffer and write files to disk and close summary records. Please use the statement to autoclose.

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     try:
...         summary_record = SummaryRecord(log_dir="./summary_dir")
...     finally:
...         summary_record.close()
flush()[source]

Flush the buffer and write files to disk.

Call it to make sure that all pending events have been written to disk.

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     with SummaryRecord(log_dir="./summary_dir", file_prefix="xx_", file_suffix="_yy") as summary_record:
...         summary_record.flush()
property log_dir

Get the full path of the log file.

Returns

str, the full path of log file.

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     with SummaryRecord(log_dir="./summary_dir", file_prefix="xx_", file_suffix="_yy") as summary_record:
...         log_dir = summary_record.log_dir
record(step, train_network=None, plugin_filter=None)[source]

Record the summary.

Parameters
  • step (int) – Represents training step number.

  • train_network (Cell) – The spare network for saving graph. Default: None, it means just do not save the graph summary when the original network graph is None.

  • plugin_filter (Optional[Callable[[str], bool]]) – The filter function, which is used to filter out which plugin should be written. Default: None.

Returns

bool, whether the record process is successful or not.

Raises

TypeErrorstep is not int,or train_network is not mindspore.nn.Cell

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     with SummaryRecord(log_dir="./summary_dir", file_prefix="xx_", file_suffix="_yy") as summary_record:
...         result = summary_record.record(step=2)
...         print(result)
...
True
set_mode(mode)[source]

Set the model running phase. Different phases affect data recording.

Parameters

mode (str) –

The mode to be set, which should be ‘train’ or ‘eval’. When the mode is ‘eval’, summary_record will not record the data of summary operators.

  • train:the model running phase is train mode.

  • eval:the model running phase is eval mode,When the mode is ‘eval’, summary_record will not record the data of summary operators.

Raises

ValueErrormode is not in the optional value.

Examples

>>> from mindspore.train.summary import SummaryRecord
>>> if __name__ == '__main__':
...     with SummaryRecord(log_dir="./summary_dir", file_prefix="xx_", file_suffix="_yy") as summary_record:
...         summary_record.set_mode('eval')

mindspore.train.callback

Callback related classes and functions.

class mindspore.train.callback.Callback[source]

Abstract base class used to build a callback class. Callbacks are context managers which will be entered and exited when passing into the Model. You can use this mechanism to initialize and release resources automatically.

Callback function will execute some operations in the current step or epoch. To create a custom callback, subclass Callback and override the method associated with the stage of interest. For details of Callback fusion, please check Callback.

It holds the information of the model. Such as network, train_network, epoch_num, batch_num, loss_fn, optimizer, parallel_mode, device_number, list_callback, cur_epoch_num, cur_step_num, dataset_sink_mode, net_outputs and so on.

Examples

>>> from mindspore import Model, nn
>>> from mindspore.train.callback import Callback
>>> class Print_info(Callback):
...     def step_end(self, run_context):
...         cb_params = run_context.original_args()
...         print("step_num: ", cb_params.cur_step_num)
>>>
>>> print_cb = Print_info()
>>> dataset = create_custom_dataset()
>>> net = Net()
>>> loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
>>> optim = nn.Momentum(net.trainable_params(), 0.01, 0.9)
>>> model = Model(net, loss_fn=loss, optimizer=optim)
>>> model.train(1, dataset, callbacks=print_cb)
step_num: 1
begin(run_context)[source]

Called once before the network executing.

Parameters

run_context (RunContext) – Include some information of the model.

end(run_context)[source]

Called once after network training.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_begin(run_context)[source]

Called before each epoch beginning.

Parameters

run_context (RunContext) – Include some information of the model.

epoch_end(run_context)[source]

Called after each epoch finished.

Parameters

run_context (RunContext) – Include some information of the model.

step_begin(run_context)[source]

Called before each step beginning.

Parameters

run_context (RunContext) – Include some information of the model.

step_end(run_context)[source]

Called after each step finished.

Parameters

run_context (RunContext) – Include some information of the model.

class mindspore.train.callback.CheckpointConfig(save_checkpoint_steps=1, save_checkpoint_seconds=0, keep_checkpoint_max=5, keep_checkpoint_per_n_minutes=0, integrated_save=True, async_save=False, saved_network=None, append_info=None, enc_key=None, enc_mode='AES-GCM', exception_save=False)[source]

The configuration of model checkpoint.

Note

During the training process, if dataset is transmitted through the data channel, It is suggested to set ‘save_checkpoint_steps’ to an integer multiple of loop_size. Otherwise, the time to save the checkpoint may be biased. It is recommended to set only one save strategy and one keep strategy at the same time. If both save_checkpoint_steps and save_checkpoint_seconds are set, save_checkpoint_seconds will be invalid. If both keep_checkpoint_max and keep_checkpoint_per_n_minutes are set, keep_checkpoint_per_n_minutes will be invalid.

Parameters
  • save_checkpoint_steps (int) – Steps to save checkpoint. Default: 1.

  • save_checkpoint_seconds (int) – Seconds to save checkpoint. Can’t be used with save_checkpoint_steps at the same time. Default: 0.

  • keep_checkpoint_max (int) – Maximum number of checkpoint files can be saved. Default: 5.

  • keep_checkpoint_per_n_minutes (int) – Save the checkpoint file every keep_checkpoint_per_n_minutes minutes. Can’t be used with keep_checkpoint_max at the same time. Default: 0.

  • integrated_save (bool) – Whether to merge and save the split Tensor in the automatic parallel scenario. Integrated save function is only supported in automatic parallel scene, not supported in manual parallel. Default: True.

  • async_save (bool) – Whether asynchronous execution saves the checkpoint to a file. Default: False.

  • saved_network (Cell) – Network to be saved in checkpoint file. If the saved_network has no relation with the network in training, the initial value of saved_network will be saved. Default: None.

  • append_info (list) – The information save to checkpoint file. Support “epoch_num”, “step_num” and dict. The key of dict must be str, the value of dict must be one of int float and bool. Default: None.

  • enc_key (Union[None, bytes]) – Byte type key used for encryption. If the value is None, the encryption is not required. Default: None.

  • enc_mode (str) – This parameter is valid only when enc_key is not set to None. Specifies the encryption mode, currently supports ‘AES-GCM’ and ‘AES-CBC’. Default: ‘AES-GCM’.

  • exception_save (bool) – Whether to save the current checkpoint when an exception occurs. Default: False.

Raises

ValueError – If input parameter is not the correct type.

Examples

>>> from mindspore import Model, nn
>>> from mindspore.train.callback import ModelCheckpoint, CheckpointConfig
>>> from mindspore.common.initializer import Normal
>>>
>>> class LeNet5(nn.Cell):
...     def __init__(self, num_class=10, num_channel=1):
...         super(LeNet5, self).__init__()
...         self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid')
...         self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')
...         self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02))
...         self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02))
...         self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02))
...         self.relu = nn.ReLU()
...         self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
...         self.flatten = nn.Flatten()
...
...     def construct(self, x):
...         x = self.max_pool2d(self.relu(self.conv1(x)))
...         x = self.max_pool2d(self.relu(self.conv2(x)))
...         x = self.flatten(x)
...         x = self.relu(self.fc1(x))
...         x = self.relu(self.fc2(x))
...         x = self.fc3(x)
...         return x
>>>
>>> net = LeNet5()
>>> loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
>>> optim = nn.Momentum(net.trainable_params(), 0.01, 0.9)
>>> model = Model(net, loss_fn=loss, optimizer=optim)
>>> data_path = './MNIST_Data'
>>> dataset = create_dataset(data_path)
>>> config = CheckpointConfig(saved_network=net)
>>> ckpoint_cb = ModelCheckpoint(prefix='LeNet5', directory='./checkpoint', config=config)
>>> model.train(10, dataset, callbacks=ckpoint_cb)
property append_dict

Get the value of append_dict.

property async_save

Get the value of _async_save.

property enc_key

Get the value of _enc_key

property enc_mode

Get the value of _enc_mode

get_checkpoint_policy()[source]

Get the policy of checkpoint.

property integrated_save

Get the value of _integrated_save.

property keep_checkpoint_max

Get the value of _keep_checkpoint_max.

property keep_checkpoint_per_n_minutes

Get the value of _keep_checkpoint_per_n_minutes.

property save_checkpoint_seconds

Get the value of _save_checkpoint_seconds.

property save_checkpoint_steps

Get the value of _save_checkpoint_steps.

property saved_network

Get the value of _saved_network

class mindspore.train.callback.FederatedLearningManager(model, sync_frequency, sync_type='fixed', **kwargs)[source]

Manage Federated Learning during training.

Parameters
  • model (nn.Cell) – A training model.

  • sync_frequency (int) – Synchronization frequency of parameters in Federated Learning. Note that in dataset sink mode, the unit of the frequency is the number of epochs. Otherwise, the unit of the frequency is the number of steps.

  • sync_type (str) –

    Parameter synchronization type in Federated Learning. Supports [“fixed”, “adaptive”]. Default: “fixed”.

    • fixed: The frequency of parameter synchronization is fixed.

    • adaptive: The frequency of parameter synchronization changes adaptively.

Note

This is an experimental prototype that is subject to change.

step_end(run_context)[source]

Synchronization parameters at the end of step. If sync_type is “adaptive”, the synchronous frequency is adaptively adjusted here.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.LearningRateScheduler(learning_rate_function)[source]

Change the learning_rate during training.

Parameters

learning_rate_function (Function) – The function about how to change the learning rate during training.

Examples

>>> from mindspore import Model
>>> from mindspore.train.callback import LearningRateScheduler
>>> import mindspore.nn as nn
...
>>> def learning_rate_function(lr, cur_step_num):
...     if cur_step_num%1000 == 0:
...         lr = lr*0.1
...     return lr
...
>>> lr = 0.1
>>> momentum = 0.9
>>> net = Net()
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> optim = nn.Momentum(net.trainable_params(), learning_rate=lr, momentum=momentum)
>>> model = Model(net, loss_fn=loss, optimizer=optim)
...
>>> dataset = create_custom_dataset("custom_dataset_path")
>>> model.train(1, dataset, callbacks=[LearningRateScheduler(learning_rate_function)],
...             dataset_sink_mode=False)
step_end(run_context)[source]

Change the learning_rate at the end of step.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.LossMonitor(per_print_times=1)[source]

Monitor the loss in training.

If the loss is NAN or INF, it will terminate training.

Note

If per_print_times is 0, do not print loss.

Parameters

per_print_times (int) – How many steps to print once loss. During sink mode, it will print loss in the nearest step. Default: 1.

Raises

ValueError – If per_print_times is not an integer or less than zero.

step_end(run_context)[source]

Print training loss at the end of step.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.ModelCheckpoint(prefix='CKP', directory=None, config=None)[source]

The checkpoint callback class.

It is called to combine with train process and save the model and network parameters after training.

Note

In the distributed training scenario, please specify different directories for each training process to save the checkpoint file. Otherwise, the training may fail.

Parameters
  • prefix (str) – The prefix name of checkpoint files. Default: “CKP”.

  • directory (str) – The path of the folder which will be saved in the checkpoint file. By default, the file is saved in the current directory. Default: None.

  • config (CheckpointConfig) – Checkpoint strategy configuration. Default: None.

Raises
  • ValueError – If the prefix is invalid.

  • TypeError – If the config is not CheckpointConfig type.

end(run_context)[source]

Save the last checkpoint after training finished.

Parameters

run_context (RunContext) – Context of the train running.

property latest_ckpt_file_name

Return the latest checkpoint path and file name.

step_end(run_context)[source]

Save the checkpoint at the end of step.

Parameters

run_context (RunContext) – Context of the train running.

class mindspore.train.callback.RunContext(original_args)[source]

Provide information about the model.

Provide information about original request to model function. Callback objects can stop the loop by calling request_stop() of run_context.

Parameters

original_args (dict) – Holding the related information of model.

get_stop_requested()[source]

Return whether a stop is requested or not.

Returns

bool, if true, model.train() stops iterations.

original_args()[source]

Get the _original_args object.

Returns

Dict, an object that holds the original arguments of model.

request_stop()[source]

Set stop requirement during training.

Callbacks can use this function to request stop of iterations. model.train() checks whether this is called or not.

class mindspore.train.callback.SummaryCollector(summary_dir, collect_freq=10, collect_specified_data=None, keep_default_action=True, custom_lineage_data=None, collect_tensor_freq=None, max_file_size=None, export_options=None)[source]

SummaryCollector can help you to collect some common information.

It can help you to collect loss, learning late, computational graph and so on. SummaryCollector also enables the summary operator to collect data to summary files.

Note

  1. Multiple SummaryCollector instances in callback list are not allowed.

  2. Not all information is collected at the training phase or at the eval phase.

  3. SummaryCollector always record the data collected by the summary operator.

  4. SummaryCollector only supports Linux systems.

  5. The Summary is not supported when compile source with -s on option.

Parameters
  • summary_dir (str) – The collected data will be persisted to this directory. If the directory does not exist, it will be created automatically.

  • collect_freq (int) – Set the frequency of data collection, it should be greater than zero, and the unit is step. If a frequency is set, we will collect data when (current steps % freq) equals to 0, and the first step will be collected at any time. It is important to note that if the data sink mode is used, the unit will become the epoch. It is not recommended to collect data too frequently, which can affect performance. Default: 10.

  • collect_specified_data (Union[None, dict]) –

    Perform custom operations on the collected data. By default, if set to None, all data is collected as the default behavior. You can customize the collected data with a dictionary. For example, you can set {‘collect_metric’: False} to control not collecting metrics. The data that supports control is shown below. Default: None.

    • collect_metric (bool): Whether to collect training metrics, currently only the loss is collected. The first output will be treated as the loss and it will be averaged. Default: True.

    • collect_graph (bool): Whether to collect the computational graph. Currently, only training computational graph is collected. Default: True.

    • collect_train_lineage (bool): Whether to collect lineage data for the training phase, this field will be displayed on the lineage page of MindInsight. Default: True.

    • collect_eval_lineage (bool): Whether to collect lineage data for the evaluation phase, this field will be displayed on the lineage page of MindInsight. Default: True.

    • collect_input_data (bool): Whether to collect dataset for each training. Currently only image data is supported. If there are multiple columns of data in the dataset, the first column should be image data. Default: True.

    • collect_dataset_graph (bool): Whether to collect dataset graph for the training phase. Default: True.

    • histogram_regular (Union[str, None]): Collect weight and bias for parameter distribution page and displayed in MindInsight. This field allows regular strings to control which parameters to collect. It is not recommended to collect too many parameters at once, as it can affect performance. Note that if you collect too many parameters and run out of memory, the training will fail. Default: None, it means only the first five parameters are collected.

    • collect_landscape (Union[dict,None]): Collect the parameters needed to create the loss landscape.

      • landscape_size (int): Specify the image resolution of the generated loss landscape. For example, if it is set to 128, the resolution of the landscape is 128 * 128. The calculation time increases with the increase of resolution. Default: 40. Optional values: between 3 and 256.

      • unit (str): Specify the interval strength of the training process. Default: “step”. Optional: epoch/step.

      • create_landscape (dict): Select how to create loss landscape. Training process loss landscape(train) and training result loss landscape(result). Default: {“train”: True, “result”: True}. Optional: True/False.

      • num_samples (int): The size of the dataset used to create the loss landscape. For example, in image dataset, You can set num_samples is 128, which means that 128 images are used to create loss landscape. Default: 128.

      • intervals (List[List[int]]): Specifies the interval in which the loss landscape. For example: If the user wants to create loss landscape of two training processes, they are 1-5 epoch and 6-10 epoch respectively. They anc set [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]. Note: Each interval have at least three epochs.

  • keep_default_action (bool) – This field affects the collection behavior of the ‘collect_specified_data’ field. True: it means that after specified data is set, non-specified data is collected as the default behavior. False: it means that after specified data is set, only the specified data is collected, and the others are not collected. Default: True.

  • custom_lineage_data (Union[dict, None]) – Allows you to customize the data and present it on the MingInsight lineage page. In the custom data, the type of the key supports str, and the type of value supports str, int and float. Default: None, it means there is no custom data.

  • collect_tensor_freq (Optional[int]) – The same semantics as the collect_freq, but controls TensorSummary only. Because TensorSummary data is too large to be compared with other summary data, this parameter is used to reduce its collection. By default, The maximum number of steps for collecting TensorSummary data is 20, but it will not exceed the number of steps for collecting other summary data. For example, given collect_freq=10, when the total steps is 600, TensorSummary will be collected 20 steps, while other summary data 61 steps, but when the total steps is 20, both TensorSummary and other summary will be collected 3 steps. Also note that when in parallel mode, the total steps will be split evenly, which will affect the number of steps TensorSummary will be collected. Default: None, which means to follow the behavior as described above.

  • max_file_size (Optional[int]) – The maximum size in bytes of each file that can be written to the disk. For example, to write not larger than 4GB, specify max_file_size=4*1024**3. Default: None, which means no limit.

  • export_options (Union[None, dict]) –

    Perform custom operations on the export data. Note that the size of export files is not limited by the max_file_size. You can customize the export data with a dictionary. For example, you can set {‘tensor_format’: ‘npy’} to export tensor as npy file. The data that supports control is shown below. Default: None, it means that the data is not exported.

    • tensor_format (Union[str, None]): Customize the export tensor format. Supports [“npy”, None]. Default: None, it means that the tensor is not exported.

      • npy: export tensor as npy file.

Raises

ValueError – The Summary is not supported, please without -s on and recompile source.

Examples

>>> import mindspore.nn as nn
>>> from mindspore import context
>>> from mindspore.train.callback import SummaryCollector
>>> from mindspore import Model
>>> from mindspore.nn import Accuracy
>>>
>>> if __name__ == '__main__':
...     # If the device_target is GPU, set the device_target to "GPU"
...     context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
...     mnist_dataset_dir = '/path/to/mnist_dataset_directory'
...     # The detail of create_dataset method shown in model_zoo.official.cv.lenet.src.dataset.py
...     ds_train = create_dataset(mnist_dataset_dir, 32)
...     # The detail of LeNet5 shown in model_zoo.official.cv.lenet.src.lenet.py
...     network = LeNet5(10)
...     net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
...     net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
...     model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2")
...
...     # Simple usage:
...     summary_collector = SummaryCollector(summary_dir='./summary_dir')
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)
...
...     # Do not collect metric and collect the first layer parameter, others are collected by default
...     specified={'collect_metric': False, 'histogram_regular': '^conv1.*'}
...     summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_specified_data=specified)
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)
class mindspore.train.callback.SummaryLandscape(summary_dir)[source]

SummaryLandscape can help you to collect loss landscape information. It can create landscape in PCA direction or random direction by calculating loss.

Note

  1. SummaryLandscape only supports Linux systems.

Parameters

summary_dir (str) – The path of summary is used to save the model weight, metadata and other data required to create landscape.

Examples

>>> import mindspore.nn as nn
>>> from mindspore import context
>>> from mindspore.train.callback import SummaryCollector, SummaryLandscape
>>> from mindspore import Model
>>> from mindspore.nn import Loss, Accuracy
>>>
>>> if __name__ == '__main__':
...     # If the device_target is Ascend, set the device_target to "Ascend"
...     context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
...     mnist_dataset_dir = '/path/to/mnist_dataset_directory'
...     # The detail of create_dataset method shown in model_zoo.official.cv.lenet.src.dataset.py
...     ds_train = create_dataset(mnist_dataset_dir, 32)
...     # The detail of LeNet5 shown in model_zoo.official.cv.lenet.src.lenet.py
...     network = LeNet5(10)
...     net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
...     net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
...     model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()})
...     # Simple usage for collect landscape information:
...     interval_1 = [1, 2, 3, 4, 5]
...     summary_collector = SummaryCollector(summary_dir='./summary/lenet_interval_1',
...                                          collect_specified_data={'collect_landscape':{"landscape_size": 4,
...                                                                                        "unit": "step",
...                                                                          "create_landscape":{"train":True,
...                                                                                             "result":False},
...                                                                          "num_samples": 2048,
...                                                                          "intervals": [interval_1]}
...                                                                    })
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)
...
...     # Simple usage for visualization landscape:
...     def callback_fn():
...         network = LeNet5(10)
...         net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
...         metrics = {"Loss": Loss()}
...         model = Model(network, net_loss, metrics=metrics)
...         mnist_dataset_dir = '/path/to/mnist_dataset_directory'
...         ds_eval = create_dataset(mnist_dataset_dir, 32)
...         return model, network, ds_eval, metrics
...
...     summary_landscape = SummaryLandscape('./summary/lenet_interval_1')
...     # parameters of collect_landscape can be modified or unchanged
...     summary_landscape.gen_landscapes_with_multi_process(callback_fn,
...                                                        collect_landscape={"landscape_size": 4,
...                                                                         "create_landscape":{"train":False,
...                                                                                            "result":False},
...                                                                          "num_samples": 2048,
...                                                                          "intervals": [interval_1]},
...                                                         device_ids=[1])
clean_ckpt()[source]

Clean the checkpoint.

gen_landscapes_with_multi_process(callback_fn, collect_landscape=None, device_ids=None, output=None)[source]

Use the multi process to generate landscape.

Parameters
  • callback_fn (python function) –

    A python function object. User needs to write a function, it has no input, and the return requirements are as follows.

    • mindspore.train.Model: User’s model object.

    • mindspore.nn.Cell: User’s network object.

    • mindspore.dataset: User’s dataset object for create loss landscape.

    • mindspore.nn.Metrics: User’s metrics object.

  • collect_landscape (Union[dict, None]) –

    The meaning of the parameters when creating loss landscape is consistent with the fields with the same name in SummaryCollector. The purpose of setting here is to allow users to freely modify creating parameters. Default: None.

    • landscape_size (int): Specify the image resolution of the generated loss landscape. For example, if it is set to 128, the resolution of the landscape is 128 * 128. The calculation time increases with the increase of resolution. Default: 40. Optional values: between 3 and 256.

    • create_landscape (dict): Select how to create loss landscape. Training process loss landscape(train) and training result loss landscape(result). Default: {“train”: True, “result”: True}. Optional: True/False.

    • num_samples (int): The size of the dataset used to create the loss landscape. For example, in image dataset, You can set num_samples is 2048, which means that 2048 images are used to create loss landscape. Default: 2048.

    • intervals (List[List[int]): Specifies the interval in which the loss landscape. For example: If the user wants to create loss landscape of two training processes, they are 1-5 epoch and 6-10 epoch respectively. They can set [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]. Note: Each interval have at least three epochs.

  • device_ids (List(int)) – Specifies which devices are used to create loss landscape. For example: [0, 1] refers to creating loss landscape with device 0 and device 1. Default: None.

  • output (str) – Specifies the path to save the loss landscape. Default: None. The default save path is the same as the summary file.

class mindspore.train.callback.TimeMonitor(data_size=None)[source]

Monitor the time in training.

Parameters

data_size (int) – How many steps are the intervals between print information each time. if the program get batch_num during training, data_size will be set to batch_num, otherwise data_size will be used. Default: None.

Raises

ValueError – If data_size is not positive int.

epoch_begin(run_context)[source]

Record time at the beginning of epoch.

Parameters

run_context (RunContext) – Context of the process running.

epoch_end(run_context)[source]

Print process cost time at the end of epoch.

Parameters

run_context (RunContext) – Context of the process running.

mindspore.train.train_thor

convert to second order related classes and functions.

class mindspore.train.train_thor.ConvertModelUtils[source]

Convert model to thor model.

static convert_to_thor_model(model, network, loss_fn=None, optimizer=None, metrics=None, amp_level='O0', loss_scale_manager=None, keep_batchnorm_fp32=False)[source]

This interface is used to convert model to thor model.

Parameters
  • model (Object) – High-Level API for Training. Model groups layers into an object with training features.

  • network (Cell) – A training network.

  • loss_fn (Cell) – Objective function. Default: None.

  • optimizer (Cell) – Optimizer used to updating the weights. Default: None.

  • metrics (Union[dict, set]) – A Dictionary or a set of metrics to be evaluated by the model during training. eg: {‘accuracy’, ‘recall’}. Default: None.

  • amp_level (str) –

    Level for mixed precision training. Supports [“O0”, “O2”, “O3”, “auto”]. Default: “O0”.

    • O0: Do not change.

    • O2: Cast network to float16, keep batchnorm run in float32, using dynamic loss scale.

    • O3: Cast network to float16, with additional property ‘keep_batchnorm_fp32=False’.

    • auto: Set level to recommended level in different devices. O2 is recommended on GPU, O3 is recommended on Ascend. The recommended level is based on the expert experience, cannot always generalize. User should specify the level for special network.

  • loss_scale_manager (Union[None, LossScaleManager]) – If it is None, the loss would not be scaled. Otherwise, scale the loss by LossScaleManager and optimizer can not be None. It is a key argument. e.g. Use loss_scale_manager=None to set the value.

  • keep_batchnorm_fp32 (bool) – Keep Batchnorm running in float32. If True, the level setting before will be overwritten. Default: False.

Returns

High-Level API for Training.

Model groups layers into an object with training features.

Return type

model (Object)

Supported Platforms:

Ascend GPU

Examples

>>> from mindspore import nn
>>> from mindspore import Tensor
>>> from mindspore.nn import thor
>>> from mindspore import Model
>>> from mindspore import FixedLossScaleManager
>>> from mindspore.train.callback import LossMonitor
>>> from mindspore.train.train_thor import ConvertModelUtils
>>>
>>> net = Net()
>>> dataset = create_dataset()
>>> temp = Tensor([4e-4, 1e-4, 1e-5, 1e-5], mstype.float32)
>>> opt = thor(net, learning_rate=temp, damping=temp, momentum=0.9, loss_scale=128, frequency=4)
>>> loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
>>> loss_scale = FixedLossScaleManager(128, drop_overflow_update=False)
>>> model = Model(net, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'},
...               amp_level="O2", keep_batchnorm_fp32=False)
>>> model = ConvertModelUtils.convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt,
...                                                 loss_scale_manager=loss_scale, metrics={'acc'},
...                                                 amp_level="O2", keep_batchnorm_fp32=False)
>>> loss_cb = LossMonitor()
>>> model.train(1, dataset, callbacks=loss_cb, sink_size=4, dataset_sink_mode=True)
class mindspore.train.train_thor.ConvertNetUtils[source]

Convert net to thor layer net

convert_to_thor_net(net)[source]

This interface is used to convert a network to thor layer network, in order to calculate and store the second-order information matrix.

Note

This interface is automatically called by the second-order optimizer thor.

Parameters

net (Cell) – Network to be trained by the second-order optimizer thor.

Supported Platforms:

Ascend GPU

Examples

>>> ConvertNetUtils().convert_to_thor_net(net)