mindspore.SummaryCollector

View Source On Gitee
class mindspore.SummaryCollector(summary_dir, collect_freq=10, num_process=32, collect_specified_data=None, keep_default_action=True, custom_lineage_data=None, collect_tensor_freq=None, max_file_size=None, export_options=None)[source]

SummaryCollector can help you to collect some common information, such as loss, learning late, computational graph and so on.

SummaryCollector also enables the summary operator to collect data to summary files.

Note

  1. Multiple SummaryCollector instances in callback list are not allowed.

  2. Not all information is collected at the training phase or at the eval phase.

  3. SummaryCollector always record the data collected by the summary operator.

  4. SummaryCollector only supports Linux systems.

  5. The Summary is not supported when compile source with -s on option.

Parameters
  • summary_dir (str) – The collected data will be persisted to this directory. If the directory does not exist, it will be created automatically.

  • collect_freq (int) – Set the frequency of data collection, it should be greater than zero, and the unit is step. If a frequency is set, we will collect data when (current steps % freq) equals to 0, and the first step will be collected at any time. It is important to note that if the data sink mode is used, the unit will become the epoch. It is not recommended to collect data too frequently, which can affect performance. Default: 10 .

  • num_process (int) – Number of processes saving summary data. The more processes there are, the better the performance, but there may be host memory overflow issues. Default: 32 .

  • collect_specified_data (Union[None, dict]) –

    Perform custom operations on the collected data. By default, if set to None, all data is collected as the default behavior. You can customize the collected data with a dictionary. For example, you can set {'collect_metric': False} to control not collecting metrics. The data that supports control is shown below. Default: None .

    • collect_metric (bool): Whether to collect training metrics, currently only the loss is collected. The first output will be treated as the loss and it will be averaged. Default: True .

    • collect_graph (bool): Whether to collect the computational graph. Currently, only training computational graph is collected. Default: True .

    • collect_train_lineage (bool): Whether to collect lineage data for the training phase, this field will be displayed on the lineage page of MindInsight. Default: True .

    • collect_eval_lineage (bool): Whether to collect lineage data for the evaluation phase, this field will be displayed on the lineage page of MindInsight. Default: True .

    • collect_input_data (bool): Whether to collect dataset for each training. Currently only image data is supported. If there are multiple columns of data in the dataset, the first column should be image data. Default: True .

    • collect_dataset_graph (bool): Whether to collect dataset graph for the training phase. Default: True .

    • histogram_regular (Union[str, None]): Collect weight and bias for parameter distribution page and displayed in MindInsight. This field allows regular strings to control which parameters to collect. It is not recommended to collect too many parameters at once, as it can affect performance. Note that if you collect too many parameters and run out of memory, the training will fail. Default: None , it means only the first five parameters are collected.

    • collect_landscape (Union[dict,None]): Whether to collect the parameters needed to create the loss landscape. If set to None, collect_landscape parameters will not be collected. All parameter information is collected by default and stored in file {summary_dir}/ckpt_dir/train_metadata.json.

      • landscape_size (int): Specify the image resolution of the generated loss landscape. For example, if it is set to 128 , the resolution of the landscape is 128 * 128. The calculation time increases with the increase of resolution. Default: 40 . Optional values: between 3 and 256.

      • unit (str): Specify the interval strength of the training process. Default: "step" . Optional: epoch/step.

      • create_landscape (dict): Select how to create loss landscape. Training process loss landscape(train) and training result loss landscape(result). Default: {"train": True, "result": True}. Optional: True / False .

      • num_samples (int): The size of the dataset used to create the loss landscape. For example, in image dataset, You can set num_samples is 128, which means that 128 images are used to create loss landscape. Default: 128 .

      • intervals (List[List[int]]): Specifies the interval in which the loss landscape. For example: If the user wants to create loss landscape of two training processes, they are 1-5 epoch and 6-10 epoch respectively. They anc set [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]. Note: Each interval have at least three epochs.

  • keep_default_action (bool) – This field affects the collection behavior of the 'collect_specified_data' field. True: it means that after specified data is set, non-specified data is collected as the default behavior. False: it means that after specified data is set, only the specified data is collected, and the others are not collected. Default: True .

  • custom_lineage_data (Union[dict, None]) –

    Allows you to customize the data and present it on the MingInsight lineage page . In the custom data, the type of the key supports str, and the type of value supports str, int and float. Default: None , it means there is no custom data.

  • collect_tensor_freq (Optional[int]) – The same semantics as the collect_freq, but controls TensorSummary only. Because TensorSummary data is too large to be compared with other summary data, this parameter is used to reduce its collection. By default, The maximum number of steps for collecting TensorSummary data is 20, but it will not exceed the number of steps for collecting other summary data. For example, given collect_freq=10, when the total steps is 600, TensorSummary will be collected 20 steps, while other summary data 61 steps, but when the total steps is 20, both TensorSummary and other summary will be collected 3 steps. Also note that when in parallel mode, the total steps will be split evenly, which will affect the number of steps TensorSummary will be collected. Default: None , which means to follow the behavior as described above.

  • max_file_size (Optional[int]) – The maximum size in bytes of each file that can be written to the disk. For example, to write not larger than 4GB, specify max_file_size=4*1024**3. Default: None , which means no limit.

  • export_options (Union[None, dict]) –

    Perform custom operations on the export data. Note that the size of export files is not limited by the max_file_size. You can customize the export data with a dictionary. For example, you can set {'tensor_format': 'npy'} to export tensor as npy file. The data that supports control is shown below. Default: None , it means that the data is not exported.

    • tensor_format (Union[str, None]): Customize the export tensor format. Supports ["npy", None]. Default: None , it means that the tensor is not exported.

      • npy: export tensor as npy file.

Raises

ValueError – The Summary is not supported, please without -s on and recompile source.

Examples

>>> import mindspore as ms
>>> from mindspore import nn, SummaryCollector
>>> from mindspore.train import Model, Accuracy
>>>
>>> if __name__ == '__main__':
...     # If the device_target is GPU, set the device_target to "GPU"
...     ms.set_context(mode=ms.GRAPH_MODE, device_target="Ascend")
...     mnist_dataset_dir = '/path/to/mnist_dataset_directory'
...     # Create the dataset taking MNIST as an example. Refer to
...     # https://gitee.com/mindspore/docs/blob/master/docs/mindspore/code/mnist.py
...     ds_train = create_dataset()
...     # Define the network structure of LeNet5. Refer to
...     # https://gitee.com/mindspore/docs/blob/master/docs/mindspore/code/lenet.py
...     network = LeNet5(10)
...     net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
...     net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
...     model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2")
...
...     # Simple usage:
...     summary_collector = SummaryCollector(summary_dir='./summary_dir')
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)
...
...     # Do not collect metric and collect the first layer parameter, others are collected by default
...     specified={'collect_metric': False, 'histogram_regular': '^conv1.*'}
...     summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_specified_data=specified)
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)