mindspore.set_auto_parallel_context

View Source On Gitee
mindspore.set_auto_parallel_context(**kwargs)[source]

Set auto parallel context, only data parallel supported on CPU.

Note

Attribute name is required for setting attributes. If a program has tasks on different parallel modes, before setting a new parallel mode for the next task, interface mindspore.reset_auto_parallel_context() should be called to reset the configuration. Setting or changing parallel modes must be called before creating any Initializer, otherwise, it may have RuntimeError when compiling the network.

Some configurations are parallel mode specific, see the below table for details:

Common

AUTO_PARALLEL

device_num

gradient_fp32_sync

global_rank

loss_repeated_mean

gradients_mean

search_mode

parallel_mode

parameter_broadcast

all_reduce_fusion_config

strategy_ckpt_load_file

enable_parallel_optimizer

strategy_ckpt_save_file

parallel_optimizer_config

dataset_strategy

enable_alltoall

pipeline_stages

pipeline_config

auto_parallel_search_mode

comm_fusion

strategy_ckpt_config

Parameters
  • device_num (int) – Available device number, the value must be in [1, 4096]. Default: 1 .

  • global_rank (int) – Global rank id, the value must be in [0, 4095]. Default: 0 .

  • gradients_mean (bool) – Whether to perform mean operator after allreduce of gradients. “stand_alone” do not support gradients_mean. Default: False .

  • gradient_fp32_sync (bool) – Run allreduce of gradients in fp32. “stand_alone”, “data_parallel” and “hybrid_parallel” do not support gradient_fp32_sync. Default: True .

  • parallel_mode (str) –

    There are five kinds of parallel modes, "stand_alone" , "data_parallel" , "hybrid_parallel" , "semi_auto_parallel" and "auto_parallel" . Note the pynative mode only supports the "stand_alone" and "data_parallel" mode. Default: "stand_alone" .

    • stand_alone: Only one processor is working.

    • data_parallel: Distributes the data across different processors.

    • hybrid_parallel: Achieves data parallelism and model parallelism manually.

    • semi_auto_parallel: Achieves data and model parallelism by setting parallel strategies.

    • auto_parallel: Achieving parallelism automatically.

  • search_mode (str) –

    There are three kinds of shard strategy search modes: "recursive_programming" , "dynamic_programming" and "sharding_propagation" . Default: "recursive_programming" .

    • recursive_programming: Recursive programming search mode. In order to obtain optimal performance, it is recommended that users set the batch size to be greater than or equal to the product of the number of devices and the number of multi-copy parallelism.

    • dynamic_programming: Dynamic programming search mode.

    • sharding_propagation: Propagate shardings from configured ops to non-configured ops.

  • auto_parallel_search_mode (str) – This is the old version of ‘search_mode’. Here, remaining this attribute is for forward compatibility, and this attribute will be deleted in a future MindSpore version.

  • parameter_broadcast (bool) – Whether to broadcast parameters before training. Before training, in order to have the same network initialization parameter values for all devices, broadcast the parameters on device 0 to other devices. Parameter broadcasting in different parallel modes is different, data_parallel mode, all parameters are broadcast except for the parameter whose attribute layerwise_parallel is True . Hybrid_parallel , semi_auto_parallel and auto_parallel mode , the segmented parameters do not participate in broadcasting. Default: False .

  • strategy_ckpt_load_file (str) – The path to load parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using ‘strategy_ckpt_config’ to replace it. Default: ''

  • strategy_ckpt_save_file (str) – The path to save parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using ‘strategy_ckpt_config’ to replace it. Default: ''

  • full_batch (bool) – If you load whole batch datasets in auto_parallel mode, this parameter should be set as True . Default: False . The interface is not to be recommended currently, it is better using ‘dataset_strategy’ to replace it.

  • dataset_strategy (Union[str, tuple]) – Dataset sharding strategy. Default: "data_parallel" . dataset_strategy=”data_parallel” is equal to full_batch=False, dataset_strategy=”full_batch” is equal to full_batch=True. For execution mode is ‘GRAPH_MODE’ and dataset load into net by model parallel strategy likes ds_stra ((1, 8), (1, 8)), it requires using set_auto_parallel_context(dataset_strategy=ds_stra).

  • enable_parallel_optimizer (bool) – This is a developing feature, which shards the weight update computation for data parallel training in the benefit of time and memory saving. Currently, auto and semi auto parallel mode support all optimizers in both Ascend and GPU. Data parallel mode only supports Lamb and AdamWeightDecay in Ascend . Default: False .

  • enable_alltoall (bool) – A switch that allows AllToAll operators to be generated during communication. If its value is False , there will be a combination of operators such as AllGather, Split and Concat instead of AllToAll. Default: False .

  • all_reduce_fusion_config (list) – Set allreduce fusion strategy by parameters indices. Only support ReduceOp.SUM and HCCL_WORLD_GROUP/NCCL_WORLD_GROUP. No Default, if it is not set, the fusion is closed.

  • pipeline_stages (int) – Set the stage information for pipeline parallel. This indicates how the devices are distributed alone in the pipeline. The total devices will be divided into ‘pipeline_stags’ stages. Default: 1 .

  • pipeline_config (dict) –

    A dict contains the keys and values for setting the pipeline parallelism configuration. It supports the following keys:

    • pipeline_interleave(bool): Indicates whether to enable the interleaved execution mode.

    • pipeline_scheduler(str): Indicates the scheduling mode for pipeline parallelism. Only support gpipe/1f1b.

  • parallel_optimizer_config (dict) –

    A dict contains the keys and values for setting the parallel optimizer configure. The configure provides more detailed behavior control about parallel training when parallel optimizer is enabled. The configure will be effective when we use mindspore.set_auto_parallel_context(enable_parallel_optimizer=True). It supports the following keys.

    • gradient_accumulation_shard(bool): If true , the accumulation gradient parameters will be sharded across the data parallel devices. This will introduce additional communication(ReduceScatter) at each step when accumulate the gradients, but saves a lot of device memories, thus can make model be trained with larger batch size. This configure is effective only when the model runs on pipeline training or gradient accumulation with data parallel. Default False .

    • parallel_optimizer_threshold(int): Set the threshold of parallel optimizer. When parallel optimizer is enabled, parameters with size smaller than this threshold will not be sharded across the devices. Parameter size = shape[0] * … * shape[n] * size(dtype). Non-negative. Unit: KB. Default: 64 .

    • optimizer_weight_shard_size(int): Set the optimizer weight shard group size, if you want to specific the maximum group size across devices when the parallel optimizer is enabled. The numerical range can be (0, device_num]. If pipeline parallel is enabled, the numerical range is (0, device_num/stage]. If the size of data parallel communication domain of the parameter cannot be divided by optimizer_weight_shard_size, then the specified communication group size will not take effect. Default value is -1 , which means the optimizer weight shard group size will be the size of data parallel group of each parameter.

  • comm_fusion (dict) –

    A dict contains the types and configurations for setting the communication fusion. each communication fusion config has two keys: “mode” and “config”. It supports following communication fusion types and configurations:

    • openstate: Whether turn on the communication fusion or not. If openstate is True , turn on the communication fusion, otherwise, turn off the communication fusion. Default: True .

    • allreduce: If communication fusion type is allreduce. The mode contains: auto, size and index. In auto mode, AllReduce fusion is configured by gradients size and the default fusion threshold is 64 MB. In ‘size’ mode, AllReduce fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB. In index mode, it is same as all_reduce_fusion_config.

    • allgather: If communication fusion type is allgather. The mode contains: auto, size. In auto mode, AllGather fusion is configured by gradients size, and the default fusion threshold is 64 MB. In ‘size’ mode, AllGather fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB.

    • reducescatter: If communication fusion type is reducescatter. The mode contains: auto and size. Config is same as allgather.

  • strategy_ckpt_config (dict) –

    A dict contains the configurations for setting the parallel strategy file. This interface contains the functions of parameter strategy_ckpt_load_file and strategy_ckpt_save_file, it is recommonded to use this parameter to replace those two parameters. It contains following configurations:

    • load_file (str): The path to load parallel strategy checkpoint. If the file name extension is .json, the file is loaded in JSON format. Otherwise, the file is loaded in ProtoBuf format. Default: ‘’

    • save_file (str): The path to save parallel strategy checkpoint. If the file name extension is .json, the file is saved in JSON format. Otherwise, the file is saved in ProtoBuf format. Default: ‘’

    • only_trainable_params (bool): Only save/load the strategy information for trainable parameter. Default: True .

Raises

ValueError – If input key is not attribute in auto parallel context.

Examples

>>> import mindspore as ms
>>> ms.set_auto_parallel_context(device_num=8)
>>> ms.set_auto_parallel_context(global_rank=0)
>>> ms.set_auto_parallel_context(gradients_mean=True)
>>> ms.set_auto_parallel_context(gradient_fp32_sync=False)
>>> ms.set_auto_parallel_context(parallel_mode="auto_parallel")
>>> ms.set_auto_parallel_context(search_mode="dynamic_programming")
>>> ms.set_auto_parallel_context(auto_parallel_search_mode="dynamic_programming")
>>> ms.set_auto_parallel_context(parameter_broadcast=False)
>>> ms.set_auto_parallel_context(strategy_ckpt_load_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(strategy_ckpt_save_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(dataset_strategy=((1, 8), (1, 8)))
>>> ms.set_auto_parallel_context(enable_parallel_optimizer=False)
>>> ms.set_auto_parallel_context(enable_alltoall=False)
>>> ms.set_auto_parallel_context(all_reduce_fusion_config=[8, 160])
>>> ms.set_auto_parallel_context(pipeline_stages=2)
>>> parallel_config = {"gradient_accumulation_shard": True, "parallel_optimizer_threshold": 24,
...                    "optimizer_weight_shard_size": 2}
>>> ms.set_auto_parallel_context(parallel_optimizer_config=parallel_config, enable_parallel_optimizer=True)
>>> config = {"allreduce": {"mode": "size", "config": 32}, "allgather": {"mode": "size", "config": 32}}
>>> ms.set_auto_parallel_context(comm_fusion=config)
>>> stra_ckpt_dict = {"load_file": "./stra0.ckpt", "save_file": "./stra1.ckpt", "only_trainable_params": False}
>>> ms.set_auto_parallel_context(strategy_ckpt_config=stra_ckpt_dict)