mindspore.set_auto_parallel_context

mindspore.set_auto_parallel_context(**kwargs)[source]

Set auto parallel context, only data parallel supported on CPU.

Note

Attribute name is required for setting attributes. If a program has tasks on different parallel modes, before setting a new parallel mode for the next task, interface mindspore.reset_auto_parallel_context() should be called to reset the configuration. Setting or changing parallel modes must be called before creating any Initializer, otherwise, it may have RuntimeError when compiling the network.

Some configurations are parallel mode specific, see the below table for details:

Common

AUTO_PARALLEL

device_num

gradient_fp32_sync

global_rank

loss_repeated_mean

gradients_mean

search_mode

parallel_mode

strategy_ckpt_load_file

all_reduce_fusion_config

strategy_ckpt_save_file

enable_parallel_optimizer

dataset_strategy

parallel_optimizer_config

pipeline_stages

enable_alltoall

grad_accumulation_step

auto_parallel_search_mode

comm_fusion

Parameters
  • device_num (int) – Available device number, the value must be in [1, 4096]. Default: 1.

  • global_rank (int) – Global rank id, the value must be in [0, 4095]. Default: 0.

  • gradients_mean (bool) – Whether to perform mean operator after allreduce of gradients. “stand_alone” do not support gradients_mean. Default: False.

  • gradient_fp32_sync (bool) – Run allreduce of gradients in fp32. “stand_alone”, “data_parallel” and “hybrid_parallel” do not support gradient_fp32_sync. Default: True.

  • parallel_mode (str) –

    There are five kinds of parallel modes, “stand_alone”, “data_parallel”, “hybrid_parallel”, “semi_auto_parallel” and “auto_parallel”. Note the pynative mode only supports the “stand_alone” and “data_parallel” mode. Default: “stand_alone”.

    • stand_alone: Only one processor is working.

    • data_parallel: Distributes the data across different processors.

    • hybrid_parallel: Achieves data parallelism and model parallelism manually.

    • semi_auto_parallel: Achieves data and model parallelism by setting parallel strategies.

    • auto_parallel: Achieving parallelism automatically.

  • search_mode (str) –

    There are three kinds of shard strategy search modes: “recursive_programming”, “dynamic_programming” and “sharding_propagation”. Default: “dynamic_programming”.

    • recursive_programming: Recursive programming search mode.

    • dynamic_programming: Dynamic programming search mode.

    • sharding_propagation: Propagate shardings from configured ops to non-configured ops.

  • auto_parallel_search_mode (str) – This is the old version of ‘search_mode’. Here, remaining this attribute is for forward compatibility, and this attribute will be deleted in a future MindSpore version.

  • parameter_broadcast (bool) – Whether to broadcast parameters before training. Before training, in order to have the same network initialization parameter values for all devices, broadcast the parameters on device 0 to other devices. Parameter broadcasting in different parallel modes is different, data_parallel mode, all parameters are broadcast except for the parameter whose attribute layerwise_parallel is True. Hybrid_parallel, semi_auto_parallel and auto_parallel mode, the segmented parameters do not participate in broadcasting. Default: False.

  • strategy_ckpt_load_file (str) – The path to load parallel strategy checkpoint. Default: ‘’

  • strategy_ckpt_save_file (str) – The path to save parallel strategy checkpoint. Default: ‘’

  • full_batch (bool) – If you load whole batch datasets in auto_parallel mode, this parameter should be set as True. Default: False. The interface is not to be recommended currently, it is better using ‘dataset_strategy’ to replace it.

  • dataset_strategy (Union[str, tuple]) – Dataset sharding strategy. Default: “data_parallel”. dataset_strategy=”data_parallel” is equal to full_batch=False, dataset_strategy=”full_batch” is equal to full_batch=True. For dataset load into net by model parallel strategy likes ds_stra ((1, 8), (1, 8)), it requires using set_auto_parallel_context(dataset_strategy=ds_stra).

  • enable_parallel_optimizer (bool) – This is a developing feature, which shards the weight update computation for data parallel training in the benefit of time and memory saving. Currently, auto and semi auto parallel mode support all optimizers in both Ascend and GPU. Data parallel mode only supports Lamb and AdamWeightDecay in Ascend . Default: False.

  • enable_alltoall (bool) – A switch that allows AllToAll operators to be generated during communication. If its value is False, there will be a combination of operators such as AllGather, Split and Concat instead of AllToAll. Default: False.

  • all_reduce_fusion_config (list) – Set allreduce fusion strategy by parameters indices. Only support ReduceOp.SUM and HCCL_WORLD_GROUP/NCCL_WORLD_GROUP. No Default, if it is not set, the fusion is closed.

  • pipeline_stages (int) – Set the stage information for pipeline parallel. This indicates how the devices are distributed alone in the pipeline. The total devices will be divided into ‘pipeline_stags’ stages. Currently, this could only be used when parallel mode semi_auto_parallel is enabled. Default: 1.

  • grad_accumulation_step (int) – Set the accumulation steps of gradients in auto and semi auto parallel mode. This should be a positive int. Default: 1.

  • parallel_optimizer_config (dict) –

    A dict contains the keys and values for setting the parallel optimizer configure. The configure provides more detailed behavior control about parallel training when parallel optimizer is enabled. Currently it supports the key gradient_accumulation_shard. The configure will be effective when we use mindspore.set_auto_parallel_context(enable_parallel_optimizer=True). It supports the following keys.

    • gradient_accumulation_shard(bool): If true, the accumulation gradient parameters will be sharded across the data parallel devices. This will introduce additional communication(ReduceScatter) at each step when accumulate the gradients, but saves a lot of device memories, thus can make model be trained with larger batch size. This configure is effective only when the model runs on pipeline training or gradient accumulation with data parallel. Default True.

    • parallel_optimizer_threshold(int): Set the threshold of parallel optimizer. When parallel optimizer is enabled, parameters with size smaller than this threshold will not be sharded across the devices. Parameter size = shape[0] * … * shape[n] * size(dtype). Non-negative. Unit: KB. Default: 64.

  • comm_fusion (dict) –

    A dict contains the types and configurations for setting the communication fusion. each communication fusion config has two keys: “mode” and “config”. It supports following communication fusion types and configurations:

    • allreduce: If communication fusion type is allreduce. The mode contains: auto, size and index. In auto mode, AllReduce fusion is configured by gradients size and the default fusion threshold is 64 MB. In ‘size’ mode, AllReduce fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB. In index mode, it is same as all_reduce_fusion_config.

    • allgather: If communication fusion type is allgather. The mode contains: auto, size. In auto mode, AllGather fusion is configured by gradients size, and the default fusion threshold is 64 MB. In ‘size’ mode, AllGather fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB.

    • reducescatter: If communication fusion type is reducescatter. The mode contains: auto and size. Config is same as allgather.

Raises

ValueError – If input key is not attribute in auto parallel context.

Examples

>>> import mindspore as ms
>>> ms.set_auto_parallel_context(device_num=8)
>>> ms.set_auto_parallel_context(global_rank=0)
>>> ms.set_auto_parallel_context(gradients_mean=True)
>>> ms.set_auto_parallel_context(gradient_fp32_sync=False)
>>> ms.set_auto_parallel_context(parallel_mode="auto_parallel")
>>> ms.set_auto_parallel_context(search_mode="dynamic_programming")
>>> ms.set_auto_parallel_context(auto_parallel_search_mode="dynamic_programming")
>>> ms.set_auto_parallel_context(parameter_broadcast=False)
>>> ms.set_auto_parallel_context(strategy_ckpt_load_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(strategy_ckpt_save_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(dataset_strategy=((1, 8), (1, 8)))
>>> ms.set_auto_parallel_context(enable_parallel_optimizer=False)
>>> ms.set_auto_parallel_context(enable_alltoall=False)
>>> ms.set_auto_parallel_context(all_reduce_fusion_config=[8, 160])
>>> ms.set_auto_parallel_context(pipeline_stages=2)
>>> parallel_config = {"gradient_accumulation_shard": True, "parallel_optimizer_threshold": 24}
>>> ms.set_auto_parallel_context(parallel_optimizer_config=parallel_config, enable_parallel_optimizer=True)
>>> config = {"allreduce": {"mode": "size", "config": 32}, "allgather": {"mode": "size", "config": 32}}
>>> ms.set_auto_parallel_context(comm_fusion=config)