mindspore.set_auto_parallel_context
- mindspore.set_auto_parallel_context(**kwargs)[source]
Set auto parallel context, only data parallel supported on CPU.
Note
Attribute name is required for setting attributes. If a program has tasks on different parallel modes, before setting a new parallel mode for the next task, interface
mindspore.reset_auto_parallel_context()
should be called to reset the configuration. Setting or changing parallel modes must be called before creating any Initializer, otherwise, it may have RuntimeError when compiling the network.Some configurations are parallel mode specific, see the below table for details:
Common
AUTO_PARALLEL
device_num
gradient_fp32_sync
global_rank
loss_repeated_mean
gradients_mean
search_mode
parallel_mode
parameter_broadcast
all_reduce_fusion_config
strategy_ckpt_load_file
enable_parallel_optimizer
strategy_ckpt_save_file
parallel_optimizer_config
full_batch
enable_alltoall
dataset_strategy
pipeline_stages
pipeline_result_broadcast
auto_parallel_search_mode
comm_fusion
strategy_ckpt_config
group_ckpt_save_file
- Parameters
device_num (int) – Available device number, the value must be in [1, 4096]. Default:
1
.global_rank (int) – Global rank id, the value must be in [0, 4095]. Default:
0
.gradients_mean (bool) – Whether to perform mean operator after allreduce of gradients. “stand_alone” do not support gradients_mean. Default:
False
.gradient_fp32_sync (bool) – Run allreduce of gradients in fp32. “stand_alone”, “data_parallel” and “hybrid_parallel” do not support gradient_fp32_sync. Default:
True
.loss_repeated_mean (bool) – calculation is repeated. Default:
True
.parallel_mode (str) –
There are five kinds of parallel modes,
"stand_alone"
,"data_parallel"
,"hybrid_parallel"
,"semi_auto_parallel"
and"auto_parallel"
. Note the pynative mode only supports the"stand_alone"
and"data_parallel"
mode. Default:"stand_alone"
.stand_alone: Only one processor is working.
data_parallel: Distributes the data across different processors.
hybrid_parallel: Achieves data parallelism and model parallelism manually.
semi_auto_parallel: Achieves data and model parallelism by setting parallel strategies.
auto_parallel: Achieving parallelism automatically.
search_mode (str) –
There are three kinds of shard strategy search modes:
"recursive_programming"
,"sharding_propagation"
and"dynamic_programming"
(Not recommended). Default:"recursive_programming"
.recursive_programming: Recursive programming search mode. In order to obtain optimal performance, it is recommended that users set the batch size to be greater than or equal to the product of the number of devices and the number of multi-copy parallelism.
sharding_propagation: Propagate shardings from configured ops to non-configured ops.
dynamic_programming: Dynamic programming search mode.
auto_parallel_search_mode (str) – This is the old version of ‘search_mode’. Here, remaining this attribute is for forward compatibility, and this attribute will be deleted in a future MindSpore version.
parameter_broadcast (bool) – Whether to broadcast parameters before training. Before training, in order to have the same network initialization parameter values for all devices, broadcast the parameters on device 0 to other devices. Parameter broadcasting in different parallel modes is different,
data_parallel
mode, all parameters are broadcast except for the parameter whose attribute layerwise_parallel isTrue
.Hybrid_parallel
,semi_auto_parallel
andauto_parallel mode
, the segmented parameters do not participate in broadcasting. Default:False
.strategy_ckpt_load_file (str) – The path to load parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using ‘strategy_ckpt_config’ to replace it. Default:
''
strategy_ckpt_save_file (str) – The path to save parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using ‘strategy_ckpt_config’ to replace it. Default:
''
full_batch (bool) – If you load whole batch datasets in
auto_parallel
mode, this parameter should be set asTrue
. Default:False
. The interface is not to be recommended currently, it is better using ‘dataset_strategy’ to replace it.dataset_strategy (Union[str, tuple]) – Dataset sharding strategy. Default:
"data_parallel"
. dataset_strategy=”data_parallel” is equal to full_batch=False, dataset_strategy=”full_batch” is equal to full_batch=True. For execution mode is ‘GRAPH_MODE’ and dataset load into net by model parallel strategy likes ds_stra ((1, 8), (1, 8)), it requires using set_auto_parallel_context(dataset_strategy=ds_stra).enable_parallel_optimizer (bool) – This is a developing feature, which shards the weight update computation for data parallel training in the benefit of time and memory saving. Currently, auto and semi auto parallel mode support all optimizers in both Ascend and GPU. Data parallel mode only supports Lamb and AdamWeightDecay in Ascend . Default:
False
.enable_alltoall (bool) – A switch that allows AllToAll operators to be generated during communication. If its value is
False
, there will be a combination of operators such as AllGather, Split and Concat instead of AllToAll. Default:False
.all_reduce_fusion_config (list) – Set allreduce fusion strategy by parameters indices. Only support ReduceOp.SUM and HCCL_WORLD_GROUP/NCCL_WORLD_GROUP. No Default, if it is not set, the fusion is closed.
pipeline_stages (int) – Set the stage information for pipeline parallel. This indicates how the devices are distributed alone in the pipeline. The total devices will be divided into ‘pipeline_stags’ stages. Default:
1
.pipeline_result_broadcast (bool) – A switch that broadcast the last stage result to all other stage in pipeline parallel inference. Default:
False
.parallel_optimizer_config (dict) –
A dict contains the keys and values for setting the parallel optimizer configure. The configure provides more detailed behavior control about parallel training when parallel optimizer is enabled. The configure will be effective when we use mindspore.set_auto_parallel_context(enable_parallel_optimizer=True). It supports the following keys.
gradient_accumulation_shard(bool): If
true
, the accumulation gradient parameters will be sharded across the data parallel devices. This will introduce additional communication(ReduceScatter) at each step when accumulate the gradients, but saves a lot of device memories, thus can make model be trained with larger batch size. This configure is effective only when the model runs on pipeline training or gradient accumulation with data parallel. DefaultFalse
.parallel_optimizer_threshold(int): Set the threshold of parallel optimizer. When parallel optimizer is enabled, parameters with size smaller than this threshold will not be sharded across the devices. Parameter size = shape[0] * … * shape[n] * size(dtype). Non-negative. Unit: KB. Default:
64
.optimizer_weight_shard_size(int): Set the optimizer weight shard group size, if you want to specific the maximum group size across devices when the parallel optimizer is enabled. The numerical range can be (0, device_num]. If pipeline parallel is enabled, the numerical range is (0, device_num/stage]. If the size of data parallel communication domain of the parameter cannot be divided by optimizer_weight_shard_size, then the specified communication group size will not take effect. Default value is
-1
, which means the optimizer weight shard group size will be the size of data parallel group of each parameter.
comm_fusion (dict) –
A dict contains the types and configurations for setting the communication fusion. each communication fusion config has two keys: “mode” and “config”. It supports following communication fusion types and configurations:
openstate: Whether turn on the communication fusion or not. If openstate is
True
, turn on the communication fusion, otherwise, turn off the communication fusion. Default:True
.allreduce: If communication fusion type is allreduce. The mode contains: auto, size and index. In auto mode, AllReduce fusion is configured by gradients size and the default fusion threshold is 64 MB. In ‘size’ mode, AllReduce fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB. In index mode, it is same as all_reduce_fusion_config.
allgather: If communication fusion type is allgather. The mode contains: auto, size. In auto mode, AllGather fusion is configured by gradients size, and the default fusion threshold is 64 MB. In ‘size’ mode, AllGather fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB.
reducescatter: If communication fusion type is reducescatter. The mode contains: auto and size. Config is same as allgather.
strategy_ckpt_config (dict) –
A dict contains the configurations for setting the parallel strategy file. This interface contains the functions of parameter strategy_ckpt_load_file and strategy_ckpt_save_file, it is recommonded to use this parameter to replace those two parameters. It contains following configurations:
load_file (str): The path to load parallel strategy checkpoint. If the file name extension is .json, the file is loaded in JSON format. Otherwise, the file is loaded in ProtoBuf format. Default:
''
save_file (str): The path to save parallel strategy checkpoint. If the file name extension is .json, the file is saved in JSON format. Otherwise, the file is saved in ProtoBuf format. Default:
''
only_trainable_params (bool): Only save/load the strategy information for trainable parameter. Default:
True
.
group_ckpt_save_file (str) – The path to save parallel group checkpoint.
- Raises
ValueError – If input key is not attribute in auto parallel context.
Examples
>>> import mindspore as ms >>> ms.set_auto_parallel_context(device_num=8) >>> ms.set_auto_parallel_context(global_rank=0) >>> ms.set_auto_parallel_context(gradients_mean=True) >>> ms.set_auto_parallel_context(gradient_fp32_sync=False) >>> ms.set_auto_parallel_context(parallel_mode="auto_parallel") >>> ms.set_auto_parallel_context(search_mode="recursive_programming") >>> ms.set_auto_parallel_context(auto_parallel_search_mode="recursive_programming") >>> ms.set_auto_parallel_context(parameter_broadcast=False) >>> ms.set_auto_parallel_context(strategy_ckpt_load_file="./strategy_stage1.ckpt") >>> ms.set_auto_parallel_context(strategy_ckpt_save_file="./strategy_stage1.ckpt") >>> ms.set_auto_parallel_context(dataset_strategy=((1, 8), (1, 8))) >>> ms.set_auto_parallel_context(enable_parallel_optimizer=False) >>> ms.set_auto_parallel_context(enable_alltoall=False) >>> ms.set_auto_parallel_context(all_reduce_fusion_config=[8, 160]) >>> ms.set_auto_parallel_context(pipeline_stages=2) >>> ms.set_auto_parallel_context(pipeline_stages=2, pipeline_result_broadcast=True) >>> parallel_config = {"gradient_accumulation_shard": True, "parallel_optimizer_threshold": 24, ... "optimizer_weight_shard_size": 2} >>> ms.set_auto_parallel_context(parallel_optimizer_config=parallel_config, enable_parallel_optimizer=True) >>> config = {"allreduce": {"mode": "size", "config": 32}, "allgather": {"mode": "size", "config": 32}} >>> ms.set_auto_parallel_context(comm_fusion=config) >>> stra_ckpt_dict = {"load_file": "./stra0.ckpt", "save_file": "./stra1.ckpt", "only_trainable_params": False} >>> ms.set_auto_parallel_context(strategy_ckpt_config=stra_ckpt_dict)