mindspore.set_auto_parallel_context

mindspore.set_auto_parallel_context(**kwargs)[source]

Set auto parallel context, this api will be deprecated and removed in future versions, please use the api mindspore.parallel.auto_parallel.AutoParallel instead.

Note

CPU only support data parallel.

Some configurations are parallel mode specific, see the below table for details:

Common	AUTO_PARALLEL
device_num	gradient_fp32_sync
global_rank	loss_repeated_mean
gradients_mean	search_mode
parallel_mode	parameter_broadcast
all_reduce_fusion_config	strategy_ckpt_load_file
enable_parallel_optimizer	strategy_ckpt_save_file
parallel_optimizer_config	dataset_strategy
enable_alltoall	pipeline_stages
pipeline_config	auto_parallel_search_mode
force_fp32_communication	pipeline_result_broadcast
	comm_fusion
	strategy_ckpt_config
	group_ckpt_save_file
	auto_pipeline
	dump_local_norm
	dump_local_norm_path
	dump_device_local_norm

Parameters

device_num (int) – Available device number, the value must be in [1, 4096]. Default: 1 .
global_rank (int) – Global rank id, the value must be in [0, 4095]. Default: 0 .
gradients_mean (bool) – Whether to perform mean operator after allreduce of gradients. "stand_alone" do not support gradients_mean. Default: False .
gradient_fp32_sync (bool) – Run allreduce of gradients in fp32. "stand_alone", "data_parallel" and "hybrid_parallel" do not support gradient_fp32_sync. Default: True .
loss_repeated_mean (bool) – calculation is repeated. Default: True .
parallel_mode (str) –
There are five kinds of parallel modes, "stand_alone" , "data_parallel" , "hybrid_parallel" , "semi_auto_parallel" and "auto_parallel" . Note the pynative mode only supports the "stand_alone" and "data_parallel" mode. Default: "stand_alone" .
- stand_alone: Only one processor is working.
- data_parallel: Distributes the data across different processors.
- hybrid_parallel: Achieves data parallelism and model parallelism manually.
- semi_auto_parallel: Achieves data and model parallelism by setting parallel strategies.
- auto_parallel: Achieving parallelism automatically.
search_mode (str) –
There are three kinds of shard strategy search modes: "recursive_programming" , "sharding_propagation" and "dynamic_programming" (Not recommended). Only works in "auto_parallel" mode. Default: "recursive_programming" .
- recursive_programming: Recursive programming search mode. In order to obtain optimal performance, it is recommended that users set the batch size to be greater than or equal to the product of the number of devices and the number of multi-copy parallelism.
- sharding_propagation: Propagate shardings from configured ops to non-configured ops. Dynamic shapes are not supported currently.
- dynamic_programming: Dynamic programming search mode.
auto_parallel_search_mode (str) – This is the old version of 'search_mode'. Here, remaining this attribute is for forward compatibility, and this attribute will be deleted in a future MindSpore version.
parameter_broadcast (bool) – Whether to broadcast parameters before training. Before training, in order to have the same network initialization parameter values for all devices, broadcast the parameters on device 0 to other devices. Parameter broadcasting in different parallel modes is different, data_parallel mode, all parameters are broadcast except for the parameter whose attribute layerwise_parallel is True . Hybrid_parallel , semi_auto_parallel and auto_parallel mode , the segmented parameters do not participate in broadcasting. Default: False .
strategy_ckpt_load_file (str) – The path to load parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using 'strategy_ckpt_config' to replace it. Default: ''
strategy_ckpt_save_file (str) – The path to save parallel strategy checkpoint. The parameter is not to be recommended currently, it is better using 'strategy_ckpt_config' to replace it. Default: ''
full_batch (bool) – If you load whole batch datasets in auto_parallel mode, this parameter should be set as True . Default: False . The interface is not to be recommended currently, it is better using 'dataset_strategy' to replace it.
dataset_strategy (Union[str, tuple]) – Dataset sharding strategy. Default: "data_parallel" . dataset_strategy="data_parallel" is equal to full_batch=False, dataset_strategy="full_batch" is equal to full_batch=True. For execution mode is 'GRAPH_MODE' and dataset load into net by model parallel strategy likes ds_stra ((1, 8), (1, 8)), it requires using set_auto_parallel_context(dataset_strategy=ds_stra). The dataset sharding strategy is not affected by the currently configured parallel mode. parallel strategy also supports tuple of Layout.
enable_parallel_optimizer (bool) – This is a developing feature, which shards the weight update computation for data parallel training in the benefit of time and memory saving. Currently, auto and semi auto parallel mode support all optimizers in both Ascend and GPU. Data parallel mode only supports Lamb and AdamWeightDecay in Ascend . Default: False .
force_fp32_communication (bool) – A switch that determines whether reduce operators (AllReduce, ReduceScatter) are forced to use the fp32 data type for communication during communication. True is the enable switch. Default: False .
enable_alltoall (bool) – A switch that allows AllToAll operators to be generated during communication. If its value is False , there will be a combination of operators such as AllGather, Split and Concat instead of AllToAll. Default: False .
all_reduce_fusion_config (list) – Set allreduce fusion strategy by parameters indices. Only support ReduceOp.SUM and HCCL_WORLD_GROUP/NCCL_WORLD_GROUP. No Default, if it is not set, the fusion is closed.
pipeline_stages (int) – Set the stage information for pipeline parallel. This indicates how the devices are distributed alone in the pipeline. The total devices will be divided into 'pipeline_stages' stages. Default: 1 .
pipeline_result_broadcast (bool) – A switch that broadcast the last stage result to all other stage in pipeline parallel inference. Default: False .
pipeline_config (dict) –
A dict contains the keys and values for setting the pipeline parallelism configuration. It supports the following keys:
- pipeline_interleave(bool): Indicates whether to enable the interleaved execution mode.
- pipeline_scheduler(str): Indicates the scheduling mode for pipeline parallelism. Only support gpipe/1f1b/seqpipe/seqvpp/seqsmartvpp/zero_bubble_v. When applying seqsmartvpp, the pipeline parallel must be an even number.
parallel_optimizer_config (dict) –
A dict contains the keys and values for setting the parallel optimizer configure. The configure provides more detailed behavior control about parallel training when parallel optimizer is enabled. The configure will be effective when we use mindspore.set_auto_parallel_context(enable_parallel_optimizer=True). It supports the following keys.
- gradient_accumulation_shard(bool): Please using optimizer_level: level2 to replace this config. If true , the accumulation gradient parameters will be sharded across the data parallel devices. This will introduce additional communication(ReduceScatter) at each step when accumulate the gradients, but saves a lot of device memories, thus can make model be trained with larger batch size. This configure is effective only when the model runs on pipeline training or gradient accumulation with data parallel. Default False .
- parallel_optimizer_threshold(int): Set the threshold of parallel optimizer. When parallel optimizer is enabled, parameters with size smaller than this threshold will not be sharded across the devices. Parameter size is calculated as: shape[0] * … * shape[n] * size(dtype). Non-negative. Unit: KB. Default: 64 .
- optimizer_weight_shard_size(int): Set the optimizer weight shard group size, if you want to specific the maximum group size across devices when the parallel optimizer is enabled. The numerical range can be (0, device_num]. If pipeline parallel is enabled, the numerical range is (0, device_num/stage]. If the size of data parallel communication domain of the parameter cannot be divided by optimizer_weight_shard_size, then the specified communication group size will not take effect. Default value is -1 , which means the optimizer weight shard group size will be the size of data parallel group of each parameter.
- optimizer_level(str, optional): optimizer_level configuration is used to specify the splitting level for optimizer sharding. It is important to note that the implementation of optimizer sharding in static graph is inconsistent with dynamic graph like megatron, but the memory optimization effect is the same. When optimizer_level= level1 , splitting is performed on weights and optimizer state. When optimizer_level= level2 , splitting is performed on weights, optimizer state, and gradients. When optimizer_level= level3 , splitting is performed on weights, optimizer state, gradients, additionally, before the backward pass, the weights are further applied with allgather communication to release the memory used by the forward pass allgather. It must be one of [level1, level2, level3]. Default: level1.
comm_fusion (dict) –
A dict contains the types and configurations for setting the communication fusion. each communication fusion config has two keys: "mode" and "config". It supports following communication fusion types and configurations:
- openstate: Whether turn on the communication fusion or not. If openstate is True , turn on the communication fusion, otherwise, turn off the communication fusion. Default: True .
- allreduce: If communication fusion type is allreduce. The mode contains: auto, size and index. In auto mode, AllReduce fusion is configured by gradients size and the default fusion threshold is 64 MB. In 'size' mode, AllReduce fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB. In index mode, it is same as all_reduce_fusion_config.
- allgather: If communication fusion type is allgather. The mode contains: auto, size. In auto mode, AllGather fusion is configured by gradients size, and the default fusion threshold is 64 MB. In 'size' mode, AllGather fusion is configured by gradients size manually, and the fusion threshold must be larger than 0 MB.
- reducescatter: If communication fusion type is reducescatter. The mode contains: auto and size. Config is same as allgather.
strategy_ckpt_config (dict) –
A dict contains the configurations for setting the parallel strategy file. This interface contains the functions of parameter strategy_ckpt_load_file and strategy_ckpt_save_file, it is recommonded to use this parameter to replace those two parameters. It contains following configurations:
- load_file (str): The path to load parallel strategy checkpoint. If the file name extension is .json, the file is loaded in JSON format. Otherwise, the file is loaded in ProtoBuf format. Default: ''
- save_file (str): The path to save parallel strategy checkpoint. If the file name extension is .json, the file is saved in JSON format. Otherwise, the file is saved in ProtoBuf format. Default: ''
- only_trainable_params (bool): Only save/load the strategy information for trainable parameter. Default: True .
group_ckpt_save_file (str) – The path to save parallel group checkpoint.
auto_pipeline (bool) – Set the pipeline stage number to automatic. Its value will be selected between 1 and the parameter pipeline_stages. This option requires the parallel_mode to be auto_parallel and the search_mode to be recursive_programming. Default: False .
dump_local_norm (bool) – Whether to dump local_norm value, when the parallel_mode is set to semi_auto_parallel or auto_parallel. Default: False .
dump_local_norm_path (str) – The path to save dump files of local_norm value. Default: '' .
dump_device_local_norm (bool) – Whether to dump device_local_norm value, when the parallel_mode is set to semi_auto_parallel or auto_parallel. Default: False .

Raises

ValueError – If input key is not attribute in auto parallel context.

Examples

>>> import mindspore as ms
>>> ms.set_auto_parallel_context(device_num=8)
>>> ms.set_auto_parallel_context(global_rank=0)
>>> ms.set_auto_parallel_context(gradients_mean=True)
>>> ms.set_auto_parallel_context(gradient_fp32_sync=False)
>>> ms.set_auto_parallel_context(parallel_mode="auto_parallel")
>>> ms.set_auto_parallel_context(search_mode="recursive_programming")
>>> ms.set_auto_parallel_context(auto_parallel_search_mode="recursive_programming")
>>> ms.set_auto_parallel_context(parameter_broadcast=False)
>>> ms.set_auto_parallel_context(strategy_ckpt_load_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(strategy_ckpt_save_file="./strategy_stage1.ckpt")
>>> ms.set_auto_parallel_context(dataset_strategy=((1, 8), (1, 8)))
>>> ms.set_auto_parallel_context(enable_parallel_optimizer=False)
>>> ms.set_auto_parallel_context(enable_alltoall=False)
>>> ms.set_auto_parallel_context(all_reduce_fusion_config=[8, 160])
>>> ms.set_auto_parallel_context(pipeline_stages=2)
>>> ms.set_auto_parallel_context(pipeline_stages=2, pipeline_result_broadcast=True)
>>> parallel_config = {"gradient_accumulation_shard": True, "parallel_optimizer_threshold": 24,
...                    "optimizer_weight_shard_size": 2, "optimizer_level": "level3"}
>>> ms.set_auto_parallel_context(parallel_optimizer_config=parallel_config, enable_parallel_optimizer=True)
>>> config = {"allreduce": {"mode": "size", "config": 32}, "allgather": {"mode": "size", "config": 32}}
>>> ms.set_auto_parallel_context(comm_fusion=config)
>>> stra_ckpt_dict = {"load_file": "./stra0.ckpt", "save_file": "./stra1.ckpt", "only_trainable_params": False}
>>> ms.set_auto_parallel_context(strategy_ckpt_config=stra_ckpt_dict)