Other Features

Sharding Propagation 

In operator-level parallelism, the user is required to configure a slicing strategy for each operator in the forward network (if not configured, the data-parallel policy is used by default). The slicing strategy propagation feature can configure only a few operators to automatically generate a feasible sharding strategy for operators without a sharding strategy, and achieve the effect of minimizing communication overhead.

Parameter Server Training 

Parameter Server is a widely used architecture in distributed training, which has better flexibility, scalability, and node disaster tolerance than the AllReduce training method of data parallel synchronization. The parameter server supports both synchronous SGD (Stochastic Gradient Descent) and asynchronous SGD training algorithms. In terms of scalability, the calculation of the model and the update of the model are deployed in the worker and server processes respectively, so that the resources of the worker and server can be scaled horizontally independently (adding or removing the worker and server resources). In addition, in the environment of large-scale data centers, computing equipment, networks and storage often have various failures that lead to some node abnormalities, and under the architecture of parameter servers, such failures can be easily handled without affecting the tasks in training.

Communication Operator Fusion

In the distributed training scenario, cross-device or even cross-node data transmission is a bottleneck that restricts scalability and computing power utilization. Communication operator fusion is an important method to improve the utilization of network resources and accelerate the efficiency of data transmission, which packages the communication operators of the same source node and the destination node and executes them at the same time to avoid the additional overhead caused by multiple single operator execution.

Dataset Splitting

When doing distributed training, you need to import the training dataset to each device. There are two common ways to import: 1) Import in parallel with the data, that is, the data is split into match dimensions, and each device is imported as part; 2) Import full amount of data per device. In addition, when some dimensions of the data are particularly large (such as the H/W dimension of the remote sensing picture may be particularly large), even if the sample size is small, the picture needs to be split, that is, the data is split in the H/W dimension, and each device reads a part of the picture. This special performance supports splitting datasets into specific dimensions to meet training requirements in the field of large-format image processing.

Functional Operator Splitting

In dynamic graph mode, you specify that a part of the network structure executes in graph mode and performs various parallel operations.

Performing Distributed Training on K8S Clusters 

MindSpore Operator is a plugin that follows Kubernetes’ Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: ms-operator.

Description of the Interface Related to the Feature

Feature category	Feature interface	Description	Function
operator parallel	shard(in_strategy=None, out_strategy=None) In Primitive class	Set the sharding strategy of the input and output tensors of the operator (where the sharding strategy of the output tensor only supports some operators, such as Gauther and MatMul.)	Reduce the memory capacity of a single device by slicing the tensor involved in each operator in the network model to complete the large model training/inference. Or use cluster resources to perform distributed computing to reduce the overall execution time.
	add_prim_attr(name, value) In Primitive class	Gather Operator: add_prim_attr(“manual_split”, config): Configure a non-uniform sharding strategy for its first input, where config type is tuple, which describes how the first parameter, dimension 0, is split. For example , ( 10 , 20 , 30 , 4 ) means that the 0th dimension of the first input of the operator is tangent into 4 parts , and the shape size of each part is 10 , 20 , 30 , 4, respectively.	In the recommended field, there is a scene where each column of the dataset corresponds to a subtable. In this scenario, using this configuration can reduce traffic and improve overall performance.
		EmbeddingLookUp Operator: add_prim_attr(“primitive_target”, “CPU”): Configure it to execute on the CPU for heterogeneous scenarios.	In the recommended field, there is a particularly large scene of the Embedding Table, in order to save device memory, you can use this configuration to put EmbeddingLookUp on the CPU to execute to complete the training of the recommended large model.
	set_auto_parallel_context(enable_alltoall=bool_value)	Indicate whether the AllToAll communication operator is allowed to be generated when communicating, and its value is the bool type, which defaults to False.	AllToAll communication can reduce the amount of communication data and improve communication efficiency, but it requires environmental support.
Pipeline parallel	set_auto_parallel_context(pipeline_stages=stage_num)	Set the number of pipes in pipeline parallelism, the value of which is a positive integer, and the value range is [1, number of devices].	Specify the number of stages, limiting the communication domain of the collection communication to the stage, and the point-to-point communication between the stages.
	pipeline_stage(value) In Cell class	Set which stage the Cell executes in.	Set which stage the Cell executes in.
	PipelineCell(network, micro_size)	Specify the number of MicroSizes for the training network, where the network is the network to be trained and the micro_size is a positive integer.	Specify micro_size can reduce the idle wait time between stages and improve the overall efficiency of pipeline parallel.
Optimizer parallel	set_auto_parallel_context(enable_parallel_optimizer=bool_value)	Indicate whether optimizer parallelism is enabled. Its value is bool type, and the default is False.	Optimizer parallel saves static memory overhead, but increases communication overhead.
	set_auto_parallel_context(parallel_optimizer_config=config)	This configuration takes effect only after optimizer parallel is turned on. The config is a dict and supports two key values: gradient_accumulation_shard(bool): If True, the cumulative gradient variable will be sharded on the data parallelism, defaulting to False. parallel_optimizer_threshold(int): This value represents the optimizer sharding threshold in KB (default value is 64KB). When the parameter size does not exceed this value, it will not be split.	gradient_accumulation_shard true saves a portion of the parameter size of static memory, but increases communication overhead. Optimizer sharding thresholds allow smaller shape parameters to be not optimized for splitting to save communication resources.
Recompute	recompute(mode=True) In primitive class	Used to specify whether the operator needs to be recalculated, and its value is bool type, which defaults to True and means that the operator recalculation is enabled.	After enabling operator recalculation, you can reduce the peak of dynamic memory, but increase the overall computation amount.
	recompute(**kwargs) In Cell class	When this interface is called, the operator in this Cell is recalculated. The input parameter has two bool class options: mp_comm_recompute: Whether to enable model parallel communication operator recalculation, and the default is True. parallel_optimizer_comm_recompute: Whether to enable optimizer parallel communication operator recompute, and the default is False.	Enable Cell recompute and configure whether the model parallel communication operator and the optimizer parallel communication operator are recomputed. When the communication operator is recomputed, it consumes communication resources but reduces the peak of dynamic memory.