Other Features
Parameter Server Training
Parameter Server is a widely used architecture in distributed training, which has better flexibility, scalability, and node disaster tolerance than the AllReduce training method of data parallel synchronization. The parameter server supports both synchronous SGD (Stochastic Gradient Descent) and asynchronous SGD training algorithms. In terms of scalability, the calculation of the model and the update of the model are deployed in the worker and server processes respectively, so that the resources of the worker and server can be scaled horizontally independently (adding or removing the worker and server resources). In addition, in the environment of large-scale data centers, computing equipment, networks and storage often have various failures that lead to some node abnormalities, and under the architecture of parameter servers, such failures can be easily handled without affecting the tasks in training.
Communication Operator Fusion
In the distributed training scenario, cross-device or even cross-node data transmission is a bottleneck that restricts scalability and computing power utilization. Communication operator fusion is an important method to improve the utilization of network resources and accelerate the efficiency of data transmission, which packages the communication operators of the same source node and the destination node and executes them at the same time to avoid the additional overhead caused by multiple single operator execution.
Dataset Splitting
When doing distributed training, you need to import the training dataset to each device. There are two common ways to import: 1) Import in parallel with the data, that is, the data is split into match dimensions, and each device is imported as part; 2) Import full amount of data per device. In addition, when some dimensions of the data are particularly large (such as the H/W dimension of the remote sensing picture may be particularly large), even if the sample size is small, the picture needs to be split, that is, the data is split in the H/W dimension, and each device reads a part of the picture. This special performance supports splitting datasets into specific dimensions to meet training requirements in the field of large-format image processing.
Functional Operator Splitting
In dynamic graph mode, you specify that a part of the network structure executes in graph mode and performs various parallel operations.
Performing Distributed Training on K8S Clusters
MindSpore Operator is a plugin that follows Kubernetes’ Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: ms-operator.