Performing Distributed Training on K8S Clusters
MindSpore Operator is a plugin that follows Kubernetes’ Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: ms-operator.
Installation
There are three installation methods:
Install directly by using YAML
kubectl apply -f deploy/v1/ms-operator.yaml
After installation:
Use
kubectl get pods --all-namespaces
to see the namespace as the deployment task for the ms-operator-system.Use
kubectl describe pod ms-operator-controller-manager-xxx-xxx -n ms-operator-system
to view pod details.Install by using make deploy
make deploy IMG=swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest
Local debugging environment
make run
Sample
The current ms-operator supports ordinary single worker training, single worker training in PS mode, and Scheduler and Worker startups for automatic parallelism (such as data parallelism and model parallelism).
There are running examples in config/samples/. Take the data-parallel Scheduler and Worker startup as an example, where the dataset and network scripts need to be prepared in advance:
kubectl apply -f config/samples/ms_wide_deep_dataparallel.yaml
Use kubectl get all -o wide
to see scheduler and worker launched in the cluster, as well as the services corresponding to Scheduler.
Development Guide
Core Code
pkg/apis/v1/msjob_types.go
is the CRD definition for MSJob.
pkg/controllers/v1/msjob_controller.go
is the core logic of the MSJob controller.
Image Creation and Upload
To modify the ms-operator code and create an upload image, please refer to the following command:
make docker-build IMG={image_name}:{tag}
docker push {image_name}:{tag}
YAML File Configuration Instructions
Taking the data parallelization of self-developed networking as an example, the YAML configuration of MSJob is introduced, such as runPolicy
, successPolicy
, the number of roles, mindspore images, and file mounting, and users need to configure it according to their actual needs.
apiVersion: mindspore.gitee.com/v1
kind: MSJob # ms-operator custom CRD type, MSJob
metadata:
name: ms-widedeep-dataparallel # Task name
spec:
runPolicy: # RunPolicy encapsulates various runtime strategies for distributed training jobs, such as how to clean up resources and how long the job can remain active.
cleanPodPolicy: None # All/Running/None
successPolicy: AllWorkers # The condition that marks MSJob as subcess, which defaults to blank, represents the use of the default rule (success after a single worker execution is completed)
msReplicaSpecs:
Scheduler:
replicas: 1 # The number of Scheduler
restartPolicy: Never # Restart the policy Always,OnFailure,Never
template:
spec:
volumes: # File mounts, such as datasets, network scripts, and so on
- name: script-data
hostPath:
path: /absolute_path
containers:
- name: mindspore # Each character must have a container with only one mindspore name, configure containerPort to adjust the default port number (2222), and you need to set the port name to msjob-port
image: mindspore-image-name:tag # mindspore image
imagePullPolicy: IfNotPresent
command: # Execute the command after the container starts
- /bin/bash
- -c
- python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000
volumeMounts:
- mountPath: /absolute_path
name: script-data
env: # Configurable environment variables
- name: GLOG_v
value: "1"
Worker:
replicas: 4 # The number of Worker
restartPolicy: Never
template:
spec:
volumes:
- name: script-data
hostPath:
path: /absolute_path
containers:
- name: mindspore
image: mindspore-image-name:tag # mindspore image
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000
volumeMounts:
- mountPath: /absolute_path
name: script-data
env:
- name: GLOG_v
value: "1"
resources: # Resource limit configuration
limits:
nvidia.com/gpu: 1
Frequent Questions
If you find that gcr.io/distroless/static cannot be pulled during the image construction process, see issue.
During the installation and deployment process, when finding that the gcr.io/kubebuilder/kube-rbac-proxy cannot be pulled, see issue.
When you call up tasks through k8s in the GPU and need to use NVIDIA graphics cards, you need to install k8s device plugin, nvidia-docker2 and other environments.
Do not use underscores in YAML file configuration items.
When k8s is blocked but the cause cannot be determined by the pod log, view the log of the pod creation process via
kubectl logs $(kubectl get statefulset,pods -o wide --all -namespaces|grep ms-operator-system|awk-F""'{print$2}') -n ms-operator-system
.Performing tasks through the pod, it will be executed in the root directory of the launched container, and the relevant files generated will be stored in the root directory by default. But if the mapping path is only a directory under the root directory, the generated files will not be mapped and saved to the host. It is recommended to switch the path to the specified directory before officially performing the task, so as to save the files generated during the execution of the task.
In the disaster recovery scenario, if bindIP failed occurs, confirm whether the persistence file generated by the last training has not been cleaned.
It is not recommended to redirect log files directly in YAML. If redirection is required, distinguish between redirect log file names for different pods.