Distributed Parallel Startup Methods

View Source on Gitee

Startup Method

Currently GPU, Ascend and CPU support multiple startup methods respectively, four of which are msrun, dynamic cluster, mpirun and rank table:

  • msrun: msrun is the capsulation of Dynamic cluster. It allows user to launch distributed jobs using one single command in each node. It could be used after MindSpore is installed. This method does not rely on third-party libraries and configuration files, has disaster recovery function, good security, and supports three hardware platforms. It is recommended that users prioritize the use of this startup method.

  • Dynamic cluster: dynamic cluster requires user to spawn multiple processes and export environment variables. It's the implementation of msrun. Use this method when running Parameter Server training mode. For other distributed jobs, msrun is recommended.

  • mpirun: this method relies on the open source library OpenMPI, and startup command is simple. Multi-machine need to ensure two-by-two password-free login. It is recommended for users who have experience in using OpenMPI to use this startup method.

  • rank table: this method requires the Ascend hardware platform and does not rely on third-party library. After manually configuring the rank_table file, you can start the parallel program via a script, and the script is consistent across multiple machines for easy batch deployment.

Warning

rank_table method will be deprecated in MindSpore 2.4 version.

The hardware support for the four startup methods is shown in the table below:

GPU

Ascend

CPU

msrun

Support

Support

Support

Dynamic cluster

Support

Support

Support

mpirun

Support

Support

Not support

rank table

Not support

Support

Not support