Distributed Configure

Linux Windows Ascend GPU CPU Environment Preparation Basic Intermediate

Q: What do I do if the error Init plugin so failed, ret = 1343225860 occurs during the HCCL distributed training?

A: HCCL fails to be initialized. The possible cause is that rank json is incorrect. You can use the tool in mindspore/model_zoo/utils/hccl_tools to generate one. Alternatively, import the environment variable export ASCEND_SLOG_PRINT_TO_STDOUT=1 to enable the log printing function of HCCL and check the log information.

Q: How to fix the error below when running MindSpore distributed training with GPU:

Loading libgpu_collective.so failed. Many reasons could cause this:
libgpu_collective.so is not installed.
nccl is not installed or found.
mpi is not installed or found

A: This message means that MindSpore failed to load library libgpu_collective.so. The Possible causes are:

OpenMPI or NCCL is not installed in this environment.
NCCL version is not updated to v2.7.6: MindSpore v1.1.0 supports GPU P2P communication operator which relies on NCCL v2.7.6. libgpu_collective.so can’t be loaded successfully if NCCL is not updated to this version.

Q: The communication profile file needs to be configured on the Ascend environment, how should it be configured?

A: Please refer to the Configuring Distributed Environment Variables section of Ascend-based distributed training in the MindSpore tutorial.

Q: How to perform distributed multi-machine multi-card training?

A: For Ascend environment, please refer to the Multi-machine Training section of the MindSpore tutorial “distributed_training_ascend”. For GPU-based environments, please refer to the Run Multi-Host Script section of the MindSpore tutorial “distributed_training_gpu”.