Distributed Configure
Linux
Windows
Ascend
GPU
CPU
Environment Preparation
Basic
Intermediate
Q: What do I do if the error Init plugin so failed, ret = 1343225860
occurs during the HCCL distributed training?
A: HCCL fails to be initialized. The possible cause is that rank json
is incorrect. You can use the tool in mindspore/model_zoo/utils/hccl_tools
to generate one. Alternatively, import the environment variable export ASCEND_SLOG_PRINT_TO_STDOUT=1
to enable the log printing function of HCCL and check the log information.
Q: How to fix the error below when running MindSpore distributed training with GPU:
Loading libgpu_collective.so failed. Many reasons could cause this:
1.libgpu_collective.so is not installed.
2.nccl is not installed or found.
3.mpi is not installed or found
A: This message means that MindSpore failed to load library libgpu_collective.so
. The Possible causes are:
OpenMPI or NCCL is not installed in this environment.
NCCL version is not updated to
v2.7.6
: MindSporev1.1.0
supports GPU P2P communication operator which relies on NCCLv2.7.6
.libgpu_collective.so
can’t be loaded successfully if NCCL is not updated to this version.
Q: The communication profile file needs to be configured on the Ascend environment, how should it be configured?
A: Please refer to the Configuring Distributed Environment Variables section of Ascend-based distributed training in the MindSpore tutorial.
Q: How to perform distributed multi-machine multi-card training?
A: For Ascend environment, please refer to the Multi-machine Training section of the MindSpore tutorial “distributed_training_ascend”. For GPU-based environments, please refer to the Run Multi-Host Script section of the MindSpore tutorial “distributed_training_gpu”.