Model-Releated
Q: How to deal with network runtime error “Out of Memory” (OOM
)?
A: First of all, the above error refers to insufficient memory on the device, which may be caused by a variety of reasons, and it is recommended to carry out the following aspects of the investigation.
Use the command
npu-smi info
to verify that the card is exclusive.It is recommended to use the default
yaml
configuration for the corresponding network when running network.Increase the value of
max_device_memory
in the correspondingyaml
configuration file of the network. Note that some memory needs to be reserved for inter-card communication, which can be tried with incremental increases.Adjust the hybrid parallelism strategy, increase pipeline parallelism (pp) and model parallelism (mp) appropriately, and reduce data parallelism (dp) accordingly, keep
dp * mp * pp = device_num
, and increase the number of NPUs if necessary.Try to reduce batch size or sequence length.
Turn on selective recalculation or full recalculation, turn on optimizer parallelism.
If the problem still needs further troubleshooting, please feel free to raise issue for feedback.