Multi Dimensional
=================

.. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r1.9/resource/_static/logo_source_en.png
    :target: https://gitee.com/mindspore/docs/blob/r1.9/tutorials/experts/source_en/parallel/multi_dimensional.rst

.. toctree::
  :maxdepth: 1

  pipeline_parallel
  host_device_training
  recompute

As deep learning evolves, models get larger and larger. For example, in
the field of NLP, in just a few years, the amount of parameters has
developed from BERT’s 100 million to GPT-3’s 170 billion, and then to
Pangu alpha 200 billion, and the current industry has even proposed a
million billion. It can be seen that the scale of parameters has shown
an exponential growth trend in recent years. On the other hand, with the
development of related technologies in the fields of big data and the
Internet, the datasets available for model training are also rapidly
expanding, such as recommendations, natural language processing and
other scenarios of the dataset that can reach terabytes.

In the face of large-scale data and large-scale parameter training, a
single device either takes a long time to complete model training, or it
cannot be trained due to insufficient display memory. Therefore,
distributed training technology needs to be introduced.

Currently, the most commonly used distributed training technique is data
parallelism. Data parallelization splits the training data into multiple
devices, each maintaining the same model parameters and the same size of
computing tasks, but processing different data. In the process of
backpropagation, the parameter gradient generated by each device is
globally AllReduce synchronously summed. When the dataset is large and
the model is small, there is an advantage to choosing data parallelism,
such as ResNet50. However, when the model is large, or the dataset and
model are larger, other distributed features need to be used.

MindSpore provides the following advanced features to support
distributed training of large models, and users can flexibly combine
them according to their own needs.

Operator Parallel
-----------------

Operator-level parallelism is a distributed computation of operators by
splitting their input tensors into multiple devices in units. On the one
hand, data samples and model parameters can be split into multiple
devices at the same time to complete the training of large models. On
the other hand, you can make full use of cluster resources for parallel
computing to improve the overall speed.

The users can set the sharding strategy of each operator in the forward
network, and the framework models each operator and its input tensor
according to the sharding strategy of the operator, so that the
computational logic of the operator remains mathematically equivalent
before and after the sharding.

`Pipeline Parallel <https://www.mindspore.cn/tutorials/experts/en/r1.9/parallel/pipeline_parallel.html>`__
------------------------------------------------------------------------------------------------------------

When there are a large number of cluster devices, if only the operator
level is used in parallel, communication needs to be carried out on the
communication domain of the entire cluster, which may make communication
inefficient and reduce overall performance.

Pipeline parallel can split the neural network structure into multiple
stages, and each stage runs in a part of the device. The communication
domain of the set communication limits to this part of the device, and
the stage uses point-to-point communication.

The advantages of pipeline parallel are that they can improve
communication efficiency and easily handle layered neural network
structures. The disadvantage is that some nodes may be idle at the same
time.

Optimizer Parallel
------------------

When training in parallel with data or operators, the parameters of the
model may have the same copy on multiple devices. This allows the
optimizer to have redundant calculations across multiple devices when
updating this weight. In this case, the optimizer’s computational volume
can be spread across multiple devices through optimizer parallelism. It
has the advantage of reducing static memory consumption and reducing the
amount of computation in the optimizer. The disadvantage is that it
increases the communication overhead.

`Host Device Training <https://www.mindspore.cn/tutorials/experts/en/r1.9/parallel/host_device_training.html>`__
------------------------------------------------------------------------------------------------------------------

When training large models, the overall size of the model that can be
trained will be limited by the number of devices due to the limited
memory capacity of each device (accelerator). In order to complete
larger-scale model training, you can use the host and device
heterogeneous training modes. It takes advantage of both the large
memory on the host side and the fast calculation on the accelerator
side, and is an effective way to reduce the number of devices during the
training of the super-large model.

`Recompute <https://www.mindspore.cn/tutorials/experts/en/r1.9/parallel/recompute.html>`__
--------------------------------------------------------------------------------------------

MindSpore automatically derives the reverse graph according to the
forward graph calculation process, and the forward graph and the inverse
graph together form a complete calculation graph. When calculating some
reverse operators, it may be necessary to use the calculation results of
some forward operators, resulting in the calculation results of these
forward operators, which need to reside in memory until these reverse
operators are calculated, and the memory they occupy will not be reused
by other operators. The compute results of these forward operators,
which reside in memory for a long time, push up the peak memory
footprint of the computation, especially in large-scale network models.
In order to reduce memory peaks, the recomputing technique can not save
the calculation results of the forward activation layer, so that the
memory can be reused, and then when calculating the reverse part,
recalculate the results of the forward activation layer.

Description of the Interface Related to the Feature
---------------------------------------------------

+-------------+---------------------------+-----------------------+-----------------------+
| Feature     | Feature interface         | Description           | Function              |
|             |                           |                       |                       |
| category    |                           |                       |                       |
|             |                           |                       |                       |
+=============+===========================+=======================+=======================+
| Auto-\      | set_auto_paralle\         | Specify the           | Automatic             |
| parallel    | _context(search\_\        | policy search         | parallel allows       |
|             | mode=mode)                | algorithm, with a     | the user to           |
|             |                           | value of type         | search for            |
|             |                           | string, and the       | sharding strategy     |
|             |                           | optional value:       | without               |
|             |                           | 1.                    | configuring or        |
|             |                           | “sharding_propaga\    | configuring a         |
|             |                           | tion”:                | small number of       |
|             |                           | indicate a policy     | operators, and        |
|             |                           | search by using       | the framework         |
|             |                           | sharding strategy     | searches for the      |
|             |                           | propagation           | sharding              |
|             |                           | algorithm;2.          | strategy.             |
|             |                           | “dynamic_programm\    |                       |
|             |                           | ing”:                 |                       |
|             |                           | indicate the use      |                       |
|             |                           | of dynamic            |                       |
|             |                           | programming           |                       |
|             |                           | algorithms for        |                       |
|             |                           | policy search;3.      |                       |
|             |                           | “recursive_progra\    |                       |
|             |                           | mming”:               |                       |
|             |                           | indicate the use      |                       |
|             |                           | of a double           |                       |
|             |                           | recursive             |                       |
|             |                           | algorithm for         |                       |
|             |                           | policy search;        |                       |
+-------------+---------------------------+-----------------------+-----------------------+
|             | set_algo_paramet\         | Whether operators     | If the operator       |
|             | rs(fully_use_dev\         | need to be split      | is split into all     |
|             | ces=bool_value)           | across all            | devices, the          |
|             |                           | devices when          | search space can      |
|             |                           | setting up search     | be reduced and        |
|             |                           | policies. Its         | the search speed      |
|             |                           | value is of type      | can be improved,      |
|             |                           | bool, which           | but the search        |
|             |                           | defaults to True.     | strategy is not       |
|             |                           |                       | globally optimal.     |
+-------------+---------------------------+-----------------------+-----------------------+
|             | set_auto_paralle\         | Configure the         | Reduce the number     |
|             | _context(all_red\         | gradient              | of operations of      |
|             | ce_fusion_config\         | AllReduce             | the AllReduce         |
|             | config)                   | operator fusion       | communication         |
|             |                           | strategy with a       | operator and          |
|             |                           | value of type         | improve               |
|             |                           | list. For             | communication         |
|             |                           | example: [20,         | efficiency.           |
|             |                           | 35], which means      |                       |
|             |                           | that the first 20     |                       |
|             |                           | AllReduces are        |                       |
|             |                           | fused into 1, the     |                       |
|             |                           | 20th to 35th          |                       |
|             |                           | AllReduce are         |                       |
|             |                           | fused into 1, and     |                       |
|             |                           | the remaining         |                       |
|             |                           | AllReduce are         |                       |
|             |                           | fused into 1.         |                       |
+-------------+---------------------------+-----------------------+-----------------------+
| comm\_\     | set_auto_paralle\         | Set the fusion        | Reduce the number     |
| fusion      | _context(comm_fu\         | configuration of      | of operations of      |
|             | ion=config)               | the communication     | the                   |
|             |                           | operator, and         | AllReduce/AllGath     |
|             |                           | support the           | er/ReduceScatter      |
|             |                           | configuration of      | communication         |
|             |                           | the AllReduce,        | operator and          |
|             |                           | AllGather, and        | improve               |
|             |                           | ReduceScatter         | communication         |
|             |                           | communication         | efficiency.           |
|             |                           | operators             |                       |
|             |                           | currently. Its        |                       |
|             |                           | value is of type      |                       |
|             |                           | dict, such as         |                       |
|             |                           | comm_fusion={“all\    |                       |
|             |                           | reduce”:              |                       |
|             |                           | {“mode”: “auto”,      |                       |
|             |                           | “config”: None}}.     |                       |
|             |                           | There are three       |                       |
|             |                           | options for           |                       |
|             |                           | “mode” among          |                       |
|             |                           | them: “auto”:         |                       |
|             |                           | Automatically         |                       |
|             |                           | perform operator      |                       |
|             |                           | fusion according      |                       |
|             |                           | to the data           |                       |
|             |                           | volume threshold      |                       |
|             |                           | of 64MB, and the      |                       |
|             |                           | configuration         |                       |
|             |                           | parameter             |                       |
|             |                           | “config” is None.     |                       |
|             |                           | “size”:               |                       |
|             |                           | Communicate           |                       |
|             |                           | operator fusion       |                       |
|             |                           | according to the      |                       |
|             |                           | method of             |                       |
|             |                           | manually setting      |                       |
|             |                           | the data volume       |                       |
|             |                           | threshold, and        |                       |
|             |                           | the configuration     |                       |
|             |                           | parameter             |                       |
|             |                           | “config” type is      |                       |
|             |                           | int, with the         |                       |
|             |                           | unit of MB.           |                       |
|             |                           | “index”: Only         |                       |
|             |                           | “allreduce”           |                       |
|             |                           | supports              |                       |
|             |                           | configuring           |                       |
|             |                           | index, which          |                       |
|             |                           | means that the        |                       |
|             |                           | configuration         |                       |
|             |                           | parameter             |                       |
|             |                           | “config” type is      |                       |
|             |                           | list according to     |                       |
|             |                           | the way the           |                       |
|             |                           | sequence number       |                       |
|             |                           | of the                |                       |
|             |                           | communication         |                       |
|             |                           | operator is           |                       |
|             |                           | fused. For            |                       |
|             |                           | example: [20,         |                       |
|             |                           | 35], which means      |                       |
|             |                           | that the first 20     |                       |
|             |                           | AllReduces are        |                       |
|             |                           | fused into 1, the     |                       |
|             |                           | 20th to 35th          |                       |
|             |                           | AllReduce are         |                       |
|             |                           | fused into 1, and     |                       |
|             |                           | the remaining         |                       |
|             |                           | AllReduce are         |                       |
|             |                           | fused into 1.         |                       |
+-------------+---------------------------+-----------------------+-----------------------+
| Dataset     | set_auto_parallel\        | Configure the         | When the number       |
| slicing     | _context(dataset\_\       | sharding policy       | of samples is         |
|             | strategy=config)          | for the dataset.      | smaller than the      |
|             |                           | where config is       | number of cards,      |
|             |                           | Union[str,            | it can be             |
|             |                           | tuple]. When a        | imported in the       |
|             |                           | string is passed      | way of                |
|             |                           | in, there are two     | “full_batch”;         |
|             |                           | options:              | when the number       |
|             |                           | “full_batch”:         | of samples is         |
|             |                           | indicates that        | large and the         |
|             |                           | the dataset is        | model parameters      |
|             |                           | not tangential,       | are small, it can     |
|             |                           | and                   | be imported in        |
|             |                           | “data_parallel”:      | the way of            |
|             |                           | indicates that        | “data_parallel”;      |
|             |                           | the dataset is        | when the data set     |
|             |                           | sliced in             | is                    |
|             |                           | parallel with the     | high-resolution       |
|             |                           | data. When passed     | image data, it        |
|             |                           | in tuple, the         | can be imported       |
|             |                           | content in tuple      | by configuring        |
|             |                           | represents the        | the tuple             |
|             |                           | shard() interface     | sharding              |
|             |                           | of the dataset,       | strategy.             |
|             |                           | similar to the        |                       |
|             |                           | premiumive            |                       |
|             |                           | shard()               |                       |
|             |                           | interface. if         |                       |
|             |                           | this interface is     |                       |
|             |                           | not called, it        |                       |
|             |                           | defaults to the       |                       |
|             |                           | “data_parallel”       |                       |
|             |                           | mode.                 |                       |
+-------------+---------------------------+-----------------------+-----------------------+
| Distributed | infer_predict_lay\        | Use inference         | Obtain the            |
| inference   | out(\*predict_data)       | data to perform       | sharding              |
|             |                           | precompilation,       | information of        |
|             |                           | which outputs the     | the ownership         |
|             |                           | splitting             | weight at the         |
|             |                           | information of        | time of               |
|             |                           | the operator.         | inference.            |
+-------------+---------------------------+-----------------------+-----------------------+
|             | load_distributed\_\       | Load the              | Load distributed      |
|             | checkpoint(network,\      | distributed           | weights for           |
|             | checkpoint_filenames,\    | weights. Each         | distributed           |
|             | predict_strategy=None,\   | machine needs to      | inference.            |
|             | train_strategy\_\         | pre-place the         |                       |
|             | filename=None)            | full amount of        |                       |
|             |                           | ckpt. where           |                       |
|             |                           | network               |                       |
|             |                           | represents the        |                       |
|             |                           | inference             |                       |
|             |                           | network,              |                       |
|             |                           | checkpoint_filena\    |                       |
|             |                           | mes                   |                       |
|             |                           | represents the        |                       |
|             |                           | checkpoint file,      |                       |
|             |                           | predict_strategy      |                       |
|             |                           | is the output of      |                       |
|             |                           | the                   |                       |
|             |                           | infer_predict_lay\    |                       |
|             |                           | out(),                |                       |
|             |                           | and                   |                       |
|             |                           | train_strategy_fi\    |                       |
|             |                           | lename                |                       |
|             |                           | is the operator       |                       |
|             |                           | slicing strategy      |                       |
|             |                           | information saved     |                       |
|             |                           | during training.      |                       |
+-------------+---------------------------+-----------------------+-----------------------+
| Functional  | shard(in_strategy,\       | Set the sharding      | In PyNative mode,     |
| operator    | out_strategy,\            | strategy of the       | specify that a        |
| sharding    | device=“Ascend”,\         | input and output      | cell instance         |
|             | level=0)                  | tensors of the        | executes in graph     |
|             | In Cell class             | cell, and the         | mode, and             |
|             |                           | parallel strategy     | synchronizes the      |
|             |                           | of the remaining      | operator-level        |
|             |                           | operators is          | model according       |
|             |                           | propagated by the     | to the specified      |
|             |                           | sharding              | input-output          |
|             |                           | strategy.             | sharding              |
|             |                           | in_strategy/out_s\    | strategy, while       |
|             |                           | trategy               | the rest of the       |
|             |                           | specify the           | model is still        |
|             |                           | sharding policy       | executed in           |
|             |                           | for the               | Python mode.          |
|             |                           | input/output          |                       |
|             |                           | tensor. device        |                       |
|             |                           | specifies the         |                       |
|             |                           | execution device,     |                       |
|             |                           | and level             |                       |
|             |                           | specifies the         |                       |
|             |                           | pattern of the        |                       |
|             |                           | sharding policy       |                       |
|             |                           | propagation           |                       |
|             |                           | algorithm.            |                       |
+-------------+---------------------------+-----------------------+-----------------------+
|             | ops.shard(fn,             | The incoming fn       | This usage allows     |
|             | in_strategy,              | is a cell             | you to specify        |
|             | out_strategy,             | instance or           | that a function       |
|             | device=“Ascend”,          | function. The         | performs model        |
|             | level=0)                  | rest of the input     | parallelism at        |
|             |                           | is the same as        | the operator          |
|             |                           | shard, and the        | level, with the       |
|             |                           | return value is a     | same function as      |
|             |                           | function. When        | cell’s shard          |
|             |                           | this function is      | method.               |
|             |                           | called, the           |                       |
|             |                           | operator-level        |                       |
|             |                           | model is executed     |                       |
|             |                           | in graph mode in      |                       |
|             |                           | parallel.             |                       |
+-------------+---------------------------+-----------------------+-----------------------+