{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "547900ed",
   "metadata": {},
   "source": [
    "# 静态图高级编程技巧\n",
    "\n",
    "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.q1/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.3.q1/tutorials/zh_cn/advanced/mindspore_static_graph_expert_programming.ipynb)&emsp;\n",
    "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.q1/resource/_static/logo_download_code.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.3.q1/tutorials/zh_cn/advanced/mindspore_static_graph_expert_programming.py)&emsp;\n",
    "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.q1/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/r2.3.q1/tutorials/source_zh_cn/advanced/static_graph_expert_programming.ipynb)\n",
    "\n",
    "本篇章介绍一些常用的静态图优化的高级编程技巧，这些技巧能够有效地提高静态图的编译效率以及执行效率，并使程序运行地更加稳定。有关静态图编译的基础介绍，请见[使用静态图加速](https://www.mindspore.cn/tutorials/zh-CN/r2.3.0rc1/beginner/accelerate_with_static_graph.html)。\n",
    "\n",
    "## 如何优化编译性能\n",
    "\n",
    "### 使用lazy_inline装饰器\n",
    "\n",
    "神经网络模型的编译过程往往采用默认inline的方式，把层级的代码表达最终展开成一张扁平的计算图，一方面寻求最大的编译优化机会，另一方面也可以简化自动微分以及执行的逻辑。inline后形成的计算图包含了所有的计算节点，可以在更大的范围内进行优化，比如常量折叠、节点融合、并行分析等，也可以更好地实现内存分配，减少内存申请和性能开销。虽然inline优化对于运行期性能提升帮助非常大，但过度inline也带来了编译期的负担。例如随着计算图节点数量膨胀，执行pass的耗时也在急剧增长。\n",
    "\n",
    "为了减轻inline对编译性能带来的损耗，对于重复调用相同计算单元的场景（典型的场景是在for循环中调用同一个Cell类的不同实例），我们提供了Lazy Inline机制来减少编译时间。\n",
    "\n",
    "#### 大模型pipeline并行场景\n",
    "\n",
    "在大模型场景中，编译耗时问题尤为突出，一是大模型的模型结构层次深，节点数多；二是大模型在训练时，由于启用pipeline并行，导致模型规模和节点数进一步加大，如果原来图的规模是O，那开启pipeline并行，单节点图的规模变为(O/X)*Y，其中X为pipeline的stage数量，Y为micro batch的数量。以盘古13B网络为例，计算图中计算节点数量达到13.5万个，单次编译时长可接近3小时。\n",
    "\n",
    "我们观察到类似盘古的大模型网络结构，是由多层layer组成的，在开启pipeline并行时，各个micro batch的layer层结构是完全一样的。当开启pipeline并行时，`PipelineCell`使用for循环的方式来多次调用相同结构的layer，代码如下所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "514c3fe6",
   "metadata": {},
   "source": [
    "```python\n",
    "from mindspore import nn\n",
    "\n",
    "class PipelineCell(nn.Cell):\n",
    "    def __init__(self, network, micro_size):\n",
    "        ...\n",
    "        self.network = network\n",
    "        self.micro_size = micro_size\n",
    "        ...\n",
    "\n",
    "    def construct(self, ...):\n",
    "        ...\n",
    "        for i in range(self.micro_size):\n",
    "            output = self.network(...)\n",
    "        ...\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9080abe3",
   "metadata": {},
   "source": [
    "如果我们把循环体看作被频繁调用的子图，通过把它标记为Lazy Inline，告知编译器推迟inline处理，那么就可以在编译的大部分阶段大幅度减少计算图节点数量，从而获得性能收益。例如上面的代码，可以保留`network`实例的子图结构，不inline或者不提前inline。对此，我们提供了`@lazy_inline`装饰器来实现延迟inline。\n",
    "\n",
    "以Pangu_alpha网络为例，`PipelineCell`函数体中处理的`network`为`PanGUAlphaWithLoss`类的实例，为实现延迟inline，我们需要对`PanGUAlphaWithLoss`类的`__init__`函数加上`@lazy_inline`装饰器，以标记`PanGUAlphaWithLoss`类的子图结构需要被保留下来，不做inline或者延迟inline。如下所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5607ec06",
   "metadata": {},
   "source": [
    "```python\n",
    "from mindspore import nn\n",
    "from mindspore import lazy_inline\n",
    "\n",
    "class PanGUAlphaWithLoss(nn.Cell):\n",
    "    @lazy_inline\n",
    "    def __init__(self, ...):\n",
    "        ...\n",
    "\n",
    "    def construct(self, ...):\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f0309ef",
   "metadata": {},
   "source": [
    "> 完整代码可以参考：[Pangu_alpha](https://gitee.com/mindspore/models/tree/r2.3/official/nlp/Pangu_alpha)\n",
    "\n",
    "还是以盘古13B网络为例，应用Lazy Inline方案后，计算图编译规模从13万+节点下降到2万+个节点，编译时间从3个小时下降到20分钟。\n",
    "\n",
    "#### 更加泛化的一般场景\n",
    "\n",
    "`@lazy_inline`是`Cell::__init__`的装饰器，它会以`__init__`的所有参数生成Cell的`cell_init_args`属性值，`cell_init_args`值相同表明Cell类名和初始化参数值是一样的。而对于相同Cell类的实例，它们的weights还可能是不一样的，因此对于用`construct(self, x)`定义的网络结构，在实际编译时我们可以转换为`construct(x, self.cell_init_args, self.trainable_parameters())`。对于同一个Cell类的不同实例，如果`cell_init_args`是相同的，那么这两个实例可以复用同一个网络结构，如下所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c3f3655",
   "metadata": {},
   "source": [
    "```python\n",
    "def construct(self, x)\n",
    "    reuse_construct(x, self.trainable_parameters())\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efa1a2c5",
   "metadata": {},
   "source": [
    "引入可复用计算图后，具有相同`cell_init_args`的Cell实例只需编译解析一次。所以对于更加泛化的调用同一个Cell类的不同实例的场景，只要`cell_init_args`是相同的，我们都可以加上`@lazy_inline`装饰器来加速编译。例如GPT网络："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef0fbd8c",
   "metadata": {},
   "source": [
    "```python\n",
    "from mindspore import nn\n",
    "from mindspore import lazy_inline\n",
    "\n",
    "class Block(nn.Cell):\n",
    "    @lazy_inline\n",
    "    def __init__(self, config):\n",
    "        ...\n",
    "\n",
    "    def construct(self, x, attention_mask, layer_past):\n",
    "        ...\n",
    "\n",
    "class GPT_Model(nn.Cell):\n",
    "    def __init__(self, config):\n",
    "        ...\n",
    "        for i in range(config.num_layers):\n",
    "            self.blocks.append(Block(config))\n",
    "            ...\n",
    "        self.num_layers = config.num_layers\n",
    "\n",
    "    def construct(self, input_ids, input_mask, layer_past):\n",
    "        ...\n",
    "        present_layer = ()\n",
    "        for i in range(self.num_layers):\n",
    "            hidden_states, present = self.blocks[i](...)\n",
    "            present_layer = present_layer + (present,)\n",
    "        ...\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0184b17",
   "metadata": {},
   "source": [
    "> 完整代码可以参考：[GPT](https://gitee.com/mindspore/models/tree/r2.3/official/nlp/GPT)\n",
    "\n",
    "GPT的网络结构由多层`Block`类的不同实例构成，这些`Block`的初始化参数都是同一个`config`，所以加上`@lazy_inline`装饰器后，这些`Block`实例都可以复用同一个网络结构，而且在大部分的编译阶段都不进行inline，从而可以大幅度减少编译时间。\n",
    "\n",
    "#### 使用步骤\n",
    "\n",
    "如上面的例子，在网络脚本中，往需要延迟inline和复用子图结构的Cell类的`__init__`函数加上`@lazy_inline`装饰器。\n",
    "\n",
    "#### 使用限制\n",
    "\n",
    "1. Cell 是以Cell的类名和`__init__`参数值生成Cell实例标识的，这是基于`__init__`的参数确定Cell 的所有属性，以及`construct`构图开始时的Cell属性和`__init__`执行完的属性一致为假设前提，因此Cell与构图有关的属性，在`__init__`执行完后不能进行更改。例如：\n",
    "\n",
    "   ```python\n",
    "   from mindspore import nn\n",
    "   from mindspore import lazy_inline\n",
    "\n",
    "   class Block(nn.Cell):\n",
    "       @lazy_inline\n",
    "       def __init__(self, ...):\n",
    "           self.x = 0\n",
    "           ...\n",
    "       def construct(self, ...):\n",
    "        if self.x == 0:\n",
    "             ...\n",
    "        else:\n",
    "           ...\n",
    "           ...\n",
    "   class Model(nn.Cell):\n",
    "       def __init__(self, ...):\n",
    "           ...\n",
    "           self.num_layers = 10\n",
    "           for i in range(self.num_layers):\n",
    "               self.blocks.append(Block(...)) # 此处Block进行初始化\n",
    "               ...\n",
    "           self.blocks[0].x = 1               # 此处在Block初始化后修改Block的属性，会导致该Block无法复用同一份子图\n",
    "\n",
    "       def construct(self, ...):\n",
    "            ...\n",
    "            for i in range(self.num_layers):\n",
    "               res = self.blocks[i](...)\n",
    "           ...\n",
    "   ```\n",
    "\n",
    "如上代码所示，网络Model中的某个`Block`实例，它的属性`x`在该实例初始化后被修改了，那么这个`Block`实例就无法准确复用同一个子图结构了。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00c363fd",
   "metadata": {},
   "source": [
    "2. 一个Cell类的网络结构包含多个Cell_X类的实例，同时每个Cell_X类的网络结构又包含多个Cell_Y的实例的场景，如果往Cell_X和Cell_Y类的`__init__`函数上都加上`@lazy_inline`，那么只有最外层的Cell_X实例的网络结构被编译成可复用的计算图且被延迟inline，内层的Cell_Y实例的计算图还是会被inline。例如：\n",
    "\n",
    "   ```python\n",
    "   from mindspore import nn\n",
    "   from mindspore import lazy_inline\n",
    "\n",
    "   class InnerBlock(nn.Cell):\n",
    "       @lazy_inline             # InnerBlock不会被延迟inline\n",
    "       def __init__(self, ...):\n",
    "           ...\n",
    "       def construct(self, ...):\n",
    "           ...\n",
    "   class OuterBlock(nn.Cell):\n",
    "       @lazy_inline             # OuterBlock将会被延迟inline\n",
    "       def __init__(self, ...):\n",
    "           ...\n",
    "           self.num_layers = 10\n",
    "           for i in range(self.num_layers):\n",
    "               self.blocks.append(InnerBlock(...))\n",
    "       def construct(self, ...):\n",
    "            ...\n",
    "            for i in range(self.num_layers):\n",
    "              res = self.blocks[i](...)\n",
    "            ...\n",
    "   class Model(nn.Cell):\n",
    "       def __init__(self, ...):\n",
    "           ...\n",
    "           self.num_layers = 10\n",
    "           for i in range(self.num_layers):\n",
    "               self.blocks.append(OuterBlock(...))\n",
    "       def construct(self, ...):\n",
    "         ...\n",
    "            for i in range(self.num_layers):\n",
    "              res = self.blocks[i](...)\n",
    "           ...\n",
    "   ```\n",
    "\n",
    "后续有计划支持这种多层级的Lazy Inline机制。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e9dd7b8",
   "metadata": {},
   "source": [
    "### 使用HyperMap\n",
    "\n",
    "使用场景：使用HyperMap替换for循环来优化编译性能。\n",
    "\n",
    "`HyperMap`是一个特殊的类，类对象构造时需要传入映射函数f，调用对象时需要传入f的n个参数序列，更多使用方法见：[HyperMap](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/ops/mindspore.ops.HyperMap.html)。映射函数f必须是`MultitypeFuncGraph`类型, 可参考[MultitypeFuncGraph](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/ops/mindspore.ops.MultitypeFuncGraph.html)。在使用for循环批量处理列表元素时，可以通过`HyperMap`等价语义替换来优化网络编译性能。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88b51f82",
   "metadata": {},
   "source": [
    "### 使用编译缓存\n",
    "\n",
    "使用场景：在进行训练或者推理时，如果编译依赖的文件未作任何变更，通过使用编译缓存来缩短编译时间。\n",
    "\n",
    "编译缓存的本质是存储了网络模型的编译中间过程文件，当网络模型不变时，生产的编译中间过程文件也是一样的，因此可以复用上一次编程产生的中间过程文件。\n",
    "\n",
    "通过设置context中的[enable_compile_cache](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/mindspore/mindspore.set_context.html?highlight=enable_compile_cache)或环境变量[MS_COMPILER_CACHE_ENABLE](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/note/env_var_list.html?highlight=MS_COMPILER_CACHE_ENABLE)，可以指定是否保存和加载编译缓存，前者优先级更高。\n",
    "\n",
    "通过设置context中的[compile_cache_path](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/mindspore/mindspore.set_context.html?highlight=compile_cache_path)或环境变量[MS_COMPILER_CACHE_PATH](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/note/env_var_list.html?highlight=MS_COMPILER_CACHE_PATH)，可以指定MindSpore编译缓存目录，用于存储图和算子编译过程生成的缓存文件，前者优先级更高。\n",
    "\n",
    "一个通过使能编译缓存来优化编译性能的代码样例如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8e98fb56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Disable comile_cache cost time: 0.5485098361968994\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "from mindspore import set_context\n",
    "from mindspore import dtype\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit\n",
    "def func(input_x, input_y):\n",
    "    output = input_x\n",
    "    for _ in range(200):\n",
    "        output = input_x + input_x * input_y + output\n",
    "    return output\n",
    "\n",
    "set_context(enable_compile_cache=False)\n",
    "x = ms.Tensor([1], dtype.float32)\n",
    "y = ms.Tensor([2], dtype.float32)\n",
    "start_time = time.time()\n",
    "out = func(x, y)\n",
    "end_time = time.time()\n",
    "print(\"Disable comile_cache cost time:\", end_time - start_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e66fb52",
   "metadata": {},
   "source": [
    "上述测试样例是关闭编译缓存状态，执行上述测试样例2次，第1次耗时和第2次耗时如下（实际耗时与硬件环境有关，以下数据仅供参考）：\n",
    "\n",
    "```text\n",
    "Disable comile_cache cost time: 0.5485098361968994\n",
    "Disable comile_cache cost time: 0.4614279270172119\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3805561",
   "metadata": {},
   "source": [
    "可以看到，关闭编译缓存时，第2次执行样例比第1次耗时少一些，这是因为算子编译缓存是默认开启的，第2次执行样例能够利用前一次的算子编译缓存。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "278e2ca1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enable comile_cache cost time: 0.6357541084289551\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "from mindspore import set_context\n",
    "from mindspore import dtype\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit\n",
    "def func(input_x, input_y):\n",
    "    output = input_x\n",
    "    for _ in range(200):\n",
    "        output = input_x + input_x * input_y + output\n",
    "    return output\n",
    "\n",
    "set_context(enable_compile_cache=True, compile_cache_path=\"my_compile_cache\")\n",
    "x = ms.Tensor([1], dtype.float32)\n",
    "y = ms.Tensor([2], dtype.float32)\n",
    "start_time = time.time()\n",
    "out = func(x, y)\n",
    "end_time = time.time()\n",
    "print(\"Enable comile_cache cost time:\", end_time - start_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a76213d",
   "metadata": {},
   "source": [
    "上述测试样例是开启编译缓存状态，执行上述测试样例2次，第1次耗时和第2次耗时如下（实际耗时与硬件环境有关，以下数据仅供参考）：\n",
    "\n",
    "```text\n",
    "Enable comile_cache cost time: 0.6357541084289551\n",
    "Enable comile_cache cost time: 0.09379792213439941\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5655959",
   "metadata": {},
   "source": [
    "可以看到，开启编译缓存时，第2次执行样例耗时只有第一次执行耗时的1/7左右。\n",
    "\n",
    "## 如何优化执行性能\n",
    "\n",
    "### 使用jit_class\n",
    "\n",
    "使用场景：使用`@jit_class`装饰器修饰自定义类，提高执行性能。jit_class应用于静态图模式，在动态图模式下，`@jit_class`会被忽略，不影响动态图模式的执行逻辑。\n",
    "\n",
    "#### jit_class的介绍\n",
    "\n",
    "用户在网络脚本中定义一个类时，可以写成继承于`Cell`的类、自定义类、`@jit_class`修饰的类，它们的用法和区别如下：\n",
    "\n",
    "- 继承于Cell的类\n",
    "\n",
    "  Cell是MindSpore中神经网络的基本构成单元，模型或者神经网络层应当继承该类。静态图模式下，使用`Cell`类并且在`construct`函数中编写执行代码，此时`construct`函数的代码会被编译成静态计算图。\n",
    "\n",
    "- 自定义类\n",
    "\n",
    "  定义自定义类后，可以对类进行实例化、调用类对象的属性和方法，请参考[自定义类的使用](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/note/static_graph_syntax_support.html#支持自定义类的使用)。相比于`Cell`的类定义，自定义类更贴近用户调用Python类的使用习惯。自定义类在静态图模式下的实现方式与`Cell`不同，例如，调用自定义类对象的函数方法时，其函数方法中的代码不会被编译成静态计算图，而是通过Python解释器进行解释执行。\n",
    "\n",
    "- `@jit_class`修饰的类\n",
    "\n",
    "  为了兼顾用户的Python使用习惯和静态图编译带来的性能优势，提供了`@jit_class`装饰器。给自定义类修饰`@jit_class`装饰器后，该类的函数代码会被编译成静态计算图，基于图优化、静态图整图下沉等技术，编译器可以针对计算图进行全局的优化，从而获得较好的执行性能。\n",
    "\n",
    "在静态图模式下，通过使用`@jit_class`修饰自定义类，用户可以创建、调用该类的实例，并且可以获取其属性和方法。\n",
    "\n",
    "#### jit_class装饰器的使用\n",
    "\n",
    "jit_class装饰器仅支持修饰自定义类，不支持修饰继承于`Cell`的类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c0234770",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1 2 3]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    value = ms.Tensor(np.array([1, 2, 3]))\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def construct(self):\n",
    "        return InnerNet().value\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "net = Net()\n",
    "out = net()\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe681740",
   "metadata": {},
   "source": [
    "如果jit_class修饰继承于`Cell`的类，将会报错。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106d6e6a",
   "metadata": {},
   "source": [
    "```python\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class Net(nn.Cell):\n",
    "    def construct(self, x):\n",
    "        return x\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "x = ms.Tensor(1)\n",
    "net = Net()\n",
    "net(x)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e566051b",
   "metadata": {},
   "source": [
    "报错信息如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d633c29",
   "metadata": {},
   "source": [
    "```text\n",
    "TypeError: Decorator jit_class is used for user-defined classes and cannot be used for nn.Cell: Net<>.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6163d8ce",
   "metadata": {},
   "source": [
    "jit_class支持自定义类嵌套使用、自定义类与`Cell`嵌套使用的场景。需要注意的是，类继承时，如果父类使用了jit_class，子类也会具有jit_class的能力。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "32579fd0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1 2 3]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class Inner:\n",
    "    def __init__(self):\n",
    "        self.value = ms.Tensor(np.array([1, 2, 3]))\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    def __init__(self):\n",
    "        self.inner = Inner()\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def __init__(self):\n",
    "        super(Net, self).__init__()\n",
    "        self.inner_net = InnerNet()\n",
    "\n",
    "    def construct(self):\n",
    "        out = self.inner_net.inner.value\n",
    "        return out\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "net = Net()\n",
    "out = net()\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e99952c",
   "metadata": {},
   "source": [
    "#### 获取类的属性和方法\n",
    "\n",
    "支持通过类名或类实例调用属性和方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c5c1a327",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n"
     ]
    }
   ],
   "source": [
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    def __init__(self, val):\n",
    "        self.number = val\n",
    "\n",
    "    def act(self, x, y):\n",
    "        return self.number * (x + y)\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def __init__(self):\n",
    "        super(Net, self).__init__()\n",
    "        self.inner_net = InnerNet(2)\n",
    "\n",
    "    def construct(self, x, y):\n",
    "        return self.inner_net.number + self.inner_net.act(x, y)\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "x = ms.Tensor(2, dtype=ms.int32)\n",
    "y = ms.Tensor(3, dtype=ms.int32)\n",
    "net = Net()\n",
    "out = net(x, y)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42eeec97",
   "metadata": {},
   "source": [
    "#### 创建类的实例\n",
    "\n",
    "对于将会被编译成静态计算图的函数，如`Cell`的`construct`函数、`@jit`修饰的函数或前两者调用的子函数，如果需要在函数内创建`@jit_class`所修饰的类的实例，参数要求为常量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6697edd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    def __init__(self, val):\n",
    "        self.number = val + 3\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def construct(self):\n",
    "        net = InnerNet(2)\n",
    "        return net.number\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "net = Net()\n",
    "out = net()\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5cabba3",
   "metadata": {},
   "source": [
    "#### 调用类的实例\n",
    "\n",
    "调用`@jit_class`所修饰的类的实例时，将会调用该类的`__call__`函数方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f0a25831",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    def __init__(self, number):\n",
    "        self.number = number\n",
    "\n",
    "    def __call__(self, x, y):\n",
    "        return self.number * (x + y)\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def construct(self, x, y):\n",
    "        net = InnerNet(2)\n",
    "        out = net(x, y)\n",
    "        return out\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "x = ms.Tensor(2, dtype=ms.int32)\n",
    "y = ms.Tensor(3, dtype=ms.int32)\n",
    "net = Net()\n",
    "out = net(x, y)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6e384c9",
   "metadata": {},
   "source": [
    "如果该类没有定义`__call__`函数，将会报错提示。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "592c8d0c",
   "metadata": {},
   "source": [
    "```python\n",
    "import numpy as np\n",
    "import mindspore.nn as nn\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit_class\n",
    "class InnerNet:\n",
    "    def __init__(self, number):\n",
    "        self.number = number\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def construct(self, x, y):\n",
    "        net = InnerNet(2)\n",
    "        out = net(x, y)\n",
    "        return out\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "x = ms.Tensor(2, dtype=ms.int32)\n",
    "y = ms.Tensor(3, dtype=ms.int32)\n",
    "net = Net()\n",
    "out = net(x, y)\n",
    "print(out)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29c4b91f",
   "metadata": {},
   "source": [
    "报错信息如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48212c94",
   "metadata": {},
   "source": [
    "```text\n",
    "ValueError: MsClassObject: 'InnerNet' has no __call__ function, please check the code.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78c3a5b9",
   "metadata": {},
   "source": [
    "### 使用select算子\n",
    "\n",
    "使用场景：`Select`算子来替代if控制流语句，减少静态图子图生成，提高执行性能（也可以提高编译性能）。\n",
    "\n",
    "编写网络时，会经常使用到if语句，如果if语句的条件是变量条件，每个if语句都会产生额外的子图。在静态图模式下，子图数量越多，编译耗时越久，因此部分场景可以通过`Select`算子等价替换if语句来优化编译性能。\n",
    "\n",
    "需要注意的是，使用`Select`算子替换if语句会影响网络的运行性能。一方面，`Select`算子会同时执行true分支和false分支，而if语句只执行其一个分支，因此使用if运行耗时相比使用`Select`算子耗时减少；另一方面，`Select`算子性能优于if语句产生的控制流算子，使用if运行耗时相比使用`Select`算子运行耗时增加。综合上述两种因素，最终运行性能变化情况需要结合实际情况判断。一般来讲，当分支中算子数量较少，建议使用`Select`算子；当分支中算子数量较多，建议使用if语句。\n",
    "\n",
    "一个使用`Select`算子替代if语句来优化编译性能的代码样例如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "50935382",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "if net cost time: 2.1914021968841553\n",
      "select net cost time: 0.7116048336029053\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "from mindspore import ops\n",
    "import mindspore as ms\n",
    "\n",
    "@ms.jit\n",
    "def if_net(x, y):\n",
    "    out = 0\n",
    "    for _ in range(100):\n",
    "        if x < y:\n",
    "            x = x - y\n",
    "        else:\n",
    "            x = x + y\n",
    "        out = out + x\n",
    "    return out\n",
    "\n",
    "start_time = time.time()\n",
    "out = if_net(ms.Tensor([0]), ms.Tensor([1]))\n",
    "end_time = time.time()\n",
    "print(\"if net cost time:\", end_time - start_time)\n",
    "\n",
    "@ms.jit\n",
    "def select_net(x, y):\n",
    "    out = x\n",
    "    for _ in range(100):\n",
    "        cond = x < y\n",
    "        x = ops.select(cond, x - y, x + y)\n",
    "        out = out + x\n",
    "    return out\n",
    "\n",
    "start_time = time.time()\n",
    "out = select_net(ms.Tensor([0]), ms.Tensor([1]))\n",
    "end_time = time.time()\n",
    "print(\"select net cost time:\", end_time - start_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45095e4e",
   "metadata": {},
   "source": [
    "### 使用Vmap进行批处理\n",
    "\n",
    "使用场景：在处理无依赖关系的批量数据且相关的算子支持Vmap功能时，可以使用Vmap替代for循环处理批量数据来优化执行性能（也可以提高编译性能）。\n",
    "\n",
    "MindSpore已支持Vmap特性，Vmap的详细介绍可参考[自动向量化Vmap](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.0rc1/vmap/vmap.html)。\n",
    "\n",
    "一个使用Vmap替换for循环处理批量数据来优化编译性能的代码样例如下：\n",
    "\n",
    "代码的运行结果如下（实际耗时与硬件环境有关，以下数据仅供参考）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b89d6496",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vmap cost time: 0.05766916275024414\n",
      "for loop cost time: 1.9284062385559082\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import time\n",
    "from mindspore import ops, vmap\n",
    "import mindspore as ms\n",
    "\n",
    "def hswish_func(x):\n",
    "    return ops.HSwish()(x)\n",
    "\n",
    "@ms.jit\n",
    "def manually_batched(xs):\n",
    "    output = []\n",
    "    for i in range(xs.shape[0]):\n",
    "        output.append(hswish_func(xs[i]))\n",
    "    return ops.stack(output)\n",
    "\n",
    "shape = (100, 2)\n",
    "prop = 100\n",
    "x_np = (np.random.randn(*shape) * prop).astype(np.float32)\n",
    "x = ms.Tensor(x_np)\n",
    "x = ops.sub(x, 0)\n",
    "\n",
    "start_time = time.time()\n",
    "output_vmap = vmap(hswish_func, in_axes=(0,))(x)\n",
    "end_time = time.time()\n",
    "print(\"Vmap cost time:\", end_time - start_time)\n",
    "\n",
    "start_time = time.time()\n",
    "output_manually = manually_batched(x)\n",
    "end_time = time.time()\n",
    "print(\"for loop cost time:\", end_time - start_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ded0e87",
   "metadata": {},
   "source": [
    "上述样例中，相当于需要批量处理100组Tensor数据，可以看到使用Vmap处理的性能超过使用for循环处理性能的30倍。\n",
    "\n",
    "## 依赖控制保证执行序\n",
    "\n",
    "如果函数的运行结果依赖或影响外部状态，我们认为该函数具有副作用，比如函数会改变外部全局变量、函数的结果依赖全局变量的值。如果算子会改变输入参数的值或者算子的输出依赖全局参数的值，我们认为这是带副作用的算子。\n",
    "\n",
    "根据内存属性和IO状态，将副作用划分为内存副作用和IO副作用。当前内存副作用主要有Assign、优化器算子等等，IO副作用主要有Print算子。详细可以查看算子定义，内存副作用算子在定义中有side_effect_mem属性，IO副作用算子在定义中有side_effect_io属性。\n",
    "\n",
    "Depend用于处理依赖项操作。在大多数情况下，如果操作符有IO副作用或内存副作用，则将根据用户的语义执行它们，不需要另外使用Depend算子来保证执行顺序。在某些情况下，如果两个运算符A和B没有顺序依赖关系，并且A必须在B之前执行，我们建议使用Depend指定它们的执行顺序。使用方法如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "966360dc",
   "metadata": {},
   "source": [
    "```python\n",
    "a = A(x)\n",
    "b = B(y)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fe66617",
   "metadata": {},
   "source": [
    "在插入Depend算子后，如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60817470",
   "metadata": {},
   "source": [
    "```python\n",
    "a = A(x)\n",
    "y = Depend(y, a)\n",
    "b = B(y)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec290f5b",
   "metadata": {},
   "source": [
    "值得说明的是，用于浮点数溢出状态检测的一组特殊算子它们存在隐含副作用，但又不属于IO副作用或内存副作用。此外，使用时还有严格的顺序要求，即：在使用NPUClearFloatStatus算子前需要保证NPUAllocFloatStatus已经执行，使用NPUGetFloatStatus算子前需要保证NPUClearFloatStatus已经执行。因为这些算子使用较少，目前的方案是保持它们的定义为无副作用形式，以Depend确保执行顺序。注意：此类浮点数溢出状态检测的算子仅在Ascend平台支持。如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7043dde3",
   "metadata": {},
   "source": [
    "```python\n",
    "import numpy as np\n",
    "import mindspore as ms\n",
    "import mindspore.nn as nn\n",
    "from mindspore import ops, set_context, Tensor\n",
    "from mindspore import dtype as mstype\n",
    "\n",
    "set_context(mode=ms.GRAPH_MODE, device_target=\"Ascend\")\n",
    "\n",
    "class Net(nn.Cell):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.alloc_status = ops.NPUAllocFloatStatus()\n",
    "        self.get_status = ops.NPUGetFloatStatus()\n",
    "        self.clear_status = ops.NPUClearFloatStatus()\n",
    "\n",
    "    def construct(self, x):\n",
    "        init = self.alloc_status()\n",
    "        clear_status = self.clear_status(init)\n",
    "        x = ops.Depend()(x, clear_status)\n",
    "        res = ops.sub(x, ops.neg(x))\n",
    "        init = ops.Depend()(init, res)\n",
    "        get_status = self.get_status(init)\n",
    "        res = ops.Depend()(res, get_status)\n",
    "        return res\n",
    "\n",
    "value = 5\n",
    "data = np.full((2, 3), value, dtype=np.float16)\n",
    "x = Tensor(data, dtype=mstype.float16)\n",
    "net = Net()\n",
    "res = net(x)\n",
    "print(res)\n",
    "```\n",
    "\n",
    "```text\n",
    "[[10. 10. 10.]\n",
    " [10. 10. 10.]]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d14746ba",
   "metadata": {},
   "source": [
    "## 优化冗余显存拷贝操作\n",
    "\n",
    "在函数式编程中，通过参数和返回值之外的渠道和外界存在数据交换的函数，被称为非纯函数，被认为是存在副作用的。在MindSpore框架内部，针对副作用的问题会插入Load算子，该算子属于虚拟算子，不需要在后端执行，不占用显存，仅用于表示需要读取全局变量的值。在图模式下，需要编译完整个图之后才将图中的各个算子下发到后端执行，使用Load算子多次读取全局变量，而不是多次使用真实算子多次保存全局变量的值，这样可以减少显存的消耗。\n",
    "\n",
    "但是，全局变量的值可能是变化的，如果没有真实算子保存值，某些场景下会存在精度问题。针对这种情况，MindSpore框架内部会插入真实算子，占用一定的显存来保存全局变量的值，从而避免出现精度问题。\n",
    "\n",
    "我们提供了MS_DEV_SIDE_EFFECT_LOAD_ELIM开关来优化显存占用的程度，即设置export MS_DEV_SIDE_EFFECT_LOAD_ELIM=0/1/2/3。\n",
    "\n",
    "- 当将MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为0时，表示对框架内部的Load算子都插入真实算子，即占用显存最多，保证网络精度没有问题。\n",
    "- 当将MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为1或者没有设置值时（即默认模式），表示对框架内部的Load算子可能出现精度问题的场景保守地插入真实算子，保证网络精度没有问题。\n",
    "- 当将MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为2，在损耗一定编译性能的前提下，尽量少地插入真实算子，优化显存较多，且保证网络精度没有问题。\n",
    "- 当将MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为3，不插入真实算子，不保证网络的精度，显存消耗最少。\n",
    "\n",
    "我们可以通过用例和生成的中间表示(即IR)来进一步理解。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05537350",
   "metadata": {},
   "source": [
    "```python\n",
    "import numpy as np\n",
    "from mindspore.nn import Cell\n",
    "from mindspore import Tensor, Parameter, ops\n",
    "import mindspore as ms\n",
    "\n",
    "ms.set_context(mode=ms.GRAPH_MODE)\n",
    "\n",
    "class ForwardNet(Cell):\n",
    "    def __init__(self):\n",
    "        super(ForwardNet, self).__init__()\n",
    "        self.weight = Parameter(Tensor(np.array(0), ms.int32), name=\"param\")\n",
    "\n",
    "    def construct(self, x):\n",
    "        out = 0\n",
    "        i = 0\n",
    "        while i < 3:\n",
    "            ops.assign(self.weight, i)\n",
    "            out = x * self.weight + out\n",
    "            i = i + 1\n",
    "        return out\n",
    "\n",
    "\n",
    "class BackwardNet(Cell):\n",
    "    def __init__(self, net):\n",
    "        super(BackwardNet, self).__init__(auto_prefix=False)\n",
    "        self.forward_net = net\n",
    "        self.grad = ops.GradOperation(get_all=True)\n",
    "\n",
    "    def construct(self, *inputs):\n",
    "        grads = self.grad(self.forward_net)(*inputs)\n",
    "        return grads\n",
    "\n",
    "x = Tensor(np.array(1), ms.int32)\n",
    "graph_forword_net = ForwardNet()\n",
    "graph_backword_net = BackwardNet(graph_forword_net)\n",
    "graph_mode_grads = graph_backword_net(x)\n",
    "output_except = (Tensor(np.array(3), ms.int32),)\n",
    "assert np.all(graph_mode_grads == output_except)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60c7ca19",
   "metadata": {},
   "source": [
    "如上用例，通过设置保存中间文件，即：ms.set_context(mode=ms.GRAPH_MODE，save_graphs=True)，可以得到中间文件IR，为了便于查看，我们将得到的中间文件简化如下：\n",
    "\n",
    "在未对框架内部的Load算子插入真实算子时的IR文件如下，可以看到存在3个Load算子，均是取不同时机的para2_param这个全局变量的值，而这个全局变量会通过Assign算子修改值。即3个Load取到的值是不同的。而如果我们没有对Load算子插入真实算子，即没有对不同时机下的para2_param这个全局变量的值进行保存，那么得到的最终结果是不对的。即这种情况在MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为3，内存占用最少，但结果存在精度问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08ae4853",
   "metadata": {},
   "source": [
    "```text\n",
    "# IR entry: @BackwardNet_construct\n",
    "# Total subgraphs: 1\n",
    "# Total params: 2\n",
    "# Params:\n",
    "%para1_inputs0 : <Tensor[Int32], ()>\n",
    "%para2_param : <Ref[Tensor[Int32]], (), ref_key=:param>  :  has_default\n",
    "\n",
    "subgraph @BackwardNet_construct() {\n",
    "  %0 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=0), U)\n",
    "  %1 = UpdateState(U, %0)\n",
    "  %2 = Load(%para2_param, %1)\n",
    "  %3 = UpdateState(%1, %2)\n",
    "  %4 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=1), %3)\n",
    "  %5 = UpdateState(%3, %4)\n",
    "  %6 = Load(%para2_param, %5)\n",
    "  %7 = UpdateState(%5, %6)\n",
    "  %8 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=2), %7)\n",
    "  %9 = UpdateState(%7, %8)\n",
    "  %10 = Load(%para2_param, %9)\n",
    "  %11 = MakeTuple(%10, %6)\n",
    "  %12 = AddN(%11)\n",
    "  %13 = MakeTuple(%12, %2)\n",
    "  %14 = AddN(%13)\n",
    "  %15 = MakeTuple(%14)\n",
    "  %16 = UpdateState(%9, %10)\n",
    "  %17 = Depend(%15, %16)\n",
    "  Return(%17)\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7378662",
   "metadata": {},
   "source": [
    "将MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为0，1，2时，可以得到简化后的IR图如下。由于该场景中的Load算子均需要插入真实算子来保存每次Assign算子修改后的值，所以MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为0，1，2得到的IR文件是一致的。更多复杂的情况下，MS_DEV_SIDE_EFFECT_LOAD_ELIM设置为0，1，2时可能是不同的，在此不再一一展开。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0a3d96c",
   "metadata": {},
   "source": [
    "```text\n",
    "# IR entry: @BackwardNet_construct\n",
    "# Total subgraphs: 1\n",
    "# Total params: 2\n",
    "# Params:\n",
    "%para1_inputs0 : <Tensor[Int32], ()>\n",
    "%para2_param : <Ref[Tensor[Int32]], (), ref_key=:param>  :  has_default\n",
    "\n",
    "subgraph @BackwardNet_construct() {\n",
    "  %0 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=0), U)\n",
    "  %1 = UpdateState(U, %0)\n",
    "  %2 = Load(%para2_param, %1)\n",
    "  %3 = TensorMove(%2)\n",
    "  %4 = UpdateState(%1, %3)\n",
    "  %5 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=1), %4)\n",
    "  %6 = UpdateState(%4, %5)\n",
    "  %7 = Load(%para2_param, %6)\n",
    "  %8 = TensorMove(%7)\n",
    "  %9 = UpdateState(%6, %8)\n",
    "  %10 = Assign(%para2_param, Tensor(shape=[], dtype=Int32, value=2), %9)\n",
    "  %11 = UpdateState(%9, %10)\n",
    "  %12 = Load(%para2_param, %11)\n",
    "  %13 = TensorMove(%12)\n",
    "  %14 = MakeTuple(%13, %8)\n",
    "  %15 = AddN(%14)\n",
    "  %16 = MakeTuple(%15, %3)\n",
    "  %17 = AddN(%16)\n",
    "  %18 = MakeTuple(%17)\n",
    "  %19 = UpdateState(%11, %13, %15)\n",
    "  %20 = Depend(%18, %19)\n",
    "  Return(%20)\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}