{ "cells": [ { "cell_type": "markdown", "id": "a882ffc9", "metadata": {}, "source": [ "# 梯度求导\n", "\n", "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.0/zh_cn/migration_guide/model_development/mindspore_gradient.ipynb)\n", "[![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.0/zh_cn/migration_guide/model_development/mindspore_gradient.py)\n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindspore/source_zh_cn/migration_guide/model_development/gradient.ipynb)\n", "\n", "## 自动微分接口\n", "\n", "正向网络构建完成之后,MindSpore提供了[自动微分](https://mindspore.cn/tutorials/zh-CN/r2.0/beginner/autograd.html)的接口用以计算模型的梯度结果。\n", "在[自动求导](https://mindspore.cn/tutorials/zh-CN/r2.0/advanced/derivation.html)的教程中,对各种梯度计算的场景做了一些介绍。\n", "\n", "MindSpore求梯度的接口目前有三种:\n", "\n", "### mindspore.grad\n", "\n", "[mindspore.grad](https://www.mindspore.cn/docs/zh-CN/r2.0/api_python/mindspore/mindspore.grad.html)这个API有四个可以配置的参数:\n", "\n", "- fn (Union[Cell, Function]) - 待求导的函数或网络(Cell)。\n", "\n", "- grad_position (Union[NoneType, int, tuple[int]]) - 指定求导输入位置的索引,默认值:0。\n", "\n", "- weights (Union[ParameterTuple, Parameter, list[Parameter]]) - 训练网络中需要返回梯度的网络参数,默认值:None。\n", "\n", "- has_aux (bool) - 是否返回辅助参数的标志。若为True, fn 输出数量必须超过一个,其中只有 fn 第一个输出参与求导,其他输出值将直接返回。默认值:False。\n", "\n", "其中`grad_position`和`weights`共同决定要输出哪些值的梯度,has_aux在有多个输出时配置对第一个输入求梯度还是全部输出求梯度。\n", "\n", "| grad_position | weights | output |\n", "| ------------- | ------- | ------ |\n", "| 0 | None | 第一个输入的梯度 |\n", "| 1 | None | 第二个输入的梯度 |\n", "| (0, 1) | None | (第一个输入的梯度, 第二个输入的梯度) |\n", "| None | weights | (weights的梯度) |\n", "| 0 | weights | (第一个输入的梯度), (weights的梯度) |\n", "| (0, 1) | weights | (第一个输入的梯度, 第二个输入的梯度), (weights的梯度) |\n", "| None | None | 报错 |\n", "\n", "下面实际运行一个示例,看下具体是怎么用的。\n", "\n", "首先,构造一个带参数的网络,这个网络有两个输出loss和logits,其中loss是我们用于求梯度的输出。" ] }, { "cell_type": "code", "execution_count": 1, "id": "64960668", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== weight ===\n", "name: fc.weight data: [[2. 3. 4.]]\n", "=== output ===\n", "1.0 20.0\n" ] } ], "source": [ "import mindspore as ms\n", "from mindspore import nn\n", "\n", "class Net(nn.Cell):\n", " def __init__(self, in_channel, out_channel):\n", " super(Net, self).__init__()\n", " self.fc = nn.Dense(in_channel, out_channel, has_bias=False)\n", " self.loss = nn.MSELoss()\n", "\n", " def construct(self, x, y):\n", " logits = self.fc(x).squeeze()\n", " loss = self.loss(logits, y)\n", " return loss, logits\n", "\n", "\n", "net = Net(3, 1)\n", "net.fc.weight.set_data(ms.Tensor([[2, 3, 4]], ms.float32)) # 给全连接的weight设置固定值\n", "\n", "print(\"=== weight ===\")\n", "for param in net.trainable_params():\n", " print(\"name:\", param.name, \"data:\", param.data.asnumpy())\n", "x = ms.Tensor([[1, 2, 3]], ms.float32)\n", "y = ms.Tensor(19, ms.float32)\n", "\n", "loss, logits = net(x, y)\n", "print(\"=== output ===\")\n", "print(loss, logits)" ] }, { "cell_type": "code", "execution_count": 2, "id": "d9f84bb4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 1 ===\n", "grad [[4. 6. 8.]]\n", "logit (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对第一个输入求梯度\n", "\n", "print(\"=== grads 1 ===\")\n", "grad_func = ms.grad(net, grad_position=0, weights=None, has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)" ] }, { "cell_type": "code", "execution_count": 3, "id": "d183eb1b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 2 ===\n", "grad -2.0\n", "logit (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对第二个输入求梯度\n", "\n", "print(\"=== grads 2 ===\")\n", "grad_func = ms.grad(net, grad_position=1, weights=None, has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)" ] }, { "cell_type": "code", "execution_count": 4, "id": "010a2b75", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 3 ===\n", "grad (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2))\n", "logit (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对多个输入求梯度\n", "\n", "print(\"=== grads 3 ===\")\n", "grad_func = ms.grad(net, grad_position=(0, 1), weights=None, has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)" ] }, { "cell_type": "code", "execution_count": 5, "id": "d3ef8fdb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 4 ===\n", "grad (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),)\n", "logits (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对weights求梯度\n", "\n", "print(\"=== grads 4 ===\")\n", "grad_func = ms.grad(net, grad_position=None, weights=net.trainable_params(), has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logits\", logit)" ] }, { "cell_type": "code", "execution_count": 6, "id": "3a194b45", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 5 ===\n", "grad (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))\n", "logit (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对第一个输入和weights求梯度\n", "\n", "print(\"=== grads 5 ===\")\n", "grad_func = ms.grad(net, grad_position=0, weights=net.trainable_params(), has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)" ] }, { "cell_type": "code", "execution_count": 7, "id": "f8c8f703", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 6 ===\n", "grad ((Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2)), (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))\n", "logit (Tensor(shape=[], dtype=Float32, value= 20),)\n" ] } ], "source": [ "# 对多个输入和weights求梯度\n", "\n", "print(\"=== grads 6 ===\")\n", "grad_func = ms.grad(net, grad_position=(0, 1), weights=net.trainable_params(), has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)" ] }, { "cell_type": "code", "execution_count": 8, "id": "b7e91156", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== grads 7 ===\n", "grad [[ 6. 9. 12.]]\n" ] } ], "source": [ "# has_aux=False的场景\n", "\n", "print(\"=== grads 7 ===\")\n", "grad_func = ms.grad(net, grad_position=0, weights=None, has_aux=False)\n", "grad = grad_func(x, y) # 只有一个输出\n", "print(\"grad\", grad)" ] }, { "cell_type": "markdown", "id": "a46bdf98", "metadata": {}, "source": [ "`has_aux=False`的场景实际上等价于两个输出相加作为求梯度的输出:" ] }, { "cell_type": "code", "execution_count": 9, "id": "61068ad7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "grad [[ 6. 9. 12.]]\n" ] } ], "source": [ "class Net2(nn.Cell):\n", " def __init__(self, in_channel, out_channel):\n", " super().__init__()\n", " self.fc = nn.Dense(in_channel, out_channel, has_bias=False)\n", " self.loss = nn.MSELoss()\n", "\n", " def construct(self, x, y):\n", " logits = self.fc(x).squeeze()\n", " loss = self.loss(logits, y)\n", " return loss + logits\n", "\n", "net2 = Net2(3, 1)\n", "net2.fc.weight.set_data(ms.Tensor([[2, 3, 4]], ms.float32)) # 给全连接的weight设置固定值\n", "grads = ms.grad(net2, grad_position=0, weights=None, has_aux=False)\n", "grad = grads(x, y) # 只有一个输出\n", "print(\"grad\", grad)" ] }, { "cell_type": "markdown", "id": "406e2752", "metadata": {}, "source": [ "```python\n", "# grad_position=None, weights=None\n", "\n", "print(\"=== grads 8 ===\")\n", "grad_func = ms.grad(net, grad_position=None, weights=None, has_aux=True)\n", "grad, logit = grad_func(x, y)\n", "print(\"grad\", grad)\n", "print(\"logit\", logit)\n", "\n", "# === grads 8 ===\n", "# ValueError: `grad_position` and `weight` can not be None at the same time.\n", "```" ] }, { "cell_type": "markdown", "id": "f718a9dd", "metadata": {}, "source": [ "### mindspore.value_and_grad\n", "\n", "[mindspore.value_and_grad](https://www.mindspore.cn/docs/zh-CN/r2.0/api_python/mindspore/mindspore.value_and_grad.html)这个接口和上面的grad的参数是一样的,只不过这个接口可以一次性计算网络的正向结果和梯度。\n", "\n", "| grad_position | weights | output |\n", "| ------------- | ------- | ------ |\n", "| 0 | None | (网络的输出, 第一个输入的梯度) |\n", "| 1 | None | (网络的输出, 第二个输入的梯度) |\n", "| (0, 1) | None | (网络的输出, (第一个输入的梯度, 第二个输入的梯度)) |\n", "| None | weights | (网络的输出, (weights的梯度)) |\n", "| 0 | weights | (网络的输出, ((第一个输入的梯度), (weights的梯度))) |\n", "| (0, 1) | weights | (网络的输出, ((第一个输入的梯度, 第二个输入的梯度), (weights的梯度))) |\n", "| None | None | 报错 |\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "4fbc7970", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== value and grad ===\n", "value (Tensor(shape=[], dtype=Float32, value= 1), Tensor(shape=[], dtype=Float32, value= 20))\n", "grad ((Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[4.00000000e+000, 6.00000000e+000, 8.00000000e+000]]), Tensor(shape=[], dtype=Float32, value= -2)), (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),))\n" ] } ], "source": [ "print(\"=== value and grad ===\")\n", "value_and_grad_func = ms.value_and_grad(net, grad_position=(0, 1), weights=net.trainable_params(), has_aux=True)\n", "value, grad = value_and_grad_func(x, y)\n", "print(\"value\", value)\n", "print(\"grad\", grad)" ] }, { "cell_type": "markdown", "id": "03333b1c", "metadata": {}, "source": [ "### mindspore.ops.GradOperation\n", "\n", "[mindspore.ops.GradOperation](https://mindspore.cn/docs/zh-CN/r2.0/api_python/ops/mindspore.ops.GradOperation.html)一个高阶函数,为输入函数生成梯度函数。\n", "\n", "由 GradOperation 高阶函数生成的梯度函数可以通过构造参数自定义。\n", "\n", "这个函数和grad的功能差不多,当前版本不推荐使用,详情请参考API内描述。\n", "\n", "## loss scale\n", "\n", "由于在混合精度的场景,在求梯度的过程中可能会遇到梯度下溢,一般我们会使用loss scale配套梯度求导使用。\n", "\n", "> 在Ascend上因为Conv、Sort、TopK等算子只能是float16的,MatMul由于性能问题最好也是float16的,所以建议loss scale操作作为网络训练的标配。[Ascend 上只支持float16的算子列表](https://www.mindspore.cn/docs/zh-CN/r2.0/migration_guide/debug_and_tune.html#4%E8%AE%AD%E7%BB%83%E7%B2%BE%E5%BA%A6)。\n", ">\n", "> 溢出可以通过MindSpore Insight的[调试器](https://www.mindspore.cn/mindinsight/docs/zh-CN/r2.0/debugger.html)或者[dump数据](https://mindspore.cn/tutorials/experts/zh-CN/r2.0/debug/dump.html)获取到溢出算子信息。\n", ">\n", "> 一般溢出表现为loss Nan/INF,loss突然变得很大等。\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "454ffc8e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loss 1.0\n", "=== loss scale ===\n", "loss 1024.0\n", "grad (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.04800000e+003, 4.09600000e+003, 6.14400000e+003]]),)\n", "=== unscale ===\n", "loss 1.0\n", "grad (Tensor(shape=[1, 3], dtype=Float32, value=\n", "[[2.00000000e+000, 4.00000000e+000, 6.00000000e+000]]),)\n", "True\n" ] } ], "source": [ "from mindspore.amp import StaticLossScaler, all_finite\n", "\n", "loss_scale = StaticLossScaler(1024.) # 静态lossscale\n", "\n", "def forward_fn(x, y):\n", " loss, logits = net(x, y)\n", " print(\"loss\", loss)\n", " loss = loss_scale.scale(loss)\n", " return loss, logits\n", "\n", "value_and_grad_func = ms.value_and_grad(forward_fn, grad_position=None, weights=net.trainable_params(), has_aux=True)\n", "(loss, logits), grad = value_and_grad_func(x, y)\n", "print(\"=== loss scale ===\")\n", "print(\"loss\", loss)\n", "print(\"grad\", grad)\n", "print(\"=== unscale ===\")\n", "loss = loss_scale.unscale(loss)\n", "grad = loss_scale.unscale(grad)\n", "print(\"loss\", loss)\n", "print(\"grad\", grad)\n", "\n", "# 检查是否溢出,无溢出的话返回True\n", "state = all_finite(grad)\n", "print(state)" ] }, { "cell_type": "markdown", "id": "69835f00", "metadata": {}, "source": [ "loss scale的原理非常简单,通过给loss乘一个比较大的值,通过梯度的链式传导,在计算梯度的链路上乘一个比较大的值,防止在梯度反向传播过程中过小而出现精度问题。\n", "\n", "在计算完梯度之后,需要把loss和梯度除回原来的值,保证整个计算过程正确。\n", "\n", "最后一般需要使用all_finite来判断下是否有溢出,如果没有溢出的话就可以使用优化器进行参数更新了。\n", "\n", "## 梯度裁剪\n", "\n", "当训练过程中遇到梯度爆炸或者梯度特别大,训练不稳定的情况,可以考虑添加梯度裁剪,这里对常用的使用global_norm进行梯度裁剪的场景举例说明:" ] }, { "cell_type": "code", "execution_count": 13, "id": "f877378f", "metadata": {}, "outputs": [], "source": [ "from mindspore import ops\n", "\n", "grad = ops.clip_by_global_norm(grad)" ] }, { "cell_type": "markdown", "id": "702bfe13", "metadata": {}, "source": [ "## 梯度累积\n", "\n", "梯度累积是一种训练神经网络的数据样本按Batch拆分为几个小Batch的方式,然后按顺序计算,用以解决由于内存不足,导致Batch size过大,神经网络无法训练或者网络模型过大无法加载的OOM(Out Of Memory)问题。\n", "\n", "详情请参考[梯度累积](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.0/optimize/gradient_accumulation.html)。" ] } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 5 }