{ "cells": [ { "cell_type": "markdown", "source": [ "# 使用字符级RNN生成名称\n", "\n", "`Ascend` `GPU` `进阶` `自然语言处理` `全流程`\n", "\n", "[![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9yMS41L3R1dG9yaWFscy96aF9jbi9taW5kc3BvcmVfcm5uX2dlbmVyYXRpb24uaXB5bmI=&imageid=59a6e9f5-93c0-44dd-85b0-82f390c5d53b) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_notebook.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.5/tutorials/zh_cn/mindspore_rnn_generation.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_download_code.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r1.5/tutorials/zh_cn/mindspore_rnn_generation.py) [![](https://gitee.com/mindspore/docs/raw/r1.5/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.5/tutorials/source_zh_cn/intermediate/text/rnn_generation.ipynb)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 概述\n", "\n", "本教程中,我们将通过反向操作来生成不同语言的名称。这里仍通过编写由线性层结构构建出的小型RNN网络模型来实现目标。此次与[《使用字符级RNN分类名称》](https://www.mindspore.cn/tutorials/zh-CN/r1.5/intermediate/text/rnn_classification.html)这篇教程最大的区别在于,不是通过输入名称中的所有字母来预测分类,而是输入一个分类类别,然后一次输出一个字母,这种用于预测字符来形成一个单词的方法通常称为“语言模型”。\n", "\n", "> 本篇基于GPU/Ascend环境运行。" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 准备环节" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 环境配置" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "本教程我们在Ascend环境下,使用PyNative模式运行实验。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "from mindspore import context\n", "\n", "context.set_context(mode=context.PYNATIVE_MODE, device_target=\"Ascend\")" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 准备数据\n", "\n", "数据集是来自18种语言的数千种姓氏,点击[这里](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/intermediate/data.zip)下载数据,并将其提取到当前目录。\n", "\n", "数据集目录结构为`data/names`,目录中包含 18 个文本文件,名称为`[Language].txt`。 每个文件包含一系列名称,每行一个名称。数据大多数是罗马化的,需要将其从Unicode转换为ASCII。\n", "\n", "可在Jupyter Notebook中执行以下代码完成数据集的下载,并将数据集解压完成。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "!wget -NP ./ https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/intermediate/data.zip\n", "!unzip ./data.zip" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 数据处理" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 导入模块。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "import os\n", "import glob\n", "import string\n", "import unicodedata\n", "from io import open" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`find_files`函数,查找符合通配符要求的文件。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "def find_files(path): \n", " return glob.glob(path)\n", "\n", "print(find_files('data/names/*.txt'))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['data/names/German.txt', 'data/names/Dutch.txt', 'data/names/English.txt', 'data/names/Italian.txt', 'data/names/Vietnamese.txt', 'data/names/Portuguese.txt', 'data/names/Korean.txt', 'data/names/Spanish.txt', 'data/names/French.txt', 'data/names/Russian.txt', 'data/names/Greek.txt', 'data/names/Arabic.txt', 'data/names/Irish.txt', 'data/names/Chinese.txt', 'data/names/Czech.txt', 'data/names/Polish.txt', 'data/names/Japanese.txt', 'data/names/Scottish.txt']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`unicode_to_ascii`函数,将Unicode转换为ASCII。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "all_letters = string.ascii_letters + \" .,;'-\"\n", "n_letters = len(all_letters) + 1\n", "\n", "def unicode_to_ascii(s):\n", " return ''.join(\n", " c for c in unicodedata.normalize('NFD', s)\n", " if unicodedata.category(c) != 'Mn'\n", " and c in all_letters\n", " )\n", "\n", "print(unicode_to_ascii('Estéves'))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Esteves\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`read_lines`函数,读取文件,并将文件每一行内容的编码转换为ASCII。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "def read_lines(filename):\n", " lines = open(filename, encoding='utf-8').read().strip().split('\\n')\n", " return [unicode_to_ascii(line) for line in lines]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "定义`category_lines`字典和`all_categories`列表。\n", "\n", "- `category_lines`:key为语言的类别,value为名称的列表。\n", "- `all_categories`:所有语言的种类。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "category_lines = {}\n", "all_categories = []\n", "\n", "for filename in find_files('data/names/*.txt'):\n", " category = os.path.splitext(os.path.basename(filename))[0]\n", " all_categories.append(category)\n", " lines = read_lines(filename)\n", " category_lines[category] = lines\n", "\n", "n_categories = len(all_categories)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 将所有语言的数量和种类进行打印显示。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "print('# categories:', n_categories, all_categories)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "# categories: 18 ['German', 'Dutch', 'English', 'Italian', 'Vietnamese', 'Portuguese', 'Korean', 'Spanish', 'French', 'Russian', 'Greek', 'Arabic', 'Irish', 'Chinese', 'Czech', 'Polish', 'Japanese', 'Scottish']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 创建网络\n", "\n", "该网络基于[《使用字符级RNN分类名称》](https://www.mindspore.cn/tutorials/zh-CN/r1.5/intermediate/text/rnn_classification.html)教程中的RNN网络进行了扩展,附加了一个与输入`input`和隐藏状态`hidden`连接在一起的类别`category`张量。该张量与字母输入一样采用one-hot编码。\n", "\n", "该网络的输出为下一个字母出现的概率,将最有可能出现的字母作为下一次迭代的输入`input`。\n", "\n", "与上一个网络结构略有不同,为了有更好的效果,在`output combined`层之后我们又添加了一个线性层`o2o`。与此同时也新添加了一个`dropout`层,该层以一定的概率(此处为0.1)将输入的部分随机归零。这一步骤通常用来防止过拟合。\n", "\n", "![rnn2](https://gitee.com/mindspore/docs/raw/r1.5/tutorials/source_zh_cn/intermediate/text/images/run2.png)" ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "import numpy as np\n", "\n", "from mindspore import nn, ops, Tensor\n", "from mindspore import dtype as mstype\n", "\n", "class RNN(nn.Cell):\n", " \"\"\"定义RNN网络\"\"\"\n", " def __init__(self, input_size, hidden_size, output_size):\n", " super(RNN, self).__init__()\n", " self.hidden_size = hidden_size\n", " self.i2h = nn.Dense(n_categories + input_size + hidden_size, hidden_size)\n", " self.i2o = nn.Dense(n_categories + input_size + hidden_size, output_size)\n", " self.o2o = nn.Dense(hidden_size + output_size, output_size)\n", " self.dropout = nn.Dropout(0.1)\n", " self.softmax = nn.LogSoftmax(axis=1)\n", " \n", " # 构建RNN网络结构\n", " def construct(self, category, input, hidden):\n", " op = ops.Concat(axis=1)\n", " input_combined = op((category, input, hidden))\n", " hidden = self.i2h(input_combined)\n", " output = self.i2o(input_combined)\n", " output_combined = op((hidden, output))\n", " output = self.o2o(output_combined)\n", " output = self.dropout(output)\n", " output = self.softmax(output)\n", " return output, hidden\n", " \n", " # 初始化隐层状态\n", " def initHidden(self):\n", " return Tensor(np.zeros((1, self.hidden_size)),mstype.float32)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 训练\n", "\n", "### 准备训练" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 通过`random_training_pair`函数随机选择一种语言和其中一个名称作为训练数据。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "import random\n", "\n", "# 随机选择\n", "def random_choice(l):\n", " return l[random.randint(0, len(l) - 1)]\n", "\n", "# 随机选择一种语言和一个名称\n", "def random_training_pair():\n", " category = random_choice(all_categories)\n", " line = random_choice(category_lines[category])\n", " return category, line" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "对于训练集中的每个名称,该网络的输入为:`(category, current letter, hidden state)`,输出为:`(next letter, next hidden state)`。因此对于每个训练集,我们都需要`categoryTensor`(代表种类的one-hot张量),用于输入的`inputTensor`(由首字母到尾字母(不包括EOS)组成的one-hot矩阵)和用于输出的`targetTensor`(由第二个字母到尾字母(包括EOS)组成的张量)。\n", "\n", "因为我们需要预测当前字母所对应的下一个字母,所以需要拆分连续字母来组成字母对。例如:对于`\"ABCD\"`,我们将创建出`('A', 'B'), ('B', 'C'), ('C', 'D'), ('D', 'EOS')`字母对。\n", "\n", "![pair](https://gitee.com/mindspore/docs/raw/r1.5/tutorials/source_zh_cn/intermediate/text/images/pair.png)\n", "\n", "我们在训练时会持续将`category`张量传输至网络中,该张量维度为`<1 x n_categories>`的[one-hot张量](https://en.wikipedia.org/wiki/One-hot)。" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`category_to_tensor`函数将类别转换成维度为`<1 x n_categories>`的one-hot张量。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "def category_to_tensor(category):\n", " li = all_categories.index(category)\n", " tensor = Tensor(np.zeros((1, n_categories)),mstype.float32)\n", " tensor[0,li] = 1.0\n", " return tensor" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`input_to_tensor`函数,将输入转换成一个由首字母到尾字母(不包括EOS)组成的one-hot矩阵。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "def input_to_tensor(line):\n", " tensor = Tensor(np.zeros((len(line), 1, n_letters)), mstype.float32)\n", " for li in range(len(line)):\n", " letter = line[li]\n", " tensor[li,0,all_letters.find(letter)] = 1.0\n", " return tensor" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`target_to_tensor`函数,将目标值准换成一个由第二个字母到尾字母(包括EOS)组成的张量。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "def target_to_tensor(line):\n", " letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]\n", " \n", " # 添加EOS\n", " letter_indexes.append(n_letters - 1)\n", " \n", " return Tensor(np.array(letter_indexes), mstype.int64)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "为了方便训练,我们将使用`random_training`函数来获取随机对`(category,line)`,并将其转换为所需格式的`(category, input, target)`张量。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 14, "source": [ "def random_training():\n", " category, line = random_training_pair()\n", " category_tensor = category_to_tensor(category)\n", " input_line_tensor = input_to_tensor(line)\n", " target_line_tensor = target_to_tensor(line)\n", " return category_tensor, input_line_tensor, target_line_tensor" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 训练网络\n", "\n", "与分类模型依赖最后输出作为结果不同,这里我们在每一步都进行了预测,因此每一步都需要计算损失。" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 定义`NLLLoss`损失函数。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 15, "source": [ "import mindspore.ops as ops\n", "\n", "class NLLLoss(nn.LossBase):\n", " def __init__(self, reduction='mean'):\n", " super(NLLLoss, self).__init__(reduction)\n", " self.one_hot = ops.OneHot()\n", " self.reduce_sum = ops.ReduceSum()\n", " \n", " def construct(self, logits, label):\n", " label_one_hot = self.one_hot(label, ops.shape(logits)[-1], ops.scalar_to_array(1.0), ops.scalar_to_array(0.0))\n", " loss = self.reduce_sum(-1.0 * logits * label_one_hot, (1,))\n", " return self.get_loss(loss)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 16, "source": [ "criterion = NLLLoss()" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- MindSpore将损失函数,优化器等操作都封装到了Cell中,但是本教程rnn网络循环的每一步都需要计算损失,所以我们需要自定义`WithLossCell`类,将网络和Loss连接起来。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 17, "source": [ "class WithLossCellRnn(nn.Cell):\n", " \"\"\"构建有损失计算的RNN网络\"\"\"\n", " \n", " def __init__(self, backbone,loss_fn):\n", " super(WithLossCellRnn, self).__init__(auto_prefix=True)\n", " self._backbone = backbone\n", " self._loss_fn = loss_fn\n", "\n", " def construct(self, category_tensor, input_line_tensor, hidden, target_line_tensor):\n", " loss = 0\n", " for i in range(input_line_tensor.shape[0]):\n", " output, hidden = self._backbone(category_tensor, input_line_tensor[i], hidden)\n", " l = self._loss_fn(output, target_line_tensor[i])\n", " loss += l\n", " return loss" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 创建优化器、`WithLossCellRnn`实例和`TrainOneStepCell`训练网络。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 18, "source": [ "rnn_cf = RNN(n_letters, 128, n_letters)\n", "optimizer = nn.Momentum(filter(lambda x:x.requires_grad,rnn_cf.get_parameters()),0.0001,0.9)\n", "net_with_criterion = WithLossCellRnn(rnn_cf, criterion)\n", "net = nn.TrainOneStepCell(net_with_criterion, optimizer)\n", "net.set_train()\n", "\n", "# 训练网络\n", "def train(category_tensor,input_line_tensor, target_line_tensor):\n", " new_shape = list(target_line_tensor.shape)\n", " new_shape.append(1)\n", " target_line_tensor = target_line_tensor.reshape(new_shape)\n", " hidden = rnn_cf.initHidden()\n", " loss = net(category_tensor, input_line_tensor, hidden,target_line_tensor)\n", " \n", " # 返回一个序列最后一个\n", " for i in range(input_line_tensor.shape[0]):\n", " output, hidden = rnn_cf(category_tensor, input_line_tensor[i], hidden)\n", "\n", " return output, loss / input_line_tensor.shape[0]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 为了跟踪网络模型训练过程中的耗时,定义`time_since`函数,用来计算训练运行的时间,方便我们持续看到训练的整个过程。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 19, "source": [ "import time\n", "import math\n", "\n", "# 定义可读时间回调字符串\n", "def time_since(since):\n", " now = time.time()\n", " s = now - since\n", " m = math.floor(s / 60)\n", " s -= m * 60\n", " return '%dm %ds' % (m, s)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 在训练过程中,每经过`print_every`(500)次迭代就打印一次,分别打印迭代所用时间、迭代次数、迭代进度和损失值。同时,根据`plot_every`的值计算平均损失,将其添加进`all_losses`列表,以便于后面绘制训练过程种损失函数的图像。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 20, "source": [ "n_iters = 7500\n", "print_every = 500\n", "plot_every = 100\n", "all_losses = []\n", "\n", "# 每经过100次迭代,就重置为0\n", "total_loss = 0 \n", "\n", "start = time.time()\n", "\n", "for iter in range(1, n_iters + 1):\n", " output, loss = train(*random_training())\n", " total_loss += loss\n", " \n", " # 分别打印迭代所用时间、迭代次数、迭代进度和损失值\n", " if iter % print_every == 0:\n", " print('%s (%d %d%%) %.4f'% (time_since(start), iter, iter / n_iters * 100, loss.asnumpy())) \n", " \n", " if iter % plot_every == 0:\n", " all_losses.append((total_loss / plot_every).asnumpy())\n", " total_loss = 0" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "5m 56s (500 6%) 3.7287\n", "11m 37s (1000 13%) 4.1077\n", "17m 8s (1500 20%) 4.0800\n", "22m 51s (2000 26%) 4.1025\n", "28m 28s (2500 33%) 4.0878\n", "34m 3s (3000 40%) 4.0277\n", "39m 38s (3500 46%) 4.0859\n", "45m 22s (4000 53%) 3.9899\n", "51m 7s (4500 60%) 3.5648\n", "56m 52s (5000 66%) 4.0283\n", "63m 34s (5500 73%) 4.0877\n", "71m 50s (6000 80%) 4.0858\n", "79m 42s (6500 86%) 4.1082\n", "88m 11s (7000 93%) 3.7557\n", "97m 32s (7500 100%) 4.0461\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "- 使用`matplotlib.pyplot`绘制训练过程中损失函数的图像。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 21, "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.figure()\n", "plt.plot(all_losses)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, "metadata": {}, "execution_count": 22 }, { "output_type": "display_data", "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" } } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 验证模型\n", "\n", "在训练结束后,对获得的模型进行验证。这里,我们向网络中输入一个字母并推理得出下一个字母。将输出的字母作为下一步的输入,重复直到EOS标记处。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 23, "source": [ "max_length = 20\n", "\n", "# 根据类别、起始字母、隐藏状态开始推理\n", "def sample(category, start_letter='A'):\n", " category_tensor = category_to_tensor(category)\n", " input = input_to_tensor(start_letter)\n", " hidden = rnn_cf.initHidden()\n", " output_name = start_letter\n", " \n", " for i in range(max_length):\n", " output, hidden = rnn_cf(category_tensor, input[0], hidden)\n", " topk = ops.TopK(sorted=True)\n", " topv, topi = topk(output,1)\n", " topi = topi[0,0]\n", " if topi == n_letters - 1:\n", " break\n", " else:\n", " letter = all_letters[topi]\n", " output_name += letter\n", " input = input_to_tensor(letter)\n", " \n", " return output_name\n", "\n", "# 遍历提供的字母,得到输出名称\n", "def samples(category, start_letters='ABC'):\n", " for start_letter in start_letters:\n", " print('语言类型:%s 首字母:%s 输出结果:%s' %(category, start_letter, sample(category, start_letter)))\n", "\n", "samples('Russian', 'RUS')\n", "samples('German', 'GER')\n", "samples('Spanish', 'SPA')\n", "samples('Chinese', 'CHI')" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "语言类型:Russian 首字母:R 输出结果:Rrnarao\n", "语言类型:Russian 首字母:U 输出结果:Uilehidaauritai\n", "语言类型:Russian 首字母:S 输出结果:Sa\n", "语言类型:German 首字母:G 输出结果:Gallh\n", "语言类型:German 首字母:E 输出结果:Ehuhkiakena\n", "语言类型:German 首字母:R 输出结果:Rcsahonuianah\n", "语言类型:Spanish 首字母:S 输出结果:Stadlalugtnaaa\n", "语言类型:Spanish 首字母:P 输出结果:Perahaiarsaorrol\n", "语言类型:Spanish 首字母:A 输出结果:Aaan\n", "语言类型:Chinese 首字母:C 输出结果:Cadco\n", "语言类型:Chinese 首字母:H 输出结果:Hn\n", "语言类型:Chinese 首字母:I 输出结果:I\n" ] } ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "MindSpore-python3.7-aarch64", "language": "python", "name": "mindspore-python3.7-aarch64" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }