{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 文本变换样例库\n",
        "\n",
        "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/docs/api_python/samples/dataset/text_gallery.ipynb) \n",
        "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/text_gallery.ipynb)\n",
        "\n",
        "此指南展示了[mindspore.dataset.text](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC)模块中各种变换的用法。"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 环境准备"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)\n",
            "\n",
            "file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]\n",
            "Successfully downloaded file to ./bert-base-uncased-vocab.txt\n",
            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)\n",
            "\n",
            "file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]\n",
            "Successfully downloaded file to ./article.txt\n",
            "['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from download import download\n",
        "\n",
        "import mindspore.dataset as ds\n",
        "import mindspore.dataset.text as text\n",
        "\n",
        "# Download opensource datasets\n",
        "# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab\n",
        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt\"\n",
        "download(url, './bert-base-uncased-vocab.txt', replace=True)\n",
        "\n",
        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt\"\n",
        "download(url, './article.txt', replace=True)\n",
        "\n",
        "# Show the directory\n",
        "print(os.listdir())\n",
        "\n",
        "def call_op(op, input):\n",
        "    print(op(input), flush=True)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Vocab\n",
        "\n",
        "[mindspore.dataset.text.Vocab](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) 用于存储多对字符与ID。其包含一个映射,可以将每个单词(str)映射到一个ID(int)。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "ids [18863, 18279]\n",
            "tokens ['##nology', 'crystalline']\n",
            "lookup: ids [18863 18279]\n"
          ]
        }
      ],
      "source": [
        "# Load bert vocab\n",
        "vocab_file = open(\"bert-base-uncased-vocab.txt\")\n",
        "vocab_content = list(set(vocab_file.read().splitlines()))\n",
        "vocab = text.Vocab.from_list(vocab_content)\n",
        "\n",
        "# lookup tokens to ids\n",
        "ids = vocab.tokens_to_ids([\"good\", \"morning\"])\n",
        "print(\"ids\", ids)\n",
        "\n",
        "# lookup ids to tokens\n",
        "tokens = vocab.ids_to_tokens([128, 256])\n",
        "print(\"tokens\", tokens)\n",
        "\n",
        "# Use Lookup op to lookup index\n",
        "op = text.Lookup(vocab)\n",
        "ids = op([\"good\", \"morning\"])\n",
        "print(\"lookup: ids\", ids)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## AddToken\n",
        "\n",
        "[mindspore.dataset.text.AddToken](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) 将分词(token)添加到序列的开头或结尾处。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "['TOKEN' 'a' 'b' 'c' 'd' 'e']\n",
            "['a' 'b' 'c' 'd' 'e' 'END']\n"
          ]
        }
      ],
      "source": [
        "txt = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n",
        "add_token_op = text.AddToken(token='TOKEN', begin=True)\n",
        "call_op(add_token_op, txt)\n",
        "\n",
        "add_token_op = text.AddToken(token='END', begin=False)\n",
        "call_op(add_token_op, txt)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## SentencePieceTokenizer\n",
        "\n",
        "[mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) 使用SentencePiece分词器对字符串进行分词。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)\n",
            "\n",
            "file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]\n",
            "Successfully downloaded file to ./sentencepiece.bpe.model\n",
            "['▁Today' '▁is' '▁Tuesday' '.']\n"
          ]
        }
      ],
      "source": [
        "# Construct a SentencePieceVocab model\n",
        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model\"\n",
        "download(url, './sentencepiece.bpe.model', replace=True)\n",
        "sentence_piece_vocab_file = './sentencepiece.bpe.model'\n",
        "\n",
        "# Use the model to tokenize text\n",
        "tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)\n",
        "txt = \"Today is Tuesday.\"\n",
        "call_op(tokenizer, txt)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## WordpieceTokenizer\n",
        "\n",
        "[mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) 将输入的字符串切分为子词。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "['token' '##izer' 'will' 'outputs' 'sub' '##words']\n"
          ]
        }
      ],
      "source": [
        "# Reuse the vocab defined above as input vocab\n",
        "tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')\n",
        "txt = [\"tokenizer\", \"will\", \"outputs\", \"subwords\"]\n",
        "call_op(tokenizer, txt)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 在数据Pipeline中加载和处理TXT文件\n",
        "\n",
        "使用 [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) 将磁盘中的文本文件内容加载到数据Pipeline中,并应用分词器对其中的内容进行分词。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Load text content into dataset pipeline\n",
        "text_file = \"article.txt\"\n",
        "dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)\n",
        "\n",
        "# check the column names inside the dataset\n",
        "print(\"column names:\", dataset.get_col_names())\n",
        "\n",
        "# tokenize all text content into tokens with bert vocab\n",
        "dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=[\"text\"])\n",
        "\n",
        "for data in dataset:\n",
        "    print(data)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "ly37",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.5"
    },
    "vscode": {
      "interpreter": {
        "hash": "9f0efe8a0d8ccef1406a56130f5ab5480567fb275f7fbf51bbc40aede97503df"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}