{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 分词器\n",
    "\n",
    "[![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.2/docs/programming_guide/source_zh_cn/tokenizer.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.2/programming_guide/mindspore_tokenizer.ipynb) [![](https://gitee.com/mindspore/docs/raw/r1.2/docs/programming_guide/source_zh_cn/_static/logo_modelarts.png)](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/notebook/loading?share-url-b64=aHR0cHM6Ly9vYnMuZHVhbHN0YWNrLmNuLW5vcnRoLTQubXlodWF3ZWljbG91ZC5jb20vbWluZHNwb3JlLXdlYnNpdGUvbm90ZWJvb2svbW9kZWxhcnRzL3Byb2dyYW1taW5nX2d1aWRlL21pbmRzcG9yZV90b2tlbml6ZXIuaXB5bmI=&image_id=65f636a0-56cf-49df-b941-7d2a07ba8c8c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 概述\n",
    "\n",
    "分词就是将连续的字序列按照一定的规范重新组合成词序列的过程,合理的进行分词有助于语义的理解。\n",
    "\n",
    "MindSpore提供了多种用途的分词器(Tokenizer),能够帮助用户高性能地处理文本,用户可以构建自己的字典,使用适当的标记器将句子拆分为不同的标记,并通过查找操作获取字典中标记的索引。\n",
    "\n",
    "MindSpore目前提供的分词器如下表所示。此外,用户也可以根据需要实现自定义的分词器。\n",
    "\n",
    "| 分词器 | 分词器说明 |\n",
    "| :-- | :-- |\n",
    "| BasicTokenizer | 根据指定规则对标量文本数据进行分词。 |\n",
    "| BertTokenizer | 用于处理Bert文本数据的分词器。 |\n",
    "| JiebaTokenizer | 基于字典的中文字符串分词器。 |\n",
    "| RegexTokenizer | 根据指定正则表达式对标量文本数据进行分词。 |\n",
    "| SentencePieceTokenizer | 基于SentencePiece开源工具包进行分词。 |\n",
    "| UnicodeCharTokenizer | 将标量文本数据分词为Unicode字符。 |\n",
    "| UnicodeScriptTokenizer | 根据Unicode边界对标量文本数据进行分词。 |\n",
    "| WhitespaceTokenizer | 根据空格符对标量文本数据进行分词。 |\n",
    "| WordpieceTokenizer | 根据单词集对标量文本数据进行分词。 |\n",
    "\n",
    "更多分词器的详细说明,可以参见[API文档](https://www.mindspore.cn/doc/api_python/zh-CN/r1.2/mindspore/mindspore.dataset.text.html)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MindSpore分词器\n",
    "\n",
    "下面介绍几种常用分词器的使用方法。\n",
    "\n",
    "### BertTokenizer\n",
    "\n",
    "`BertTokenizer`是通过调用`BasicTokenizer`和`WordpieceTokenizer`来进行分词的。\n",
    "\n",
    "下面的样例首先构建了一个文本数据集和字符串列表,然后通过`BertTokenizer`对数据集进行分词,并展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "床前明月光\n",
      "疑是地上霜\n",
      "举头望明月\n",
      "低头思故乡\n",
      "I am making small mistakes during working hours\n",
      "😀嘿嘿😃哈哈😄大笑😁嘻嘻\n",
      "繁體字\n",
      "------------------------after tokenization-----------------------------\n",
      "['床' '前' '明' '月' '光']\n",
      "['疑' '是' '地' '上' '霜']\n",
      "['举' '头' '望' '明' '月']\n",
      "['低' '头' '思' '故' '乡']\n",
      "['I' 'am' 'mak' '##ing' 'small' 'mistake' '##s' 'during' 'work' '##ing'\n",
      " 'hour' '##s']\n",
      "['😀' '嘿' '嘿' '😃' '哈' '哈' '😄' '大' '笑' '😁' '嘻' '嘻']\n",
      "['繁' '體' '字']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "\n",
    "input_list = [\"床前明月光\", \"疑是地上霜\", \"举头望明月\", \"低头思故乡\", \"I am making small mistakes during working hours\",\n",
    "                \"😀嘿嘿😃哈哈😄大笑😁嘻嘻\", \"繁體字\"]\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "vocab_list = [\n",
    "  \"床\", \"前\", \"明\", \"月\", \"光\", \"疑\", \"是\", \"地\", \"上\", \"霜\", \"举\", \"头\", \"望\", \"低\", \"思\", \"故\", \"乡\",\n",
    "  \"繁\", \"體\", \"字\", \"嘿\", \"哈\", \"大\", \"笑\", \"嘻\", \"i\", \"am\", \"mak\", \"make\", \"small\", \"mistake\",\n",
    "  \"##s\", \"during\", \"work\", \"##ing\", \"hour\", \"😀\", \"😃\", \"😄\", \"😁\", \"+\", \"/\", \"-\", \"=\", \"12\",\n",
    "  \"28\", \"40\", \"16\", \" \", \"I\", \"[CLS]\", \"[SEP]\", \"[UNK]\", \"[PAD]\", \"[MASK]\", \"[unused1]\", \"[unused10]\"]\n",
    "\n",
    "vocab = text.Vocab.from_list(vocab_list)\n",
    "tokenizer_op = text.BertTokenizer(vocab=vocab)\n",
    "dataset = dataset.map(operations=tokenizer_op)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### JiebaTokenizer\n",
    "\n",
    "`JiebaTokenizer`是基于jieba的中文分词。\n",
    "\n",
    "下载字典文件`hmm_model.utf8`和`jieba.dict.utf8`,并将其放到指定位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./datasets/tokenizer/\n",
      "├── hmm_model.utf8\n",
      "└── jieba.dict.utf8\n",
      "\n",
      "0 directories, 2 files\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/hmm_model.utf8\n",
    "!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/jieba.dict.utf8\n",
    "!mkdir -p ./datasets/tokenizer/\n",
    "!mv hmm_model.utf8 jieba.dict.utf8 -t ./datasets/tokenizer/\n",
    "!tree ./datasets/tokenizer/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的样例首先构建了一个文本数据集,然后使用HMM与MP字典文件创建`JiebaTokenizer`对象,并对数据集进行分词,最后展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "今天天气太好了我们一起去外面玩吧\n",
      "------------------------after tokenization-----------------------------\n",
      "['今天天气' '太好了' '我们' '一起' '去' '外面' '玩吧']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "\n",
    "input_list = [\"今天天气太好了我们一起去外面玩吧\"]\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "# files from open source repository https://github.com/yanyiwu/cppjieba/tree/master/dict\n",
    "HMM_FILE = \"./datasets/tokenizer/hmm_model.utf8\"\n",
    "MP_FILE = \"./datasets/tokenizer/jieba.dict.utf8\"\n",
    "jieba_op = text.JiebaTokenizer(HMM_FILE, MP_FILE)\n",
    "dataset = dataset.map(operations=jieba_op, input_columns=[\"text\"], num_parallel_workers=1)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SentencePieceTokenizer\n",
    "\n",
    "`SentencePieceTokenizer`是基于[SentencePiece](https://github.com/google/sentencepiece)这个开源的自然语言处理工具包。\n",
    "\n",
    "下载文本数据集文件`botchan.txt`,并将其放置到指定位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./datasets/tokenizer/\n",
      "└── botchan.txt\n",
      "\n",
      "0 directories, 1 files\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/botchan.txt\n",
    "!mkdir -p ./datasets/tokenizer/\n",
    "!mv botchan.txt ./datasets/tokenizer/\n",
    "!tree ./datasets/tokenizer/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的样例首先构建了一个文本数据集,然后从`vocab_file`文件中构建一个`vocab`对象,再通过`SentencePieceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "I saw a girl with a telescope.\n",
      "------------------------after tokenization-----------------------------\n",
      "['▁I' '▁sa' 'w' '▁a' '▁girl' '▁with' '▁a' '▁te' 'les' 'co' 'pe' '.']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType\n",
    "\n",
    "input_list = [\"I saw a girl with a telescope.\"]\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "# file from MindSpore repository https://gitee.com/mindspore/mindspore/blob/r1.2/tests/ut/data/dataset/test_sentencepiece/botchan.txt\n",
    "vocab_file = \"./datasets/tokenizer/botchan.txt\"\n",
    "vocab = text.SentencePieceVocab.from_file([vocab_file], 5000, 0.9995, SentencePieceModel.UNIGRAM, {})\n",
    "tokenizer_op = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)\n",
    "dataset = dataset.map(operations=tokenizer_op)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UnicodeCharTokenizer\n",
    "\n",
    "`UnicodeCharTokenizer`是根据Unicode字符集来分词的。\n",
    "\n",
    "下面的样例首先构建了一个文本数据集,然后通过`UnicodeCharTokenizer`对数据集进行分词,并展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "Welcome to Beijing!\n",
      "北京欢迎您!\n",
      "我喜欢English!\n",
      "------------------------after tokenization-----------------------------\n",
      "['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 't', 'o', ' ', 'B', 'e', 'i', 'j', 'i', 'n', 'g', '!']\n",
      "['北', '京', '欢', '迎', '您', '!']\n",
      "['我', '喜', '欢', 'E', 'n', 'g', 'l', 'i', 's', 'h', '!']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "\n",
    "input_list = [\"Welcome to Beijing!\", \"北京欢迎您!\", \"我喜欢English!\"]\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "tokenizer_op = text.UnicodeCharTokenizer()\n",
    "dataset = dataset.map(operations=tokenizer_op)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']).tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### WhitespaceTokenizer\n",
    "\n",
    "`WhitespaceTokenizer`是根据空格来进行分词的。\n",
    "\n",
    "下面的样例首先构建了一个文本数据集,然后通过`WhitespaceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "Welcome to Beijing!\n",
      "北京欢迎您!\n",
      "我喜欢English!\n",
      "------------------------after tokenization-----------------------------\n",
      "['Welcome', 'to', 'Beijing!']\n",
      "['北京欢迎您!']\n",
      "['我喜欢English!']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "\n",
    "input_list = [\"Welcome to Beijing!\", \"北京欢迎您!\", \"我喜欢English!\"]\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "tokenizer_op = text.WhitespaceTokenizer()\n",
    "dataset = dataset.map(operations=tokenizer_op)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']).tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### WordpieceTokenizer\n",
    "\n",
    "`WordpieceTokenizer`是基于单词集来进行划分的,划分依据可以是单词集中的单个单词,或者多个单词的组合形式。\n",
    "\n",
    "下面的样例首先构建了一个文本数据集,然后从单词列表中构建`vocab`对象,通过`WordpieceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------before tokenization----------------------------\n",
      "my\n",
      "favorite\n",
      "book\n",
      "is\n",
      "love\n",
      "during\n",
      "the\n",
      "cholera\n",
      "era\n",
      "what\n",
      "我\n",
      "最\n",
      "喜\n",
      "欢\n",
      "的\n",
      "书\n",
      "是\n",
      "霍\n",
      "乱\n",
      "时\n",
      "期\n",
      "的\n",
      "爱\n",
      "情\n",
      "您\n",
      "------------------------after tokenization-----------------------------\n",
      "['my']\n",
      "['favor' '##ite']\n",
      "['book']\n",
      "['is']\n",
      "['love']\n",
      "['dur' '##ing']\n",
      "['the']\n",
      "['cholera']\n",
      "['era']\n",
      "['[UNK]']\n",
      "['我']\n",
      "['最']\n",
      "['喜']\n",
      "['欢']\n",
      "['的']\n",
      "['书']\n",
      "['是']\n",
      "['霍']\n",
      "['乱']\n",
      "['时']\n",
      "['期']\n",
      "['的']\n",
      "['爱']\n",
      "['情']\n",
      "['[UNK]']\n"
     ]
    }
   ],
   "source": [
    "import mindspore.dataset as ds\n",
    "import mindspore.dataset.text as text\n",
    "\n",
    "input_list = [\"my\", \"favorite\", \"book\", \"is\", \"love\", \"during\", \"the\", \"cholera\", \"era\", \"what\",\n",
    "    \"我\", \"最\", \"喜\", \"欢\", \"的\", \"书\", \"是\", \"霍\", \"乱\", \"时\", \"期\", \"的\", \"爱\", \"情\", \"您\"]\n",
    "vocab_english = [\"book\", \"cholera\", \"era\", \"favor\", \"##ite\", \"my\", \"is\", \"love\", \"dur\", \"##ing\", \"the\"]\n",
    "vocab_chinese = [\"我\", '最', '喜', '欢', '的', '书', '是', '霍', '乱', '时', '期', '爱', '情']\n",
    "\n",
    "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n",
    "\n",
    "print(\"------------------------before tokenization----------------------------\")\n",
    "\n",
    "for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "    print(text.to_str(data['text']))\n",
    "\n",
    "vocab = text.Vocab.from_list(vocab_english+vocab_chinese)\n",
    "tokenizer_op = text.WordpieceTokenizer(vocab=vocab)\n",
    "dataset = dataset.map(operations=tokenizer_op)\n",
    "\n",
    "print(\"------------------------after tokenization-----------------------------\")\n",
    "\n",
    "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n",
    "    print(text.to_str(i['text']))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}