{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 文本变换样例库\n", "\n", "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.q1/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.3.q1/docs/api_python/samples/dataset/text_gallery.ipynb) \n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.q1/resource/_static/logo_source.svg)](https://gitee.com/mindspore/mindspore/blob/r2.3.q1/docs/api/api_python/samples/dataset/text_gallery.ipynb)\n", "\n", "此指南展示了[mindpore.dataset.text](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC)模块中各种变换的用法。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 环境准备" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)\n", "\n", "file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]\n", "Successfully downloaded file to ./bert-base-uncased-vocab.txt\n", "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)\n", "\n", "file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]\n", "Successfully downloaded file to ./article.txt\n", "['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']\n" ] } ], "source": [ "import os\n", "from download import download\n", "\n", "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "# Download opensource datasets\n", "# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt\"\n", "download(url, './bert-base-uncased-vocab.txt', replace=True)\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt\"\n", "download(url, './article.txt', replace=True)\n", "\n", "# Show the directory\n", "print(os.listdir())\n", "\n", "def call_op(op, input):\n", " print(op(input), flush=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Vocab\n", "\n", "[mindspore.dataset.text.Vocab](https://mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) 用于存储多对字符与ID。其包含一个映射,可以将每个单词(str)映射到一个ID(int)。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ids [18863, 18279]\n", "tokens ['##nology', 'crystalline']\n", "lookup: ids [18863 18279]\n" ] } ], "source": [ "# Load bert vocab\n", "vocab_file = open(\"bert-base-uncased-vocab.txt\")\n", "vocab_content = list(set(vocab_file.read().splitlines()))\n", "vocab = text.Vocab.from_list(vocab_content)\n", "\n", "# lookup tokens to ids\n", "ids = vocab.tokens_to_ids([\"good\", \"morning\"])\n", "print(\"ids\", ids)\n", "\n", "# lookup ids to tokens\n", "tokens = vocab.ids_to_tokens([128, 256])\n", "print(\"tokens\", tokens)\n", "\n", "# Use Lookup op to lookup index\n", "op = text.Lookup(vocab)\n", "ids = op([\"good\", \"morning\"])\n", "print(\"lookup: ids\", ids)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## AddToken\n", "\n", "[mindspore.dataset.text.AddToken](https://mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) 将分词(token)添加到序列的开头或结尾处。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['TOKEN' 'a' 'b' 'c' 'd' 'e']\n", "['a' 'b' 'c' 'd' 'e' 'END']\n" ] } ], "source": [ "txt = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n", "add_token_op = text.AddToken(token='TOKEN', begin=True)\n", "call_op(add_token_op, txt)\n", "\n", "add_token_op = text.AddToken(token='END', begin=False)\n", "call_op(add_token_op, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SentencePieceTokenizer\n", "\n", "[mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) 使用SentencePiece分词器对字符串进行分词。\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)\n", "\n", "file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]\n", "Successfully downloaded file to ./sentencepiece.bpe.model\n", "['▁Today' '▁is' '▁Tuesday' '.']\n" ] } ], "source": [ "# Construct a SentencePieceVocab model\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model\"\n", "download(url, './sentencepiece.bpe.model', replace=True)\n", "sentence_piece_vocab_file = './sentencepiece.bpe.model'\n", "\n", "# Use the model to tokenize text\n", "tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)\n", "txt = \"Today is Tuesday.\"\n", "call_op(tokenizer, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## WordpieceTokenizer\n", "\n", "[mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) 将输入的字符串切分为子词。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['token' '##izer' 'will' 'outputs' 'sub' '##words']\n" ] } ], "source": [ "# Reuse the vocab defined above as input vocab\n", "tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')\n", "txt = [\"tokenizer\", \"will\", \"outputs\", \"subwords\"]\n", "call_op(tokenizer, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 在数据Pipeline中加载和处理TXT文件\n", "\n", "使用 [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/zh-CN/r2.3.0rc1/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) 将磁盘中的文本文件内容加载到数据Pipeline中,并应用分词器对其中的内容进行分词。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load text content into dataset pipeline\n", "text_file = \"article.txt\"\n", "dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)\n", "\n", "# check the column names inside the dataset\n", "print(\"column names:\", dataset.get_col_names())\n", "\n", "# tokenize all text content into tokens with bert vocab\n", "dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=[\"text\"])\n", "\n", "for data in dataset:\n", " print(data)" ] } ], "metadata": { "kernelspec": { "display_name": "ly37", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "vscode": { "interpreter": { "hash": "9f0efe8a0d8ccef1406a56130f5ab5480567fb275f7fbf51bbc40aede97503df" } } }, "nbformat": 4, "nbformat_minor": 0 }