文本变换样例库
此指南展示了mindspore.dataset.text模块中各种变换的用法。
环境准备
[1]:
import os
from download import download
import mindspore.dataset as ds
import mindspore.dataset.text as text
# Download opensource datasets
# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt"
download(url, './bert-base-uncased-vocab.txt', replace=True)
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt"
download(url, './article.txt', replace=True)
# Show the directory
print(os.listdir())
def call_op(op, input):
print(op(input), flush=True)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)
file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]
Successfully downloaded file to ./bert-base-uncased-vocab.txt
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)
file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]
Successfully downloaded file to ./article.txt
['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']
Vocab
mindspore.dataset.text.Vocab 用于存储多对字符与ID。其包含一个映射,可以将每个单词(str)映射到一个ID(int)。
[2]:
# Load bert vocab
vocab_file = open("bert-base-uncased-vocab.txt")
vocab_content = list(set(vocab_file.read().splitlines()))
vocab = text.Vocab.from_list(vocab_content)
# lookup tokens to ids
ids = vocab.tokens_to_ids(["good", "morning"])
print("ids", ids)
# lookup ids to tokens
tokens = vocab.ids_to_tokens([128, 256])
print("tokens", tokens)
# Use Lookup op to lookup index
op = text.Lookup(vocab)
ids = op(["good", "morning"])
print("lookup: ids", ids)
ids [18863, 18279]
tokens ['##nology', 'crystalline']
lookup: ids [18863 18279]
AddToken
mindspore.dataset.text.AddToken 将分词(token)添加到序列的开头或结尾处。
[3]:
txt = ["a", "b", "c", "d", "e"]
add_token_op = text.AddToken(token='TOKEN', begin=True)
call_op(add_token_op, txt)
add_token_op = text.AddToken(token='END', begin=False)
call_op(add_token_op, txt)
['TOKEN' 'a' 'b' 'c' 'd' 'e']
['a' 'b' 'c' 'd' 'e' 'END']
SentencePieceTokenizer
mindspore.dataset.text.SentencePieceTokenizer 使用SentencePiece分词器对字符串进行分词。
[4]:
# Construct a SentencePieceVocab model
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model"
download(url, './sentencepiece.bpe.model', replace=True)
sentence_piece_vocab_file = './sentencepiece.bpe.model'
# Use the model to tokenize text
tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)
txt = "Today is Tuesday."
call_op(tokenizer, txt)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)
file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]
Successfully downloaded file to ./sentencepiece.bpe.model
['▁Today' '▁is' '▁Tuesday' '.']
WordpieceTokenizer
mindspore.dataset.text.WordpieceTokenizer 将输入的字符串切分为子词。
[5]:
# Reuse the vocab defined above as input vocab
tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')
txt = ["tokenizer", "will", "outputs", "subwords"]
call_op(tokenizer, txt)
['token' '##izer' 'will' 'outputs' 'sub' '##words']
在数据Pipeline中加载和处理TXT文件
使用 mindspore.dataset.TextFileDataset 将磁盘中的文本文件内容加载到数据Pipeline中,并应用分词器对其中的内容进行分词。
[ ]:
# Load text content into dataset pipeline
text_file = "article.txt"
dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)
# check the column names inside the dataset
print("column names:", dataset.get_col_names())
# tokenize all text content into tokens with bert vocab
dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=["text"])
for data in dataset:
print(data)