Illustration of text transforms

View Source On Gitee

This example illustrates the various transforms available in the mindspore.dataset.text module.

Preparation

[1]:
import os
from download import download

import mindspore.dataset as ds
import mindspore.dataset.text as text

# Download opensource datasets
# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt"
download(url, './bert-base-uncased-vocab.txt', replace=True)

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt"
download(url, './article.txt', replace=True)

# Show the directory
print(os.listdir())

def call_op(op, input):
    print(op(input), flush=True)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)

file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]
Successfully downloaded file to ./bert-base-uncased-vocab.txt
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)

file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]
Successfully downloaded file to ./article.txt
['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']

Vocab

The mindspore.dataset.text.Vocab is used to save pairs of words and ids. It contains a map that maps each word(str) to an id(int) or reverse.

[2]:
# Load bert vocab
vocab_file = open("bert-base-uncased-vocab.txt")
vocab_content = list(set(vocab_file.read().splitlines()))
vocab = text.Vocab.from_list(vocab_content)

# lookup tokens to ids
ids = vocab.tokens_to_ids(["good", "morning"])
print("ids", ids)

# lookup ids to tokens
tokens = vocab.ids_to_tokens([128, 256])
print("tokens", tokens)

# Use Lookup op to lookup index
op = text.Lookup(vocab)
ids = op(["good", "morning"])
print("lookup: ids", ids)
ids [18863, 18279]
tokens ['##nology', 'crystalline']
lookup: ids [18863 18279]

AddToken

The mindspore.dataset.text.AddToken transform adds token to beginning or end of sequence.

[3]:
txt = ["a", "b", "c", "d", "e"]
add_token_op = text.AddToken(token='TOKEN', begin=True)
call_op(add_token_op, txt)

add_token_op = text.AddToken(token='END', begin=False)
call_op(add_token_op, txt)
['TOKEN' 'a' 'b' 'c' 'd' 'e']
['a' 'b' 'c' 'd' 'e' 'END']

SentencePieceTokenizer

The mindspore.dataset.text.SentencePieceTokenizer transform tokenizes scalar string into tokens by sentencepiece.

[4]:
# Construct a SentencePieceVocab model
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model"
download(url, './sentencepiece.bpe.model', replace=True)
sentence_piece_vocab_file = './sentencepiece.bpe.model'

# Use the model to tokenize text
tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)
txt = "Today is Tuesday."
call_op(tokenizer, txt)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)

file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]
Successfully downloaded file to ./sentencepiece.bpe.model
['▁Today' '▁is' '▁Tuesday' '.']

WordpieceTokenizer

The mindspore.dataset.text.WordpieceTokenizer transform tokenizes the input text to subword tokens.

[5]:
# Reuse the vocab defined above as input vocab
tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')
txt = ["tokenizer", "will", "outputs", "subwords"]
call_op(tokenizer, txt)
['token' '##izer' 'will' 'outputs' 'sub' '##words']

Process TXT File In Dataset Pipeline

Use mindspore.dataset.TextFileDataset to read content of text into dataset pipeline and the perform tokenization on text.

[ ]:
# Load text content into dataset pipeline
text_file = "article.txt"
dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)

# check the column names inside the dataset
print("column names:", dataset.get_col_names())

# tokenize all text content into tokens with bert vocab
dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=["text"])

for data in dataset:
    print(data)