Document feedback

Question document fragment

When a question document fragment contains a formula, it is displayed as a space.

Submission type
issue

It's a little complicated...

I'd like to ask someone.

PR

Just a small problem.

I can fix it online!

Please select the submission type

Problem type
Specifications and Common Mistakes

- Specifications and Common Mistakes:

- Misspellings or punctuation mistakes,incorrect formulas, abnormal display.

- Incorrect links, empty cells, or wrong formats.

- Chinese characters in English context.

- Minor inconsistencies between the UI and descriptions.

- Low writing fluency that does not affect understanding.

- Incorrect version numbers, including software package names and version numbers on the UI.

Usability

- Usability:

- Incorrect or missing key steps.

- Missing main function descriptions, keyword explanation, necessary prerequisites, or precautions.

- Ambiguous descriptions, unclear reference, or contradictory context.

- Unclear logic, such as missing classifications, items, and steps.

Correctness

- Correctness:

- Technical principles, function descriptions, supported platforms, parameter types, or exceptions inconsistent with that of software implementation.

- Incorrect schematic or architecture diagrams.

- Incorrect commands or command parameters.

- Incorrect code.

- Commands inconsistent with the functions.

- Wrong screenshots.

- Sample code running error, or running results inconsistent with the expectation.

Risk Warnings

- Risk Warnings:

- Lack of risk warnings for operations that may damage the system or important data.

Content Compliance

- Content Compliance:

- Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions.

- Copyright infringement.

Please select the type of question

Problem description

Describe the bug so that we can quickly locate the problem.

Illustration of text transforms

View Source On Gitee

This example illustrates the various transforms available in the mindspore.dataset.text module.

Preparation

[1]:
import os
from download import download

import mindspore.dataset as ds
import mindspore.dataset.text as text

# Download opensource datasets
# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt"
download(url, './bert-base-uncased-vocab.txt', replace=True)

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt"
download(url, './article.txt', replace=True)

# Show the directory
print(os.listdir())

def call_op(op, input):
    print(op(input), flush=True)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)

file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]
Successfully downloaded file to ./bert-base-uncased-vocab.txt
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)

file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]
Successfully downloaded file to ./article.txt
['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']

Vocab

The mindspore.dataset.text.Vocab is used to save pairs of words and ids. It contains a map that maps each word(str) to an id(int) or reverse.

[2]:
# Load bert vocab
vocab_file = open("bert-base-uncased-vocab.txt")
vocab_content = list(set(vocab_file.read().splitlines()))
vocab = text.Vocab.from_list(vocab_content)

# lookup tokens to ids
ids = vocab.tokens_to_ids(["good", "morning"])
print("ids", ids)

# lookup ids to tokens
tokens = vocab.ids_to_tokens([128, 256])
print("tokens", tokens)

# Use Lookup op to lookup index
op = text.Lookup(vocab)
ids = op(["good", "morning"])
print("lookup: ids", ids)
ids [18863, 18279]
tokens ['##nology', 'crystalline']
lookup: ids [18863 18279]

AddToken

The mindspore.dataset.text.AddToken transform adds token to beginning or end of sequence.

[3]:
txt = ["a", "b", "c", "d", "e"]
add_token_op = text.AddToken(token='TOKEN', begin=True)
call_op(add_token_op, txt)

add_token_op = text.AddToken(token='END', begin=False)
call_op(add_token_op, txt)
['TOKEN' 'a' 'b' 'c' 'd' 'e']
['a' 'b' 'c' 'd' 'e' 'END']

SentencePieceTokenizer

The mindspore.dataset.text.SentencePieceTokenizer transform tokenizes scalar string into tokens by sentencepiece.

[4]:
# Construct a SentencePieceVocab model
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model"
download(url, './sentencepiece.bpe.model', replace=True)
sentence_piece_vocab_file = './sentencepiece.bpe.model'

# Use the model to tokenize text
tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)
txt = "Today is Tuesday."
call_op(tokenizer, txt)
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)

file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]
Successfully downloaded file to ./sentencepiece.bpe.model
['▁Today' '▁is' '▁Tuesday' '.']

WordpieceTokenizer

The mindspore.dataset.text.WordpieceTokenizer transform tokenizes the input text to subword tokens.

[5]:
# Reuse the vocab defined above as input vocab
tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')
txt = ["tokenizer", "will", "outputs", "subwords"]
call_op(tokenizer, txt)
['token' '##izer' 'will' 'outputs' 'sub' '##words']

Process TXT File In Dataset Pipeline

Use mindspore.dataset.TextFileDataset to read content of text into dataset pipeline and the perform tokenization on text.

[ ]:
# Load text content into dataset pipeline
text_file = "article.txt"
dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)

# check the column names inside the dataset
print("column names:", dataset.get_col_names())

# tokenize all text content into tokens with bert vocab
dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=["text"])

for data in dataset:
    print(data)