Document feedback

Question document fragment

When a question document fragment contains a formula, it is displayed as a space.

Submission type

issue

It's a little complicated...

I'd like to ask someone.

PR

Just a small problem.

I can fix it online!

Please select the submission type

Problem type

Specifications and Common Mistakes

- Specifications and Common Mistakes:

- Misspellings or punctuation mistakes,incorrect formulas, abnormal display.

- Incorrect links, empty cells, or wrong formats.

- Chinese characters in English context.

- Minor inconsistencies between the UI and descriptions.

- Low writing fluency that does not affect understanding.

- Incorrect version numbers, including software package names and version numbers on the UI.

Usability

- Usability:

- Incorrect or missing key steps.

- Missing main function descriptions, keyword explanation, necessary prerequisites, or precautions.

- Ambiguous descriptions, unclear reference, or contradictory context.

- Unclear logic, such as missing classifications, items, and steps.

Correctness

- Correctness:

- Technical principles, function descriptions, supported platforms, parameter types, or exceptions inconsistent with that of software implementation.

- Incorrect schematic or architecture diagrams.

- Incorrect commands or command parameters.

- Incorrect code.

- Commands inconsistent with the functions.

- Wrong screenshots.

- Sample code running error, or running results inconsistent with the expectation.

Risk Warnings

- Risk Warnings:

- Lack of risk warnings for operations that may damage the system or important data.

Content Compliance

- Content Compliance:

- Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions.

- Copyright infringement.

Please select the type of question

Problem description

Describe the bug so that we can quickly locate the problem.

Document feedback

Illustration of text transforms

This example illustrates the various transforms available in the mindspore.dataset.text module.

Preparation

[1]:

import os
from download import download

import mindspore.dataset as ds
import mindspore.dataset.text as text

# Download opensource datasets
# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt"
download(url, './bert-base-uncased-vocab.txt', replace=True)

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt"
download(url, './article.txt', replace=True)

# Show the directory
print(os.listdir())

def call_op(op, input):
    print(op(input), flush=True)

Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)

file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]
Successfully downloaded file to ./bert-base-uncased-vocab.txt
Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)

file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]
Successfully downloaded file to ./article.txt
['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']

Vocab

The mindspore.dataset.text.Vocab is used to save pairs of words and ids. It contains a map that maps each word(str) to an id(int) or reverse.

[2]:

# Load bert vocab
vocab_file = open("bert-base-uncased-vocab.txt")
vocab_content = list(set(vocab_file.read().splitlines()))
vocab = text.Vocab.from_list(vocab_content)

# lookup tokens to ids
ids = vocab.tokens_to_ids(["good", "morning"])
print("ids", ids)

# lookup ids to tokens
tokens = vocab.ids_to_tokens([128, 256])
print("tokens", tokens)

# Use Lookup op to lookup index
op = text.Lookup(vocab)
ids = op(["good", "morning"])
print("lookup: ids", ids)

ids [18863, 18279]
tokens ['##nology', 'crystalline']
lookup: ids [18863 18279]

AddToken

The mindspore.dataset.text.AddToken transform adds token to beginning or end of sequence.

[3]:

txt = ["a", "b", "c", "d", "e"]
add_token_op = text.AddToken(token='TOKEN', begin=True)
call_op(add_token_op, txt)

add_token_op = text.AddToken(token='END', begin=False)
call_op(add_token_op, txt)

['TOKEN' 'a' 'b' 'c' 'd' 'e']
['a' 'b' 'c' 'd' 'e' 'END']

SentencePieceTokenizer

The mindspore.dataset.text.SentencePieceTokenizer transform tokenizes scalar string into tokens by sentencepiece.

[4]:

# Construct a SentencePieceVocab model
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model"
download(url, './sentencepiece.bpe.model', replace=True)
sentence_piece_vocab_file = './sentencepiece.bpe.model'

# Use the model to tokenize text
tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)
txt = "Today is Tuesday."
call_op(tokenizer, txt)

Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)

file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]
Successfully downloaded file to ./sentencepiece.bpe.model
['▁Today' '▁is' '▁Tuesday' '.']

WordpieceTokenizer

The mindspore.dataset.text.WordpieceTokenizer transform tokenizes the input text to subword tokens.

[5]:

# Reuse the vocab defined above as input vocab
tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')
txt = ["tokenizer", "will", "outputs", "subwords"]
call_op(tokenizer, txt)

['token' '##izer' 'will' 'outputs' 'sub' '##words']

Process TXT File In Dataset Pipeline

Use mindspore.dataset.TextFileDataset to read content of text into dataset pipeline and the perform tokenization on text.

[ ]:

# Load text content into dataset pipeline
text_file = "article.txt"
dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)

# check the column names inside the dataset
print("column names:", dataset.get_col_names())

# tokenize all text content into tokens with bert vocab
dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=["text"])

for data in dataset:
    print(data)