mindspore.dataset.text.WordpieceTokenizer

class mindspore.dataset.text.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]

Tokenize the input text to subword tokens.

Parameters
  • vocab (Vocab) – Vocabulary used to look up words.

  • suffix_indicator (str, optional) – Prefix flags used to indicate subword suffixes. Default: ‘##’.

  • max_bytes_per_token (int, optional) – The maximum length of tokenization, words exceeding this length will not be split. Default: 100.

  • unknown_token (str, optional) – The output for unknown words. When set to an empty string, the corresponding unknown word will be directly returned as the output. Otherwise, the set string will be returned as the output. Default: ‘[UNK]’.

  • with_offsets (bool, optional) – Whether to return the offsets of tokens. Default: False.

Raises
Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                       max_bytes_per_token=100, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"])