mindspore.dataset.text.UnicodeCharTokenizer

class mindspore.dataset.text.UnicodeCharTokenizer(with_offsets=False)[source]

Unpack the Unicode characters in the input strings.

Parameters

with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises

TypeError – If with_offsets is not of type bool.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Welcome     To   BeiJing!'], column_names=["text"])
>>>
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=False)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
...     break
['W' 'e' 'l' 'c' 'o' 'm' 'e' ' ' ' ' ' ' ' ' ' ' 'T' 'o' ' ' ' ' ' ' 'B' 'e' 'i' 'J' 'i' 'n' 'g' '!']
>>>
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                  ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=True)
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Welcome     To   BeiJing!'], column_names=["text"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                                 output_columns=["token", "offsets_start", "offsets_limit"])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["token"], item["offsets_start"], item["offsets_limit"])
['W' 'e' 'l' 'c' 'o' 'm' 'e' ' ' ' ' ' ' ' ' ' ' 'T' 'o' ' ' ' ' ' ' 'B' 'e' 'i' 'J' 'i' 'n' 'g' '!'] [ 0  1  2
3  4  5  6  7  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25]
>>>
>>> # Use the transform in eager mode
>>> data = 'Welcome     To   BeiJing!'
>>> output = text.UnicodeCharTokenizer(with_offsets=True)(data)
>>> print(output)
(array(['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', ' ', ' ', ' ', ' ', 'T', 'o', ' ', ' ', ' ', 'B', 'e', 'i', 'J',
'i', 'n', 'g', '!'], dtype='<U1'), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24], dtype=uint32), array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], dtype=uint32))
Tutorial Examples: