mindspore.dataset.text.PythonTokenizer

View Source On Gitee
class mindspore.dataset.text.PythonTokenizer(tokenizer)[source]

Class that applies user-defined string tokenizer into input string.

Parameters

tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

Raises

TypeError – If tokenizer is not a callable Python function.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> def my_tokenizer(line):
...     return line.split()
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=text.PythonTokenizer(my_tokenizer))
Tutorial Examples: