mindspore.dataset.text.BasicTokenizer
- class mindspore.dataset.text.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]
Tokenize the input UTF-8 encoded string by specific rules.
Note
BasicTokenizer is not supported on Windows platform yet.
- Parameters
lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text, with mode specified by normalization_form . Default:
False
.keep_whitespace (bool, optional) – If True, the whitespace will be kept in the output. Default:
False
.normalization_form (NormalizeForm, optional) – The desired normalization form. See
NormalizeForm
for details on optional values. Default:NormalizeForm.NFKC
.preserve_unused_token (bool, optional) – Whether to preserve special tokens. If True, will not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’. Default:
True
.with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default:
False
.
- Raises
TypeError – If lower_case is not of type bool.
TypeError – If keep_whitespace is not of type bool.
TypeError – If normalization_form is not of type
NormalizeForm
.TypeError – If preserve_unused_token is not of type bool.
TypeError – If with_offsets is not of type bool.
RuntimeError – If dtype of input Tensor is not str.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import NormalizeForm >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> >>> # 1) If with_offsets=False, default output one column {["text", dtype=str]} >>> tokenizer_op = text.BasicTokenizer(lower_case=False, ... keep_whitespace=False, ... normalization_form=NormalizeForm.NONE, ... preserve_unused_token=True, ... with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # 2) If with_offsets=True, then output three columns {["token", dtype=str], >>> # ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.BasicTokenizer(lower_case=False, ... keep_whitespace=False, ... normalization_form=NormalizeForm.NONE, ... preserve_unused_token=True, ... with_offsets=True) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"])
- Tutorial Examples: