mindspore.dataset.text.NormalizeUTF8
- class mindspore.dataset.text.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]
Apply normalize operation on UTF-8 string tensor.
Note
NormalizeUTF8 is not supported on Windows platform yet.
- Parameters
normalize_form (NormalizeForm, optional) –
Valid values can be
NormalizeForm.NONE
,NormalizeForm.NFC
,NormalizeForm.NFKC
,NormalizeForm.NFD
,NormalizeForm.NFKD
any of the four unicode normalized forms. Default:NormalizeForm.NFKC
. See http://unicode.org/reports/tr15/ for details.NormalizeForm.NONE
, do nothing for input string tensor.NormalizeForm.NFC
, normalize with Normalization Form C.NormalizeForm.NFKC
, normalize with Normalization Form KC.NormalizeForm.NFD
, normalize with Normalization Form D.NormalizeForm.NFKD
, normalize with Normalization Form KD.
- Raises
TypeError – If normalize_form is not of type NormalizeForm.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import NormalizeForm >>> >>> normalize_op = text.NormalizeUTF8(normalize_form=NormalizeForm.NFC) >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=normalize_op)
- Tutorial Examples: