mindspore.dataset.text.NormalizeUTF8
- class mindspore.dataset.text.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]
Apply normalize operation on UTF-8 string tensor.
Note
NormalizeUTF8 is not supported on Windows platform yet.
- Parameters
normalize_form (NormalizeForm, optional) –
Valid values can be [NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD, NormalizeForm.NFKD] any of the four unicode normalized forms. Default: NormalizeForm.NFKC. See http://unicode.org/reports/tr15/ for details.
NormalizeForm.NONE, do nothing for input string tensor.
NormalizeForm.NFC, normalize with Normalization Form C.
NormalizeForm.NFKC, normalize with Normalization Form KC.
NormalizeForm.NFD, normalize with Normalization Form D.
NormalizeForm.NFKD, normalize with Normalization Form KD.
- Raises
TypeError – If normalize_form is not of type NormalizeForm.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import NormalizeForm >>> normalize_op = text.NormalizeUTF8(normalize_form=NormalizeForm.NFC) >>> text_file_dataset = text_file_dataset.map(operations=normalize_op)