mindspore.dataset.text.NormalizeUTF8

class mindspore.dataset.text.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]

Apply normalize operation on UTF-8 string tensor.

Note

NormalizeUTF8 is not supported on Windows platform yet.

Parameters

normalize_form (NormalizeForm, optional) –

Valid values can be [NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD, NormalizeForm.NFKD] any of the four unicode normalized forms. Default: NormalizeForm.NFKC. See http://unicode.org/reports/tr15/ for details.

  • NormalizeForm.NONE, do nothing for input string tensor.

  • NormalizeForm.NFC, normalize with Normalization Form C.

  • NormalizeForm.NFKC, normalize with Normalization Form KC.

  • NormalizeForm.NFD, normalize with Normalization Form D.

  • NormalizeForm.NFKD, normalize with Normalization Form KD.

Raises

TypeError – If normalize_form is not of type NormalizeForm.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import NormalizeForm
>>> normalize_op = text.NormalizeUTF8(normalize_form=NormalizeForm.NFC)
>>> text_file_dataset = text_file_dataset.map(operations=normalize_op)