mindspore.dataset.text.NormalizeUTF8
- class mindspore.dataset.text.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]
Normalize the input UTF-8 encoded strings.
Note
NormalizeUTF8 is not supported on Windows platform yet.
- Parameters
normalize_form (NormalizeForm, optional) – The desired normalization form. See
NormalizeForm
for details on optional values. Default:NormalizeForm.NFKC
.- Raises
TypeError – If normalize_form is not of type
NormalizeForm
.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import NormalizeForm >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["ṩ", "ḍ̇", "q̇", "fi", "2⁵", "ẛ"], ... column_names=["text"], shuffle=False) >>> normalize_op = text.NormalizeUTF8(normalize_form=NormalizeForm.NFC) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=normalize_op) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ... break ṩ >>> >>> # Use the transform in eager mode >>> data = ["ṩ", "ḍ̇", "q̇", "fi", "2⁵", "ẛ"] >>> output = text.NormalizeUTF8(NormalizeForm.NFKC)(data) >>> print(output) ['ṩ' 'ḍ̇' 'q̇' 'fi' '25' 'ṡ']
- Tutorial Examples: