mindspore.dataset.text.Ngram

View Source On AtomGit
class mindspore.dataset.text.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]

Generate n-grams from a 1-D string Tensor.

Refer to N-gram for an overview of what an n-gram is and how it works.

Parameters
  • n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up an n-gram, an empty string will be returned. For example, 3-gram on ["mindspore", "best"] will result in an empty string.

  • left_pad (tuple, optional) – Padding performed on the left side of the sequence shaped like ("pad_token", pad_width). pad_width will be capped at n-1. For example, specifying left_pad=("_", 2) would pad the left side of the sequence with "__". Default: ('', 0).

  • right_pad (tuple, optional) – Padding performed on the right side of the sequence shaped like ("pad_token", pad_width). pad_width will be capped at n-1. For example, specifying right_pad=("_", 2) would pad the right side of the sequence with "__". Default: ('', 0).

  • separator (str, optional) – Symbol used to join strings together. For example, if the 2-gram is ["mindspore", "amazing"] and the separator is "-", the result would be ["mindspore-amazing"]. Default: ' ', which will use whitespace as the separator.

Raises
  • TypeError – If any value in n is not of type int.

  • ValueError – If any value in n is not positive.

  • ValueError – If left_pad is not a tuple of length 2.

  • ValueError – If right_pad is not a tuple of length 2.

  • TypeError – If separator is not of type string.

Supported Platforms:

CPU

Examples

>>> import numpy as np
>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> def gen(texts):
...     for line in texts:
...         yield(np.array(line.split(" "), dtype=str),)
>>> data = ["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"]
>>> generator_dataset = ds.GeneratorDataset(gen(data), ["text"])
>>> ngram_op = text.Ngram(3, separator="-")
>>> generator_dataset = generator_dataset.map(operations=ngram_op)
>>> for item in generator_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
...     break
['']
>>>
>>> # Use the transform in eager mode
>>> output = ngram_op(data)
>>> print(output)
["WildRose Country-Canada's Ocean Playground-Land of Living Skies"]
Tutorial Examples: