mindspore.dataset.TextBaseDataset.build_vocab

TextBaseDataset.build_vocab(columns, freq_range, top_k, special_tokens, special_first)[source]

Function to create a Vocab from source dataset. Desired source dataset is a text type dataset.

Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab which contains top_k most frequent words (if top_k is specified).

Parameters

columns (Union[str, list[str]]) – Column names to get words from.
freq_range (tuple[int]) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range will be stored. Naturally 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency can be set to default, which corresponds to 0/total_words separately.
top_k (int) – Number of words to be built into vocab. top_k most frequent words are taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken
special_tokens (list[str]) – A list of strings, each one is a special token.
special_first (bool) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to default, special_tokens will be prepended.

Returns

Vocab, vocab built from the dataset.

Examples

>>> import numpy as np
>>>
>>> def gen_corpus():
...     # key: word, value: number of occurrences, reason for using letters is so their order is apparent
...     corpus = {"Z": 4, "Y": 4, "X": 4, "W": 3, "U": 3, "V": 2, "T": 1}
...     for k, v in corpus.items():
...         yield (np.array([k] * v, dtype='S'),)
>>> column_names = ["column1"]
>>> dataset = ds.GeneratorDataset(gen_corpus, column_names)
>>> dataset = dataset.build_vocab(columns=["column1"],
...                               freq_range=(1, 10), top_k=5,
...                               special_tokens=["<pad>", "<unk>"],
...                               special_first=True)