mindspore.dataset.text.utils.Vocab

class mindspore.dataset.text.utils.Vocab[source]

Vocab object that is used to lookup a word.

It contains a map that maps each word(str) to an id (int).

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).

  • top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the dataset.

classmethod from_dict(word_dict)[source]

Build a vocab object from a dict.

Parameters

word_dict (dict) – Dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

Returns

Vocab, vocab built from the dict.

classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • file_path (str) – Path to the file which contains the vocab list.

  • delimiter (str, optional) – A delimiter to break up each line in file, the first element is taken to be the word (default=””).

  • vocab_size (int, optional) – Number of words to read from file_path (default=None, all words are taken).

  • special_tokens (list, optional) – A list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the file.

classmethod from_list(word_list, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • word_list (list) – A list of string where each element is a word of type string.

  • special_tokens (list, optional) – A list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens is prepended or appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the list.