mindspore.dataset.text.Vocab
- class mindspore.dataset.text.Vocab[source]
Vocab object that is used to save pairs of words and ids.
It contains a map that maps each word(str) to an id(int) or reverse.
- classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]
Build a Vocab from a dataset.
This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from the highest frequency to the lowest frequency. Words with the same frequency would be ordered lexicographically.
- Parameters
dataset (Dataset) – dataset to build vocab from.
columns (list[str], optional) – column names to get words from. It can be a list of column names. Default: None.
freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately. Default: None, all words are included.
top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken. Default: None, all words are included.
special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.
special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.
- Returns
Vocab, Vocab object built from the dataset.
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False) >>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None, ... special_tokens=["<pad>", "<unk>"], ... special_first=True) >>> dataset = dataset.map(operations=text.Lookup(vocab, "<unk>"), input_columns=["text"])
- classmethod from_dict(word_dict)[source]
Build a vocab object from a dict.
- Parameters
word_dict (dict) – Dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.
- Returns
Vocab, Vocab object built from the dict.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_dict({"home": 3, "behind": 2, "the": 4, "world": 5, "<unk>": 6})
- classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]
Build a vocab object from a file.
- Parameters
file_path (str) – Path to the file which contains the vocab list.
delimiter (str, optional) – A delimiter to break up each line in file, the first element is taken to be the word. Default: ‘’, the whole line will be treated as a word.
vocab_size (int, optional) – Number of words to read from file_path. Default: None, all words are taken.
special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.
special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.
- Returns
Vocab, Vocab object built from the file.
Examples
>>> import mindspore.dataset.text as text >>> # Assume vocab file contains the following content: >>> # --- begin of file --- >>> # apple,apple2 >>> # banana, 333 >>> # cat,00 >>> # --- end of file --- >>> >>> # Read file through this API and specify "," as delimiter. >>> # The delimiter will break up each line in file, then the first element is taken to be the word. >>> vocab = text.Vocab.from_file("/path/to/simple/vocab/file", ",", None, ["<pad>", "<unk>"], True) >>> >>> # Finally, there are 5 words in the vocab: "<pad>", "<unk>", "apple", "banana", "cat". >>> vocabulary = vocab.vocab()
- classmethod from_list(word_list, special_tokens=None, special_first=True)[source]
Build a vocab object from a list of word.
- Parameters
word_list (list) – A list of string where each element is a word of type string.
special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.
special_first (bool, optional) – Whether special_tokens is prepended or appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.
- Returns
Vocab, Vocab object built from the list.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
- ids_to_tokens(ids)[source]
Converts a single index or a sequence of indices in a token or a sequence of tokens. If id does not exist, return empty string.
- Parameters
ids (Union[int, list[int]]) – The token id (or token ids) to convert to tokens.
- Returns
The decoded token(s).
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True) >>> token = vocab.ids_to_tokens(0)
- tokens_to_ids(tokens)[source]
Converts a token string or a sequence of tokens in a single integer id or a sequence of ids. If token does not exist, return id with value -1.
- Parameters
tokens (Union[str, list[str]]) – One or several token(s) to convert to token id(s).
- Returns
The token id or list of token ids.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True) >>> ids = vocab.tokens_to_ids(["w1", "w3"])