mindspore.dataset.text.Vocab
- class mindspore.dataset.text.Vocab[source]
Create Vocab for training NLP models.
Vocab is a collection of all possible Tokens in the data, preserving the mapping between each Token and its ID.
- classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]
Build a Vocab from a given dataset.
The samples in the dataset are used as a corpus to create Vocab, in which the Token is arranged in ascending order of Token frequency, and Tokens with the same frequency are arranged in alphabetical order.
- Parameters
dataset (Dataset) – The dataset to build the Vocab from.
columns (list[str], optional) – The name of the data columns used to create the Vocab. Default:
None
, use all columns.freq_range (tuple[int, int], optional) – The Token frequency range used to create the Vocab. Must contain two elements representing the minimum and maximum frequencies, within which the Token will be retained. When the minimum or maximum frequency is None, it means there is no minimum or maximum frequency limit. Default:
None
, no Token frequency range restriction.top_k (int, optional) – Only the first specified number of Tokens with the highest Token frequency are selected to build the Vocab. This operation will be performed after Token frequency filtering. If the value is greater than the total number of Tokens, all Tokens will be retained. Default:
None
, there is no limit to the number of Tokens.special_tokens (list[str], optional) – A list of special Token to append to the Vocab. Default:
None
, no special Token is appended.special_first (bool, optional) – Whether to add the special Token to the top of the Vocab, otherwise to the bottom of the Vocab. Default:
True
.
- Returns
Vocab, Vocab built from the dataset.
- Raises
TypeError – If columns is not of type list[str].
TypeError – If freq_range is not of type tuple[int, int]l.
ValueError – If element of freq_range is negative.
TypeError – If top_k is not of type int.
ValueError – If top_k is not positive.
TypeError – If special_tokens is not of type list[str].
ValueError – If there are duplicate elements in special_tokens.
TypeError – If special_first is not of type bool.
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> >>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False) >>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None, ... special_tokens=["<pad>", "<unk>"], ... special_first=True) >>> # Use the vocab to look up string to id >>> lookup = text.Lookup(vocab, "<unk>") >>> id = lookup("text1")
- classmethod from_dict(word_dict)[source]
Build a Vocab from a given dictionary.
- Parameters
word_dict (dict[str, int]) – A dictionary storing the mappings between each Token and its ID.
- Returns
Vocab, Vocab built from the dictionary.
- Raises
TypeError – If word_dict is not of type dict[str, int].
ValueError – If key value of word_dict is negative.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_dict({"home": 3, "behind": 2, "the": 4, "world": 5, "<unk>": 6}) >>> >>> # look up ids to string >>> tokens = vocab.ids_to_tokens([3, 4, 5]) >>> print(tokens) ['home', 'the', 'world']
- classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]
Build a Vocab from a file.
- Parameters
file_path (str) – The path of the file to build the Vocab from.
delimiter (str, optional) – The separator for the Token in the file line. The string before the separator will be treated as a Token. Default:
''
, the whole line will be treated as a Token.vocab_size (int, optional) – The upper limit on the number of Tokens that Vocab can contain. Default:
None
, no upper limit on the number of Token.special_tokens (list[str], optional) – A list of special Token to append to the Vocab. Default:
None
, no special Token is appended.special_first (bool, optional) – Whether to add the special Token to the top of the Vocab, otherwise to the bottom of the Vocab. Default:
True
.
- Returns
Vocab, Vocab built from the file.
- Raises
TypeError – If file_path is not of type str.
TypeError – If delimiter is not of type str.
ValueError – If vocab_size is not positive.
TypeError – If special_tokens is not of type list[str].
ValueError – If there are duplicate elements in special_tokens.
TypeError – If special_first is not of type bool.
Examples
>>> import mindspore.dataset.text as text >>> # Assume vocab file contains the following content: >>> # --- begin of file --- >>> # apple,apple2 >>> # banana, 333 >>> # cat,00 >>> # --- end of file --- >>> >>> # Read file through this API and specify "," as delimiter. >>> # The delimiter will break up each line in file, then the first element is taken to be the word. >>> vocab = text.Vocab.from_file("/path/to/simple/vocab/file", ",", None, ["<pad>", "<unk>"], True) >>> >>> # Finally, there are 5 words in the vocab: "<pad>", "<unk>", "apple", "banana", "cat". >>> vocabulary = vocab.vocab() >>> >>> # look up strings to ids >>> ids = vocab.tokens_to_ids(["apple", "banana"])
- classmethod from_list(word_list, special_tokens=None, special_first=True)[source]
Build a Vocab from a given Token list.
- Parameters
word_list (list[str]) – The Token list to build the Vocab from.
special_tokens (list[str], optional) – A list of special Token to append to the Vocab. Default:
None
, no special Token is appended.special_first (bool, optional) – Whether to add the special Token to the top of the Vocab, otherwise to the bottom of the Vocab. Default:
True
.
- Returns
Vocab, Vocab built from the list.
- Raises
TypeError – If word_list is not of type list[str].
ValueError – If there are duplicate elements in word_list.
TypeError – If special_tokens is not of type list[str].
ValueError – If there are duplicate elements in special_tokens.
TypeError – If special_first is not of type bool.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True) >>> # look up strings to ids >>> ids = vocab.tokens_to_ids(["w1", "w3"])
- ids_to_tokens(ids)[source]
Look up the Token corresponding to the specified ID.
- Parameters
ids (Union[int, list[int], numpy.ndarray]) – The ID or list of IDs to be looked up. If the ID does not exist, an empty string is returned.
- Returns
Union[str, list[str]], the Token(s) corresponding to the ID(s).
- Raises
TypeError – If ids is not of type Union[int, list[int], numpy.ndarray].
ValueError – If element of ids is negative.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True) >>> token = vocab.ids_to_tokens(1) >>> print(token) w1
- tokens_to_ids(tokens)[source]
Look up the ID corresponding to the specified Token.
- Parameters
tokens (Union[str, list[str], numpy.ndarray]) – The Token or list of Tokens to be looked up. If the Token does not exist, -1 is returned.
- Returns
Union[int, list[int]], the ID(s) corresponding to the Token(s).
- Raises
TypeError – If tokens is not of type Union[str, list[str], numpy.ndarray].
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True) >>> ids = vocab.tokens_to_ids(["w1", "w3"]) >>> print(ids) [1, 3]
- vocab()[source]
Get the dictionary of the mappings between Tokens and its IDs.
- Returns
dict[str, int], the dictionary of mappings between Tokens and IDs.
Examples
>>> import mindspore.dataset.text as text >>> vocab = text.Vocab.from_list(["word_1", "word_2", "word_3", "word_4"]) >>> vocabory_dict = vocab.vocab() >>> print(sorted(vocabory_dict.items())) [('word_1', 0), ('word_2', 1), ('word_3', 2), ('word_4', 3)]