Class Vocab
Defined in File text.h
Class Documentation
-
class Vocab
Vocab object that is used to save pairs of words and ids.
Note
It contains a map that maps each word(str) to an id(int) or reverse.
Public Functions
-
WordIdType TokensToIds(const WordType &word) const
Lookup the id of a word, if the word doesn’t exist in vocab, return -1.
- Parameters
word – Word to be looked up.
- Returns
ID of the word in the vocab.
Example// lookup, convert token to id auto single_index = vocab->TokensToIds("home"); single_index = vocab->TokensToIds("hello");
-
std::vector<WordIdType> TokensToIds(const std::vector<WordType> &words) const
Lookup the id of a word, if the word doesn’t exist in vocab, return -1.
- Parameters
words – Words to be looked up.
- Returns
ID of the word in the vocab.
Example// lookup multiple tokens auto multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "behind"}); std::vector<int32_t> expected_multi_indexs = {0, 4}; multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "apple"}); expected_multi_indexs = {0, -1};
-
WordType IdsToTokens(const WordIdType &id)
Lookup the word of an ID, if ID doesn’t exist in vocab, return empty string.
- Parameters
id – ID to be looked up.
- Returns
Indicates the word corresponding to the ID.
Example// reverse lookup, convert id to token auto single_word = vocab->IdsToTokens(2); single_word = vocab->IdsToTokens(-1);
-
std::vector<WordType> IdsToTokens(const std::vector<WordIdType> &ids)
Lookup the word of an ID, if ID doesn’t exist in vocab, return empty string.
- Parameters
ids – ID to be looked up.
- Returns
Indicates the word corresponding to the ID.
Example// reverse lookup multiple ids auto multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 4}); std::vector<std::string> expected_multi_words = {"<pad>", "behind"}; multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 99}); expected_multi_words = {"<pad>", ""};
-
explicit Vocab(std::unordered_map<WordType, WordIdType> map)
Constructor, shouldn’t be called directly, can’t be private due to std::make_unique().
- Parameters
map – Sanitized word2id map.
-
void AppendWord(const std::string &word)
Add one word to vocab, increment it’s index automatically.
- Parameters
word – Word to be added, word will skip if word already exists.
-
inline const std::unordered_map<WordType, WordIdType> &GetVocab() const
Return a read-only vocab in unordered_map type.
- Returns
A unordered_map of word2id.
-
Vocab() = default
Constructor.
-
~Vocab() = default
Destructor.
Public Static Functions
-
static Status BuildFromUnorderedMap(const std::unordered_map<WordType, WordIdType> &words, std::shared_ptr<Vocab> *vocab)
Build a vocab from an unordered_map. IDs should be no duplicate and continuous.
- Parameters
words – [in] An unordered_map containing word id pair.
vocab – [out] A vocab object.
- Returns
Status code.
Example// Build a map std::unordered_map<std::string, int32_t> dict; dict["banana"] = 0; dict["apple"] = 1; dict["cat"] = 2; dict["dog"] = 3; // Build vocab from map std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromUnorderedMap(dict, &vocab);
-
static Status BuildFromVector(const std::vector<WordType> &words, const std::vector<WordType> &special_tokens, bool prepend_special, std::shared_ptr<Vocab> *vocab)
Build a vocab from a c++ vector. id no duplicate and continuous.
- Parameters
words – [in] A vector of string containing words.
special_tokens – [in] A vector of string containing special tokens.
prepend_special – [in] Whether the special_tokens will be prepended/appended to vocab.
vocab – [out] A vocab object.
- Returns
Status code.
Example// Build vocab from a vector of words, special tokens are prepended to vocab std::vector<std::string> list = {"apple", "banana", "cat", "dog", "egg"}; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromVector(list, {"<unk>"}, true, &vocab);
-
static Status BuildFromFile(const std::string &path, const std::string &delimiter, int32_t vocab_size, const std::vector<WordType> &special_tokens, bool prepend_special, std::shared_ptr<Vocab> *vocab)
Build a vocab from vocab file, IDs will be automatically assigned.
- Parameters
path – [in] Path to vocab file, each line in file is assumed as a word (including space).
delimiter – [in] Delimiter to break each line, characters after the delimiter will be deprecated.
vocab_size – [in] Number of lines to be read from file.
special_tokens – [in] A vector of string containing special tokens.
prepend_special – [in] Whether the special_tokens will be prepended/appended to vocab.
vocab – [out] A vocab object.
- Returns
Status code.
Example// Build vocab from local file std::string vocab_dir = datasets_root_path_ + "/testVocab/vocab_list.txt"; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromFile(vocab_dir, ",", -1, {"<pad>", "<unk>"}, true, &vocab);
-
WordIdType TokensToIds(const WordType &word) const