Class SentencePieceVocab
Defined in File text.h
Class Documentation
-
class SentencePieceVocab
SentencePiece object that is used to do words segmentation.
Public Functions
-
SentencePieceVocab()
Constructor.
-
~SentencePieceVocab() = default
Destructor.
Public Static Functions
-
static Status BuildFromFile(const std::vector<std::string> &path_list, const int32_t vocab_size, const float character_coverage, const SentencePieceModel model_type, const std::unordered_map<std::string, std::string> ¶ms, std::shared_ptr<SentencePieceVocab> *vocab)
Build a SentencePiece object from a file.
- Example
std::string dataset_path; dataset_path = datasets_root_path_ + "/test_sentencepiece/vocab.txt"; std::vector<std::string> path_list; path_list.emplace_back(dataset_path); std::unordered_map<std::string, std::string> param_map; std::shared_ptr<SentencePieceVocab> spm = std::make_unique<SentencePieceVocab>(); Status rc = SentencePieceVocab::BuildFromFile(path_list, 5000, 0.9995, SentencePieceModel::kUnigram, param_map, &spm);
- Parameters
path_list – [in] Path to the file which contains the SentencePiece list.
vocab_size – [in] Vocabulary size.
character_coverage – [in] Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type – [in] It can be any of [SentencePieceModel::kUnigram, SentencePieceModel::kBpe, SentencePieceModel::kChar, SentencePieceModel::kWord], default is SentencePieceModel::kUnigram. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.kUnigram, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.kBpe, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.kChar, refers to char based sentencePiece Model type.
SentencePieceModel.kWord, refers to word based sentencePiece Model type.
params – [in] A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
vocab – [out] A SentencePieceVocab object.
- Returns
SentencePieceVocab, vocab built from the file.
-
static Status SaveModel(const std::shared_ptr<SentencePieceVocab> *vocab, const std::string path, std::string filename)
Save the SentencePiece model into given file path.
- Example
// Save vocab model to local vocab->SaveModel(&vocab, datasets_root_path_ + "/test_sentencepiece", "m.model");
- Parameters
vocab – [in] A SentencePiece object to be saved.
path – [in] Path to store the model.
filename – [in] The save name of model file.
-
SentencePieceVocab()