Class SentencePieceVocab
Defined in File text.h
Class Documentation
-
class SentencePieceVocab
SentencePiece object that is used to do words segmentation.
Public Static Functions
Build a SentencePiece object from a file.
- Parameters
path_list – [in] Path to the file which contains the SentencePiece list.
vocab_size – [in] Vocabulary size.
character_coverage – [in] Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type – [in] It can be any of [SentencePieceModel::kUnigram, SentencePieceModel::kBpe, SentencePieceModel::kChar, SentencePieceModel::kWord], default is SentencePieceModel::kUnigram. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.kUnigram, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.kBpe, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.kChar, refers to char based sentencePiece Model type.
SentencePieceModel.kWord, refers to word based sentencePiece Model type.
params – [in] A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
vocab – [out] A SentencePieceVocab object.
- Returns
SentencePieceVocab, vocab built from the file.
样例std::string dataset_path; dataset_path = datasets_root_path_ + "/test_sentencepiece/vocab.txt"; std::vector<std::string> path_list; path_list.emplace_back(dataset_path); std::unordered_map<std::string, std::string> param_map; std::shared_ptr<SentencePieceVocab> spm = std::make_unique<SentencePieceVocab>(); Status rc = SentencePieceVocab::BuildFromFile(path_list, 5000, 0.9995, SentencePieceModel::kUnigram, param_map, &spm);
Save the SentencePiece model into given file path.
- Parameters
vocab – [in] A SentencePiece object to be saved.
path – [in] Path to store the model.
filename – [in] The save name of model file.
样例// Save vocab model to local vocab->SaveModel(&vocab, datasets_root_path_ + "/test_sentencepiece", "m.model");