Class SentencePieceVocab
Defined in File text.h
Class Documentation
-
class SentencePieceVocab
SentencePiece object that is used to do words segmentation.
Public Static Functions
Build a SentencePiece object from a file.
- Example
std::string dataset_path; dataset_path = datasets_root_path_ + "/test_sentencepiece/vocab.txt"; std::vector<std::string> path_list; path_list.emplace_back(dataset_path); std::unordered_map<std::string, std::string> param_map; std::shared_ptr<SentencePieceVocab> spm = std::make_unique<SentencePieceVocab>(); Status rc = SentencePieceVocab::BuildFromFile(path_list, 5000, 0.9995, SentencePieceModel::kUnigram, param_map, &spm);
- 参数
path_list – [in] Path to the file which contains the SentencePiece list.
vocab_size – [in] Vocabulary size.
character_coverage – [in] Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type – [in] It can be any of [SentencePieceModel::kUnigram, SentencePieceModel::kBpe, SentencePieceModel::kChar, SentencePieceModel::kWord], default is SentencePieceModel::kUnigram. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.kUnigram, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.kBpe, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.kChar, refers to char based sentencePiece Model type.
SentencePieceModel.kWord, refers to word based sentencePiece Model type.
params – [in] A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
vocab – [out] A SentencePieceVocab object.
- 返回
SentencePieceVocab, vocab built from the file.
Save the SentencePiece model into given file path.
- Example
// Save vocab model to local vocab->SaveModel(&vocab, datasets_root_path_ + "/test_sentencepiece", "m.model");
- 参数
vocab – [in] A SentencePiece object to be saved.
path – [in] Path to store the model.
filename – [in] The save name of model file.