Class SentencePieceVocab

Class Documentation

class SentencePieceVocab

SentencePiece object that is used to do words segmentation.

Public Functions

SentencePieceVocab()

Constructor.

~SentencePieceVocab() = default

Destructor.

Public Static Functions

static Status BuildFromFile(const std::vector<std::string> &path_list, const int32_t vocab_size, const float character_coverage, const SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params, std::shared_ptr<SentencePieceVocab> *vocab)

Build a SentencePiece object from a file.

Example
std::string dataset_path;
dataset_path = datasets_root_path_ + "/test_sentencepiece/vocab.txt";
std::vector<std::string> path_list;
path_list.emplace_back(dataset_path);
std::unordered_map<std::string, std::string> param_map;
std::shared_ptr<SentencePieceVocab> spm = std::make_unique<SentencePieceVocab>();
Status rc = SentencePieceVocab::BuildFromFile(path_list, 5000, 0.9995,
                                              SentencePieceModel::kUnigram, param_map, &spm);

参数
  • path_list[in] Path to the file which contains the SentencePiece list.

  • vocab_size[in] Vocabulary size.

  • character_coverage[in] Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type[in] It can be any of [SentencePieceModel::kUnigram, SentencePieceModel::kBpe, SentencePieceModel::kChar, SentencePieceModel::kWord], default is SentencePieceModel::kUnigram. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.

    • SentencePieceModel.kUnigram, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

    • SentencePieceModel.kBpe, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

    • SentencePieceModel.kChar, refers to char based sentencePiece Model type.

    • SentencePieceModel.kWord, refers to word based sentencePiece Model type.

  • params[in] A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

  • vocab[out] A SentencePieceVocab object.

返回

SentencePieceVocab, vocab built from the file.

static Status SaveModel(const std::shared_ptr<SentencePieceVocab> *vocab, const std::string path, std::string filename)

Save the SentencePiece model into given file path.

Example
// Save vocab model to local
vocab->SaveModel(&vocab, datasets_root_path_ + "/test_sentencepiece", "m.model");

参数
  • vocab[in] A SentencePiece object to be saved.

  • path[in] Path to store the model.

  • filename[in] The save name of model file.