mindspore.dataset.text.utils.SentencePieceVocab
- class mindspore.dataset.text.utils.SentencePieceVocab[source]
SentencePiece obiect that is used to segmentate words
- classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]
Build a sentencepiece from a dataset
- Parameters
dataset (Dataset) – Dataset to build sentencepiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) – Choose from UNIGRAM (default), BPE, CHAR, or WORD. The input sentence must be pretokenized when using word type.
params (dict) – A dictionary with no incoming parameters.
- Returns
SentencePieceVocab, vocab built from the dataset.
- classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece object from a list of word.
- Parameters
file_path (list) – Path to the file which contains the sentencepiece list.
vocab_size (int) – Vocabulary size, the type of uint32_t.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) – Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
params (dict) –
A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
input_sentence_size 0 max_sentencepiece_length 16
- Returns
SentencePieceVocab, vocab built from the file.
- classmethod save_model(vocab, path, filename)[source]
Save model to filepath
- Parameters
vocab (SentencePieceVocab) – A sentencepiece object.
path (str) – Path to store model.
filename (str) – The name of the file.