mindspore.dataset.text.utils.SentencePieceVocab
- class mindspore.dataset.text.utils.SentencePieceVocab[source]
SentencePiece object that is used to do words segmentation.
- classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece from a dataset.
- Parameters
dataset (Dataset) – Dataset to build SentencePiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) – A dictionary with no incoming parameters.
- Returns
SentencePieceVocab, vocab built from the dataset.
- classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece object from a list of word.
- Parameters
file_path (list) – Path to the file which contains the SentencePiece list.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) –
A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
input_sentence_size 0 max_sentencepiece_length 16
- Returns
SentencePieceVocab, vocab built from the file.
- classmethod save_model(vocab, path, filename)[source]
Save model into given filepath.
- Parameters
vocab (SentencePieceVocab) – A SentencePiece object.
path (str) – Path to store model.
filename (str) – The name of the file.