mindspore.dataset.text.SentencePieceVocab
- class mindspore.dataset.text.SentencePieceVocab[source]
SentencePiece object that is used to do words segmentation.
- classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece from a dataset.
- Parameters
dataset (Dataset) – Dataset to build SentencePiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) – A dictionary with no incoming parameters.
- Returns
SentencePieceVocab, vocab built from the dataset.
Examples
>>> import mindspore.dataset as ds >>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False) >>> vocab = SentencePieceVocab.from_dataset(dataset, ["text"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {})
- classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece object from a file.
- Parameters
file_path (list) – Path to the file which contains the SentencePiece list.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) – A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
- Returns
SentencePieceVocab, vocab built from the file.
Examples
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {})
- classmethod save_model(vocab, path, filename)[source]
Save model into given filepath.
- Parameters
vocab (SentencePieceVocab) – A SentencePiece object.
path (str) – Path to store model.
filename (str) – The name of the file.
Examples
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> SentencePieceVocab.save_model(vocab, "./", "m.model")