mindspore.dataset.text.SentencePieceVocab
- class mindspore.dataset.text.SentencePieceVocab[source]
SentencePiece object that is used to do words segmentation.
- classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece from a dataset.
- Parameters
dataset (Dataset) – Dataset to build SentencePiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model. Recommend
0.9995
for languages with rich character set like Japanese or Chinese and1.0
for other languages with small character set.model_type (SentencePieceModel) – The desired subword algorithm. See
SentencePieceModel
for details on optional values.params (dict) – A dictionary with no incoming parameters.
- Returns
SentencePieceVocab, vocab built from the dataset.
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> >>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False) >>> vocab = SentencePieceVocab.from_dataset(dataset, ["text"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> # Build tokenizer based on vocab >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=text.SPieceTokenizerOutType.STRING) >>> txt = "Today is Tuesday." >>> token = tokenizer(txt)
- classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]
Build a SentencePiece object from a file.
- Parameters
file_path (list) – Path to the file which contains the SentencePiece list.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model. Recommend
0.9995
for languages with rich character set like Japanese or Chinese and1.0
for other languages with small character set.model_type (SentencePieceModel) – The desired subword algorithm. See
SentencePieceModel
for details on optional values.params (dict) – A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
- Returns
SentencePieceVocab, vocab built from the file.
Examples
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> # Build tokenizer based on vocab model >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=text.SPieceTokenizerOutType.STRING) >>> txt = "Today is Friday." >>> token = tokenizer(txt)
- classmethod save_model(vocab, path, filename)[source]
Save model into given filepath.
- Parameters
vocab (SentencePieceVocab) – A SentencePiece object.
path (str) – Path to store model.
filename (str) – The name of the file.
Examples
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel >>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> SentencePieceVocab.save_model(vocab, "./", "m.model")