mindspore.dataset.TextBaseDataset.build_sentencepiece_vocab
- TextBaseDataset.build_sentencepiece_vocab(columns, vocab_size, character_coverage, model_type, params)[source]
Function to create a SentencePieceVocab from source dataset. Desired source dataset is a text type dataset.
- Parameters
vocab_size (int) – Vocabulary size.
character_coverage (float) – Percentage of characters covered by the model, must be between 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like Japanese or Chinese character sets, and 1.0 for other languages with small character sets like English or Latin.
model_type (SentencePieceModel) – Model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
params (dict) – Any extra optional parameters of sentencepiece library according to your raw data
- Returns
SentencePieceVocab, vocab built from the dataset.
Examples
>>> from mindspore.dataset.text import SentencePieceModel >>> >>> # You can construct any text dataset as source, take TextFileDataset as example. >>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False) >>> dataset = dataset.build_sentencepiece_vocab(["text"], 5000, 0.9995, SentencePieceModel.UNIGRAM, {})