mindspore.dataset.text.utils.SentencePieceVocab

class mindspore.dataset.text.utils.SentencePieceVocab[source]

SentencePiece object that is used to do words segmentation.

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece from a dataset.

Parameters
  • dataset (Dataset) – Dataset to build SentencePiece.

  • col_names (list) – The list of the col name.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) –

    It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.

    • SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

    • SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

    • SentencePieceModel.CHAR, refers to char based sentencePiece Model type.

    • SentencePieceModel.WORD, refers to word based sentencePiece Model type.

  • params (dict) – A dictionary with no incoming parameters.

Returns

SentencePieceVocab, vocab built from the dataset.

classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece object from a list of word.

Parameters
  • file_path (list) – Path to the file which contains the SentencePiece list.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) –

    It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.

    • SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

    • SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

    • SentencePieceModel.CHAR, refers to char based sentencePiece Model type.

    • SentencePieceModel.WORD, refers to word based sentencePiece Model type.

  • params (dict) –

    A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

    input_sentence_size 0
    max_sentencepiece_length 16
    

Returns

SentencePieceVocab, vocab built from the file.

classmethod save_model(vocab, path, filename)[source]

Save model into given filepath.

Parameters
  • vocab (SentencePieceVocab) – A SentencePiece object.

  • path (str) – Path to store model.

  • filename (str) – The name of the file.