mindspore.dataset.text.JiebaTokenizer

class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]

Tokenize Chinese string into words based on dictionary.

Note

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

Parameters
  • hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mode (JiebaMode, optional) –

    Valid values can be JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX. Default: JiebaMode.MIX.

    • JiebaMode.MP, tokenize with MPSegment algorithm.

    • JiebaMode.HMM, tokenize with Hidden Markov Model Segment algorithm.

    • JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.

  • with_offsets (bool, optional) – Whether or not output offsets of tokens. Default: False.

Raises
  • ValueError – If path of HMMSegment dict is not provided.

  • ValueError – If path of MPSegment dict is not provided.

  • TypeError – If hmm_path or mp_path is not of type string.

  • TypeError – If with_offsets is not of type bool.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>>
>>> # 1) If with_offsets=False, return one data column {["text", dtype=str]}
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>>
>>> # 2) If with_offsets=True, return three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"])
Tutorial Examples:
add_dict(user_dict)[source]

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters

user_dict (Union[str, dict]) –

One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2 None
word3 freq3

Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
add_word(word, freq=None)[source]

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters
  • word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.

  • freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized. Default: None, use default frequency.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])