mindspore.dataset.text.JiebaTokenizer

class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]

Use Jieba tokenizer to tokenize Chinese strings.

Note

The dictionary files used by Hidden Markov Model segment and Max Probability segment can be obtained through the cppjieba GitHub . Please ensure the validity and integrity of these files.

Parameters

hmm_path (str) – Path to the dictionary file used by Hidden Markov Model segment.
mp_path (str) – Path to the dictionary file used by Max Probability segment.
mode (JiebaMode, optional) – The desired segment algorithms. See JiebaMode for details on optional values. Default: JiebaMode.MIX .
with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises

TypeError – If hmm_path is not of type str.
TypeError – If mp_path is not of type str.
TypeError – If mode is not of type JiebaMode .
TypeError – If with_offsets is not of type bool.

Supported Platforms:: CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>>
>>> # 1) If with_offsets=False, return one data column {["text", dtype=str]}
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>>
>>> # 2) If with_offsets=True, return three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"])

Tutorial Examples:

Illustration of text transforms

add_dict(user_dict)[source]

Add the specified word mappings to the Vocab of the tokenizer.

Parameters: user_dict (Union[str, dict[str, int]]) – The word mappings to be added to the Vocab. If the input type is str, it means the path of the file storing the word mappings to be added. Each line of the file should contain two fields separated by a space, where the first field indicates the word itself and the second field should be a number indicating the word frequency. Invalid lines will be ignored and no error or warning will be returned. If the input type is dict[str, int], it means the dictionary storing the word mappings to be added, where the key name is the word itself and the key value is the word frequency.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])

add_word(word, freq=None)[source]

Add a specified word mapping to the Vocab of the tokenizer.

Parameters

word (str) – The word to be added to the Vocab.
freq (int, optional) – The frequency of the word to be added. The higher the word frequency, the greater the chance that the word will be tokenized. Default: None, using the default word frequency.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])