mindspore.dataset.text.transforms.JiebaTokenizer
- class mindspore.dataset.text.transforms.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]
Tokenize Chinese string into words based on dictionary.
Note
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.
- Parameters
hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mode (JiebaMode, optional) –
Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX).
JiebaMode.MP, tokenize with MPSegment algorithm.
JiebaMode.HMM, tokenize with Hidden Markov Model Segment algorithm.
JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.
with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False).
Examples
>>> from mindspore.dataset.text import JiebaMode >>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True) >>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"], ... column_order=["token", "offsets_start", "offsets_limit"])
- add_dict(user_dict)[source]
Add a user defined word to JiebaTokenizer’s dictionary.
- Parameters
user_dict (Union[str, dict]) –
One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:
word1 freq1 word2 None word3 freq3
Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned.
Examples
>>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> user_dict = {"男默女泪": 10} >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> jieba_op.add_dict(user_dict) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
- add_word(word, freq=None)[source]
Add a user defined word to JiebaTokenizer’s dictionary.
- Parameters
word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).
Examples
>>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> with open(sentence_piece_vocab_file, 'r') as f: ... for line in f: ... word = line.split(',')[0] ... jieba_op.add_word(word) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])