mindspore.dataset.text.JiebaTokenizer
- class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]
Tokenize Chinese string into words based on dictionary.
Note
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.
- Parameters
hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mode (JiebaMode, optional) –
Valid values can be
JiebaMode.MP
,JiebaMode.HMM
,JiebaMode.MIX
. Default:JiebaMode.MIX
.JiebaMode.MP
, tokenize with MPSegment algorithm.JiebaMode.HMM
, tokenize with Hidden Markov Model Segment algorithm.JiebaMode.MIX
, tokenize with a mix of MPSegment and HMMSegment algorithm.
with_offsets (bool, optional) – Whether or not output offsets of tokens. Default:
False
.
- Raises
ValueError – If path of HMMSegment dict is not provided.
ValueError – If path of MPSegment dict is not provided.
TypeError – If hmm_path or mp_path is not of type string.
TypeError – If with_offsets is not of type bool.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> >>> # 1) If with_offsets=False, return one data column {["text", dtype=str]} >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> >>> # 2) If with_offsets=True, return three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"])
- Tutorial Examples:
- add_dict(user_dict)[source]
Add a user defined word to JiebaTokenizer’s dictionary.
- Parameters
user_dict (Union[str, dict]) –
One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:
word1 freq1 word2 None word3 freq3
Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned.
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> user_dict = {"男默女泪": 10} >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> jieba_op.add_dict(user_dict) >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
- add_word(word, freq=None)[source]
Add a user defined word to JiebaTokenizer’s dictionary.
- Parameters
word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized. Default:
None
, use default frequency.
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> with open(sentence_piece_vocab_file, 'r') as f: ... for line in f: ... word = line.split(',')[0] ... jieba_op.add_word(word) >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])