mindspore.dataset.text.JiebaTokenizer
- class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[源代码]
使用Jieba分词器对中文字符串进行分词。
说明
隐式马尔可夫模型(Hidden Markov Model)分词和最大概率法(Max Probability)分词所使用的词典文件可通过 cppjieba开源仓 获取,请保证文件的有效性与 完整性。
- 参数:
- 异常:
TypeError - 当 hmm_path 不为str类型。
TypeError - 当 mp_path 不为str类型。
TypeError - 当 mode 不为
JiebaMode
类型。TypeError - 当 with_offsets 不为bool类型。
- 支持平台:
CPU
样例:
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"]) >>> >>> # 1) If with_offsets=False, return one data column {["text", dtype=str]} >>> # The paths to jieba_hmm_file and jieba_mp_file can be downloaded directly from the mindspore repository. >>> # Refer to https://gitee.com/mindspore/mindspore/blob/v2.3.0-rc2/tests/ut/data/dataset/jiebadict/hmm_model.utf8 >>> # and https://gitee.com/mindspore/mindspore/blob/v2.3.0-rc2/tests/ut/data/dataset/jiebadict/jieba.dict.utf8 >>> jieba_hmm_file = "tests/ut/data/dataset/jiebadict/hmm_model.utf8" >>> jieba_mp_file = "tests/ut/data/dataset/jiebadict/jieba.dict.utf8" >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ['床' '前' '明月光'] >>> >>> # 2) If with_offsets=True, return three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"]) >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"]) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["token"], item["offsets_start"], item["offsets_limit"]) ['床' '前' '明月光'] [0 3 6] [ 3 6 15] >>> >>> # Use the transform in eager mode >>> data = "床前明月光" >>> output = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)(data) >>> print(output) ['床' '前' '明月光']
- 教程样例:
- add_dict(user_dict)[源代码]
添加指定的词映射字典到Vocab中。
- 参数:
user_dict (Union[str, dict[str, int]]) - 待添加到Vocab中的词映射。 若输入类型为str,表示存储待添加词映射的文件路径,文件的每一行需包含两个字段,间隔一个空格,其中第一个 字段表示词本身,第二个字段需为数字,表示词频。无效的行将被忽略,且不返回错误或告警。 若输入类型为dict[str, int],表示存储待添加词映射的字典,其中键名为词本身,键值为词频。
样例:
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> user_dict = {"男默女泪": 10} >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> jieba_op.add_dict(user_dict) >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
- add_word(word, freq=None)[源代码]
添加一个指定的词映射到Vocab中。
- 参数:
word (str) - 待添加到Vocab中的词。
freq (int,可选) - 待添加词的词频。词频越高,单词被分词的机会就越大。默认值:
None
,使用默认词频。
样例:
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import JiebaMode >>> >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> with open(sentence_piece_vocab_file, 'r') as f: ... for line in f: ... word = line.split(',')[0] ... jieba_op.add_word(word) >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])