Class JiebaTokenizer

Inheritance Relationships

Base Type

Class Documentation

class JiebaTokenizer : public mindspore::dataset::TensorTransform

Tokenize a Chinese string into words based on the dictionary.

Note

The integrity of the HMMSegment algorithm and MPSegment algorithm files must be confirmed.

Public Functions

inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)

Constructor.

Parameters
  • hmm_path[in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).

  • mp_path[in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).

  • mode[in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).

    • JiebaMode.kMP, tokenizes with MPSegment algorithm.

    • JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.

    • JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.

  • with_offsets[in] Whether to output offsets of tokens (default=false).

样例
/* Define operations */
auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file");

/* dataset is an instance of Dataset object */
dataset = dataset->Map({tokenizer_op},   // operations
                       {"text"});        // input columns
JiebaTokenizer(const std::vector<char> &hmm_path, const std::vector<char> &mp_path, const JiebaMode &mode, bool with_offsets)

Constructor.

Parameters
  • hmm_path[in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).

  • mp_path[in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).

  • mode[in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).

    • JiebaMode.kMP, tokenizes with MPSegment algorithm.

    • JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.

    • JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.

  • with_offsets[in] Whether to output offsets of tokens (default=false).

~JiebaTokenizer() = default

Destructor.

inline Status AddWord(const std::string &word, int64_t freq = 0)

Add a user defined word to the JiebaTokenizer’s dictionary.

Parameters
  • word[in] The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.

  • freq[in] The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).

Returns

Status error code, returns OK if no error is encountered.

样例
/* Define operations */
auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file");

Status s = tokenizer_op.AddWord("hello", 2);
inline Status AddDict(const std::vector<std::pair<std::string, int64_t>> &user_dict)

Add a user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary.

Parameters

user_dict[in] Vector of word-freq pairs to be added to the JiebaTokenizer’s dictionary.

Returns

Status error code, returns OK if no error is encountered.

样例
/* Define operations */
auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file");

std::vector<std::pair<std::string, int64_t>> user_dict = {{"a", 1}, {"b", 2}, {"c", 3}};
Status s = tokenizer_op.AddDict(user_dict);
inline Status AddDict(const std::string &file_path)

Add user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary from a file. Only valid word-freq pairs in user defined file will be added into the dictionary. Rows containing invalid inputs will be ignored, no error nor warning status is returned.

Parameters

file_path[in] Path to the dictionary which includes user defined word-freq pairs.

Returns

Status error code, returns OK if no error is encountered.

样例
/* Define operations */
auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file");

Status s = tokenizer_op.AddDict("/path/to/dict/file");