Class JiebaTokenizer

Defined in File text.h

Inheritance Relationships

Base Type

public mindspore::dataset::TensorTransform (Class TensorTransform)

Class Documentation

class JiebaTokenizer : public mindspore::dataset::TensorTransform 

Tokenize a Chinese string into words based on the dictionary.

Note

The integrity of the HMMSegment algorithm and MPSegment algorithm files must be confirmed.

Public Functions

inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)

Constructor.

Parameters

hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
- JiebaMode.kMP, tokenizes with MPSegment algorithm.
- JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
- JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).

JiebaTokenizer(const std::vector<char> &hmm_path, const std::vector<char> &mp_path, const JiebaMode &mode, bool with_offsets)

Constructor.

Parameters

hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
- JiebaMode.kMP, tokenizes with MPSegment algorithm.
- JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
- JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).

~JiebaTokenizer() = default: Destructor.

inline Status AddWord(const std::string &word, int64_t freq = 0)

Add a user defined word to the JiebaTokenizer’s dictionary.

Parameters

word – [in] The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq – [in] The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).

Returns

Status error code, returns OK if no error is encountered.

inline Status AddDict(const std::vector<std::pair<std::string, int64_t>> &user_dict)

Add a user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary.

Parameters: user_dict – [in] Vector of word-freq pairs to be added to the JiebaTokenizer’s dictionary.
Returns: Status error code, returns OK if no error is encountered.

inline Status AddDict(const std::string &file_path)

Add user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary from a file. Only valid word-freq pairs in user defined file will be added into the dictionary. Rows containing invalid inputs will be ignored, no error nor warning status is returned.

Parameters: file_path – [in] Path to the dictionary which includes user defined word-freq pairs.
Returns: Status error code, returns OK if no error is encountered.