Class JiebaTokenizer
Defined in File text.h
Inheritance Relationships
Base Type
public mindspore::dataset::TensorTransform
(Class TensorTransform)
Class Documentation
-
class JiebaTokenizer : public mindspore::dataset::TensorTransform
Tokenize a Chinese string into words based on the dictionary.
Note
The integrity of the HMMSegment algorithm and MPSegment algorithm files must be confirmed.
Public Functions
-
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)
Constructor.
- Parameters
hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
JiebaMode.kMP, tokenizes with MPSegment algorithm.
JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).
-
JiebaTokenizer(const std::vector<char> &hmm_path, const std::vector<char> &mp_path, const JiebaMode &mode, bool with_offsets)
Constructor.
- Parameters
hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
JiebaMode.kMP, tokenizes with MPSegment algorithm.
JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).
-
~JiebaTokenizer() = default
Destructor.
-
inline Status AddWord(const std::string &word, int64_t freq = 0)
Add a user defined word to the JiebaTokenizer’s dictionary.
- Parameters
word – [in] The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq – [in] The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).
- Returns
Status error code, returns OK if no error is encountered.
-
inline Status AddDict(const std::vector<std::pair<std::string, int64_t>> &user_dict)
Add a user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary.
- Parameters
user_dict – [in] Vector of word-freq pairs to be added to the JiebaTokenizer’s dictionary.
- Returns
Status error code, returns OK if no error is encountered.
-
inline Status AddDict(const std::string &file_path)
Add user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary from a file. Only valid word-freq pairs in user defined file will be added into the dictionary. Rows containing invalid inputs will be ignored, no error nor warning status is returned.
- Parameters
file_path – [in] Path to the dictionary which includes user defined word-freq pairs.
- Returns
Status error code, returns OK if no error is encountered.
-
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)