Class JiebaTokenizer
Defined in File text.h
Inheritance Relationships
Base Type
public mindspore::dataset::TensorTransform
(Class TensorTransform)
Class Documentation
-
class JiebaTokenizer : public mindspore::dataset::TensorTransform
Tokenize a Chinese string into words based on the dictionary.
Note
The integrity of the HMMSegment algorithm and MPSegment algorithm files must be confirmed.
Public Functions
-
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)
Constructor.
- Parameters
hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
JiebaMode.kMP, tokenizes with MPSegment algorithm.
JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).
Example/* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); /* dataset is an instance of Dataset object */ dataset = dataset->Map({tokenizer_op}, // operations {"text"}); // input columns
-
JiebaTokenizer(const std::vector<char> &hmm_path, const std::vector<char> &mp_path, const JiebaMode &mode, bool with_offsets)
Constructor.
- Parameters
hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba).
mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX).
JiebaMode.kMP, tokenizes with MPSegment algorithm.
JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm.
JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms.
with_offsets – [in] Whether to output offsets of tokens (default=false).
-
~JiebaTokenizer() override = default
Destructor.
-
inline Status AddWord(const std::string &word, int64_t freq = 0)
Add a user defined word to the JiebaTokenizer's dictionary.
- Parameters
word – [in] The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq – [in] The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).
- Returns
Status error code, returns OK if no error is encountered.
Example/* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); Status s = tokenizer_op.AddWord("hello", 2);
-
inline Status AddDict(const std::vector<std::pair<std::string, int64_t>> &user_dict)
Add a user defined dictionary of word-freq pairs to the JiebaTokenizer's dictionary.
- Parameters
user_dict – [in] Vector of word-freq pairs to be added to the JiebaTokenizer's dictionary.
- Returns
Status error code, returns OK if no error is encountered.
Example/* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); std::vector<std::pair<std::string, int64_t>> user_dict = {{"a", 1}, {"b", 2}, {"c", 3}}; Status s = tokenizer_op.AddDict(user_dict);
-
inline Status AddDict(const std::string &file_path)
Add user defined dictionary of word-freq pairs to the JiebaTokenizer's dictionary from a file. Only valid word-freq pairs in user defined file will be added into the dictionary. Rows containing invalid inputs will be ignored, no error nor warning status is returned.
- Parameters
file_path – [in] Path to the dictionary which includes user defined word-freq pairs.
- Returns
Status error code, returns OK if no error is encountered.
Example/* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); Status s = tokenizer_op.AddDict("/path/to/dict/file");
-
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)