mindformers.dataset.MultiTurnDataset
- class mindformers.dataset.MultiTurnDataset(dataset_config: dict = None)[source]
Multi-turn dataset.
The generated dataset has two columns:
[input_ids, labels]
. The tensor of columninput_ids
is of the int32 type. The tensor of columnlabels
is of the int32 type.- Parameters
dataset_config (dict) –
Required. Config for dataset. Must be dict which contains all keys below at least.
data_loader: Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir" and "shuffle" are the keys can be parsed.
type: Required. Indicates the type of dataset. The value must be string or class type.
dataset_dir: Required. The path of dataset.
shuffle: Required. Whether to perform shuffle on the dataset. Must be bool.
tokenizer: Tokenizer configuration or object.
max_seq_length: Maximum length of the sequence.
batch_size: Size of each batch.
drop_remainder: Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.
num_parallel_workers: Specifies the number of concurrent processes or threads for map operations to accelerate processing.
python_multiprocessing: Enabling the Python Multi-Process Mode to Accelerate Map Operations.
repeat: Number of times this dataset is repeated.
seed: Random seed number.
prefetch_size: Buffer queue size of each data processing operation in the pipeline.
numa_enable: Indicates whether to use the NUMA binding function.
- Returns
Instance of MultiTurnDataset.
- Raises
ValueError – If Python version earlier than 3.9.
ValueError – If dataset_dir is missing in dataset_config.data_loader, or dataset_config.data_loader.dataset_dir does not exist.
ValueError – If the length of tokens and loss masks mismatch.
ValueError – If the length of input ids and labels mismatch.
Examples
>>> from mindformers import MultiTurnDataset >>> from mindformers.tools.register import MindFormerConfig >>> from mindformers.dataset import check_dataset_config >>> # Note: >>> # `"/path/to/tool_alpaca.jsonl"` should be replaced with the real path of the formatted dataset file. >>> # `"/path/to/tokenizer.model"` should be replaced with the real path of the tokenizer file. >>> # The detailed data setting could refer to >>> # https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/glm3.md >>> config_dict = { ... 'data_loader': { ... 'type': 'ToolAlpacaDataLoader', ... 'dataset_dir': "/path/to/tool_alpaca.jsonl", ... 'shuffle': True ... }, ... 'tokenizer': { ... 'type': 'ChatGLM3Tokenizer', ... 'vocab_file': '/path/to/tokenizer.model' ... }, ... 'max_seq_length': 2048, ... 'batch_size': 1, ... 'drop_remainder': True, ... 'num_parallel_workers': 8, ... 'python_multiprocessing': False, ... 'repeat': 1, ... 'seed': 0, ... 'prefetch_size': 1, ... 'numa_enable': False, ... } >>> # Initialize a MindFormerConfig instance with a dict. >>> config = MindFormerConfig(**config_dict) >>> check_dataset_config(config) >>> # use class to build dataset >>> dataset_from_class = MultiTurnDataset(config)