mindformers.dataset.MultiTurnDataset

class mindformers.dataset.MultiTurnDataset(dataset_config: dict = None)[source]

Multi-turn dataset.

The generated dataset has two columns: [input_ids, labels] . The tensor of column input_ids is of the int32 type. The tensor of column labels is of the int32 type.

Parameters

dataset_config (dict) –

Required. Config for dataset. Must be dict which contains all keys below at least.

data_loader: Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir" and "shuffle" are the keys can be parsed.
- type: Required. Indicates the type of dataset. The value must be string or class type.
- dataset_dir: Required. The path of dataset.
- shuffle: Required. Whether to perform shuffle on the dataset. Must be bool.
tokenizer: Tokenizer configuration or object.
max_seq_length: Maximum length of the sequence.
batch_size: Size of each batch.
drop_remainder: Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.
num_parallel_workers: Specifies the number of concurrent processes or threads for map operations to accelerate processing.
python_multiprocessing: Enabling the Python Multi-Process Mode to Accelerate Map Operations.
repeat: Number of times this dataset is repeated.
seed: Random seed number.
prefetch_size: Buffer queue size of each data processing operation in the pipeline.
numa_enable: Indicates whether to use the NUMA binding function.

Returns

Instance of MultiTurnDataset.

Raises

ValueError – If Python version earlier than 3.9.
ValueError – If dataset_dir is missing in dataset_config.data_loader, or dataset_config.data_loader.dataset_dir does not exist.
ValueError – If the length of tokens and loss masks mismatch.
ValueError – If the length of input ids and labels mismatch.

Examples

>>> from mindformers import MultiTurnDataset
>>> from mindformers.tools.register import MindFormerConfig
>>> from mindformers.dataset import check_dataset_config
>>> # Note:
>>> #     `"/path/to/tool_alpaca.jsonl"` should be replaced with the real path of the formatted dataset file.
>>> #     `"/path/to/tokenizer.model"` should be replaced with the real path of the tokenizer file.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/glm3.md
>>> config_dict = {
...     'data_loader': {
...         'type': 'ToolAlpacaDataLoader',
...         'dataset_dir': "/path/to/tool_alpaca.jsonl",
...         'shuffle': True
...     },
...     'tokenizer': {
...         'type': 'ChatGLM3Tokenizer',
...         'vocab_file': '/path/to/tokenizer.model'
...     },
...     'max_seq_length': 2048,
...     'batch_size': 1,
...     'drop_remainder': True,
...     'num_parallel_workers': 8,
...     'python_multiprocessing': False,
...     'repeat': 1,
...     'seed': 0,
...     'prefetch_size': 1,
...     'numa_enable': False,
... }
>>> # Initialize a MindFormerConfig instance with a dict.
>>> config = MindFormerConfig(**config_dict)
>>> check_dataset_config(config)
>>> # use class to build dataset
>>> dataset_from_class = MultiTurnDataset(config)