mindformers.dataset.MultiTurnDataset

View Source On Gitee
class mindformers.dataset.MultiTurnDataset(dataset_config: dict = None)[source]

Multi-turn dataset.

The generated dataset has two columns: [input_ids, labels] . The tensor of column input_ids is of the int32 type. The tensor of column labels is of the int32 type.

Parameters

dataset_config (dict) –

Required. Config for dataset. Must be dict which contains all keys below at least.

  • data_loader: Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir" and "shuffle" are the keys can be parsed.

    • type: Required. Indicates the type of dataset. The value must be string or class type.

    • dataset_dir: Required. The path of dataset.

    • shuffle: Required. Whether to perform shuffle on the dataset. Must be bool.

  • tokenizer: Tokenizer configuration or object.

  • max_seq_length: Maximum length of the sequence.

  • batch_size: Size of each batch.

  • drop_remainder: Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.

  • num_parallel_workers: Specifies the number of concurrent processes or threads for map operations to accelerate processing.

  • python_multiprocessing: Enabling the Python Multi-Process Mode to Accelerate Map Operations.

  • repeat: Number of times this dataset is repeated.

  • seed: Random seed number.

  • prefetch_size: Buffer queue size of each data processing operation in the pipeline.

  • numa_enable: Indicates whether to use the NUMA binding function.

Returns

Instance of MultiTurnDataset.

Raises
  • ValueError – If Python version earlier than 3.9.

  • ValueError – If dataset_dir is missing in dataset_config.data_loader, or dataset_config.data_loader.dataset_dir does not exist.

  • ValueError – If the length of tokens and loss masks mismatch.

  • ValueError – If the length of input ids and labels mismatch.

Examples

>>> from mindformers import MultiTurnDataset
>>> from mindformers.tools.register import MindFormerConfig
>>> from mindformers.dataset import check_dataset_config
>>> # Note:
>>> #     `"/path/to/tool_alpaca.jsonl"` should be replaced with the real path of the formatted dataset file.
>>> #     `"/path/to/tokenizer.model"` should be replaced with the real path of the tokenizer file.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/glm3.md
>>> config_dict = {
...     'data_loader': {
...         'type': 'ToolAlpacaDataLoader',
...         'dataset_dir': "/path/to/tool_alpaca.jsonl",
...         'shuffle': True
...     },
...     'tokenizer': {
...         'type': 'ChatGLM3Tokenizer',
...         'vocab_file': '/path/to/tokenizer.model'
...     },
...     'max_seq_length': 2048,
...     'batch_size': 1,
...     'drop_remainder': True,
...     'num_parallel_workers': 8,
...     'python_multiprocessing': False,
...     'repeat': 1,
...     'seed': 0,
...     'prefetch_size': 1,
...     'numa_enable': False,
... }
>>> # Initialize a MindFormerConfig instance with a dict.
>>> config = MindFormerConfig(**config_dict)
>>> check_dataset_config(config)
>>> # use class to build dataset
>>> dataset_from_class = MultiTurnDataset(config)