mindformers.dataset.CausalLanguageModelDataset

View Source On Gitee
class mindformers.dataset.CausalLanguageModelDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, input_columns: List[str] = None, output_columns: List[str] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, python_multiprocessing: bool = False, repeat: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, eod_reset: bool = False, eod_token_id: Optional[int] = None, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, **kwargs)[source]

Causal Language Model pretrain dataset.

The columns of generated dataset depend on the config provided by user. The tensor of each column will be cast to int32 type.

Parameters
  • dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default: None.

  • data_loader (Union[dict, Callable]) –

    Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir", "dataset_files" and "shuffle" are the keys can be parsed.

    • type: Required. Indicates the type of dataset. The value must be string or class type. When the value is "MindDataset" or "TFRecordDataset", one of dataset_dir and dataset_files is required, where dataset_dir takes effect first; otherwise dataset_dir is required.

    • dataset_dir: The path or directory of dataset. When type is "MindDataset" or "TFRecordDataset" and dataset_dir is a directory, search for files in mindrecord or tfrecord format recursively in the directory.

    • dataset_files: The path of files in mindrecord or tfrecord format. Take effect when type is "MindDataset" or "TFRecordDataset", otherwise this key is ignored. Must be list or tuple.

    • shuffle: Optional. Whether to perform shuffle on the dataset. Must be bool.

  • input_columns (list[str]) – Column names before the map function.

  • output_columns (list[str]) – Column names after the map function. Reuired when eod_reset is True; otherwise ignored. Default: None.

  • batch_size (int) – Size of each batch. Default: 8.

  • drop_remainder (bool) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.

  • num_parallel_workers (int) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default: 8.

  • python_multiprocessing (bool) – Enabling the Python Multi-Process Mode to Accelerate Map Operations. Default: False.

  • repeat (int) – Number of times this dataset is repeated. Default: 1.

  • seed (int) – Random seed number. Default: 0.

  • prefetch_size (int) – Buffer queue size of each data processing operation in the pipeline. Default: 1.

  • numa_enable (bool) – Indicates whether to use the NUMA binding function. Default: False.

  • eod_reset (bool) – Specifies whether to reset the EOD. Default: False.

  • eod_token_id (int, optional) – Indicates the token id of the EOD. Default: None, don't set the token id of the EOD manually.

  • auto_tune (bool) – Indicates whether to enable automatic optimization of data processing parameters. Default: False.

  • autotune_per_step (int) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default: 10.

  • filepath_prefix (str) – Path for saving optimized parameter configurations. Default: './autotune'.

  • profile (bool) – Whether to enable data collection. Default: False.

Returns

Instance of CausalLanguageModelDataset.

Raises
  • ValueError – If dataset_config.batch_size is not a multiple of device number when dataset_config.eod_reset is True and dataset isn't imported in full.

  • ValueError – If dataset_config doesn't contain "dataset_dir" or "dataset_files" as its key.

Examples

>>> from mindspore.dataset import MindDataset
>>> from mindformers.dataset import CausalLanguageModelDataset
>>> # Note:
>>> #     `"/path/to/dataset"` should be replaced with the real path of the dataset file.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md
>>> data_loader = MindDataset(dataset_files="/path/to/dataset", shuffle=True)
>>> dataset_from_param = CausalLanguageModelDataset(data_loader=data_loader,
...                                                 input_columns=["input_ids", "attention_mask"])