mindformers.dataset.CausalLanguageModelDataset
- class mindformers.dataset.CausalLanguageModelDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, input_columns: List[str] = None, output_columns: List[str] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, python_multiprocessing: bool = False, repeat: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, eod_reset: bool = False, eod_token_id: Optional[int] = None, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, **kwargs)[source]
Causal Language Model pretrain dataset.
The columns of generated dataset depend on the config provided by user. The tensor of each column will be cast to int32 type.
- Parameters
dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default:
None
.data_loader (Union[dict, Callable], optional) –
Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir", "dataset_files" and "shuffle" are the keys can be parsed. Default:
None
.type: Required. Indicates the type of dataset. The value must be string or class type. When the value is "MindDataset" or "TFRecordDataset", one of dataset_dir and dataset_files is required, where dataset_dir takes effect first; otherwise dataset_dir is required.
dataset_dir: The path or directory of dataset. When type is "MindDataset" or "TFRecordDataset" and dataset_dir is a directory, search for files in mindrecord or tfrecord format recursively in the directory.
dataset_files: The path of files in mindrecord or tfrecord format. Take effect when type is "MindDataset" or "TFRecordDataset", otherwise this key is ignored. Must be list or tuple.
shuffle: Optional. Whether to perform shuffle on the dataset. Must be bool.
input_columns (list[str], optional) – Column names before the map function. Default:
None
.output_columns (list[str], optional) – Column names after the map function. Reuired when eod_reset is True; otherwise ignored. Default:
None
.batch_size (int, optional) – Size of each batch. Default:
8
.drop_remainder (bool, optional) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default:
True
.num_parallel_workers (int, optional) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default:
8
.python_multiprocessing (bool, optional) – Enabling the Python Multi-Process Mode to Accelerate Map Operations. Default:
False
.repeat (int, optional) – Number of times this dataset is repeated. Default:
1
.seed (int, optional) – Random seed number. Default:
0
.prefetch_size (int, optional) – Buffer queue size of each data processing operation in the pipeline. Default:
1
.numa_enable (bool, optional) – Indicates whether to use the NUMA binding function. Default:
False
.eod_reset (bool, optional) – Specifies whether to reset the EOD. Default:
False
.eod_token_id (int, optional) – Indicates the token id of the EOD. Default:
None
, don't set the token id of the EOD manually.auto_tune (bool, optional) – Indicates whether to enable automatic optimization of data processing parameters. Default:
False
.autotune_per_step (int, optional) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default:
10
.filepath_prefix (str, optional) – Path for saving optimized parameter configurations. Default:
'./autotune'
.profile (bool, optional) – Whether to enable data collection. Default:
False
.
- Returns
Instance of CausalLanguageModelDataset.
- Raises
ValueError – If dataset_config.batch_size is not a multiple of device number when dataset_config.eod_reset is True and dataset isn't imported in full.
ValueError – If dataset_config doesn't contain "dataset_dir" or "dataset_files" as its key.
Examples
>>> from mindspore.dataset import MindDataset >>> from mindformers.dataset import CausalLanguageModelDataset >>> # Note: >>> # `"/path/to/dataset"` should be replaced with the real path of the dataset file. >>> # The detailed data setting could refer to >>> # https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md >>> data_loader = MindDataset(dataset_files="/path/to/dataset", shuffle=True) >>> dataset_from_param = CausalLanguageModelDataset(data_loader=data_loader, ... input_columns=["input_ids", "attention_mask"])