mindformers.dataset.CausalLanguageModelDataset

class mindformers.dataset.CausalLanguageModelDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, input_columns: List[str] = None, output_columns: List[str] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, python_multiprocessing: bool = False, repeat: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, eod_reset: bool = False, eod_token_id: Optional[int] = None, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, token_monitor: bool = False, token_monitor_config: Optional[dict] = None, **kwargs)[source]

Causal Language Model pretrain dataset.

The columns of generated dataset depend on the config provided by user. The tensor of each column will be cast to int32 type.

Parameters

dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default: None.
data_loader (Union[dict, Callable], optional) –
Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir", "dataset_files" and "shuffle" are the keys can be parsed. Default: None.
- type: Required. Indicates the type of dataset. The value must be string or class type. When the value is "MindDataset" or "TFRecordDataset", one of dataset_dir and dataset_files is required, where dataset_dir takes effect first; otherwise dataset_dir is required.
- dataset_dir: The path or directory of dataset. When type is "MindDataset" or "TFRecordDataset" and dataset_dir is a directory, search for files in mindrecord or tfrecord format recursively in the directory.
- dataset_files: The path of files in mindrecord or tfrecord format. Take effect when type is "MindDataset" or "TFRecordDataset", otherwise this key is ignored. Must be list or tuple.
- shuffle: Optional. Whether to perform shuffle on the dataset. Must be bool.
input_columns (list[str], optional) – Column names before the map function. Default: None.
output_columns (list[str], optional) – Column names after the map function. Reuired when eod_reset is True; otherwise ignored. Default: None.
batch_size (int, optional) – Size of each batch. Default: 8.
drop_remainder (bool, optional) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.
num_parallel_workers (int, optional) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default: 8.
python_multiprocessing (bool, optional) – Enabling the Python Multi-Process Mode to Accelerate Map Operations. Default: False.
repeat (int, optional) – Number of times this dataset is repeated. Default: 1.
seed (int, optional) – Random seed number. Default: 0.
prefetch_size (int, optional) – Buffer queue size of each data processing operation in the pipeline. Default: 1.
numa_enable (bool, optional) – Indicates whether to use the NUMA binding function. Default: False.
eod_reset (bool, optional) – Specifies whether to reset the EOD. Default: False.
eod_token_id (int, optional) – Indicates the token id of the EOD. Default: None, don't set the token id of the EOD manually.
auto_tune (bool, optional) – Indicates whether to enable automatic optimization of data processing parameters. Default: False.
autotune_per_step (int, optional) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default: 10.
filepath_prefix (str, optional) – Path for saving optimized parameter configurations. Default: './autotune'.
profile (bool, optional) – Whether to enable data collection. Default: False.
token_monitor (bool, optional) – Whether to enable token monitor function. Default: False.
token_monitor_config (dict, optional) – Config for token monitor function, When set to None, use deault value. Default: None.

Returns

Instance of CausalLanguageModelDataset.

Raises

ValueError – If dataset_config.batch_size is not a multiple of device number when dataset_config.eod_reset is True and dataset isn't imported in full.
ValueError – If dataset_config doesn't contain "dataset_dir" or "dataset_files" as its key.

Examples

>>> from mindspore.dataset import MindDataset
>>> from mindformers.dataset import CausalLanguageModelDataset
>>> # Note:
>>> #     `"/path/to/dataset"` should be replaced with the real path of the dataset file.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md
>>> data_loader = MindDataset(dataset_files="/path/to/dataset", shuffle=True)
>>> dataset_from_param = CausalLanguageModelDataset(data_loader=data_loader,
...                                                 input_columns=["input_ids", "attention_mask"])