mindformers.dataset.CausalLanguageModelDataset
- class mindformers.dataset.CausalLanguageModelDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, input_columns: List[str] = None, output_columns: List[str] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, python_multiprocessing: bool = False, repeat: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, eod_reset: bool = False, eod_token_id: Optional[int] = None, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, **kwargs)[source]
Causal Language Model pretrain dataset.
The columns of generated dataset depend on the config provided by user. The tensor of each column will be cast to int32 type.
- Parameters
dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default: None.
data_loader (Union[dict, Callable]) –
Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir", "dataset_files" and "shuffle" are the keys can be parsed.
type: Required. Indicates the type of dataset. The value must be string or class type. When the value is "MindDataset" or "TFRecordDataset", one of dataset_dir and dataset_files is required, where dataset_dir takes effect first; otherwise dataset_dir is required.
dataset_dir: The path or directory of dataset. When type is "MindDataset" or "TFRecordDataset" and dataset_dir is a directory, search for files in mindrecord or tfrecord format recursively in the directory.
dataset_files: The path of files in mindrecord or tfrecord format. Take effect when type is "MindDataset" or "TFRecordDataset", otherwise this key is ignored. Must be list or tuple.
shuffle: Optional. Whether to perform shuffle on the dataset. Must be bool.
input_columns (list[str]) – Column names before the map function.
output_columns (list[str]) – Column names after the map function. Reuired when eod_reset is True; otherwise ignored. Default: None.
batch_size (int) – Size of each batch. Default: 8.
drop_remainder (bool) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.
num_parallel_workers (int) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default: 8.
python_multiprocessing (bool) – Enabling the Python Multi-Process Mode to Accelerate Map Operations. Default: False.
repeat (int) – Number of times this dataset is repeated. Default: 1.
seed (int) – Random seed number. Default: 0.
prefetch_size (int) – Buffer queue size of each data processing operation in the pipeline. Default: 1.
numa_enable (bool) – Indicates whether to use the NUMA binding function. Default: False.
eod_reset (bool) – Specifies whether to reset the EOD. Default: False.
eod_token_id (int, optional) – Indicates the token id of the EOD. Default: None, don't set the token id of the EOD manually.
auto_tune (bool) – Indicates whether to enable automatic optimization of data processing parameters. Default: False.
autotune_per_step (int) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default: 10.
filepath_prefix (str) – Path for saving optimized parameter configurations. Default: './autotune'.
profile (bool) – Whether to enable data collection. Default: False.
- Returns
Instance of CausalLanguageModelDataset.
- Raises
ValueError – If dataset_config.batch_size is not a multiple of device number when dataset_config.eod_reset is True and dataset isn't imported in full.
ValueError – If dataset_config doesn't contain "dataset_dir" or "dataset_files" as its key.
Examples
>>> from mindspore.dataset import MindDataset >>> from mindformers.dataset import CausalLanguageModelDataset >>> # Note: >>> # `"/path/to/dataset"` should be replaced with the real path of the dataset file. >>> # The detailed data setting could refer to >>> # https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md >>> data_loader = MindDataset(dataset_files="/path/to/dataset", shuffle=True) >>> dataset_from_param = CausalLanguageModelDataset(data_loader=data_loader, ... input_columns=["input_ids", "attention_mask"])