mindformers.dataset.KeyWordGenDataset

View Source On Gitee
class mindformers.dataset.KeyWordGenDataset(dataset_config: Optional[dict] = None, data_loader: Union[dict, Callable] = None, tokenizer: Union[dict, Callable] = None, input_columns: List[str] = None, batch_size: int = 8, drop_remainder: bool = True, num_parallel_workers: int = 8, repeat: int = 1, ignore_pad_token_for_loss: bool = True, max_source_length: int = None, max_target_length: int = None, phase: str = 'train', version: int = 1, seed: int = 0, prefetch_size: int = 1, numa_enable: bool = False, auto_tune: bool = False, filepath_prefix: str = './autotune', autotune_per_step: int = 10, profile: bool = False, **kwargs)[source]

Keyword generation dataset.

The columns of the generated dataset depend on the value of phase .

  • When phase is 'train', the columns are [input_ids, labels, position_ids, attention_mask] .

  • When phase is 'eval', the columns are [input_ids, labels] .

The tensor of each column will be cast to int32 type.

Parameters
  • dataset_config (dict, optional) – Config for dataset. When dataset_config is an empty dict or is None, all arguments below will build a non-empty dataset_config. Otherwise, they will be ignored. Default: None.

  • data_loader (Union[dict, Callable], optional) –

    Config for data loader or a data loader object. when data_loader is a dict, this string "type", "dataset_dir", "dataset_files", "phase", "shuffle", "origin_columns" and "version" are the keys can be parsed. Default: None.

    • type: Required. Indicates the type of dataset. The value must be string or class type. When the value is "MindDataset", one of dataset_dir and dataset_files is required, where dataset_dir takes effect first; otherwise dataset_dir is required.

    • dataset_dir: The path or directory of dataset. When type is "MindDataset" and dataset_dir is a directory, search for files in mindrecord format recursively in the directory.

    • dataset_files: The path of files in mindrecord format. Take effect when type is "MindDataset", otherwise this key is ignored. Must be list or tuple.

    • phase: Required. The dataset subset to be loaded. The value can be 'train' and "eval".

    • shuffle: Required. Whether to perform shuffle on the dataset. Must be bool.

    • origin_columns: Required. The column names corresponding to py:obj:[prompt, answer] in the origin dataset files. Must be a list of two strings.

    • version: Optional. Version of the map function. The value can be 1 or 2. Default when missing: 1.

  • tokenizer (Union[dict, Callable], optional) – Tokenizer configuration or object. Default: None.

  • input_columns (list[str], optional) – Column name before the map function. Default: None.

  • batch_size (int, optional) – Size of each batch. Default: 8.

  • drop_remainder (bool, optional) – Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True.

  • num_parallel_workers (int, optional) – Specifies the number of concurrent processes or threads for map operations to accelerate processing. Default: 8.

  • repeat (int, optional) – Number of times this dataset is repeated. Default: 1.

  • ignore_pad_token_for_loss (bool, optional) – Whether ignore pad token for loss. Default: True.

  • max_source_length (int, optional) – Maximum length of the source sequence. Default: None.

  • max_target_length (int, optional) – Maximum length of the target sequence. Default: None.

  • phase (int, optional) – Phase of a task, which can be 'train' or 'eval'. Ignored when data_loader is dict. Default: 'train'.

  • version (int, optional) – Version of the map function, which can be 1 or 2. Ignored when data_loader is dict. Default: 1.

  • seed (int, optional) – Random seed number. Default: 0.

  • prefetch_size (int, optional) – Buffer queue size of each data processing operation in the pipeline. Default: 1.

  • numa_enable (bool, optional) – Indicates whether to use the NUMA binding function. Default: False.

  • auto_tune (bool, optional) – Indicates whether to enable automatic optimization of data processing parameters. Default: False.

  • autotune_per_step (int, optional) – Specifies the interval for adjusting the configuration step of automatic data acceleration. Default: 10.

  • filepath_prefix (str, optional) – Path for saving optimized parameter configurations. Default: './autotune'.

  • profile (bool, optional) – Whether to enable data collection. Default: False.

Returns

Instance of KeyWordGenDataset.

Raises

ValueError – If dataset_config doesn't contain "dataset_dir" or "dataset_files" as its key.

Examples

>>> from mindformers.dataset import KeyWordGenDataset, ADGenDataLoader
>>> from mindformers import AutoTokenizer
>>> # Note:
>>> #     `"/path/to/train.json"` should be replaced with the real path of the dataset file.
>>> #     The detailed data setting could refer to
>>> #     https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/glm.md
>>> data_loader = ADGenDataLoader(dataset_dir="/path/to/train.json", shuffle=True, phase='train',
...                               origin_columns=['content', 'summary'])
>>> tokenizer = AutoTokenizer.from_pretrained('glm_6b')
>>> dataset_from_param = KeyWordGenDataset(data_loader=data_loader, tokenizer=tokenizer,
...                                        input_columns=['input_ids', 'labels',
...                                                       'position_ids', 'attention_mask'],
...                                        max_source_length=64, max_target_length=64,
...                                        ignore_pad_token_for_loss=True, phase='train', batch_size=1)