Dataset

View Source On Gitee

At present, MindSpore Transformers' pre-training and fine-tuning support the ability to load datasets in multiple formats, including loading methods for Megatron Dataset, MindRecord Dataset, and HuggingFace datasets. The specific usage instructions for each format of dataset are as follows.

Megatron Dataset

Megatron Dataset refers to a dataset collected from multiple different sources, it contains different text types, formats, and domains. Using dataset can help models learn a wider range of language features and knowledge, thereby improving their generalization ability and performance. The current implementation of the Megatron framework requires preprocessing the original dataset into a BIN format dataset. MindSpore Transformers have been natively adapted to the Megatron Dataset, providing scripts for creating BIN format datasets and supporting direct use of the Megatron Dataset in training tasks.

How to Make a BIN Format Dataset

MindSpore Transformers provides a preprocessing script mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py, which can convert text data to a BIN format dataset. This script currently only supports processing files in a specific JSON format. Users need to first convert the original dataset file into a specific JSON format file, and then use a preprocessing script to generate a BIN format dataset file. Some models in MindSpore Transformers currently provide scripts for converting specific open-source datasets into JSON format files. If users want to use their own datasets, they need to write their own scripts to convert them into the desired format.

The format of the required JSON format file content is as follows:

{"id": "0", "text": "The quick brown fox", "type": "Eng", "src": "www.nvidia.com", "title": "First Part"}
{"id": "1", "text": "jumps over the lazy dog", "type": "Eng", "src": "The Internet", "title": "Second Part"}
...

Each piece of data consists of several key value pairs, and the supported keys and descriptions are as follows:

  • "id": The numbering of the data should be in order, required

  • "text": Text data actually used for training, required

  • "type": Indicate language type, optional

  • "src": Indicate the source of the data, optional

  • "title": Indicate the title of the data, optional

Taking the processing of Wiki datasets and their use as pre-training for Llama2 models as an example, the detailed steps for creating BIN format datasets are explained below:

  1. Download Wiki Dataset

    For the original Wiki Dataset downloading, refer to Llama2 Dataset Download.

  2. Generate JSON Format File

    The original format of the Wiki Dataset is as follows:

    The format of the JSON file wiki.json after processing the Wiki Dataset is as follows (omitting long text):

    {"id": 0, "text": "The gold dollar or gold one ..."}
    {"id": 1, "text": "Super Mario Land is a 1989 ..."}
    {"id": 2, "text": "The Sinclair Scientific Programmable ..."}
    ...
    
  3. Download The Vocabulary File For Llama2

    In the preprocessing script, the raw text data will be processed into Tokens using the Tokenizer of the model, therefore, it is necessary to download the vocabulary file in advance.

    Download link for Llama2 vocabulary file: tokenizer.model

  4. Generate BIN Format Files Using Preprocessing Scripts

    After processing into the specific JSON format file mentioned above, using mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py to convert it into a BIN format dataset, the specific command is as follows:

    python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
    --input ./wiki.json \
    --output-prefix wiki_processed_1024 \
    --tokenizer-type LlamaTokenizer \
    --vocab-file ./tokenizer.model \
    --add_bos_token True \
    --add_eos_token True \
    --pad_or_stitch stitch \
    --seq-length 1024 \
    --workers 1
    

    Configuration parameter description:

    • --input: Path to JSON format file

    • --output-prefix: The file name prefix of the preprocessed output file

    • --tokenizer-type: The type of tokenizer corresponding to the model

    • --vocab-file: The path of the vocabulary file for the tokenizer model tokenizer

    • --add_bos_token: Add bos_token at the beginning of the data, Default: False

    • --add_eos_token: Add eos_token at the ending of the data, Default: False

    • --pad_or_stitch: According to the requirements of the training task, set whether to splice or fill in, pad is in fill in mode, this mode will fill in the data with insufficient length to the seq length; Stitch is a concatenation mode that concatenates multiple pieces of data into data with a length of seq length

    • --seq-length: Preprocess the length of each piece of data

    • --workers: The number of parallel workers during preprocessing

After executing the above command, two files will be obtained, in .bin and .idx formats respectively. The .bin format file stores the actual data, and .idx format file stores the index of each piece of data.

Using Megatron Datasets in Training Tasks

Use the Megatron multi-source dataset in the training task as follows:

  1. Prepare the parallel_speed_up.json file

    parallel_speed_up.json is a dataset parallel communication configuration file, and the file content is as follows:

    {
        "dataset_broadcast_opt_level": 3
    }
    
  2. Set environment variables

    Enter the following command at the command line to set environment variables:

    export MS_DEV_DYNAMIC_SINK1=False
    
  3. Modify YAML configuration files for training tasks

    Configure the relevant parameters of Megatron Dataset in YAML configuration file. Here, taking the Llama2-7B model pre-training task as an example, modify train_dataset , runner_config , parallel_config , parallel and context in pretrain_llama2_7b.yaml. The specific modifications and explanations are as follows:

    train_dataset: &train_dataset
      data_loader:
        type: BlendedMegatronDatasetDataLoader
        datasets_type: "GPTDataset"
        sizes:
          - 1000
          - 0
          - 0
        shuffle: False
        input_columns: ["input_ids", "labels", "loss_mask", "position_ids"]
        config:
          seed: 1234
          seq_length: 1024
          split: "1, 0, 0"
          data_path:
            - 0.3
            - "/path/to/my_wiki_test_1024_text_document"
            - 0.7
            - "/path/to/my_wiki_test_1024_text_document"
          num_dataset_builder_threads: 1
          eod_mask_loss: False
          create_attention_mask: False
    

    Among them:

    • data_loader.type: The type of dataloader, should be set to BlendedMegatronDatasetDataLoader.

    • data_loader.datasets_type: Dataset type, currently only supports GPTDataset.

    • data_loader.sizes: - 1000- 0- 0 are the sampling sizes for the training set, test set, and validation set, respectively. Currently, only the training set can be configured.

    • input_columns: Set the input data columns for the training dataset, typically configured as ["input_ids", "labels", "loss_mask", "position_ids"] .

    • data_loader.config.seed: Random number seed when creating a dataset. Default: 1234 .

    • data_loader.config.seq_length: The length of each piece of data must be consistent with the model.model_config.seq_length in the YAML configuration.

    • data_loader.config.split: Split string, separate the weights of the training set, test set, and validation set with commas, used to split the dataset when drawing samples from a single distribution. Currently, only supports configuration as "1, 0, 0" .

    • data_loader.config.data_path: The number is the weight of each dataset, and the string is the path of the dataset BIN file, which needs to remove the file format suffix .bin .

    • data_loader.config.num_dataset_builder_threads: The number of processes used when creating the dataset. Default: 1 .

    • data_loader.config.eod_mask_loss: Do you want to use the switch of eod mask. Default: False .

    • data_loader.config.create_attention_mask: Whether to construct attention_mask. Default: True .

    There are still limitations to the current Megatron Dataset, which only supports non full batch scenarios, and it does not support the parallel feature of seq_pipe. The corresponding configuration items need to be modified according to the following:

    runner_config:
        sink_mode: True
        sink_size: 1
    
    parallel_config:
        data_parallel: &dp 2
        model_parallel: 2
        pipeline_stage: 1
    
    parallel:
        full_batch: False
        dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]]
    
    context:
        ascend_config:
            parallel_speed_up_json_path: "/path/to/parallel_speed_up.json"
    

    The configuration instructions that need to be noted are as follows:

    • parallel.dataset_strategy: Only support List of List type, parallel.dataset_strategy: Only support List of List type. The number of sub lists in a List needs to be equal to the length of train_dataset.input_columns, and each sub List in the List needs to be consistent with the shape of the data returned by the dataset. Generally, parallel data partitioning is performed in the first dimension of the data, so the first bit of the sub List is configured as *dp , and the other bits are configured as 1 . The specific principle can be referred to Dataset Segmentation.

  4. Compile Megatron Dataset module

    MindSpore Transformers have built-in Megatron Dataset module code, before starting the training task, the following command needs to be executed for compilation:

    pip install pybind11
    cd mindformers/dataset/blended_datasets
    make
    

MindRecord Dataset

MindRecord is an efficient data format developed by MindSpore for storing machine learning or deep learning datasets.

The MindRecord format is designed to improve data processing efficiency, especially in large-scale data training scenarios where data can be loaded and processed faster. MindRecord files typically contain the input samples needed for model training, which are preprocessed (e.g., encoded, normalized) to optimize read speed and memory usage.

For more information about the implementation of MindRecord related interfaces and examples, please refer to the documentation about MindRecord in MindSpore.

How to Make a MindRecord Dataset

The MindRecord module provides methods to convert different datasets into MindRecord format. You can use the FileWriter interface provided by MindSpore to generate MindRecord format datasets.

The following is an example of a MindRecord dataset based on a json format file, taking Llama2 as an example:

  1. Prepara json file

    Prepare a json file like this, named mydata.json:

    [
       {
         "text": "I love Beijing, because it is a city that beautifully blends rich history with modern vibrancy."
       },
       {
         "text": "I love Hangzhou, because it is a city that seamlessly combines natural beauty with rich cultural heritage."
       }
    ]
    
  2. Read json file

    import json
    
    raw_data = None
    file = open("mydata.json", "r")  # Open json file
    if file is not None:
       raw_data = json.load(file)  # Read json file into raw_data
       file.close()
    
  3. Define a MindRecord schema and create a FileWriter object;

    from mindspore.mindrecord import FileWriter
    
    # Define a schema for MindRecord
    schema = {'input_ids': {"type": "int32", "shape": [-1]}
    # Create a FileWriter object
    writer = FileWriter(file_name="output_file", shard_num=1)
    writer.add_schema(schema, "dataset_type")
    
  4. Iterate through each piece of data in the processed json file, convert it to MindRecord format, and write it to a MindRecord file.

    Word list download link: tokenizer.model

    import numpy as np
    from mindformers import LlamaTokenizer
    
    def tokenize_json(tokenizer, raw_data):
        """tokenize json file dataset"""
        content = [] # Read each json data and get its “input_ids”.
        for line in raw_data:
            stripped_line = line['text'].strip()
            if stripped_line:
                line_ids = tokenizer(stripped_line)["input_ids"]
                content.append(line_ids)
    
        for ids in content:
            sample = {}
            sample['input_ids'] = np.array(ids, dtype=np.int32)
            yield sample
    
    # Tokenize the text data
    word_tokenizer = LlamaTokenizer(vocab_file=r"tokenizer.model")
    
    # Iterate through each piece of data in the processed json file, convert it to MindRecord format and write it to the MindRecord file
    # tokenize_json is a custom method to tokenize the dialog data in json.
    for x in tokenize_json(word_tokenizer, raw_data):
        writer.write_raw_data([x])
    writer.commit()
    

For the detailed cases, refer to Examples of Data Preprocessing in Llama2.

Using MindRecord Format Datasets in Tasks

You can make a training or evaluation task use a prepared MindRecord format dataset by configuring dataset-related parameters in the yaml configuration file.

Here, as an example, for the Llama2-7B model pretraining task, the default configuration parameters and descriptions in the pretrain_llama2_7b.yaml file are as follows:

# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: True
  input_columns: ["input_ids"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  batch_size: 6
  repeat: 1
  numa_enable: False
  prefetch_size: 1

train_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *train_dataset

Configure the following parameters to use MindRecord format datasets:

  • data_loader.type: The type of the dataloader, which needs to be set to MindDataset.

  • data_loader.dataset_dir: The path to the dataset file.

  • input_columns: Sets the data columns for the input of the training dataset. Currently a pre-training scenario, set to ["input_ids"].

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

HuggingFace Datasets

Currently, the dataset loading functionality has been integrated with the ModelScope Open-Source Community and the HuggingFace Community, supporting online dataset loading and preprocessing. Additionally, datasets can be packed to enhance model training efficiency.

Usage Instructions

HuggingFace datasets support online and offline loading of datasets from both the HuggingFace community and the MoLo open-source community. Below is an introduction to environment preparation, the dataset loading process, and how to configure the use of HuggingFace datasets in configuration files.

Integrating with Open-Source Communities

  • Integrating with HuggingFace Community

    To use datasets from the HuggingFace community, follow these steps:

    1. Environment Setup

      The environment variable HF_ENDPOINT controls the remote repository used by HuggingFace. By default, it is set to https://huggingFace.co. For users in China, it is recommended to configure it to the mirror address:

    2. Install Dependencies

      pip install datasets
      
  • Integrating with ModelScope Open-Source Community

    To use datasets from the ModelScope Open-Source Community, follow these steps:

    1. Environment Setup

      The environment variable OPENMIND_HUB_ENDPOINT controls the remote repository used by the ModelScope Open-Source Community. Defaults to export OPENMIND_HUB_ENDPOINT=https://telecom.openmind.cn when not configured.

    2. Install Dependencies

      git clone https://gitee.com/openmind-ai/openmind-hub.git
      cd openmind-hub
      pip install -e .
      cd ..
      git clone https://gitee.com/foundation-models/openmind-datasets.git
      cd openmind-datasets
      pip install -e .
      cd ..
      

When the openmind-datasets component is installed in the environment, the default interface is the Modelers open source community, if you want to interface with HuggingFace, the environment variable USE_OM can control which community to interface with, the default value is ON for the Modelers community, change it to OFF to interface with the HuggingFace community.

Dataset Loading Process

commondataloader.png

The online dataset loading and processing functionality is primarily implemented through CommonDataLoader. The data loading part can be customized via configuration files, with detailed configuration instructions available in the dataloader parameter description. The online loading module requires users to implement customizations for different datasets. For example, the AlpacaInstructDataHandler class can be used to preprocess the alpaca dataset. For more information, please refer to Custom Data Handler.

dataloader Parameter Description

The online dataset loading feature is enabled by configuring the data_loader in the configuration file. Below is an example configuration for online dataset loading:

train_dataset:
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    load_func: 'load_dataset'
    shuffle: False
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    is_dynamic: False
    packing: pack
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: ["input_ids", "labels"]
      - type: PackingHandler
        seq_length: 4096
        output_columns: ["input_ids", "labels", "actual_seq_len"]
    adaptor_config:
       compress_mask: False
    column_names: *input_columns

Parameter descriptions for data_loader are as follows:

Parameter Name

Description

Type

type

Fixed as CommonDataLoader. This module supports loading datasets from HuggingFace and the ModelScope open-source community.

str

packing

Packing configuration when processing datasets with handler. Options include pack and truncate.

str

load_func

The function used to load datasets. Options are load_dataset and load_from_disk. Use load_from_disk for data saved via the save_to_disk function, and load_dataset for other scenarios. The default value is load_dataset.

str

path

When load_func=load_dataset, this parameter aligns with the interface in datasets.load_dataset. When load_func=load_from_disk, it specifies the dataset loading path.

str

data_files

When load_func=load_dataset, this parameter aligns with the interface in datasets.load_dataset. It is ineffective when load_func=load_from_disk.

str

handler

Multiple handlers can be configured to preprocess the loaded dataset in the order specified. For details on handler configuration, refer to the handler parameter description in Custom Data Handler.

list

adaptor_config

Dataset-related configuration during model training. Currently supports compress_mask, effective when packing is set. If enabled, it returns a compressed data mask. Default is False.

dict

shuffle

Indicates whether random sampling is enabled when loading the dataset.

bool

column_names

Specifies the column names returned by the dataset. If not set, all columns are returned.

list

is_dynamic

Indicates whether the dataset returns dynamic-length data. Default is False.

bool

In addition to the above configurations, all parameters from the datasets.load_dataset interface are supported with the same meanings and functions.

When packing is configured, the dataset returns an actual_seq_len column. For more information, refer to the actual_seq_qlen and ctual_seq_kvlen parameter descriptions in the documentation.

Feature Introduction

Dynamic Sequence Length Fine-Tuning

CommonDataLoader supports dynamic shape fine-tuning using HuggingFace datasets, which can be loaded online or offline. Below, we use the alpaca dataset as an example to demonstrate the configuration for dynamic shape fine-tuning.

  • Online Loading

    The online dataset name is llm-wizard/alpaca-gpt4-data. You can search and download it from the HuggingFace official website or load it directly using the online name.

    Example configuration for online loading:

     train_dataset:
       input_columns: &input_columns ["input_ids", "labels"]
       dynamic_batch: True                    # Enable dynamic shape
       divisor: 32                            # With divisor and remainder configured, seq_length in dynamic shape will become a multiple of divisor and the sum of remainder
       remainder: 1
       data_loader:
         type: CommonDataLoader
         shuffle: True
         split: "train"                       # Subset name of the online dataset
         path: "llm-wizard/alpaca-gpt4-data"  # Online dataset name
         is_dynamic: True
         handler:
           - type: AlpacaInstructDataHandler
             tokenizer_name: llama2_7b
             seq_length: 4096
             prompt_key: "conversations"
             output_columns: *input_columns
       seed: 0
       num_parallel_workers: 8
       python_multiprocessing: False
       drop_remainder: True
       repeat: 1
       numa_enable: False
       prefetch_size: 1
    
    1. For parameter descriptions in train_dataset, please refer to the documentation.

    2. AlpacaInstructDataHandler is an online processing script developed for the alpaca dataset. If using a different dataset, you need to implement a custom data handler by referring to the Custom Data Handler guide.

  • Offline Loading

    For offline loading, you need to prepare the JSON files of the alpaca dataset. The offline configuration differs from the online configuration only in the following parameters:

     train_dataset:
       data_loader:
         path: "json"                               # loading datasets using the load_dataset interface
         data_files: '/path/alpaca_gpt4_data.json'  # the file path of the alpaca dataset
    

After configuring the dataset loading method, you also need to set is_dynamic=True in the model configuration to enable dynamic shape training for the model.

model_config:
  is_dynamic: True

Since dynamic shapes may lead to operator compilation caching, it is recommended to set the following environment variables to limit the number of cached compilations when running in a memory-constrained environment. This helps prevent out-of-memory issues:

export ACLNN_CACHE_LIMIT=10
export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64"
  • The ACLNN_CACHE_LIMIT parameter description can be found in the documentation.

  • MS_DEV_RUNTIME_CONF is a parameter in MindSpore for setting the operator cache queue length. The value 64 represents the length of the sequence, which defaults to 1024. This can be adjusted based on the actual environment. Setting the value too small may affect model training performance.

After completing all the configurations above, you can proceed with dynamic shape fine-tuning by referring to the documentation for the specific model you are using.

Custom Data Handler

Users can define custom data handlers to apply various preprocessing logic to the loaded dataset.

  • Handler Parameter Description

Parameter Name

Description

Type

type

Custom data handler name. A custom handler must inherit from BaseInstructDataHandler.

str

tokenizer_name

Name of the tokenizer used.

str

tokenizer

Tokenizer configuration parameters. Can be a dictionary, string, or a tokenizer object. Takes lower priority than tokenizer_name.

dict/str

seq_length

Maximum sequence length, usually the same as the model's sequence length.

int

output_columns

Column names of the processed data returned after preprocessing.

list

prompt_key

Column name for data after applying prompt processing.

str

  • Development Sample 1

The custom data handler is usually placed in the mindformers/dataset/handler directory, and the customized one needs to inherit the abstract base class BaseInstructDataHandler. You need to implement format_func and tokenize_func methods, which preprocess each data loaded. Refer to alpaca_handler.py.

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class XXXInstructDataHandler(BaseInstructDataHandler):

    def format_func(self, example):
        # Custom data format conversion

    def tokenize_func(self, example):
        # Custom tokenizer split word processing

The BaseInstructDataHandler provides an implementation of the entry handler method by default, which is used to iterate over each piece of data for data preprocessing. The format_func is used to implement how to convert the raw data into the desired data format, and the tokenize_func method is used to take the processed data and perform a customized tokenization. The input parameter example in the example is each of the samples obtained.

  • Development Sample 2

If you want to process the data directly for the whole dataset instead of processing each piece of data in batches, you can implement the entry handle method in custom handler, and you will get the complete dataset, as shown below:

    def handle(self, dataset):
        """data handler"""
        return dataset.rename_columns({"content":"prompt","summary":"answer"})
  • alpaca Dataset Sample

Modify the task configuration file finetune_llama2_7b.yaml.

Modify the following parameters:

train_dataset:
  input_columns: &input_columns ["input_ids", "labels"]
  data_loader:
    type: CommonDataLoader
    shuffle: True
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: *input_columns

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

Custom data handler:

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AlpacaInstructDataHandler(BaseInstructDataHandler):

    def format_func(self, example):
        """format func"""
        source = PROMPT_INPUT.format_map(example) \
            if example.get(self.input_key, "") != "" \
            else PROMPT_NO_INPUT.format_map(example)
        target = example.get(self.output_key)
        formatted_example = [
            {
                "from": self.user_role,
                "value": source,
            },
            {
                "from": self.assistant_role,
                "value": target,
            },
        ]

        return formatted_example

    def tokenize_func(self, messages):
        """tokenize func"""
        conversation = self.gen_prompt(messages)
        sep = self.template.sep + self.assistant_role + ": "
        # Tokenize conversations
        rounds = conversation.split(self.template.sep2)
        ids = [self.tokenizer.bos_token_id]
        mask = [1]
        for _, rou in enumerate(rounds):
            if rou == "":
                break
            conv_out = self.tokenizer(rou)
            ids.extend(conv_out['input_ids'][1:])
            mask.extend(conv_out['attention_mask'][1:])
        d = {'input_ids': ids, 'attention_mask': mask}
        # pylint: disable=W0212
        if not self.dynamic:
            d = self.tokenizer._pad(d, max_length=self.seq_length + 1, padding_strategy='max_length')
        input_id = d['input_ids'][:self.seq_length + 1]
        target = np.array(d['input_ids'])
        total_len = int(np.not_equal(target, self.tokenizer.pad_token_id).sum())
        cur_len = 1
        target[:cur_len] = self.ignore_token_id
        for _, rou in enumerate(rounds):
            if rou == "":
                break
            parts = rou.split(sep)
            if len(parts) != 2:
                break
            parts[0] += sep
            round_len = len(self.tokenizer(rou)['input_ids']) - 1
            instruction_len = len(self.tokenizer(parts[0])['input_ids']) - 3

            target[cur_len: cur_len + instruction_len] = self.ignore_token_id

            cur_len += round_len
        if self.dynamic:
            return {
                "input_ids": input_id,
                "labels": target[:len(input_id)].tolist()
            }
        target[cur_len:] = self.ignore_token_id
        if cur_len < self.seq_length + 1:
            if cur_len != total_len:
                target[:] = self.ignore_token_id
        else:
            target = target[:self.seq_length + 1]
        label = target.tolist()
        return {
            "input_ids": input_id,
            "labels": label,
        }
  • ADGEN Dataset Sample

Modify the task configuration file run_glm3_6b_finetune_2k_800T_A2_64G.yaml.

Modify the following parameters:

train_dataset: &train_dataset
  data_loader:
    type: CommonDataLoader
    path: "xxx/ADGEN"
    split: "train"
    shuffle: True
    handler:
      - type: AdgenInstructDataHandler
        output_columns: ["prompt", "answer"]
  tokenizer:
    type: ChatGLM3Tokenizer
    vocab_file: "/path/to/tokenizer.model"
  input_columns: ["input_ids", "labels"]
  max_source_length: 1024
  max_target_length: 1023
  ignore_pad_token_for_loss: True
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  batch_size: 8
  repeat: 1
  numa_enable: False
  prefetch_size: 1
  phase: "train"
  version: 3
  seed: 0

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

Custom adgen_handler:

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AdgenInstructDataHandler(BaseInstructDataHandler):
    """agden data handler"""
    def handle(self, dataset):
        """data handler"""
        return dataset.rename_columns({"content": "prompt", "summary": "answer"})

Dataset Packing

Configuring PackingHandler in CommonDataLoader allows for packing processing of the data. Currently, the original data needs to be processed into input_ids and labels that can be fed into the model during the preprocessing step.

  • Parameter Description

Parameter Name

Description

Type

type

Fixed as PackingHandler. This module supports packing data. When packing=pack or packing=truncate is configured in dataloader, it performs non-truncating and truncating concatenation of the data, respectively.

str

seq_length

Maximum sequence length of the data after packing.

int

pad_token

Token ID used for padding input_ids when the packed sample does not reach the maximum length. Default value is 0.

int

ignore_token

Token ID used for padding labels when the packed sample does not reach the maximum length. Default value is -100.

int

  • Packing Example

By following the configuration below, the alpaca dataset can be preprocessed to achieve online packing.

train_dataset:
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    shuffle: False
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    packing: pack
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: ["input_ids", "labels"]
      - type: PackingHandler
        seq_length: 4096
        output_columns: ["input_ids", "labels", "actual_seq_len"]
    adaptor_config:
       compress_mask: False

Using the above configuration file to process the alpaca dataset will execute the following steps:

  1. The raw text data will be processed into input_ids and labels using AlpacaInstructDataHandler and the tokenizer of llama2_7b.

  2. PackingHandler will be used to perform packing on the processed input_ids and labels, resulting in concatenated input_ids and labels up to the seq_length. The actual_seq_len refers to the sequence length of each sub-sample in the concatenated sample. During training, this parameter will be used to generate the corresponding data mask.

  3. If compress_mask=False is set in adaptor_config, a complete data mask will be returned during training. Otherwise, actual_seq_len will be returned.

Offline Dataset Processing

In addition to supporting online dataset loading and processing, CommonDataLoader also supports offline dataset processing and saving.

The datasets_preprocess.py script can be used to process Huggingface datasets offline and save them.

  • Parameter Description

Parameter Name

Description

Type

config

Configuration file for offline data processing, which is used in the same way as online processing. Refer to dataloader for details.

str

save_path

Path where the preprocessed dataset will be saved.

str

register_path

Registration path for the model API, which includes the Python files related to the model, typically the model folder under the research directory.

int

  • Usage Example

You can use the configuration file provided in the dataset packing example and execute the following command.

python toolkit/data_preprocess/huggingface/datasets_preprocess.py \
  --config data_process.yaml \
  --save_path /path/processed_data

If you need to load the saved dataset, you should modify the YAML configuration as follows:

train_dataset:
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    shuffle: False
    load_func: "load_from_disk"
    path: "/path/processed_data"
    adaptor_config:
       compress_mask: False