Dataset

At present, MindSpore Transformers' pre-training and fine-tuning support the ability to load datasets in multiple formats, including loading methods for Megatron Dataset, MindRecord Dataset, and HuggingFace datasets. The specific usage instructions for each format of dataset are as follows.

Megatron Dataset

Megatron Dataset refers to a dataset collected from multiple different sources, it contains different text types, formats, and domains. Using dataset can help models learn a wider range of language features and knowledge, thereby improving their generalization ability and performance. The current implementation of the Megatron framework requires preprocessing the original dataset into a BIN format dataset. MindSpore Transformers have been natively adapted to the Megatron Dataset, providing scripts for creating BIN format datasets and supporting direct use of the Megatron Dataset in training tasks.

How to Make a BIN Format Dataset

MindSpore Transformers provides a preprocessing script mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py, which can convert text data to a BIN format dataset. This script currently only supports processing files in a specific JSON format. Users need to first convert the original dataset file into a specific JSON format file, and then use a preprocessing script to generate a BIN format dataset file. Some models in MindSpore Transformers currently provide scripts for converting specific open-source datasets into JSON format files. If users want to use their own datasets, they need to write their own scripts to convert them into the desired format.

The format of the required JSON format file content is as follows:

{"id": "0", "text": "The quick brown fox", "type": "Eng", "src": "www.nvidia.com", "title": "First Part"}
{"id": "1", "text": "jumps over the lazy dog", "type": "Eng", "src": "The Internet", "title": "Second Part"}
...

Each piece of data consists of several key value pairs, and the supported keys and descriptions are as follows:

"id": The numbering of the data should be in order, required
"text": Text data actually used for training, required
"type": Indicate language type, optional
"src": Indicate the source of the data, optional
"title": Indicate the title of the data, optional

Taking the processing of Wiki datasets and their use as pre-training for Llama2 models as an example, the detailed steps for creating BIN format datasets are explained below:

Download Wiki Dataset

For the original Wiki Dataset downloading, refer to Llama2 Dataset Download.
Generate JSON Format File

The original format of the Wiki Dataset is as follows:

The format of the JSON file wiki.json after processing the Wiki Dataset is as follows (omitting long text):
```
{"id": 0, "text": "The gold dollar or gold one ..."}
{"id": 1, "text": "Super Mario Land is a 1989 ..."}
{"id": 2, "text": "The Sinclair Scientific Programmable ..."}
...
```
Download The Vocabulary File For Llama2

In the preprocessing script, the raw text data will be processed into Tokens using the Tokenizer of the model, therefore, it is necessary to download the vocabulary file in advance.

Download link for Llama2 vocabulary file: tokenizer.model
Generate BIN Format Files Using Preprocessing Scripts

After processing into the specific JSON format file mentioned above, using mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py to convert it into a BIN format dataset, the specific command is as follows:
```
python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
--input ./wiki.json \
--output-prefix wiki_processed_1024 \
--tokenizer-type LlamaTokenizer \
--vocab-file ./tokenizer.model \
--add_bos_token True \
--add_eos_token True \
--pad_or_stitch stitch \
--seq-length 1024 \
--workers 1
```
Configuration parameter description:
- --input: Path to JSON format file
- --output-prefix: The file name prefix of the preprocessed output file
- --tokenizer-type: The type of tokenizer corresponding to the model
- --vocab-file: The path of the vocabulary file for the tokenizer model tokenizer
- --add_bos_token: Add bos_token at the beginning of the data, Default: False
- --add_eos_token: Add eos_token at the ending of the data, Default: False
- --pad_or_stitch: According to the requirements of the training task, set whether to splice or fill in, pad is in fill in mode, this mode will fill in the data with insufficient length to the seq length; Stitch is a concatenation mode that concatenates multiple pieces of data into data with a length of seq length
- --seq-length: Preprocess the length of each piece of data
- --workers: The number of parallel workers during preprocessing

After executing the above command, two files will be obtained, in .bin and .idx formats respectively. The .bin format file stores the actual data, and .idx format file stores the index of each piece of data.

Using Megatron Datasets in Training Tasks

Use the Megatron multi-source dataset in the training task as follows:

Prepare the parallel_speed_up.json file

parallel_speed_up.json is a dataset parallel communication configuration file, and the file content is as follows:
```
{
    "dataset_broadcast_opt_level": 3
}
```
Set environment variables

Enter the following command at the command line to set environment variables:
```
export MS_DEV_DYNAMIC_SINK1=False
```
Modify YAML configuration files for training tasks

Configure the relevant parameters of Megatron Dataset in YAML configuration file. Here, taking the Llama2-7B model pre-training task as an example, modify train_dataset , runner_config , parallel_config , parallel and context in pretrain_llama2_7b.yaml. The specific modifications and explanations are as follows:
```
train_dataset: &train_dataset
  data_loader:
    type: BlendedMegatronDatasetDataLoader
    datasets_type: "GPTDataset"
    sizes:
      - 1000
      - 0
      - 0
    shuffle: False
    config:
      seed: 1234
      seq_length: 1024
      split: "1, 0, 0"
      data_path:
        - 0.3
        - "/path/to/my_wiki_test_1024_text_document"
        - 0.7
        - "/path/to/my_wiki_test_1024_text_document"
      num_dataset_builder_threads: 1
      eod_mask_loss: False
      create_attention_mask: False
  input_columns: ["input_ids", "labels", "loss_mask", "position_ids"]
```
Among them:
- data_loader.type: The type of dataloader, should be set to BlendedMegatronDatasetDataLoader.
- data_loader.datasets_type: Dataset type, currently only supports GPTDataset.
- data_loader.sizes: - 1000 , - 0 , - 0 are the sampling sizes for the training set, test set, and validation set, respectively. Currently, only the training set can be configured.
- input_columns: Set the input data columns for the training dataset, typically configured as ["input_ids", "labels", "loss_mask", "position_ids"] .
- data_loader.config.seed: Random number seed when creating a dataset. Default: 1234 .
- data_loader.config.seq_length: The length of each piece of data must be consistent with the model.model_config.seq_length in the YAML configuration.
- data_loader.config.split: Split string, separate the weights of the training set, test set, and validation set with commas, used to split the dataset when drawing samples from a single distribution. Currently, only supports configuration as "1, 0, 0" .
- data_loader.config.data_path: The number is the weight of each dataset, and the string is the path of the dataset BIN file, which needs to remove the file format suffix .bin .
- data_loader.config.num_dataset_builder_threads: The number of processes used when creating the dataset. Default: 1 .
- data_loader.config.eod_mask_loss: Do you want to use the switch of eod mask. Default: False .
- data_loader.config.create_attention_mask: Whether to construct attention_mask. Default: True .
There are still limitations to the current Megatron Dataset, which only supports non full batch scenarios, and it does not support the parallel feature of seq_pipe. The corresponding configuration items need to be modified according to the following:
```
runner_config:
    sink_mode: True
    sink_size: 1

parallel_config:
    data_parallel: &dp 2
    model_parallel: 2
    pipeline_stage: 1

parallel:
    full_batch: False
    dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]]

context:
    ascend_config:
        parallel_speed_up_json_path: "/path/to/parallel_speed_up.json"
```
The configuration instructions that need to be noted are as follows:
- parallel.dataset_strategy: Only support List of List type, parallel.dataset_strategy: Only support List of List type. The number of sub lists in a List needs to be equal to the length of train_dataset.input_columns, and each sub List in the List needs to be consistent with the shape of the data returned by the dataset. Generally, parallel data partitioning is performed in the first dimension of the data, so the first bit of the sub List is configured as *dp , and the other bits are configured as 1 . The specific principle can be referred to Dataset Segmentation.
Compile Megatron Dataset module

MindSpore Transformers have built-in Megatron Dataset module code, before starting the training task, the following command needs to be executed for compilation:
```
pip install pybind11
cd mindformers/dataset/blended_datasets
make
```

MindRecord Dataset

MindRecord is an efficient data format developed by MindSpore for storing machine learning or deep learning datasets.

The MindRecord format is designed to improve data processing efficiency, especially in large-scale data training scenarios where data can be loaded and processed faster. MindRecord files typically contain the input samples needed for model training, which are preprocessed (e.g., encoded, normalized) to optimize read speed and memory usage.

For more information about the implementation of MindRecord related interfaces and examples, please refer to the documentation about MindRecord in MindSpore.

How to Make a MindRecord Dataset

The MindRecord module provides methods to convert different datasets into MindRecord format. You can use the FileWriter interface provided by MindSpore to generate MindRecord format datasets.

The following is an example of a MindRecord dataset based on a json format file, taking Llama2 as an example:

Prepara json file

Prepare a json file like this, named mydata.json:

[
   {
     "text": "I love Beijing, because it is a city that beautifully blends rich history with modern vibrancy."
   },
   {
     "text": "I love Hangzhou, because it is a city that seamlessly combines natural beauty with rich cultural heritage."
   }
]

Read json file

import json

raw_data = None
file = open("mydata.json", "r")  # Open json file
if file is not None:
   raw_data = json.load(file)  # Read json file into raw_data
   file.close()

Define a MindRecord schema and create a FileWriter object;

from mindspore.mindrecord import FileWriter

# Define a schema for MindRecord
schema = {'input_ids': {"type": "int32", "shape": [-1]}
# Create a FileWriter object
writer = FileWriter(file_name="output_file", shard_num=1)
writer.add_schema(schema, "dataset_type")

Iterate through each piece of data in the processed json file, convert it to MindRecord format, and write it to a MindRecord file.

Word list download link: tokenizer.model

import numpy as np
from mindformers import LlamaTokenizer

def tokenize_json(tokenizer, raw_data):
    """tokenize json file dataset"""
    content = [] # Read each json data and get its “input_ids”.
    for line in raw_data:
        stripped_line = line['text'].strip()
        if stripped_line:
            line_ids = tokenizer(stripped_line)["input_ids"]
            content.append(line_ids)

    for ids in content:
        sample = {}
        sample['input_ids'] = np.array(ids, dtype=np.int32)
        yield sample

# Tokenize the text data
word_tokenizer = LlamaTokenizer(vocab_file=r"tokenizer.model")

# Iterate through each piece of data in the processed json file, convert it to MindRecord format and write it to the MindRecord file
# tokenize_json is a custom method to tokenize the dialog data in json.
for x in tokenize_json(word_tokenizer, raw_data):
    writer.write_raw_data([x])
writer.commit()

For the detailed cases, refer to Examples of Data Preprocessing in Llama2.

Using MindRecord Format Datasets in Tasks

You can make a training or evaluation task use a prepared MindRecord format dataset by configuring dataset-related parameters in the yaml configuration file.

Here, as an example, for the Llama2-7B model pretraining task, the default configuration parameters and descriptions in the pretrain_llama2_7b.yaml file are as follows:

# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: True
  input_columns: ["input_ids"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  batch_size: 6
  repeat: 1
  numa_enable: False
  prefetch_size: 1

train_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *train_dataset

Configure the following parameters to use MindRecord format datasets:

data_loader.type: The type of the dataloader, which needs to be set to MindDataset.
data_loader.dataset_dir: The path to the dataset file.
input_columns: Sets the data columns for the input of the training dataset. Currently a pre-training scenario, set to ["input_ids"].

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

HuggingFace Datasets

Currently, the dataset loading functionality has been integrated with the ModelScope Open-Source Community and the HuggingFace Community, supporting online dataset loading and preprocessing. Additionally, datasets can be packed to enhance model training efficiency.

Usage Instructions

HuggingFace datasets support online and offline loading of datasets from both the HuggingFace community and the MoLo open-source community. Below is an introduction to environment preparation, the dataset loading process, and how to configure the use of HuggingFace datasets in configuration files.

Integrating with Open-Source Communities

Integrating with HuggingFace Community

To use datasets from the HuggingFace community, follow these steps:
1. Environment Setup
  
  The environment variable HF_ENDPOINT controls the remote repository used by HuggingFace. By default, it is set to https://huggingFace.co. For users in China, it is recommended to configure it to the mirror address:
2. Install Dependencies
```
pip install datasets
```
Integrating with ModelScope Open-Source Community

To use datasets from the ModelScope Open-Source Community, follow these steps:
1. Environment Setup
  
  The environment variable OPENMIND_HUB_ENDPOINT controls the remote repository used by the ModelScope Open-Source Community. Defaults to export OPENMIND_HUB_ENDPOINT=https://telecom.openmind.cn when not configured.
2. Install Dependencies
```
git clone https://gitee.com/openmind-ai/openmind-hub.git
cd openmind-hub
pip install -e .
cd ..
git clone https://gitee.com/foundation-models/openmind-datasets.git
cd openmind-datasets
pip install -e .
cd ..
```

When the openmind-datasets component is installed in the environment, the default interface is the Modelers open source community, if you want to interface with HuggingFace, the environment variable USE_OM can control which community to interface with, the default value is ON for the Modelers community, change it to OFF to interface with the HuggingFace community.

Dataset Loading Process

The online dataset loading and processing functionality is primarily implemented through CommonDataLoader. The data loading part can be customized via configuration files, with detailed configuration instructions available in the dataloader parameter description. The online loading module requires users to implement customizations for different datasets. For example, the AlpacaInstructDataHandler class can be used to preprocess the alpaca dataset. For more information, please refer to Custom Data Handler.

dataloader Parameter Description

The online dataset loading feature is enabled by configuring the data_loader in the configuration file. Below is an example configuration for online dataset loading:

train_dataset: &train_dataset
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    load_func: 'load_dataset'
    shuffle: False
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    packing: pack
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: ["input_ids", "labels"]
        is_dynamic: False
      - type: PackingHandler
        seq_length: 4096
        output_columns: ["input_ids", "labels", "actual_seq_len"]
    adaptor_config:
      compress_mask: False
    column_names: *input_columns

Parameter descriptions for data_loader are as follows:

Parameter Name	Description	Type
type	Fixed as `CommonDataLoader`. This module supports loading datasets from HuggingFace and the ModelScope open-source community.	str
packing	Packing configuration when processing datasets with `handler`. Options include `pack` and `truncate`.	str
load_func	The function used to load datasets. Options are `load_dataset` and `load_from_disk`. Use `load_from_disk` for data saved via the `save_to_disk` function, and `load_dataset` for other scenarios. The default value is `load_dataset`.	str
path	When `load_func=load_dataset`, this parameter aligns with the interface in datasets.load_dataset. When `load_func=load_from_disk`, it specifies the dataset loading path.	str
data_files	When `load_func=load_dataset`, this parameter aligns with the interface in datasets.load_dataset. It is ineffective when `load_func=load_from_disk`.	str
handler	Multiple `handlers` can be configured to preprocess the loaded dataset in the order specified. For details on `handler` configuration, refer to the handler parameter description in Custom Data Handler.	list
adaptor_config	Dataset-related configuration during model training. Currently supports `compress_mask`, effective when `packing` is set. If enabled, it returns a compressed data mask. Default is `False`.	dict
shuffle	Indicates whether random sampling is enabled when loading the dataset.	bool
column_names	Specifies the column names returned by the dataset. If not set, all columns are returned.	list
is_dynamic	Indicates whether the dataset returns dynamic-length data. Default is `False`.	bool

In addition to the above configurations, all parameters from the datasets.load_dataset interface are supported with the same meanings and functions.

When packing is configured, the dataset returns an actual_seq_len column. For more information, refer to the actual_seq_qlen and ctual_seq_kvlen parameter descriptions in the documentation.

Feature Introduction

Dynamic Sequence Length Fine-Tuning

CommonDataLoader supports dynamic shape fine-tuning using HuggingFace datasets, which can be loaded online or offline. Below, we use the alpaca dataset as an example to demonstrate the configuration for dynamic shape fine-tuning.

Online Loading

The online dataset name is llm-wizard/alpaca-gpt4-data. You can search and download it from the HuggingFace official website or load it directly using the online name.

Example configuration for online loading:

train_dataset: &train_dataset
  input_columns: &input_columns ["input_ids", "labels"]
  dynamic_batch: True                    # Enable dynamic shape
  divisor: 32                            # With divisor and remainder configured, seq_length in dynamic shape will become a multiple of divisor and the sum of remainder
  remainder: 1
  data_loader:
    type: CommonDataLoader
    shuffle: True
    split: "train"                       # Subset name of the online dataset
    path: "llm-wizard/alpaca-gpt4-data"  # Online dataset name
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: *input_columns
        is_dynamic: True
  seed: 0
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  repeat: 1
  numa_enable: False
  prefetch_size: 1

For parameter descriptions in train_dataset, please refer to the documentation.
AlpacaInstructDataHandler is an online processing script developed for the alpaca dataset. If using a different dataset, you need to implement a custom data handler by referring to the Custom Data Handler guide.

Offline Loading

For offline loading, you need to prepare the JSON files of the alpaca dataset. The offline configuration differs from the online configuration only in the following parameters:

 train_dataset:
   data_loader:
     path: "json"                               # loading datasets using the load_dataset interface
     data_files: '/path/alpaca_gpt4_data.json'  # the file path of the alpaca dataset

After configuring the dataset loading method, you also need to set is_dynamic=True in the model configuration to enable dynamic shape training for the model.

model_config:
  is_dynamic: True

Since dynamic shapes may lead to operator compilation caching, it is recommended to set the following environment variables to limit the number of cached compilations when running in a memory-constrained environment. This helps prevent out-of-memory issues:

export ACLNN_CACHE_LIMIT=10
export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64"

The ACLNN_CACHE_LIMIT parameter description can be found in the documentation.
MS_DEV_RUNTIME_CONF is a parameter in MindSpore for setting the operator cache queue length. The value 64 represents the length of the sequence, which defaults to 1024. This can be adjusted based on the actual environment. Setting the value too small may affect model training performance.

After completing all the configurations above, you can proceed with dynamic shape fine-tuning by referring to the documentation for the specific model you are using.

Custom Data Handler

Users can define custom data handlers to apply various preprocessing logic to the loaded dataset.

Handler Parameter Description

Parameter Name	Description	Type
type	Custom data handler name. A custom handler must inherit from `BaseInstructDataHandler`.	str
tokenizer_name	Name of the tokenizer used.	str
tokenizer	Tokenizer configuration parameters. Can be a dictionary, string, or a `tokenizer` object. Takes lower priority than `tokenizer_name`.	dict/str
seq_length	Maximum sequence length, usually the same as the model's sequence length.	int
output_columns	Column names of the processed data returned after preprocessing.	list
prompt_key	Column name for data after applying prompt processing.	str

Development Sample 1

The custom data handler is usually placed in the mindformers/dataset/handler directory, and the customized one needs to inherit the abstract base class BaseInstructDataHandler. You need to implement format_func and tokenize_func methods, which preprocess each data loaded. Refer to alpaca_handler.py.

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class XXXInstructDataHandler(BaseInstructDataHandler):

    def format_func(self, example):
        # Custom data format conversion

    def tokenize_func(self, example):
        # Custom tokenizer split word processing

The BaseInstructDataHandler provides an implementation of the entry handler method by default, which is used to iterate over each piece of data for data preprocessing. The format_func is used to implement how to convert the raw data into the desired data format, and the tokenize_func method is used to take the processed data and perform a customized tokenization. The input parameter example in the example is each of the samples obtained.

Development Sample 2

If you want to process the data directly for the whole dataset instead of processing each piece of data in batches, you can implement the entry handle method in custom handler, and you will get the complete dataset, as shown below:

    def handle(self, dataset):
        """data handler"""
        return dataset.rename_columns({"content":"prompt","summary":"answer"})

alpaca Dataset Sample

Modify the task configuration file finetune_llama2_7b.yaml.

Modify the following parameters:

train_dataset: &train_dataset
  input_columns: &input_columns ["input_ids", "labels"]
  data_loader:
    type: CommonDataLoader
    shuffle: True
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: *input_columns
  seed: 0
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  repeat: 1
  numa_enable: False
  prefetch_size: 1

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

Custom data handler:

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AlpacaInstructDataHandler(BaseInstructDataHandler):

    def format_func(self, example):
        """format func"""
        source = PROMPT_INPUT.format_map(example) \
            if example.get(self.input_key, "") != "" \
            else PROMPT_NO_INPUT.format_map(example)
        target = example.get(self.output_key)
        formatted_example = [
            {
                "from": self.user_role,
                "value": source,
            },
            {
                "from": self.assistant_role,
                "value": target,
            },
        ]

        return formatted_example

    def tokenize_func(self, messages):
        """tokenize func"""
        conversation = self.gen_prompt(messages)
        sep = self.template.sep + self.assistant_role + ": "
        # Tokenize conversations
        rounds = conversation.split(self.template.sep2)
        ids = [self.tokenizer.bos_token_id]
        mask = [1]
        for _, rou in enumerate(rounds):
            if rou == "":
                break
            conv_out = self.tokenizer(rou)
            ids.extend(conv_out['input_ids'][1:])
            mask.extend(conv_out['attention_mask'][1:])
        d = {'input_ids': ids, 'attention_mask': mask}
        # pylint: disable=W0212
        if not self.dynamic:
            d = self.tokenizer._pad(d, max_length=self.seq_length + 1, padding_strategy='max_length')
        input_id = d['input_ids'][:self.seq_length + 1]
        target = np.array(d['input_ids'])
        total_len = int(np.not_equal(target, self.tokenizer.pad_token_id).sum())
        cur_len = 1
        target[:cur_len] = self.ignore_token_id
        for _, rou in enumerate(rounds):
            if rou == "":
                break
            parts = rou.split(sep)
            if len(parts) != 2:
                break
            parts[0] += sep
            round_len = len(self.tokenizer(rou)['input_ids']) - 1
            instruction_len = len(self.tokenizer(parts[0])['input_ids']) - 3

            target[cur_len: cur_len + instruction_len] = self.ignore_token_id

            cur_len += round_len
        if self.dynamic:
            return {
                "input_ids": input_id,
                "labels": target[:len(input_id)].tolist()
            }
        target[cur_len:] = self.ignore_token_id
        if cur_len < self.seq_length + 1:
            if cur_len != total_len:
                target[:] = self.ignore_token_id
        else:
            target = target[:self.seq_length + 1]
        label = target.tolist()
        return {
            "input_ids": input_id,
            "labels": label,
        }

ADGEN Dataset Sample

Modify the task configuration file run_glm3_6b_finetune_2k_800T_A2_64G.yaml.

Modify the following parameters:

train_dataset: &train_dataset
  data_loader:
    type: CommonDataLoader
    path: "HasturOfficial/adgen"
    split: "train"
    shuffle: True
    handler:
      - type: AdgenInstructDataHandler
    phase: "train"
    version: 3
    column_names: ["prompt", "answer"]
  tokenizer:
    type: ChatGLM3Tokenizer
    vocab_file: "/path/to/tokenizer.model"
  input_columns: ["input_ids", "labels"]
  max_source_length: 1024
  max_target_length: 1023
  ignore_pad_token_for_loss: True
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  batch_size: 8
  repeat: 1
  numa_enable: False
  prefetch_size: 1
  seed: 0

The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.

Custom adgen_handler:

@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AdgenInstructDataHandler(BaseInstructDataHandler):
    """agden data handler"""
    def handle(self, dataset):
        """data handler"""
        return dataset.rename_columns({"content": "prompt", "summary": "answer"})

Dataset Packing

Configuring PackingHandler in CommonDataLoader allows for packing processing of the data. Currently, the original data needs to be processed into input_ids and labels that can be fed into the model during the preprocessing step.

Parameter Description

Parameter Name	Description	Type
type	Fixed as `PackingHandler`. This module supports packing data. When `packing=pack` or `packing=truncate` is configured in dataloader, it performs non-truncating and truncating concatenation of the data, respectively.	str
seq_length	Maximum sequence length of the data after packing.	int
pad_token	Token ID used for padding `input_ids` when the packed sample does not reach the maximum length. Default value is 0.	int
ignore_token	Token ID used for padding `labels` when the packed sample does not reach the maximum length. Default value is -100.	int

Packing Example

By following the configuration below, the alpaca dataset can be preprocessed to achieve online packing.

train_dataset: &train_dataset
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    shuffle: False
    split: "train"
    path: "llm-wizard/alpaca-gpt4-data"
    packing: pack
    handler:
      - type: AlpacaInstructDataHandler
        tokenizer_name: llama2_7b
        seq_length: 4096
        prompt_key: "conversations"
        output_columns: ["input_ids", "labels"]
      - type: PackingHandler
        seq_length: 4096
        output_columns: ["input_ids", "labels", "actual_seq_len"]
    adaptor_config:
       compress_mask: False
  seed: 0
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  repeat: 1
  numa_enable: False
  prefetch_size: 1

Using the above configuration file to process the alpaca dataset will execute the following steps:

The raw text data will be processed into input_ids and labels using AlpacaInstructDataHandler and the tokenizer of llama2_7b.
PackingHandler will be used to perform packing on the processed input_ids and labels, resulting in concatenated input_ids and labels up to the seq_length. The actual_seq_len refers to the sequence length of each sub-sample in the concatenated sample. During training, this parameter will be used to generate the corresponding data mask.
If compress_mask=False is set in adaptor_config, a complete data mask will be returned during training. Otherwise, actual_seq_len will be returned.

Offline Dataset Processing

In addition to supporting online dataset loading and processing, CommonDataLoader also supports offline dataset processing and saving.

The datasets_preprocess.py script can be used to process Huggingface datasets offline and save them.

Parameter Description

Parameter Name	Description	Type
config	Configuration file for offline data processing, which is used in the same way as online processing. Refer to dataloader for details.	str
save_path	Path where the preprocessed dataset will be saved.	str
register_path	Registration path for the model API, which includes the Python files related to the model, typically the model folder under the `research` directory.	int

Usage Example

You can use the configuration file provided in the dataset packing example and execute the following command.

python toolkit/data_preprocess/huggingface/datasets_preprocess.py \
  --config data_process.yaml \
  --save_path /path/processed_data

If you need to load the saved dataset, you should modify the YAML configuration as follows:

train_dataset: &train_dataset
  input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
  construct_args_key: *input_columns
  data_loader:
    type: CommonDataLoader
    shuffle: False
    load_func: "load_from_disk"
    path: "/path/processed_data"
    adaptor_config:
       compress_mask: False