Dataset
At present, MindSpore Transformers' pre-training and fine-tuning support the ability to load datasets in multiple formats, including loading methods for Megatron Dataset, MindRecord Dataset, and HuggingFace datasets. The specific usage instructions for each format of dataset are as follows.
Megatron Dataset
Megatron Dataset refers to a dataset collected from multiple different sources, it contains different text types, formats, and domains. Using dataset can help models learn a wider range of language features and knowledge, thereby improving their generalization ability and performance. The current implementation of the Megatron framework requires preprocessing the original dataset into a BIN format dataset. MindSpore Transformers have been natively adapted to the Megatron Dataset, providing scripts for creating BIN format datasets and supporting direct use of the Megatron Dataset in training tasks.
How to Make a BIN Format Dataset
MindSpore Transformers provides a preprocessing script mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py, which can convert text data to a BIN format dataset. This script currently only supports processing files in a specific JSON format. Users need to first convert the original dataset file into a specific JSON format file, and then use a preprocessing script to generate a BIN format dataset file. Some models in MindSpore Transformers currently provide scripts for converting specific open-source datasets into JSON format files. If users want to use their own datasets, they need to write their own scripts to convert them into the desired format.
The format of the required JSON format file content is as follows:
{"id": "0", "text": "The quick brown fox", "type": "Eng", "src": "www.nvidia.com", "title": "First Part"}
{"id": "1", "text": "jumps over the lazy dog", "type": "Eng", "src": "The Internet", "title": "Second Part"}
...
Each piece of data consists of several key value pairs, and the supported keys and descriptions are as follows:
"id"
: The numbering of the data should be in order, required"text"
: Text data actually used for training, required"type"
: Indicate language type, optional"src"
: Indicate the source of the data, optional"title"
: Indicate the title of the data, optional
Taking the processing of Wiki datasets and their use as pre-training for Llama2 models as an example, the detailed steps for creating BIN format datasets are explained below:
Download Wiki Dataset
For the original Wiki Dataset downloading, refer to Llama2 Dataset Download.
Generate JSON Format File
The original format of the Wiki Dataset is as follows:
The format of the JSON file
wiki.json
after processing the Wiki Dataset is as follows (omitting long text):{"id": 0, "text": "The gold dollar or gold one ..."} {"id": 1, "text": "Super Mario Land is a 1989 ..."} {"id": 2, "text": "The Sinclair Scientific Programmable ..."} ...
Download The Vocabulary File For Llama2
In the preprocessing script, the raw text data will be processed into Tokens using the Tokenizer of the model, therefore, it is necessary to download the vocabulary file in advance.
Download link for Llama2 vocabulary file: tokenizer.model
Generate BIN Format Files Using Preprocessing Scripts
After processing into the specific JSON format file mentioned above, using mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py to convert it into a BIN format dataset, the specific command is as follows:
python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ --input ./wiki.json \ --output-prefix wiki_processed_1024 \ --tokenizer-type LlamaTokenizer \ --vocab-file ./tokenizer.model \ --add_bos_token True \ --add_eos_token True \ --pad_or_stitch stitch \ --seq-length 1024 \ --workers 1
Configuration parameter description:
--input
: Path to JSON format file--output-prefix
: The file name prefix of the preprocessed output file--tokenizer-type
: The type of tokenizer corresponding to the model--vocab-file
: The path of the vocabulary file for the tokenizer model tokenizer--add_bos_token
: Add bos_token at the beginning of the data, Default: False--add_eos_token
: Add eos_token at the ending of the data, Default: False--pad_or_stitch
: According to the requirements of the training task, set whether to splice or fill in, pad is in fill in mode, this mode will fill in the data with insufficient length to the seq length; Stitch is a concatenation mode that concatenates multiple pieces of data into data with a length of seq length--seq-length
: Preprocess the length of each piece of data--workers
: The number of parallel workers during preprocessing
After executing the above command, two files will be obtained, in .bin
and .idx
formats respectively. The .bin
format file stores the actual data, and .idx
format file stores the index of each piece of data.
Using Megatron Datasets in Training Tasks
Use the Megatron multi-source dataset in the training task as follows:
Prepare the
parallel_speed_up.json
fileparallel_speed_up.json
is a dataset parallel communication configuration file, and the file content is as follows:{ "dataset_broadcast_opt_level": 3 }
Set environment variables
Enter the following command at the command line to set environment variables:
export MS_DEV_DYNAMIC_SINK1=False
Modify YAML configuration files for training tasks
Configure the relevant parameters of Megatron Dataset in YAML configuration file. Here, taking the Llama2-7B model pre-training task as an example, modify
train_dataset
,runner_config
,parallel_config
,parallel
andcontext
in pretrain_llama2_7b.yaml. The specific modifications and explanations are as follows:train_dataset: &train_dataset data_loader: type: BlendedMegatronDatasetDataLoader datasets_type: "GPTDataset" sizes: - 1000 - 0 - 0 shuffle: False input_columns: ["input_ids", "labels", "loss_mask", "position_ids"] config: seed: 1234 seq_length: 1024 split: "1, 0, 0" data_path: - 0.3 - "/path/to/my_wiki_test_1024_text_document" - 0.7 - "/path/to/my_wiki_test_1024_text_document" num_dataset_builder_threads: 1 eod_mask_loss: False create_attention_mask: False
Among them:
data_loader.type: The type of dataloader, should be set to
BlendedMegatronDatasetDataLoader
.data_loader.datasets_type: Dataset type, currently only supports
GPTDataset
.data_loader.sizes:
- 1000
,- 0
,- 0
are the sampling sizes for the training set, test set, and validation set, respectively. Currently, only the training set can be configured.input_columns: Set the input data columns for the training dataset, typically configured as
["input_ids", "labels", "loss_mask", "position_ids"]
.data_loader.config.seed: Random number seed when creating a dataset. Default:
1234
.data_loader.config.seq_length: The length of each piece of data must be consistent with the model.model_config.seq_length in the YAML configuration.
data_loader.config.split: Split string, separate the weights of the training set, test set, and validation set with commas, used to split the dataset when drawing samples from a single distribution. Currently, only supports configuration as
"1, 0, 0"
.data_loader.config.data_path: The number is the weight of each dataset, and the string is the path of the dataset BIN file, which needs to remove the file format suffix
.bin
.data_loader.config.num_dataset_builder_threads: The number of processes used when creating the dataset. Default:
1
.data_loader.config.eod_mask_loss: Do you want to use the switch of eod mask. Default:
False
.data_loader.config.create_attention_mask: Whether to construct attention_mask. Default:
True
.
There are still limitations to the current Megatron Dataset, which only supports non full batch scenarios, and it does not support the parallel feature of seq_pipe. The corresponding configuration items need to be modified according to the following:
runner_config: sink_mode: True sink_size: 1 parallel_config: data_parallel: &dp 2 model_parallel: 2 pipeline_stage: 1 parallel: full_batch: False dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]] context: ascend_config: parallel_speed_up_json_path: "/path/to/parallel_speed_up.json"
The configuration instructions that need to be noted are as follows:
parallel.dataset_strategy: Only support List of List type, parallel.dataset_strategy: Only support List of List type. The number of sub lists in a List needs to be equal to the length of train_dataset.input_columns, and each sub List in the List needs to be consistent with the shape of the data returned by the dataset. Generally, parallel data partitioning is performed in the first dimension of the data, so the first bit of the sub List is configured as
*dp
, and the other bits are configured as1
. The specific principle can be referred to Dataset Segmentation.
Compile Megatron Dataset module
MindSpore Transformers have built-in Megatron Dataset module code, before starting the training task, the following command needs to be executed for compilation:
pip install pybind11 cd mindformers/dataset/blended_datasets make
MindRecord Dataset
MindRecord is an efficient data format developed by MindSpore for storing machine learning or deep learning datasets.
The MindRecord format is designed to improve data processing efficiency, especially in large-scale data training scenarios where data can be loaded and processed faster. MindRecord files typically contain the input samples needed for model training, which are preprocessed (e.g., encoded, normalized) to optimize read speed and memory usage.
For more information about the implementation of MindRecord related interfaces and examples, please refer to the documentation about MindRecord in MindSpore.
How to Make a MindRecord Dataset
The MindRecord module provides methods to convert different datasets into MindRecord format. You can use the FileWriter interface provided by MindSpore to generate MindRecord format datasets.
The following is an example of a MindRecord dataset based on a json format file, taking Llama2 as an example:
Prepara json file
Prepare a json file like this, named
mydata.json
:[ { "text": "I love Beijing, because it is a city that beautifully blends rich history with modern vibrancy." }, { "text": "I love Hangzhou, because it is a city that seamlessly combines natural beauty with rich cultural heritage." } ]
Read json file
import json raw_data = None file = open("mydata.json", "r") # Open json file if file is not None: raw_data = json.load(file) # Read json file into raw_data file.close()
Define a MindRecord
schema
and create aFileWriter
object;from mindspore.mindrecord import FileWriter # Define a schema for MindRecord schema = {'input_ids': {"type": "int32", "shape": [-1]} # Create a FileWriter object writer = FileWriter(file_name="output_file", shard_num=1) writer.add_schema(schema, "dataset_type")
Iterate through each piece of data in the processed json file, convert it to MindRecord format, and write it to a MindRecord file.
Word list download link: tokenizer.model
import numpy as np from mindformers import LlamaTokenizer def tokenize_json(tokenizer, raw_data): """tokenize json file dataset""" content = [] # Read each json data and get its “input_ids”. for line in raw_data: stripped_line = line['text'].strip() if stripped_line: line_ids = tokenizer(stripped_line)["input_ids"] content.append(line_ids) for ids in content: sample = {} sample['input_ids'] = np.array(ids, dtype=np.int32) yield sample # Tokenize the text data word_tokenizer = LlamaTokenizer(vocab_file=r"tokenizer.model") # Iterate through each piece of data in the processed json file, convert it to MindRecord format and write it to the MindRecord file # tokenize_json is a custom method to tokenize the dialog data in json. for x in tokenize_json(word_tokenizer, raw_data): writer.write_raw_data([x]) writer.commit()
For the detailed cases, refer to Examples of Data Preprocessing in Llama2.
Using MindRecord Format Datasets in Tasks
You can make a training or evaluation task use a prepared MindRecord format dataset by configuring dataset-related parameters in the yaml configuration file.
Here, as an example, for the Llama2-7B model pretraining task, the default configuration parameters and descriptions in the pretrain_llama2_7b.yaml file are as follows:
# dataset
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 6
repeat: 1
numa_enable: False
prefetch_size: 1
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
Configure the following parameters to use MindRecord format datasets:
data_loader.type: The type of the dataloader, which needs to be set to
MindDataset
.data_loader.dataset_dir: The path to the dataset file.
input_columns: Sets the data columns for the input of the training dataset. Currently a pre-training scenario, set to
["input_ids"]
.
The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.
HuggingFace Datasets
Currently, the dataset loading functionality has been integrated with the ModelScope Open-Source Community and the HuggingFace Community, supporting online dataset loading and preprocessing. Additionally, datasets can be packed to enhance model training efficiency.
Usage Instructions
HuggingFace datasets support online and offline loading of datasets from both the HuggingFace community and the MoLo open-source community. Below is an introduction to environment preparation, the dataset loading process, and how to configure the use of HuggingFace datasets in configuration files.
Integrating with Open-Source Communities
Integrating with HuggingFace Community
To use datasets from the HuggingFace community, follow these steps:
Environment Setup
The environment variable
HF_ENDPOINT
controls the remote repository used by HuggingFace. By default, it is set tohttps://huggingFace.co
. For users in China, it is recommended to configure it to the mirror address:Install Dependencies
pip install datasets
Integrating with ModelScope Open-Source Community
To use datasets from the ModelScope Open-Source Community, follow these steps:
Environment Setup
The environment variable
OPENMIND_HUB_ENDPOINT
controls the remote repository used by the ModelScope Open-Source Community. Defaults toexport OPENMIND_HUB_ENDPOINT=https://telecom.openmind.cn
when not configured.Install Dependencies
git clone https://gitee.com/openmind-ai/openmind-hub.git cd openmind-hub pip install -e . cd .. git clone https://gitee.com/foundation-models/openmind-datasets.git cd openmind-datasets pip install -e . cd ..
When the openmind-datasets component is installed in the environment, the default interface is the Modelers open source community, if you want to interface with HuggingFace, the environment variable
USE_OM
can control which community to interface with, the default value isON
for the Modelers community, change it toOFF
to interface with the HuggingFace community.
Dataset Loading Process
The online dataset loading and processing functionality is primarily implemented through CommonDataLoader
. The data loading part can be customized via configuration files, with detailed configuration instructions available in the dataloader parameter description. The online loading module requires users to implement customizations for different datasets. For example, the AlpacaInstructDataHandler
class can be used to preprocess the alpaca
dataset. For more information, please refer to Custom Data Handler.
dataloader Parameter Description
The online dataset loading feature is enabled by configuring the data_loader
in the configuration file. Below is an example configuration for online dataset loading:
train_dataset:
input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
construct_args_key: *input_columns
data_loader:
type: CommonDataLoader
load_func: 'load_dataset'
shuffle: False
split: "train"
path: "llm-wizard/alpaca-gpt4-data"
is_dynamic: False
packing: pack
handler:
- type: AlpacaInstructDataHandler
tokenizer_name: llama2_7b
seq_length: 4096
prompt_key: "conversations"
output_columns: ["input_ids", "labels"]
- type: PackingHandler
seq_length: 4096
output_columns: ["input_ids", "labels", "actual_seq_len"]
adaptor_config:
compress_mask: False
column_names: *input_columns
Parameter descriptions for data_loader
are as follows:
Parameter Name |
Description |
Type |
---|---|---|
type |
Fixed as |
str |
packing |
Packing configuration when processing datasets with |
str |
load_func |
The function used to load datasets. Options are |
str |
path |
When |
str |
data_files |
When |
str |
handler |
Multiple |
list |
adaptor_config |
Dataset-related configuration during model training. Currently supports |
dict |
shuffle |
Indicates whether random sampling is enabled when loading the dataset. |
bool |
column_names |
Specifies the column names returned by the dataset. If not set, all columns are returned. |
list |
is_dynamic |
Indicates whether the dataset returns dynamic-length data. Default is |
bool |
In addition to the above configurations, all parameters from the datasets.load_dataset interface are supported with the same meanings and functions.
When packing is configured, the dataset returns an actual_seq_len
column. For more information, refer to the actual_seq_qlen
and ctual_seq_kvlen
parameter descriptions in the documentation.
Feature Introduction
Dynamic Sequence Length Fine-Tuning
CommonDataLoader
supports dynamic shape fine-tuning using HuggingFace datasets, which can be loaded online or offline. Below, we use the alpaca
dataset as an example to demonstrate the configuration for dynamic shape fine-tuning.
Online Loading
The online dataset name is
llm-wizard/alpaca-gpt4-data
. You can search and download it from the HuggingFace official website or load it directly using the online name.Example configuration for online loading:
train_dataset: input_columns: &input_columns ["input_ids", "labels"] dynamic_batch: True # Enable dynamic shape divisor: 32 # With divisor and remainder configured, seq_length in dynamic shape will become a multiple of divisor and the sum of remainder remainder: 1 data_loader: type: CommonDataLoader shuffle: True split: "train" # Subset name of the online dataset path: "llm-wizard/alpaca-gpt4-data" # Online dataset name is_dynamic: True handler: - type: AlpacaInstructDataHandler tokenizer_name: llama2_7b seq_length: 4096 prompt_key: "conversations" output_columns: *input_columns seed: 0 num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True repeat: 1 numa_enable: False prefetch_size: 1
For parameter descriptions in
train_dataset
, please refer to the documentation.AlpacaInstructDataHandler
is an online processing script developed for thealpaca
dataset. If using a different dataset, you need to implement a custom data handler by referring to the Custom Data Handler guide.
Offline Loading
For offline loading, you need to prepare the JSON files of the
alpaca
dataset. The offline configuration differs from the online configuration only in the following parameters:train_dataset: data_loader: path: "json" # loading datasets using the load_dataset interface data_files: '/path/alpaca_gpt4_data.json' # the file path of the alpaca dataset
After configuring the dataset loading method, you also need to set is_dynamic=True
in the model configuration to enable dynamic shape training for the model.
model_config:
is_dynamic: True
Since dynamic shapes may lead to operator compilation caching, it is recommended to set the following environment variables to limit the number of cached compilations when running in a memory-constrained environment. This helps prevent out-of-memory issues:
export ACLNN_CACHE_LIMIT=10
export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64"
The
ACLNN_CACHE_LIMIT
parameter description can be found in the documentation.MS_DEV_RUNTIME_CONF
is a parameter in MindSpore for setting the operator cache queue length. The value64
represents the length of the sequence, which defaults to1024
. This can be adjusted based on the actual environment. Setting the value too small may affect model training performance.
After completing all the configurations above, you can proceed with dynamic shape fine-tuning by referring to the documentation for the specific model you are using.
Custom Data Handler
Users can define custom data handlers to apply various preprocessing logic to the loaded dataset.
Handler Parameter Description
Parameter Name |
Description |
Type |
---|---|---|
type |
Custom data handler name. A custom handler must inherit from |
str |
tokenizer_name |
Name of the tokenizer used. |
str |
tokenizer |
Tokenizer configuration parameters. Can be a dictionary, string, or a |
dict/str |
seq_length |
Maximum sequence length, usually the same as the model's sequence length. |
int |
output_columns |
Column names of the processed data returned after preprocessing. |
list |
prompt_key |
Column name for data after applying prompt processing. |
str |
Development Sample 1
The custom data handler is usually placed in the mindformers/dataset/handler
directory, and the customized one needs to inherit the abstract base class BaseInstructDataHandler
.
You need to implement format_func
and tokenize_func
methods, which preprocess each data loaded. Refer to alpaca_handler.py
.
@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class XXXInstructDataHandler(BaseInstructDataHandler):
def format_func(self, example):
# Custom data format conversion
def tokenize_func(self, example):
# Custom tokenizer split word processing
The BaseInstructDataHandler
provides an implementation of the entry handler
method by default, which is used to iterate over each piece of data for data preprocessing.
The format_func
is used to implement how to convert the raw data into the desired data format, and the tokenize_func
method is used to take the processed data and perform a customized tokenization.
The input parameter example
in the example is each of the samples obtained.
Development Sample 2
If you want to process the data directly for the whole dataset instead of processing each piece of data in batches, you can implement the entry handle
method in custom handler, and you will get the complete dataset, as shown below:
def handle(self, dataset):
"""data handler"""
return dataset.rename_columns({"content":"prompt","summary":"answer"})
alpaca Dataset Sample
Modify the task configuration file finetune_llama2_7b.yaml.
Modify the following parameters:
train_dataset:
input_columns: &input_columns ["input_ids", "labels"]
data_loader:
type: CommonDataLoader
shuffle: True
split: "train"
path: "llm-wizard/alpaca-gpt4-data"
handler:
- type: AlpacaInstructDataHandler
tokenizer_name: llama2_7b
seq_length: 4096
prompt_key: "conversations"
output_columns: *input_columns
The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.
Custom data handler:
@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AlpacaInstructDataHandler(BaseInstructDataHandler):
def format_func(self, example):
"""format func"""
source = PROMPT_INPUT.format_map(example) \
if example.get(self.input_key, "") != "" \
else PROMPT_NO_INPUT.format_map(example)
target = example.get(self.output_key)
formatted_example = [
{
"from": self.user_role,
"value": source,
},
{
"from": self.assistant_role,
"value": target,
},
]
return formatted_example
def tokenize_func(self, messages):
"""tokenize func"""
conversation = self.gen_prompt(messages)
sep = self.template.sep + self.assistant_role + ": "
# Tokenize conversations
rounds = conversation.split(self.template.sep2)
ids = [self.tokenizer.bos_token_id]
mask = [1]
for _, rou in enumerate(rounds):
if rou == "":
break
conv_out = self.tokenizer(rou)
ids.extend(conv_out['input_ids'][1:])
mask.extend(conv_out['attention_mask'][1:])
d = {'input_ids': ids, 'attention_mask': mask}
# pylint: disable=W0212
if not self.dynamic:
d = self.tokenizer._pad(d, max_length=self.seq_length + 1, padding_strategy='max_length')
input_id = d['input_ids'][:self.seq_length + 1]
target = np.array(d['input_ids'])
total_len = int(np.not_equal(target, self.tokenizer.pad_token_id).sum())
cur_len = 1
target[:cur_len] = self.ignore_token_id
for _, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
round_len = len(self.tokenizer(rou)['input_ids']) - 1
instruction_len = len(self.tokenizer(parts[0])['input_ids']) - 3
target[cur_len: cur_len + instruction_len] = self.ignore_token_id
cur_len += round_len
if self.dynamic:
return {
"input_ids": input_id,
"labels": target[:len(input_id)].tolist()
}
target[cur_len:] = self.ignore_token_id
if cur_len < self.seq_length + 1:
if cur_len != total_len:
target[:] = self.ignore_token_id
else:
target = target[:self.seq_length + 1]
label = target.tolist()
return {
"input_ids": input_id,
"labels": label,
}
ADGEN Dataset Sample
Modify the task configuration file run_glm3_6b_finetune_2k_800T_A2_64G.yaml.
Modify the following parameters:
train_dataset: &train_dataset
data_loader:
type: CommonDataLoader
path: "xxx/ADGEN"
split: "train"
shuffle: True
handler:
- type: AdgenInstructDataHandler
output_columns: ["prompt", "answer"]
tokenizer:
type: ChatGLM3Tokenizer
vocab_file: "/path/to/tokenizer.model"
input_columns: ["input_ids", "labels"]
max_source_length: 1024
max_target_length: 1023
ignore_pad_token_for_loss: True
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
batch_size: 8
repeat: 1
numa_enable: False
prefetch_size: 1
phase: "train"
version: 3
seed: 0
The rest of the parameters can be described in "model training configuration" and "model evaluation configuration Configuration File Description.
Custom adgen_handler:
@MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER)
class AdgenInstructDataHandler(BaseInstructDataHandler):
"""agden data handler"""
def handle(self, dataset):
"""data handler"""
return dataset.rename_columns({"content": "prompt", "summary": "answer"})
Dataset Packing
Configuring PackingHandler
in CommonDataLoader
allows for packing processing of the data. Currently, the original data needs to be processed into input_ids
and labels
that can be fed into the model during the preprocessing step.
Parameter Description
Parameter Name |
Description |
Type |
---|---|---|
type |
Fixed as |
str |
seq_length |
Maximum sequence length of the data after packing. |
int |
pad_token |
Token ID used for padding |
int |
ignore_token |
Token ID used for padding |
int |
Packing Example
By following the configuration below, the alpaca
dataset can be preprocessed to achieve online packing.
train_dataset:
input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
construct_args_key: *input_columns
data_loader:
type: CommonDataLoader
shuffle: False
split: "train"
path: "llm-wizard/alpaca-gpt4-data"
packing: pack
handler:
- type: AlpacaInstructDataHandler
tokenizer_name: llama2_7b
seq_length: 4096
prompt_key: "conversations"
output_columns: ["input_ids", "labels"]
- type: PackingHandler
seq_length: 4096
output_columns: ["input_ids", "labels", "actual_seq_len"]
adaptor_config:
compress_mask: False
Using the above configuration file to process the alpaca
dataset will execute the following steps:
The raw text data will be processed into
input_ids
andlabels
usingAlpacaInstructDataHandler
and thetokenizer
ofllama2_7b
.PackingHandler
will be used to perform packing on the processedinput_ids
andlabels
, resulting in concatenatedinput_ids
andlabels
up to theseq_length
. Theactual_seq_len
refers to the sequence length of each sub-sample in the concatenated sample. During training, this parameter will be used to generate the corresponding data mask.If
compress_mask=False
is set inadaptor_config
, a complete data mask will be returned during training. Otherwise,actual_seq_len
will be returned.
Offline Dataset Processing
In addition to supporting online dataset loading and processing, CommonDataLoader
also supports offline dataset processing and saving.
The datasets_preprocess.py script can be used to process Huggingface datasets offline and save them.
Parameter Description
Parameter Name |
Description |
Type |
---|---|---|
config |
Configuration file for offline data processing, which is used in the same way as online processing. Refer to dataloader for details. |
str |
save_path |
Path where the preprocessed dataset will be saved. |
str |
register_path |
Registration path for the model API, which includes the Python files related to the model, typically the model folder under the |
int |
Usage Example
You can use the configuration file provided in the dataset packing example and execute the following command.
python toolkit/data_preprocess/huggingface/datasets_preprocess.py \
--config data_process.yaml \
--save_path /path/processed_data
If you need to load the saved dataset, you should modify the YAML configuration as follows:
train_dataset:
input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"]
construct_args_key: *input_columns
data_loader:
type: CommonDataLoader
shuffle: False
load_func: "load_from_disk"
path: "/path/processed_data"
adaptor_config:
compress_mask: False