# Dataset [](https://gitee.com/mindspore/docs/blob/r2.4.10/docs/mindformers/docs/source_en/function/dataset.md) ## MindRecord Dataset MindRecord is an efficient data format developed by MindSpore for storing machine learning or deep learning datasets. The MindRecord format is designed to improve data processing efficiency, especially in large-scale data training scenarios where data can be loaded and processed faster. MindRecord files typically contain the input samples needed for model training, which are preprocessed (e.g., encoded, normalized) to optimize read speed and memory usage. For more information about the implementation of MindRecord related interfaces and examples, please refer to the [documentation about MindRecord in MindSpore](https://www.mindspore.cn/docs/en/r2.4.10/api_python/mindspore.mindrecord.html). ### How to Make a MindRecord Dataset The MindRecord module provides methods to convert different datasets into MindRecord format. You can use the FileWriter interface provided by MindSpore to generate MindRecord format datasets. The following is an example of a MindRecord dataset based on a json format file, taking Llama2 as an example: 1. Prepara json file Prepare a json file like this, named `mydata.json`: ```json [ { "text": "I love Beijing, because it is a city that beautifully blends rich history with modern vibrancy." }, { "text": "I love Hangzhou, because it is a city that seamlessly combines natural beauty with rich cultural heritage." } ] ``` 2. Read json file ```python import json raw_data = None file = open("mydata.json", "r") # Open json file if file is not None: raw_data = json.load(file) # Read json file into raw_data file.close() ``` 3. Define a MindRecord ``schema`` and create a ``FileWriter`` object; ```python from mindspore.mindrecord import FileWriter # Define a schema for MindRecord schema = {'input_ids': {"type": "int32", "shape": [-1]} # Create a FileWriter object writer = FileWriter(file_name="output_file", shard_num=1) writer.add_schema(schema, "dataset_type") ``` 4. Iterate through each piece of data in the processed json file, convert it to MindRecord format, and write it to a MindRecord file. Word list download link: [tokenizer.model](https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/tokenizer.model) ```python import numpy as np from mindformers import LlamaTokenizer def tokenize_json(tokenizer, raw_data): """tokenize json file dataset""" content = [] # Read each json data and get its “input_ids”. for line in raw_data: stripped_line = line['text'].strip() if stripped_line: line_ids = tokenizer(stripped_line)["input_ids"] content.append(line_ids) for ids in content: sample = {} sample['input_ids'] = np.array(ids, dtype=np.int32) yield sample # Tokenize the text data word_tokenizer = LlamaTokenizer(vocab_file=r"tokenizer.model") # Iterate through each piece of data in the processed json file, convert it to MindRecord format and write it to the MindRecord file # tokenize_json is a custom method to tokenize the dialog data in json. for x in tokenize_json(word_tokenizer, raw_data): writer.write_raw_data([x]) writer.commit() ``` For the detailed cases, refer to [Examples of Data Preprocessing in Llama2](https://gitee.com/mindspore/mindformers/blob/r1.3.0/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。 ### Using MindRecord Format Datasets in Tasks You can make a training or evaluation task use a prepared MindRecord format dataset by configuring dataset-related parameters in the yaml configuration file. Here, as an example, for the Llama2-7B model pretraining task, the default configuration parameters and descriptions in the [pretrain_llama2_7b.yaml file](https://gitee.com/mindspore/mindformers/blob/r1.3.0/configs/llama2/pretrain_llama2_7b.yaml#L39) are as follows: ```yaml # dataset train_dataset: &train_dataset data_loader: type: MindDataset dataset_dir: "" shuffle: True input_columns: ["input_ids"] num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True batch_size: 6 repeat: 1 numa_enable: False prefetch_size: 1 train_dataset_task: type: CausalLanguageModelDataset dataset_config: *train_dataset ``` Configure the following parameters to use MindRecord format datasets: - data_loader.type: The type of the dataloader, which needs to be set to `MindDataset`. - data_loader.dataset_dir: The path to the dataset file. - input_columns: Sets the data columns for the input of the training dataset. Currently a pre-training scenario, set to `["input_ids"]`. The rest of the parameters can be described in "model training configuration" and "model evaluation configuration [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/r1.3.2/appendix/conf_files.html). ## BIN Format Dataset The use of datasets in binary format (BIN format) can lead to significant performance and efficiency gains during large model training. The current MindFormers framework also adapts the ability to work with BIN format datasets, including how to make BIN format datasets and use BIN format datasets in tasks. ### How to Make a BIN Format Dataset Currently the preprocessing script provided by MindFormers only supports processing files in json format, which requires the user to convert the file format of the original dataset to a file in json format that meets the support of the preprocessing script before using the preprocessing script, and the supported file formats in json format are as follows: ```json {"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"} {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"} ``` Llama2 processing of a Wiki dataset as an example, for the download of the original Wiki dataset, see [Example of data preprocessing in Llama2](https://gitee.com/mindspore/mindformers/blob/r1.3.0/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87),Call [mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py](https://gitee.com/mindspore/mindformers/blob/r1.3.0/mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py) directly after processing the dataset into a format that matches the format supported by the preprocessing scripts. The specific commands are as follows: ```shell python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ --input /path/to/wiki.json \ --output-prefix /path/to/my_wiki_1024 \ --tokenizer-type LlamaTokenizer \ --vocab-file /path/to/tokenizer.model \ --add_bos_token True \ --add_eos_token True \ --pad_or_stitch stitch \ --seq-length 1024 \ --workers 1 ``` The parameters of preprocessing script is described below: - input: Path to the file where the dataset to be processed will be processed into json format. - output-prefix: The prefix of the output file after preprocessing. - tokenizer-type: The type of the corresponding tokenizer for the model - vocab-file: tokenizer.model of the model or vocab file in other formats. - add_bos_token: whether to add bos_token at the first position of the data, default is False, please refer to each model's requirement. - add_eos_token: whether to add eos_token at the end of the data, default is False, please refer to each model's requirement for details. - pad_or_stitch: according to the requirements of the training task, set whether to splice or stitch, pad is to stitch, stitch is to splice. - seq-length: the length of data to be processed in the dataset, need to be set by the user. - workers: number of parallel workers for preprocessing. After executing the above command, you will get two files in .bin and .idx format. ### Using BIN Format Datasets in Tasks The training task can be made to use a prepared BIN format dataset by configuring dataset-related parameters in the yaml configuration file. Here, taking the Llama2-7B model pretraining task as an example, the configuration parameters in the [pretrain_llama2_7b.yaml file](https://gitee.com/mindspore/mindformers/blob/r1.3.0/configs/llama2/pretrain_llama2_7b.yaml#L39) are modified and described as follows: ```yaml # dataset train_dataset: &train_dataset data_loader: type: IndexedDataLoader path_prefix: "" shuffle: False input_columns: ["input_ids"] num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True batch_size: 6 repeat: 1 numa_enable: False prefetch_size: 1 train_dataset_task: type: CausalLanguageModelDataset dataset_config: *train_dataset ``` Configure the following parameters to use the BIN format dataset: - data_loader.type: the type of the dataloader, which should be set to `IndexedDataLoader`. - data_loader.path_prefix: the prefix of the dataset file name. - input_columns: set the data columns for training dataset input. Currently a pre-training scenario, set to `[“input_ids”]` . ## Online Dataset Access to [modelers](https://modelers.cn/datasets) and [HuggingFace repository](https://huggingface.co/datasets), loading datasets online and expanding dataset sources. ### Docking HuggingFace Open Source Community 1. Environmental preparations The `HF_ENDPOINT` environment variable controls the actual remote repository used by the open source community HuggingFace. When not configured, it defaults to `https://huggingFace.co`, and for domestic environments, it needs to be configured to the mirror address ``export HF_ENDPOINT=https://hf- mirror.com``. 2. Installing dependencies ```shell git clone https://gitee.com/openmind-ai/openmind-hub.git cd openmind-hub pip install -e . cd .. pip install datasets==2.18.0 git clone https://gitee.com/openmind-ai/openmind-extension-for-datasets.git cd openmind-extension-for-datasets pip install -e . cd .. ``` ### Docking Modelers Open Source Community 1. Environmental preparations The `OPENMIND_HUB_ENDPOINT` environment variable controls the actual remote repository used by the Modelers Open Source community, and defaults to ``export OPENMIND_HUB_ENDPOINT=https://telecom.openmind.cn`` when not configured. 2. Installing dependencies ```shell git clone https://gitee.com/openmind-ai/openmind-hub.git cd openmind-hub pip install -e . cd .. pip install datasets==2.18.0 git clone https://gitee.com/foundation-models/openmind-datasets.git cd openmind-datasets pip install -e . cd .. ``` > When the openmind-datasets component is installed in the environment, the default interface is the Modelers open source community, if you want to interface with HuggingFace, the environment variable `USE_OM` can control which community to interface with, the default value is `ON` for the Modelers community, change it to `OFF` to interface with the HuggingFace community. ### Custom Data handler Users can use custom data handler logic to perform various data preprocessing customization logic on loaded remote datasets. #### Parameters - type: The name of the custom data handler, which must inherit from ``BaseInstructDataHandler``. - tokenizer_name: name of the tokenizer used. - seq_length: length of the sequence. - output_columns: output columns after data preprocessing. - prompt_key: the name of the column after adding prompt processing. - tokenizer: the configuration parameter of the tokenizer, it can be a dictionary or a string, or you can directly configure the ``tokenizer`` object. #### Development Sample 1 The custom data handler is usually placed in the `mindformers/dataset/handler` directory, and the customized one needs to inherit the abstract base class ``BaseInstructDataHandler``. You need to implement ``format_func`` and ``tokenize_func`` methods, which preprocess each data loaded. Refer to ``alpaca_handler.py``. ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class XXXInstructDataHandler(BaseInstructDataHandler): def format_func(self, example): # Custom data format conversion def tokenize_func(self, example): # Custom tokenizer split word processing ``` The ``BaseInstructDataHandler`` provides an implementation of the entry ``handler`` method by default, which is used to iterate over each piece of data for data preprocessing. The ``format_func`` is used to implement how to convert the raw data into the desired data format, and the ``tokenize_func`` method is used to take the processed data and perform a customized tokenization. The input parameter ``example`` in the example is each of the samples obtained. #### Development Sample 2 If you want to process the data directly for the whole dataset instead of processing each piece of data in batches, you can implement the entry ``handle`` method in custom handler, and you will get the complete dataset, as shown below: ```python def handle(self, dataset): """data handler""" return dataset.rename_columns({"content":"prompt","summary":"answer"}) ``` ### alpaca Dataset Sample #### Training Processes Loaded Directly from the Remote Repository Modify the task configuration file [finetune_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/r1.3.0/configs/llama2/finetune_llama2_7b.yaml). Modify the following parameters: ```yaml train_dataset: input_columns: &input_columns ["input_ids", "labels"] data_loader: type: CommonDataLoader shuffle: True split: "train" path: "AI_Connect/alpaca" input_columns: *input_columns handler: type: AlpacaInstructDataHandler tokenizer_name: llama2_13b seq_length: 4096 prompt_key: "conversations" output_columns: *input_columns ``` Configure the following parameters to use the alpaca dataset: - input_columns: column names of the input data. - data_loader.type: name of the class for data loading processing. - data_loader.shuffle: whether the dataset is shuffled. - data_loader.path: the remote path to load the dataset. - data_loader.input_columns: which field conversions are used when datasets are converted to ms.datasets. - data_loader.handler: data preprocessor class configuration, no data processing when empty. - data_loader.handler.type: the class name of the data preprocessor class. - data_loader.handler.tokenizer_name: name of the tokenizer. - data_loader.handler.seq_length: length of the sequence. - data_loader.handler.prompt_key: name of the data column after adding prompt processing. - data_loader.handler.output_columns: columns to be output after data preprocessing. The rest of the parameters can be described in "model training configuration" and "model evaluation configuration [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/r1.3.2/appendix/conf_files.html). Custom data handler: ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class AlpacaInstructDataHandler(BaseInstructDataHandler): def format_func(self, example): """format func""" source = PROMPT_INPUT.format_map(example) \ if example.get(self.input_key, "") != "" \ else PROMPT_NO_INPUT.format_map(example) target = example.get(self.output_key) formatted_example = [ { "from": self.user_role, "value": source, }, { "from": self.assistant_role, "value": target, }, ] return formatted_example def tokenize_func(self, messages): """tokenize func""" conversation = self.gen_prompt(messages) sep = self.template.sep + self.assistant_role + ": " # Tokenize conversations rounds = conversation.split(self.template.sep2) ids = [self.tokenizer.bos_token_id] mask = [1] for _, rou in enumerate(rounds): if rou == "": break conv_out = self.tokenizer(rou) ids.extend(conv_out['input_ids'][1:]) mask.extend(conv_out['attention_mask'][1:]) d = {'input_ids': ids, 'attention_mask': mask} # pylint: disable=W0212 d = self.tokenizer._pad(d, max_length=self.seq_length + 1, padding_strategy='max_length') input_id = d['input_ids'][:self.seq_length + 1] # attention_mask.append(d['attention_mask']) target = np.array(d['input_ids']) total_len = int(np.not_equal(target, self.tokenizer.pad_token_id).sum()) cur_len = 1 target[:cur_len] = self.ignore_token_id for _, rou in enumerate(rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep round_len = len(self.tokenizer(rou)['input_ids']) - 1 instruction_len = len(self.tokenizer(parts[0])['input_ids']) - 3 target[cur_len: cur_len + instruction_len] = self.ignore_token_id cur_len += round_len target[cur_len:] = self.ignore_token_id if cur_len < self.seq_length + 1: if cur_len != total_len: target[:] = self.ignore_token_id else: target = target[:self.seq_length + 1] label = target.tolist() return { "input_ids": input_id, "labels": label, } ``` ### ADGEN Dataset Sample #### Training Processes Loaded Directly from the Remote Repository Modify the task configuration file [run_glm3_6b_finetune_2k_800T_A2_64G.yaml](https://gitee.com/mindspore/mindformers/blob/r1.3.0/configs/glm3/run_glm3_6b_finetune_2k_800T_A2_64G.yaml). Modify the following parameters: ```yaml train_dataset: &train_dataset data_loader: type: CommonDataLoader path: "xxx/ADGEN" split: "train" shuffle: True input_columns: ["prompt", "answer"] handler: type: AdgenInstructDataHandler output_columns: ["content", "summary"] tokenizer: type: ChatGLM3Tokenizer vocab_file: "/path/to/tokenizer.model" input_columns: ["input_ids", "labels"] max_source_length: 1024 max_target_length: 1023 ignore_pad_token_for_loss: True num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True batch_size: 8 repeat: 1 numa_enable: False prefetch_size: 1 phase: "train" version: 3 seed: 0 ``` Configure the following parameters to use the ADGEN dataset: - data_loader.type: class name of the data loading process. - data_loader.path: path of the loaded dataset. - data_loader.shuffle: whether to shuffle the dataset. - data_loader.split: subset of the dataset, default is train set. - data_loader.input_columns: which fields to use when converting datasets to ms.datasets. - data_loader.handler: custom data handler. - data_loader.handler.type: name of the type of the custom data handler. - data_loader.handler.output_columns: the names of the dataset columns that will be output after processing. The rest of the parameters can be described in "model training configuration" and "model evaluation configuration [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/r1.3.2/appendix/conf_files.html). Custom adgen_handler: ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class AdgenInstructDataHandler(BaseInstructDataHandler): """agden data handler""" def handle(self, dataset): """data handler""" return dataset.rename_columns({"content":"prompt","summary":"answer"}) ```