Data Processing

Linux Windows Ascend GPU CPU Environment Preparation Basic Intermediate

View Source On Gitee

Q: Does GeneratorDataset support ds.PKSampler sampling?

A: GeneratorDataset does not support PKSampler sampling logic. The main reason is that the custom data operation is too flexible. The built-in PKSampler cannot be universal. Therefore, a message is displayed at the API layer, indicating that the operation is not supported. However, for GeneratorDataset, you can easily define the required Sampler logic. That is, you can define specific sampler rules in the __getitem__ function of the ImageDataset class and return the required data.


Q: How does MindSpore load the existing pre-trained word vector?

A: When defining EmbedingLookup or Embedding, you only need to transfer the pre-trained word vector and encapsulate the pre-trained word vector into a tensor as the initial value of EmbeddingLookup.


Q: What is the difference between c_transforms and py_transforms? Which one is recommended?

A: c_transforms is recommended. Its performance is better because it is executed only at the C layer.

Principle: The underlying layer of c_transform uses opencv/jpeg-turbo of the C version for data processing, and py_transform uses Pillow of the Python version for data processing.


Q: A piece of data contains multiple images which have different widths and heights. I need to perform the map operation on the data in mindrecord format for data processing. However, the data I read from record is in np.ndarray format. My operations are for the image format. How can I preprocess the generated data in mindrecord format?

A: You are advised to perform the following operations:

#1 The defined schema is as follows: Among them, data1, data2, data3, ... These fields store your image, and only the binary of the image is stored here.

cv_schema_json = {"label": {"type": "int32"}, "data1": {"type": "bytes"}, "data2": {"type": "bytes"}, "data3": {"type": "bytes"}}

#2 The organized data can be as follows, and then this data_list can be written by FileWriter.write_raw_data(...).

data_list = []
data = {}
data['label'] = 1

f = open("1.jpg", "rb")
image_bytes = f.read()
f.close

data['data1'] = image_bytes

f2 = open("2.jpg", "rb")
image_bytes2 = f2.read()
f2.close

data['data2'] = image_bytes2

f3 = open("3.jpg", "rb")
image_bytes3 = f3.read()
f3.close

data['data3'] = image_bytes3

data_list.append(data)

#3 Use MindDataset to load, then use the decode operator we provide to decode, and then perform subsequent processing.

data_set = ds.MindDataset("mindrecord_file_name")
data_set = data_set.map(input_columns=["data1"], operations=vision.Decode(), num_parallel_workers=2)
data_set = data_set.map(input_columns=["data2"], operations=vision.Decode(), num_parallel_workers=2)
data_set = data_set.map(input_columns=["data3"], operations=vision.Decode(), num_parallel_workers=2)
resize_op = vision.Resize((32, 32), interpolation=Inter.LINEAR)
data_set = data_set.map(operations=resize_op, input_columns=["data1"], num_parallel_workers=2)
for item in data_set.create_dict_iterator(output_numpy=True):
    print(item)

Q: When a custom image dataset is converted to the mindrecord format, the data is in the numpy.ndarray format and shape is [4,100,132,3], indicating four three-channel frames, and each value ranges from 0 to 255. However, when I view the data that is converted into the mindrecord format, I find that the shape is [19800] but that of the original data is [158400]. Why?

A: The value of dtype in ndarray might be set to int8. The difference between [158400] and [19800] is eight times. You are advised to set dtype of ndarray to float64.


Q: I want to save the generated image, but the image cannot be found in the corresponding directory after the code is executed. Similarly, a dataset is generated in JupyterLab for training. During training, data can be read in the corresponding path, but the image or dataset cannot be found in the path. Why?

A: The images or datasets generated by JumperLab are stored in Docker. The data downloaded by moxing can be viewed only in Docker during the training process. After the training is complete, the data is released with Docker. You can try to transfer the data that needs to be downloaded to obs through moxing in the training task, and then download the data to the local host through obs.


Q: How do I understand the dataset_sink_mode parameter in model.train of MindSpore?

A: When dataset_sink_mode is set to True, data processing and network computing are performed in pipeline mode. That is, when data processing is performed step by step, after a batch of data is processed, the data is placed in a queue which is used to cache the processed data. Then, network computing obtains data from the queue for training. In this case, data processing and network computing are performed in pipeline mode. The entire training duration is the longest data processing/network computing duration.

When dataset_sink_mode is set to False, data processing and network computing are performed in serial mode. That is, after a batch of data is processed, it is transferred to the network for computation. After the computation is complete, the next batch of data is processed and transferred to the network for computation. This process repeats until the training is complete. The total time consumed is the time consumed for data processing plus the time consumed for network computing.


Q: Can MindSpore train image data of different sizes by batch?

A: You can refer to the usage of YOLOv3 which contains the resizing of different images. For details about the script, see yolo_dataset.


Q: Must data be converted into MindRecords when MindSpore is used for segmentation training?

A: build_seg_data.py is used to generate MindRecords based on a dataset. You can directly use or adapt it to your dataset. Alternatively, you can use GeneratorDataset if you want to read the dataset by yourself.

GenratorDataset example

GeneratorDataset API description


Q: How do I perform training without processing data in MindRecord format?

A: You can use the customized data loading method GeneratorDataset. For details, click here.


Q: When MindSpore performs multi-device training on the Ascend hardware platform, how does the user-defined dataset transfer data to different chip?

A: When GeneratorDataset is used, the num_shards=num_shards and shard_id=device_id parameters can be used to control which shard of data is read by different devices. __getitem__ and __len__ are processed as full datasets.

An example is as follows:

# Device 0:
ds.GeneratorDataset(..., num_shards=8, shard_id=0, ...)
# Device 1:
ds.GeneratorDataset(..., num_shards=8, shard_id=1, ...)
# Device 2:
ds.GeneratorDataset(..., num_shards=8, shard_id=2, ...)
...
# Device 7:
ds.GeneratorDataset(..., num_shards=8, shard_id=7, ...)

Q: How do I build a multi-label MindRecord dataset for images?

A: The data schema can be defined as follows:cv_schema_json = {"label": {"type": "int32", "shape": [-1]}, "data": {"type": "bytes"}}

Note: A label is an array of the numpy type, where label values 1, 1, 0, 1, 0, 1 are stored. These label values correspond to the same data, that is, the binary value of the same image. For details, see Converting Dataset to MindRecord.


Q: What can I do if an error message wrong shape of image is displayed when I use a model trained by MindSpore to perform prediction on a 28 x 28 digital image with white text on a black background?

A: The MNIST gray scale image dataset is used for MindSpore training. Therefore, when the model is used, the data must be set to a 28 x 28 gray scale image, that is, a single channel.


Q: Can you introduce the dedicated data processing framework?

A: MindData provides the heterogeneous hardware acceleration function for data processing. The high-concurrency data processing pipeline supports Ascend, GPU and CPU. The CPU usage is reduced by 30%. For details, see Optimizing Data Processing.


Q: When error raised during network training, indicating that sending data failed like “TDT Push data into device Failed”, how to locate the problem?

A: Firstly, above error refers failed sending data to the device through the training data transfer channel (TDT). Here are several possible reasons for this error. Therefore, the corresponding checking suggestions are given in the log. In detail:

  1. Commonly, we will find the first error (the first ERROR level error) or error TraceBack thrown in the log, and try to find information that helps locate the cause of the error.

  2. When error raised in the graph compiling stage, as training has not started (for example, the loss has not been printed in the log), please check the error log if there are errors reported by the network related operators or the environment configuration resulted Errors (such as hccl.json is incorrect, resulted abnormal initialization of multi-card communication)

  3. When error raised during the training process, usually this is caused by the mismatch between the amount of data (batch number) has been sent and the amount of data (step number) required for network training. You can print and check the number of batches of an epoch with get_dataset_size interface,several possible reason are as follows:

    • With checking the print times of loss to figure out the trained steps when error raised, when data amount(trained steps) is just an integer multiple of the batches number in an epoch, there may be a problem in the data processing part involving Epoch processing, such as the following case:

      ...
      dataset = dataset.create_tuple_iteator(num_epochs=-1) # Here, if you want to return an iterator, num_epochs should be 1, but it is recommended to return dataset directly
      return dataset
      
    • The data processing performance is slow, and cannot keep up with the speed of network training. For this case, you can use the profiler tool and MindInsight to see if there is an obvious iteration gap, or manually iterating the dataset, and print the average single batch time , if longer than the combined forward and backward time of the network, there is a high probability that the performance of the data processing part needs to be optimized.

    • During the training process, the occurrence of abnormal data may resulted in exception, causing sending data failed. In this case, there will be other ERROR logs that shows which part of the data processing process is abnormal and checking advice. If it is not obvious, you can also try to find the abnormal data by iterating each data batch in the dataset (such as turning off shuffle, and using dichotomy).

  4. when error raised after training(this is probably caused by forced release of resources), this error can be ignored.

  5. If the specific cause cannot be located, please create issue or raise question in huawei clound forum for help.