Ascend Performance Tuning
Overview
This tutorial introduces how to use MindSpore Profiler for performance tuning on Ascend AI processors. MindSpore Profiler can provide operators execution time analysis, memory usage analysis, AI Core metrics analysis, Timeline display, etc., to help users analyze performance bottlenecks and optimize training efficiency.
Operation Process
Prepare the training script;
Call the performance debugging interface in the training script, such as mindspore.Profiler and mindspore.profiler.DynamicProfilerMonitor interfaces;
Run the training script;
View the performance data through MindStudio Insight.
Usage
There are three ways to collect training performance data, users can use the Profiler enabling method according to different scenarios. The following will introduce the usage of different scenarios.
Method 1: mindspore.Profiler Interface Enabling
Add the MindSpore Profiler related interfaces in the training script, see MindSpore Profiler parameter details for details.
Graph mode collection example:
In Graph mode, users can enable Profiler through Callback.
import mindspore as ms
from mindspore import Profiler
class StopAtStep(ms.Callback):
def __init__(self, start_step, stop_step):
super(StopAtStep, self).__init__()
self.start_step = start_step
self.stop_step = stop_step
self.profiler = Profiler(start_profile=False, output_path='./profiler_data')
def on_train_step_begin(self, run_context):
cb_params = run_context.original_args()
step_num = cb_params.cur_step_num
if step_num == self.start_step:
self.profiler.start()
def on_train_step_end(self, run_context):
cb_params = run_context.original_args()
step_num = cb_params.cur_step_num
if step_num == self.stop_step:
self.profiler.stop()
self.profiler.analyse()
For the complete case, refer to graph mode collection complete code example
PyNative mode collection example:
In PyNative mode, users can enable Profiler through setting schedule and on_trace_ready parameters.
For example, if you want to collect the performance data of the first two steps, you can use the following configuration to collect.
Sample as follows:
from mindspore import Profiler
from mindspore.profiler import schedule, tensor_board_trace_handler
STEP_NUM = 15
# Define the training model network
net = Net()
with Profiler(schedule=schedule(wait=0, warmup=0, active=2, repeat=1, skip_first=0),
on_trace_ready=tensor_board_trace_handler) as prof:
for _ in range(STEP_NUM):
train(net)
# Call step to collect
prof.step()
After enabling, the Step ID column information is included in the kernel_details.csv file, and the Step ID is 0,1, indicating that the data collected is the 0th and 1st step data.
For the complete case, refer to PyNative mode collection complete code example
Method 2: Dynamic Profiler Enabling
Users can use the mindspore.profiler.DynamicProfilerMonitor interface to enable Profiler without interrupting the training process, modify the configuration file, and complete the collection task under the new configuration. This interface requires a JSON configuration file, if not configured, a JSON file with a default configuration will be generated. This interface requires a JSON configuration file, if not configured, a JSON file with a default configuration will be generated.
JSON configuration example as follows:
{
"start_step": 2,
"stop_step": 5,
"aicore_metrics": -1,
"profiler_level": 0,
"activities": 0,
"analyse_mode": -1,
"parallel_strategy": false,
"with_stack": false,
"data_simplification": true
}
Users need to configure the above JSON configuration file before instantiating DynamicProfilerMonitor, see DynamicProfilerMonitor parameter details for details, and save the configuration file to cfg_path;
Call the step interface of DynamicProfilerMonitor after the model training to collect data;
If users want to change the collection and analysis tasks during training, they can modify the JSON configuration file, such as changing the start_step in the above JSON configuration to 8, stop_step to 10, save it, and DynamicProfilerMonitor will automatically identify that the configuration file has changed to the new collection and analysis tasks.
Sample as follows:
from mindspore.profiler import DynamicProfilerMonitor
# cfg_path includes the path of the above JSON configuration file, output_path is the output path
dp = DynamicProfilerMonitor(cfg_path="./cfg_path", output_path="./output_path")
STEP_NUM = 15
# Define the training model network
net = Net()
for _ in range(STEP_NUM):
train(net)
# Call step to collect
dp.step()
At this point, the results include two folders: rank0_start2_stop5 and rank0_start8_stop10, representing the collection of steps 2-5 and 8-10 respectively.
For the complete case, refer to dynamic profiler enabling method case.
Method 3: Environment Variable Enabling
Users can use the environment variable enabling method to enable Profiler most simply, this method only needs to configure the parameters to the environment variables, and the performance data will be automatically collected during the model training, but this method does not support the schedule parameter collection data, other parameters can be used. See environment variable enabling method parameter details for details.
Environment variable enabling method related configuration items, sample as follows:
export MS_PROFILER_OPTIONS='
{"start": true,
"output_path": "/XXX",
"activities": ["CPU", "NPU"],
"with_stack": true,
"aicore_metrics": "AicoreNone",
"l2_cache": false,
"profiler_level": "Level0"}'
After loading the environment variable, start the training script directly to complete the collection. Note that in this configuration, start must be true to achieve the enabling effect, otherwise the enabling will not take effect.
Performance Data
Users can collect, parse, and analyze performance data through MindSpore Profiler, including raw performance data from the framework side, CANN side, and device side, as well as parsed performance data.
When using MindSpore to train a model, in order to analyze performance bottlenecks and optimize training efficiency, we need to collect and analyze performance data. MindSpore Profiler provides complete performance data collection and analysis capabilities, this article will detail the storage structure and content meaning of the collected performance data.
After collecting performance data, the original data will be stored according to the following directory structure:
The following data files are not required to be opened and viewed by users. Users can refer to the MindStudio Insight user guide for viewing and analyzing performance data.
The following is the full set of result files, the actual file number and content depend on the user's parameter configuration and the actual training scenario, if the user does not configure the related parameters or does not involve the related scenarios in the training, the corresponding data files will not be generated.
└── localhost.localdomain_*_ascend_ms // Analysis result directory, named format: {worker_name}_{timestamp}_ascend_ms, by default {worker_name} is {hostname}_{pid}
├── profiler_info.json // For multi-card or cluster scenarios, the naming rule is profiler_info_{Rank_ID}.json, used to record Profiler related metadata
├── profiler_metadata.json
├── ASCEND_PROFILER_OUTPUT // MindSpore Profiler interface collects performance data
│ ├── api_statistic.csv // Generated when profiler_level=ProfilerLevel.Level1 or profiler_level=ProfilerLevel.Level2
│ ├── communication.json // Provides visualization data for performance analysis in multi-card or cluster scenarios, generated when profiler_level=ProfilerLevel.Level1 or profiler_level=ProfilerLevel.Level2
│ ├── communication_matrix.json // Communication small operator basic information file, generated when profiler_level=ProfilerLevel.Level1 or profiler_level=ProfilerLevel.Level2
│ ├── dataset.csv // Generated when activities contains ProfilerActivity.CPU
│ ├── data_preprocess.csv // Generated when profiler_level=ProfilerLevel.Level2
│ ├── kernel_details.csv // Generated when activities contains ProfilerActivity.NPU
│ ├── l2_cache.csv // Generated when l2_cache=True
│ ├── memory_record.csv // Generated when profile_memory=True
│ ├── minddata_pipeline_raw_*.csv // Generated when data_process=True and call mindspore.dataset
│ ├── minddata_pipeline_summary_*.csv // Generated when data_process=True and call mindspore.dataset
│ ├── minddata_pipeline_summary_*.json // Generated when data_process=True and call mindspore.dataset
│ ├── npu_module_mem.csv // Generated when profile_memory=True
│ ├── operator_memory.csv // Generated when profile_memory=True
│ ├── op_statistic.csv // AI Core and AI CPU operator call count and time data
│ ├── step_trace_time.csv // Iteration calculation and communication time statistics
│ └── trace_view.json
├── FRAMEWORK // Framework side performance raw data, no need to pay attention to it, delete this directory when data_simplification=True
└── PROF_000001_20230628101435646_FKFLNPEPPRRCFCBA // CANN layer performance data, named format: PROF_{number}_{timestamp}_{string}, delete other data when data_simplification=True, only retain the original performance data in this directory
├── analyze // Generated when profiler_level=ProfilerLevel.Level1 or profiler_level=ProfilerLevel.Level2
├── device_*
├── host
├── mindstudio_profiler_log
└── mindstudio_profiler_output
MindSpore Profiler interface will associate and integrate the framework side data and CANN Profling data to form trace, kernel, and memory performance data files. The detailed description of each file is as follows.
FRAMEWORK
is the performance raw data of the framework side, no need to pay attention to it; PROF
directory is the performance data collected by CANN Profling, mainly saved in the mindstudio_profiler_output
directory.
communication.json
The information of this performance data file is as follows:
hcom_allGather_*@group
Communication Time Info
Start Timestamp(μs)
Elapse Time(ms)
Transit Time(ms)
Wait Time(ms)
Synchronization Time(ms)
Idel Time(ms)
Wait Time Ratio
Synchronization Time Ratio
Communication Bandwidth Info
RDMA
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
Large Packet Ratio
Size Distribution
"Package Size(MB)": [count, dur]
HCCS
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
Large Packet Ratio
Size Distribution
"Package Size(MB)": [count, dur]
PCIE
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
Large Packet Ratio
Size Distribution
"Package Size(MB)": [count, dur]
SDMA
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
Large Packet Ratio
Size Distribution
"Package Size(MB)": [count, dur]
SIO
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
Large Packet Ratio
Size Distribution
"Package Size(MB)": [count, dur]
communication_matrix.json
The information of this performance data file is as follows:
allgather-top1@*
src_rank-dst_rank
Transport Type
Transit Size(MB)
Transit Time(ms)
Bandwidth(GB/s)
op_name
dataset.csv
dataset.csv
file records the information of the dataset operator.
Field Name |
Field Explanation |
---|---|
Operation |
Corresponding dataset operation name |
Stage |
Operation stage |
Occurrences |
Operation occurrence times |
Avg. time(us) |
Operation average time (microseconds) |
Custom Info |
Custom information |
kernel_details.csv
kernel_details.csv
file is controlled by the ProfilerActivity.NPU
switch, the file contains the information of all operators executed on NPU. If the user calls schedule
in the front end to collect step
data, the Step Id
field will be added.
The difference from the data collected by the Ascend PyTorch Profiler interface is that when the with_stack
switch is turned on, MindSpore Profiler will concatenate the stack information to the Name
field.
minddata_pipeline_raw_*.csv
minddata_pipeline_raw_*.csv
records the performance metrics of the dataset operation.
Field Name |
Field Explanation |
---|---|
op_id |
Dataset operation ID |
op_type |
Operation type |
num_workers |
Number of operation workers |
output_queue_size |
Output queue size |
output_queue_average_size |
Output queue average size |
output_queue_length |
Output queue length |
output_queue_usage_rate |
Output queue usage rate |
sample_interval |
Sampling interval |
parent_id |
Parent operation ID |
children_id |
Child operation ID |
minddata_pipeline_summary_*.csv
minddata_pipeline_summary_*.csv
and minddata_pipeline_summary_*.json
have the same content, but different file formats. They record more detailed performance metrics of dataset operations and provide optimization suggestions based on these metrics.
Field Name |
Field Explanation |
---|---|
op_ids |
Dataset operation ID |
op_names |
Operation name |
pipeline_ops |
Operation pipeline |
num_workers |
Number of operation workers |
queue_queue_size |
Output queue size |
queue_utilization_pct |
Output queue usage rate |
queue_empty_freq_pct |
Output queue idle frequency |
children_ids |
Child operation ID |
parent_id |
Parent operation ID |
avg_cpu_pct |
Average CPU usage rate |
per_pipeline_time |
Time for each pipeline execution |
per_push_queue_time |
Time for each push queue |
per_batch_time |
Time for each data batch execution |
avg_cpu_pct_per_worker |
Average CPU usage rate per thread |
cpu_analysis_details |
CPU analysis details |
queue_analysis_details |
Queue analysis details |
bottleneck_warning |
Performance bottleneck warning |
bottleneck_suggestion |
Performance bottleneck suggestion |
trace_view.json
trace_view.json
is recommended to be opened using MindStudio Insight tool or chrome://tracing/. MindSpore Profiler does not support the record_shapes and GC functions.
Other Performance Data
The specific field and meaning of other performance data files can be referred to Ascend official documentation.
Performance Tuning Case
In the process of large model training, due to some unpredictable introduction, the model has some performance deterioration problems, such as slow operator calculation time, communication speed and slow card. The root cause of performance degradation needs to be identified and the problem addressed.
The most important thing in performance tuning is to apply the right medicine to the problem, delimit the problem first, and then perform targeted tuning to the problem. The first to use MindStudio Insight visualization tools and bound performance issues. The results of delimiting are usually divided into three aspects: computation, scheduling and communication. Finally, users can tune performance based on expert advice from MindStudio Insight. Re-run the training after each tuning, collect performance data, and use the MindStudio Insight tool to see if the tuning method produced results. Repeat this process until the performance issue is resolved.
MindStudio Insight provides a wealth of tuning and analysis methods, visualizing the real software and hardware operation data, analyzing performance data in multiple dimensions, locating performance bottlenecks, and supporting visual cluster performance analysis of the scale of heckcal, kcal and above. The user imports the performance data collected in the previous step into MindStudio Insight and uses the visualization capabilities to analyze the performance data according to the following process.
1. Overview of the data
You can learn about each module through the overview interface.
First, select the 'Import Data' button in the MindStudio Insight interface to import collected profiler data, and then import multi-card performance data.
Next, the overview interface can display the calculation, communication, idle time ratio of each card under the selected communication domain, and provide expert advice.
The meanings of data indicators related to each legend are as follows:
legend |
Meaning |
---|---|
Total compute time |
Total kernel time on the ascending device |
pure computing time |
pure computing time = Total computing time - Communication time (overwritten) |
Communication duration (overwritten) |
The duration of the communication that is overwritten, that is, the duration of the computation and communication at the same time |
communication duration (not covered) |
The communication duration that is not covered, that is, the pure communication duration |
Idle time |
Duration of no calculation or communication |
2. Definition and Analysis of Problems
Different indicator phenomena can delimit different performance problems:
(1) Calculation problem: usually manifested as a large difference between the maximum value and the minimum value of the total calculation time in the communication domain. If the calculation time of some computing cards is obviously beyond the normal range, it is likely to mean that the card has undertaken too heavy computing tasks, such as the amount of data to be processed is too large, or the complexity of the model calculation is too high, or the performance of the card itself is limited.
(2) Scheduling problem: Usually manifested as a large difference between the maximum and minimum of the proportion of idle time in the communication domain. If the idle time of the compute cards is too long, it indicates that the task distribution may be unbalanced, or there is a situation in which the cards are waiting for data from each other, which also adversely affects the performance of the cluster.
(3) Communication problems: If the communication time (not covered) is too long, it indicates that there is a problem with the coordination between calculation and communication, which may correspond to a variety of situations. Perhaps the communication protocol is not optimized enough, or the network bandwidth is unstable, resulting in communication and calculation can not be well matched.
2.1 Computation Problems
When the data indicator phenomenon indicates a computation problem, the operator data of the abnormal card can be directly viewed and compared with the normal card. In this case, you can use the performance comparison function of MindStudio Insight to set the two cards to the comparison mode and view the result on the operator interface.
2.2 Scheduling Problems
When the data indicator phenomenon indicates a scheduling problem, it is necessary to go to the timeline interface to compare the abnormal card with the normal card to further locate the operator that has the problem.
On the timeline screen, select the connection type of HostToDevice. HostToDevice shows the downward execution relationship of CANN layer operators to AscendHardware operators and the downward execution relationship of CANN layer operators to HCCL communication operators for locating scheduling problems.
The connection of HostToDevice usually has two forms, inclined and vertical. The following figure shows a case of scheduling problems. If the connection of HostToDevice is inclined as shown on the left, it indicates that the scheduling task is arranged properly during this time period, and the ascending device performs calculation and communication tasks at full load. If the HostToDevice cable is vertical as shown on the right, it indicates that the ascending device quickly completes the tasks sent by the CPU and performs calculation and communication tasks under full load. This generally indicates a scheduling problem.
2.3 Communication Problems
When the data indicator symptom indicates a communication problem, you need to enter the communication interface for further analysis. The communication interface is used to display the link performance of the whole network and the communication performance of all nodes in the cluster. By analyzing the overlap time of cluster communication and calculation, the slow host or slow node in the cluster training can be found out. Typically, we analyze performance issues in terms of key metrics communication matrix, communication duration.
Communication matrix
When analyzing, you can first check the transmission size, analyze whether there is a difference in the transmission volume of each card in this collection communication, and whether there is an uneven distribution. Second, look at the transmission time, if the transmission time of a card is very short, it is most likely to be dealing with other things, resulting in a long wait for the downstream card. Finally, you can view the bandwidth situation, if the bandwidth data difference between different cards is too large or the bandwidth value is abnormal, it means that there is an abnormal card in the communication domain.
Communication duration
Communication time refers to the time taken for a communication between computing cards. There are many factors that lead to excessive communication time, such as incorrect configuration of communication protocols, excessive data transmission, and so on. Only by finding these links that take too long to communicate and properly solving the problems, can data be transmitted between computing cards more smoothly, thereby improving the overall performance of the cluster. After the user selects a specific communication domain, the user can view the time summary of each calculation card in the communication domain in the communication duration interface, as well as the timing diagram and communication duration distribution diagram of each communication operator, so as to quickly obtain the relative position relationship and detailed communication data of the communication operator.
Common Tool Issues and Solutions
Common Issues with step Collection Performance Data
schedule Configuration Error Problem
schedule configuration related parameters have 5 parameters: wait, warmup, active, repeat, skip_first. Each parameter must be greater than or equal to 0; active must be greater than or equal to 1, otherwise a warning will be thrown and set to the default value 1; if repeat is set to 0, it means that the repeat parameter does not take effect, Profiler will determine the number of loops according to the number of model training times.
schedule and step Configuration Mismatch Problem
Normally, the schedule configuration should be less than the number of model training times, that is, repeat*(wait+warmup+active)+skip_first should be less than the number of model training times. If the schedule configuration is greater than the number of model training times, Profiler will throw an exception warning, but this will not interrupt the model training, but there may be incomplete data collection and analysis.