Using MindSpore on the Cloud
Linux
Ascend
Whole Process
Beginner
Intermediate
Expert
Overview
ModelArts is a one-stop AI development platform provided by HUAWEI CLOUD. It integrates the Ascend AI Processor resource pool. Developers can experience MindSpore on this platform.
ResNet-50 is used as an example to describe how to use MindSpore to complete a training task on ModelArts.
Preparations
Preparing ModelArts
Create an account, configure ModelArts, and create an Object Storage Service (OBS) bucket by referring to the “Preparations” section in the ModelArts tutorial.
For more information about ModelArts, visit https://support.huaweicloud.com/wtsnew-modelarts/index.html. Prepare ModelArts by referring to the “Preparations” section.
Accessing Ascend AI Processor Resources on HUAWEI CLOUD
You can click here to join the beta testing program of the ModelArts Ascend Compute Service.
Preparing Data
ModelArts uses OBS to store data. Therefore, before starting a training job, you need to upload the data to OBS. The CIFAR-10 dataset in binary format is used as an example.
Download and decompress the CIFAR-10 dataset.
Download the CIFAR-10 dataset at http://www.cs.toronto.edu/~kriz/cifar.html. Among the three dataset versions provided on the page, select CIFAR-10 binary version.
Create an OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload the CIFAR-10 data to the data directory according to the following structure.
└─Object storage/ms-dataset/cifar-10 ├─train │ data_batch_1.bin │ data_batch_2.bin │ data_batch_3.bin │ data_batch_4.bin │ data_batch_5.bin │ └─eval test_batch.bin
Preparing for Script Execution
Create an OBS bucket (for example, resnet50-train
), create a code directory (for example, resnet50_cifar10_train
) in the bucket, and upload all scripts in the following directories to the code directory:
ResNet-50 is used in scripts in https://gitee.com/mindspore/docs/tree/r1.3/docs/sample_code/sample_for_cloud/ to train the CIFAR-10 dataset and validate the accuracy after training is complete.
1*Ascend
or8*Ascend
can be used in scripts on ModelArts for training.Note that the script version must be the same as the MindSpore version selected in “Creating a Training Task.” For example, if you use scripts provided for MindSpore 1.1, you need to select MindSpore 1.1 when creating a training job.
To facilitate subsequent training job creation, you need to create a training output directory and a log output directory. The directory structure created in this example is as follows:
└─Object storage/resnet50-train
├─resnet50_cifar10_train
│ dataset.py
│ resnet.py
│ resnet50_train.py
│
├─output
└─log
Running the MindSpore Script on ModelArts After Simple Adaptation
Scripts provided in section “Preparing for Script Execution” can directly run on ModelArts. If you want to experience how to use ResNet-50 to train CIFAR-10, skip this section. If you need to run customized MindSpore scripts or more MindSpore sample code on ModelArts, perform simple adaptation on the MindSpore code as follows:
Adapting to Script Arguments
Set
data_url
andtrain_url
. They are necessary for running the script on ModelArts, corresponding to the data storage path (an OBS path) and training output path (an OBS path), respectively.import argparse parser = argparse.ArgumentParser(description='ResNet-50 train.') parser.add_argument('--data_url', required=True, default=None, help='Location of data.') parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
ModelArts allows you to pass arguments to the configuration options in the script. For details, see “Creating a Training Job.”
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
Adapting to OBS Data
MindSpore does not provide APIs for directly accessing OBS data. You need to use APIs provided by MoXing to interact with OBS. ModelArts training scripts are executed in containers. Generally, the /cache
directory is used to store the container data.
HUAWEI CLOUD MoXing provides various APIs for users: https://github.com/huaweicloud/ModelArts-Lab/tree/master/docs/moxing_api_doc. In this example, only the
copy_parallel
API is used.
Download the data stored in OBS to an execution container.
import moxing as mox mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path')
Upload the training output from the container to OBS.
import moxing as mox mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/')
Adapting to 8-Device Training Jobs
To run scripts in the 8*Ascend
environment, you need to adapt dataset creation code and a local data path, and configure a distributed policy. By obtaining the environment variables DEVICE_ID
and RANK_SIZE
, you can build training scripts applicable to 1*Ascend
and 8*Ascend
.
Adapt a local path.
import os device_num = int(os.getenv('RANK_SIZE')) device_id = int(os.getenv('DEVICE_ID')) # define local data path local_data_path = '/cache/data' if device_num > 1: # define distributed local data path local_data_path = os.path.join(local_data_path, str(device_id))
Adapt datasets.
import os import mindspore.dataset.engine as de device_id = int(os.getenv('DEVICE_ID')) device_num = int(os.getenv('RANK_SIZE')) if device_num == 1: # create train data for 1 Ascend situation ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) else: # create train data for 1 Ascend situation, split train data for 8 Ascend situation ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=device_num, shard_id=device_id)
Configure a distributed policy.
import os from mindspore import context from mindspore.context import ParallelMode device_num = int(os.getenv('RANK_SIZE')) if device_num > 1: context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True)
Sample Code
Perform simple adaptation on the MindSpore script based on the preceding three points. The following pseudocode is used as an example:
Original MindSpore script:
import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset.engine as de
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
def create_dataset(dataset_path):
if device_num == 1:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=device_id)
return ds
def resnet50_train(args):
if device_num > 1:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
train_dataset = create_dataset(local_data_path)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ResNet-50 train.')
parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.')
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
args_opt, unknown = parser.parse_known_args()
resnet50_train(args_opt)
Adapted MindSpore script:
import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset.engine as de
# adapt to cloud: used for downloading data
import moxing as mox
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
def create_dataset(dataset_path):
if device_num == 1:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=device_id)
return ds
def resnet50_train(args):
# adapt to cloud: define local data path
local_data_path = '/cache/data'
if device_num > 1:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
# adapt to cloud: define distributed local data path
local_data_path = os.path.join(local_data_path, str(device_id))
# adapt to cloud: download data from obs to local location
print('Download data.')
mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path)
train_dataset = create_dataset(local_data_path)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ResNet-50 train.')
# adapt to cloud: get obs data path
parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
# adapt to cloud: get obs output path
parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
args_opt, unknown = parser.parse_known_args()
resnet50_train(args_opt)
Creating a Training Job
Create a training job to run the MindSpore script. The following provides step-by-step instructions for creating a training job on ModelArts.
Opening the ModelArts Console
Click Console on the HUAWEI CLOUD ModelArts home page at https://www.huaweicloud.com/product/modelarts.html.
Using a Common Framework to Create a Training Job
ModelArts Tutorial https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0238.html shows how to use a common framework to create a training job.
Using MindSpore as a Common Framework to Create a Training Job
Training scripts and data in this tutorial are used as an example to describe how to configure arguments on the training job creation page.
Algorithm Source
: ClickFrameworks
, and then selectAscend-Powered-Engine
and the required MindSpore version. (Mindspore-0.5-python3.7-aarch64
is used as an example here. Use scripts corresponding to the selected version.)Code Directory
: Select a code directory created in an OBS bucket. SetStartup File
to a startup script in the code directory.Data Source
: ClickData Storage Path
and enter the CIFAR-10 dataset path in OBS.Argument
: Setdata_url
andtrain_url
to the values ofData Storage Path
andTraining Output Path
, respectively. Click the add icon to pass values to other arguments in the script, for example,epoch_size
.Resource Pool
: ClickPublic Resource Pool > Ascend
.Specification
: SelectAscend: 1 * Ascend 910 CPU: 24-core 96 GiB
orAscend: 8 * Ascend 910 CPU: 192-core 768 GiB
, which indicate single-node single-device and single-node 8-device specifications, respectively.
Viewing the Execution Result
You can view run logs on the Training Jobs page.
The
8*Ascend
specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 92%, and the number of images trained per second is about 12,000.The
1*Ascend
specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 95%, and the number of images trained per second is about 1800.If you specify a log path when creating a training job, you can download log files from OBS and view them.