Using MindSpore on the Cloud

Linux Ascend Whole Process Beginner Intermediate Expert

View Source On Gitee

Overview

ModelArts is a one-stop AI development platform provided by HUAWEI CLOUD. It integrates the Ascend AI Processor resource pool. Developers can experience MindSpore on this platform.

ResNet-50 is used as an example to describe how to use MindSpore to complete a training task on ModelArts.

Preparations

Preparing ModelArts

Create an account, configure ModelArts, and create an Object Storage Service (OBS) bucket by referring to the “Preparations” section in the ModelArts tutorial.

For more information about ModelArts, visit https://support.huaweicloud.com/wtsnew-modelarts/index.html. Prepare ModelArts by referring to the “Preparations” section.

Accessing Ascend AI Processor Resources on HUAWEI CLOUD

You can click here to join the beta testing program of the ModelArts Ascend Compute Service.

Preparing Data

ModelArts uses OBS to store data. Therefore, before starting a training job, you need to upload the data to OBS. The CIFAR-10 dataset in binary format is used as an example.

  1. Download and decompress the CIFAR-10 dataset.

    Download the CIFAR-10 dataset at http://www.cs.toronto.edu/~kriz/cifar.html. Among the three dataset versions provided on the page, select CIFAR-10 binary version.

  2. Create an OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload the CIFAR-10 data to the data directory according to the following structure.

    └─Object storage/ms-dataset/cifar-10
        ├─train
        │      data_batch_1.bin
        │      data_batch_2.bin
        │      data_batch_3.bin
        │      data_batch_4.bin
        │      data_batch_5.bin
        │
        └─eval
            test_batch.bin
    

Preparing for Script Execution

Create an OBS bucket (for example, resnet50-train), create a code directory (for example, resnet50_cifar10_train) in the bucket, and upload all scripts in the following directories to the code directory:

ResNet-50 is used in scripts in https://gitee.com/mindspore/docs/tree/r1.3/docs/sample_code/sample_for_cloud/ to train the CIFAR-10 dataset and validate the accuracy after training is complete. 1*Ascend or 8*Ascend can be used in scripts on ModelArts for training.

Note that the script version must be the same as the MindSpore version selected in “Creating a Training Task.” For example, if you use scripts provided for MindSpore 1.1, you need to select MindSpore 1.1 when creating a training job.

To facilitate subsequent training job creation, you need to create a training output directory and a log output directory. The directory structure created in this example is as follows:

└─Object storage/resnet50-train
    ├─resnet50_cifar10_train
    │      dataset.py
    │      resnet.py
    │      resnet50_train.py
    │
    ├─output
    └─log

Running the MindSpore Script on ModelArts After Simple Adaptation

Scripts provided in section “Preparing for Script Execution” can directly run on ModelArts. If you want to experience how to use ResNet-50 to train CIFAR-10, skip this section. If you need to run customized MindSpore scripts or more MindSpore sample code on ModelArts, perform simple adaptation on the MindSpore code as follows:

Adapting to Script Arguments

  1. Set data_url and train_url. They are necessary for running the script on ModelArts, corresponding to the data storage path (an OBS path) and training output path (an OBS path), respectively.

    import argparse
    
    parser = argparse.ArgumentParser(description='ResNet-50 train.')
    parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
    parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
    
  2. ModelArts allows you to pass arguments to the configuration options in the script. For details, see “Creating a Training Job.”

    parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
    

Adapting to OBS Data

MindSpore does not provide APIs for directly accessing OBS data. You need to use APIs provided by MoXing to interact with OBS. ModelArts training scripts are executed in containers. Generally, the /cache directory is used to store the container data.

HUAWEI CLOUD MoXing provides various APIs for users: https://github.com/huaweicloud/ModelArts-Lab/tree/master/docs/moxing_api_doc. In this example, only the copy_parallel API is used.

  1. Download the data stored in OBS to an execution container.

    import moxing as mox
    mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path')
    
  2. Upload the training output from the container to OBS.

    import moxing as mox
    mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/')
    

Adapting to 8-Device Training Jobs

To run scripts in the 8*Ascend environment, you need to adapt dataset creation code and a local data path, and configure a distributed policy. By obtaining the environment variables DEVICE_ID and RANK_SIZE, you can build training scripts applicable to 1*Ascend and 8*Ascend.

  1. Adapt a local path.

    import os
    
    device_num = int(os.getenv('RANK_SIZE'))
    device_id = int(os.getenv('DEVICE_ID'))
    # define local data path
    local_data_path = '/cache/data'
    
    if device_num > 1:
        # define distributed local data path
        local_data_path = os.path.join(local_data_path, str(device_id))
    
  2. Adapt datasets.

    import os
    import mindspore.dataset.engine as de
    
    device_id = int(os.getenv('DEVICE_ID'))
    device_num = int(os.getenv('RANK_SIZE'))
    if device_num == 1:
        # create train data for 1 Ascend situation
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
    else:
        # create train data for 1 Ascend situation, split train data for 8 Ascend situation
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
                               num_shards=device_num, shard_id=device_id)
    
  3. Configure a distributed policy.

    import os
    from mindspore import context
    from mindspore.context import ParallelMode
    
    device_num = int(os.getenv('RANK_SIZE'))
    if device_num > 1:
        context.set_auto_parallel_context(device_num=device_num,
                                          parallel_mode=ParallelMode.DATA_PARALLEL,
                                          gradients_mean=True)
    

Sample Code

Perform simple adaptation on the MindSpore script based on the preceding three points. The following pseudocode is used as an example:

Original MindSpore script:

import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset.engine as de

device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))

def create_dataset(dataset_path):
    if device_num == 1:
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
    else:
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
                               num_shards=device_num, shard_id=device_id)
    return ds

def resnet50_train(args):
    if device_num > 1:
        context.set_auto_parallel_context(device_num=device_num,
                                          parallel_mode=ParallelMode.DATA_PARALLEL,
                                          gradients_mean=True)
    train_dataset = create_dataset(local_data_path)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='ResNet-50 train.')
    parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.')
    parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')

    args_opt, unknown = parser.parse_known_args()

    resnet50_train(args_opt)

Adapted MindSpore script:

import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset.engine as de

# adapt to cloud: used for downloading data
import moxing as mox

device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))

def create_dataset(dataset_path):
    if device_num == 1:
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
    else:
        ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
                               num_shards=device_num, shard_id=device_id)
    return ds

def resnet50_train(args):
    # adapt to cloud: define local data path
    local_data_path = '/cache/data'

    if device_num > 1:
        context.set_auto_parallel_context(device_num=device_num,
                                          parallel_mode=ParallelMode.DATA_PARALLEL,
                                          gradients_mean=True)
        # adapt to cloud: define distributed local data path
        local_data_path = os.path.join(local_data_path, str(device_id))

    # adapt to cloud: download data from obs to local location
    print('Download data.')
    mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path)

    train_dataset = create_dataset(local_data_path)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='ResNet-50 train.')
    # adapt to cloud: get obs data path
    parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
    # adapt to cloud: get obs output path
    parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
    parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
    args_opt, unknown = parser.parse_known_args()

    resnet50_train(args_opt)

Creating a Training Job

Create a training job to run the MindSpore script. The following provides step-by-step instructions for creating a training job on ModelArts.

Opening the ModelArts Console

Click Console on the HUAWEI CLOUD ModelArts home page at https://www.huaweicloud.com/product/modelarts.html.

Using a Common Framework to Create a Training Job

ModelArts Tutorial https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0238.html shows how to use a common framework to create a training job.

Using MindSpore as a Common Framework to Create a Training Job

Training scripts and data in this tutorial are used as an example to describe how to configure arguments on the training job creation page.

  1. Algorithm Source: Click Frameworks, and then select Ascend-Powered-Engine and the required MindSpore version. (Mindspore-0.5-python3.7-aarch64 is used as an example here. Use scripts corresponding to the selected version.)

  2. Code Directory: Select a code directory created in an OBS bucket. Set Startup File to a startup script in the code directory.

  3. Data Source: Click Data Storage Path and enter the CIFAR-10 dataset path in OBS.

  4. Argument: Set data_url and train_url to the values of Data Storage Path and Training Output Path, respectively. Click the add icon to pass values to other arguments in the script, for example, epoch_size.

  5. Resource Pool: Click Public Resource Pool > Ascend.

  6. Specification: Select Ascend: 1 * Ascend 910 CPU: 24-core 96 GiB or Ascend: 8 * Ascend 910 CPU: 192-core 768 GiB, which indicate single-node single-device and single-node 8-device specifications, respectively.

Viewing the Execution Result

  1. You can view run logs on the Training Jobs page.

    The 8*Ascend specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 92%, and the number of images trained per second is about 12,000.

    The 1*Ascend specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 95%, and the number of images trained per second is about 1800.

  2. If you specify a log path when creating a training job, you can download log files from OBS and view them.