Using Java Interface to Parallel Inference

Overview

MindSpore Lite provides multi-model concurrent inference interface ModelParallelRunner. Multi model concurrent inference now supports CPU and GPU backend.

For a quick understanding of the complete calling process of MindSpore Lite executing concurrent reasoning, please refer to Experience Java Minimalist Concurrent Reasoning Demo.

After the model is converted into a .ms model by using the MindSpore Lite model conversion tool, the inference process can be performed in Runtime. For details, see Converting Models for Inference. This tutorial describes how to use the Java API to perform inference.

To use the MindSpore Lite parallel inference framework, perform the following steps:

  1. Create a configuration item: Create a multi-model concurrent inference configuration item RunnerConfig, which is used to configure multiple model concurrency.

  2. Initialization: initialization before multi-model concurrent inference.

  3. Execute concurrent inference: Use the Predict interface of ModelParallelRunner to perform concurrent inference on multiple Models.

  4. Release memory: When you do not need to use the MindSpore Lite concurrent inference framework, you need to release the ModelParallelRunner and related Tensors you created.

Create configuration

The configuration item will save some basic configuration parameters required for concurrent reasoning, which are used to guide the number of concurrent models, model compilation and model execution.

The following sample code from main.cc demonstrates how to create a RunnerConfig and configure the number of workers for concurrent inference:

// use default param init context
MSContext context = new MSContext();
context.init(1,0);
boolean ret = context.addDeviceInfo(DeviceType.DT_CPU, false, 0);
if (!ret) {
    System.err.println("init context failed");
    context.free();
    return ;
}
// init runner config
RunnerConfig config = new RunnerConfig();
config.init(context);
config.setWorkersNum(2);

For details on the configuration method of Context, see Context.

Multi-model concurrent inference currently only supports CPUDeviceInfo and GPUDeviceInfo two different hardware backends. When setting the GPU backend, you need to set the GPU backend first and then the CPU backend, otherwise it will report an error and exit.

Multi-model concurrent inference does not support FP32 type data reasoning. Binding cores only supports no core binding or binding large cores. It does not support the parameter settings of the bound cores, and does not support configuring the binding core list.

Initialization

When using MindSpore Lite to execute concurrent reasoning, ModelParallelRunner is the main entry of concurrent reasoning. Through ModelParallelRunner, you can initialize and execute concurrent reasoning. Use the RunnerConfig created in the previous step and call the init interface of ModelParallelRunner to initialize ModelParallelRunner.

The following sample code from main.cc demonstrates how to call Predict to execute inference:

ret = runner.predict(inputs,outputs);
if (!ret) {
    System.err.println("MindSpore Lite predict failed.");
    freeTensor();
    runner.free();
    return;
}

For Initialization of ModelParallelRunner, you do not need to set the RunnerConfig configuration parameters, and the default parameters will be used for concurrent inference of multiple models.

Execute concurrent inference

MindSpore Lite calls the Predict interface of ModelParallelRunner for model concurrent inference.

The following main.cc demonstrates how to call Predict to execute inference.

ret = runner.predict(inputs,outputs);
if (!ret) {
    System.err.println("MindSpore Lite predict failed.");
    freeTensor();
    runner.free();
    return;
}

Memory release

When you do not need to use the MindSpore Lite reasoning framework, you need to release the created ModelParallelRunner. The following main.cc demonstrates how to free memory before the end of the program.

freeTensor();
runner.free();