Using Delegate to Support Third-party AI Framework

Overview

Delegate of MindSpore Lite is used to support third-party AI frameworks (such as NPU, TensorRT) to quickly access to the inference process in MindSpore Lite. Third-party frameworks can be implemented by users themselves, or other open source frameworks. Generally, the framework has the ability to build model online, that is, multiple operators can be built into a sub-graph and distributed to the device for inference. If the user wants to schedule other inference frameworks through MindSpore Lite, please refer to this article.

Usage of Delegate

Using Delegate to support a third-party AI framework mainly includes the following steps:

  1. Add a custom delegate class: Inherit the Delegate class to implement XXXDelegate.

  2. Implementing the Init Function: The Init function needs to check whether the device supports the delegate framework and to apply for resources related to delegate.

  3. Implementing the Build Function: The Build function will implement the kernel support judgment, the sub-graph construction, and the online graph building.

  4. Implementing the sub-graph Kernel: Inherit the Kernel to implement delegate sub-graph Kernel.

Adding a Custom Delegate Class

XXXDelegate should inherit from Delegate. In the constructor of XXXDelegate, configure settings for third-party AI framework to build and execute the model, such as NPU frequency, CPU thread number, etc.

class XXXDelegate : public Delegate {
 public:
  XXXDelegate() = default;

  ~XXXDelegate() = default;

  Status Init() = 0;

  Status Build(DelegateModel *model) = 0;
};

Implementing the Init

Init will be called during the Build process of Model. The specific location is in the LiteSession::Init function of MindSpore Lite internal process.

Status XXXDelegate::Init() {
  // 1. Check whether the inference device matches the delegate framework.
  // 2. Initialize delegate related resources.
}

Implementing the Build

The input parameter of the Build(DelegateModel *model) interface is DelegateModel

std::vector<kernel::Kernel *> *kernels_: A list of kernels that have been selected by MindSpore Lite and topologically sorted.

const std::map<kernel::Kernel *, const schema::Primitive *> primitives_: A map of kernel and its attribute schema::Primitive, which is used to analyze the original attribute information.

Build will be called during the Build process of Model. The specific location is in the Schedule::Schedule function of MindSpore Lite internal process. At this time, the inner kernels have been selected by MindSpore Lite. The following steps should be implemented in Build function:

  1. Traverse the kernel list, use GetPrimitive to get the attribute of kernel. Analyze the attribute to judge whether the delegate framework supports it.

  2. For a continuous supported kernel list, construct a delegate sub-graph kernel and Replace the continuous supported kernels with it.

Status XXXDelegate::Build(DelegateModel *model) {
  KernelIter from = model->BeginKernelIterator();                   // Record the start operator position supported by the Delegate
  KernelIter end = model->BeginKernelIterator();                    // Record the end operator position supported by the Delegate
  for (KernelIter iter = model->BeginKernelIterator(); iter != model->EndKernelIterator(); iter++) {
    kernel::Kernel *kernel = *iter;
    if (IsSupport(kernel, model->GetPrimitive(kernel))) {           // Check whether the Delegate framework supports the kernel according to the primitive
      end = iter;
    } else {                                                        // The current kernel is not supported, and the sub-graph is truncated
      if (from != end) {
        auto xxx_graph_kernel = CreateXXXGraph(from, end, model);   // Create a Delegate sub-graph Kernel
        iter = model->Replace(from, end + 1, xxx_graph_kernel);     // Replace the supported kernels list with a Delegate sub-graph Kernel
      }
      from = iter + 1;
      end = iter + 1;
    }
  }
  return RET_OK;
}

Implementing the Sub-graph Kernel

The CreateXXXGraph interface above will return a sub-graph kernel of XXXDelegate. The code is as follows:

kernel::Kernel *XXXDelegate::CreateXXXGraph(KernelIter from, KernelIter end, DelegateModel *model) {
  auto in_tensors = GraphInTensors(...);    // Find the input tensors of the Delegate sub-graph
  auto out_tensors = GraphOutTensors(...);  // Find the output tensors of the Delegate sub-graph
  auto graph_kernel = new (std::nothrow) XXXGraph(in_tensors, out_tensors);
  if (graph_kernel == nullptr) {
    MS_LOG(ERROR) << "New XXX Graph failed.";
    return nullptr;
  }
  // Build graph online, load model, etc.
  return graph_kernel;
}

The delegate sub-graph kernel XXXGraph should inherit from Kernel. The realization of XXXGraph should focus on:

  1. Find the correct in_tensors and out_tensors for XXXGraph according to the original kernels list.

  2. Rewrite the Prepare, Resize, and Execute interfaces. Prepare will be called in Build of Model. Execute will be called in Predict of Model. ReSize will be called in Resize of Model.

class XXXGraph : public kernel::Kernel {
 public:
  XXXGraph(const std::vector<tensor::MSTensor *> &inputs, const std::vector<tensor::MSTensor *> &outputs)
      : kernel::Kernel(inputs, outputs, nullptr, nullptr) {}

  ~XXXGraph() override;

  int Prepare() override {
    // Generally, the model will be built only once, so Prepare is also called once.
    // Do something without input data, such as pack the constant weight tensor, etc.
  }

  int Execute() override {
    // Obtain input data from in_tensors.
    // Execute the inference process.
    // Write the result back to out_tensors.
  }

  int ReSize() override {
    // Support dynamic shape, and input shape will changed.
  }
};

Calling Delegate by Lite Framework

MindSpore Lite schedules user-defined delegate by Context. Use SetDelegate to set a custom delegate for Context. Delegate will be passed by Build to MindSpore Lite. If the Delegate in the Context is a null pointer, the process will call the inner inference of MindSpore Lite.

auto context = std::make_shared<mindspore::Context>();
if (context == nullptr) {
  MS_LOG(ERROR) << "New context failed";
  return RET_ERROR;
}
auto delegate = std::make_shared<XXXDelegate>();
if (delegate == nullptr) {
  MS_LOG(ERROR) << "New XXX delegate failed";
  return RET_ERROR;
}
context->SetDelegate(delegate);

auto model = new (std::nothrow) mindspore::Model();
if (model == nullptr) {
  std::cerr << "New Model failed." << std::endl;
}
// Assuming that we have read a ms file and stored in the address pointed by model_buf
auto build_ret = model->Build(model_buf, size, mindspore::kMindIR, context);
delete[](model_buf);
if (build_ret != mindspore::kSuccess) {
  std::cerr << "Build model failed." << std::endl;
}

Example of NPUDelegate

Currently, MindSpore Lite uses the NPUDelegate for the NPU backend. This tutorial gives a brief description of NPUDelegate, so that users can quickly understand the usage of Delegate APIs.

Adding the NPUDelegate Class

class NPUDelegate : public Delegate {
 public:
  explicit NPUDelegate(lite::NpuDeviceInfo device_info) : Delegate() { frequency_ = device_info.frequency_; }

  ~NPUDelegate() override;

  Status Init() override;

  Status Build(DelegateModel *model) override;

 protected:
  // Analyze a kernel and its attribute.
  // If NPU supports it, return an NPUOp, which has the information of connection relationship with other kernels and the attributes.
  // If not support, return null pointer.
  NPUOp *GetOP(kernel::Kernel *kernel, const schema::Primitive *primitive);

  // Construct a NPU sub-graph with a continuous NPUOps
  kernel::Kernel *CreateNPUGraph(const std::vector<NPUOp *> &ops, DelegateModel *model, KernelIter from,
                                 KernelIter end);

  NPUManager *npu_manager_ = nullptr;
  NPUPassManager *pass_manager_ = nullptr;
  std::map<schema::PrimitiveType, NPUGetOp> op_func_lists_;
  int frequency_ = 0;  // NPU frequency
};

Implementing the Init of NPUDelegate

Init function is used to apply resource for NPU and determine whether the hardware supports NPU.

Status NPUDelegate::Init() {
  npu_manager_ = new (std::nothrow) NPUManager();       // NPU manager of model buffer and client.
  if (npu_manager_ == nullptr) {
    MS_LOG(ERROR) << "New npu manager failed.";
    return RET_ERROR;
  }
  if (!npu_manager_->IsSupportNPU()) {                  // Check whether the current device supports NPU.
    MS_LOG(DEBUG) << "Checking npu is unsupported.";
    return RET_NOT_SUPPORT;
  }
  pass_manager_ = new (std::nothrow) NPUPassManager();  // The default format of MindSpore Lite is NHWC, and the default format of NPU is NCHW. The NPUPassManager is used to pack data between the sub-graphs.
  if (pass_manager_ == nullptr) {
    MS_LOG(ERROR) << "New npu pass manager failed.";
    return RET_ERROR;
  }

  // Initialize op_func lists. Get the correspondence between kernel type and GetOP function.
  op_func_lists_.clear();
  return RET_OK;
}

Implementing the Build of NPUDelegate

The Build interface parses the DelegateModel and mainly implements the kernel support judgment, the sub-graph construction, and the online graph building.

Status NPUDelegate::Build(DelegateModel *model) {
  KernelIter from, end;                     // Record the start and end positions of kernel supported by the NPU sub-graph.
  std::vector<NPUOp *> npu_ops;             // Save all NPUOp used to construct an NPU sub-graph.
  int graph_index = 0;
  for (KernelIter iter = model->BeginKernelIterator(); iter != model->EndKernelIterator(); iter++) {
    kernel::Kernel *kernel = *iter;
    auto npu_op = GetOP(kernel, model->GetPrimitive(kernel));  // Obtain an NPUOp according to the kernel and the primitive. Each NPUOp contains information such as input tensors, output tensors and operator attribute.
    if (npu_op != nullptr) {                // NPU supports the current kernel.
      if (npu_ops.size() == 0) {
        from = iter;
      }
      npu_ops.push_back(npu_op);
      end = iter;
    } else {                                 // NPU does not support the current kernel.
      if (npu_ops.size() > 0) {
        auto npu_graph_kernel = CreateNPUGraph(npu_ops);  // Create a NPU sub-graph kernel.
        if (npu_graph_kernel == nullptr) {
          MS_LOG(ERROR) << "Create NPU Graph failed.";
          return RET_ERROR;
        }
        npu_graph_kernel->set_name("NpuGraph" + std::to_string(graph_index++));
        iter = model->Replace(from, end + 1, npu_graph_kernel);  // Replace the supported kernel list with a NPU sub-graph kernel.
        npu_ops.clear();
      }
    }
  }
  auto ret = npu_manager_->LoadOMModel();    // Build model online. Load NPU model.
  if (ret != RET_OK) {
    MS_LOG(ERROR) << "NPU client load model failed.";
    return RET_ERROR;
  }
  return RET_OK;
}

Creating NPUGraph

The following Sample Code is the CreateNPUGraph interface of NPUDelegate, used to generate an NPU sub-graph kernel.

kernel::Kernel *NPUDelegate::CreateNPUGraph(const std::vector<NPUOp *> &ops) {
  auto in_tensors = GraphInTensors(ops);
  auto out_tensors = GraphOutTensors(ops);
  auto graph_kernel = new (std::nothrow) NPUGraph(ops, npu_manager_, in_tensors, out_tensors);
  if (graph_kernel == nullptr) {
    MS_LOG(DEBUG) << "New NPU Graph failed.";
    return nullptr;
  }
  ret = graph_kernel->Init();
  if (ret != RET_OK) {
    MS_LOG(DEBUG) << "NPU Graph Init failed.";
    return nullptr;
  }
  return graph_kernel;
}

Adding the NPUGraph Class

NPUGraph inherits from Kernel. And we need to rewrite the Prepare, Execute, and ReSize interfaces.

NPUGraph::Prepare mainly implements:

int NPUGraph::Prepare() {
  // Find the mapping relationship between hiai::AiTensor defined by NPU and MSTensor defined by MindSpore Lite
}

NPUGraph::Execute mainly implements:

int NPUGraph::Execute() {
  // 1. Processing input: copy input data from MSTensor to hiai::AiTensor
  // 2. Perform inference
  executor_->Execute();
  // 3. Processing output: copy output data from hiai::AiTensor to MSTensor
}

NPU is a third-party AI framework that added by MindSpore Lite internal developers. The usage of NPU is slightly different. You can set the Context through SetDelegate, or you can add the description of the NPU device KirinNPUDeviceInfo to MutableDeviceInfo of the Context.