Service Deployment with Multiple Subgraphs and Stateful Model
Overview
MindSpore allows a model to generate multiple subgraphs. Such a model is generally a stateful model. Multiple subgraps can share weights and cooperate with each other to achieve performance optimization. For example, in the Pengcheng·Pangu Model network scenario, based on a sentence, a sentence is generated through multiple inferences, where each inference generates one word. Two graphs are generated for different input lengths. The first graph processes the text with variable length for the first time and only needs to be executed one time. The second graph processes the word generated last time and needs to be executed multiple time. Compared with that before optimization, only the first graph is executed multiple times, the inference service performance can be improved by 5 to 6 times.
MindSpore Serving supports scheduling multi-subgraph model, improving the inference service performance in specific scenarios.
The following uses a simple single-chip model scenario as an example to describe the multi-subgraph model deployment process. For the detail about the distributed scenario, see Pengcheng·Pangu Model model Serving deployment
Environment Preparation
Before running the sample network, ensure that MindSpore Serving has been properly installed and the environment variables are configured. To install and configure MindSpore Serving on your PC, go to the MindSpore Serving installation page.
Downloading the Example
Please download the example first.
Exporting a Multi-Graph Model
In the directory export_model
, use export_matmul.py to build a network with the Matmul and ReduceSum operator and export the MindSpore inference deployment model based on two different inputs.
import os
from shutil import copyfile
import numpy as np
import mindspore as ms
from mindspore.nn import Cell
class Net(Cell):
"""Net"""
def __init__(self, matmul_size, init_val, transpose_a=False, transpose_b=False):
"""init"""
super().__init__()
matmul_np = np.full(matmul_size, init_val, dtype=np.float32)
self.matmul_weight = ms.Parameter(ms.Tensor(matmul_np))
self.matmul = ops.MatMul(transpose_a=transpose_a, transpose_b=transpose_b)
self.sum = ops.ReduceSum()
def construct(self, inputs):
"""construct"""
x = self.matmul(inputs, self.matmul_weight)
x = self.sum(x, 0)
return x
def export_net():
"""Export matmul net , and copy output model `matmul_0.mindir` and `matmul_1.mindir` to directory ../matmul/1"""
ms.set_context(mode=ms.GRAPH_MODE)
network = Net(matmul_size=(96, 16), init_val=0.5)
# subgraph 0: 128,96 matmul 16,96 -> 128,16 reduce sum axis 0-> 16
predict_data = np.random.randn(128, 96).astype(np.float32)
# pylint: disable=protected-access
ms.export(network, ms.Tensor(predict_data), file_name="matmul_0", file_format="MINDIR")
# subgraph 1: 8,96 matmul 16,96 -> 8,16 reduce sum axis 0-> 16
predict_data = np.random.randn(8, 96).astype(np.float32)
# pylint: disable=protected-access
ms.export(network, ms.Tensor(predict_data), file_name="matmul_1", file_format="MINDIR")
dst_dir = '../matmul/1'
try:
os.mkdir(dst_dir)
except OSError:
pass
dst_file = os.path.join(dst_dir, 'matmul_0.mindir')
copyfile('matmul_0.mindir', dst_file)
print("copy matmul_0.mindir to " + dst_dir + " success")
dst_file = os.path.join(dst_dir, 'matmul_1.mindir')
copyfile('matmul_1.mindir', dst_file)
print("copy matmul_1.mindir to " + dst_dir + " success")
if __name__ == "__main__":
export_net()
To use MindSpore for neural network definition, inherit mindspore.nn.Cell
. (A Cell
is a base class of all neural networks.) Define each layer of a neural network in the __init__
method in advance, and then define the construct
method to complete the forward construction of the neural network. Use export
of the mindspore
module to export the model file.
For more detailed examples, see Quick Start for Beginners.
Execute the export_matmul.py
script to generate the matmul_0.mindir
and matmul_1.mindir
files. The inputs shapes of these subgraphs are [128,96] and [8,96].
Deploying Inference Services
Configuring the Service
Start Serving with the following files:
matmul_multi_subgraphs
├── matmul/
│ │── servable_config.py
│ └── 1/
│ │── matmul_0.mindir
│ └── matmul_1.mindir
└── serving_server.py
serving_server.py
: Script file for starting the service.matmul
: Model folder, which is named after the model name.matmul_0.mindir
andmatmul_1.mindir
: Model files generated by the network in the previous step, which is stored in folder 1 (the number indicates the version number). Different versions are stored in different folders. The version number must be a string of digits. By default, the latest model file is started.servable_config.py: Model configuration file, which defines the model processing functions, in which method
predict
is defined.
Content of the configuration file:
from mindspore_serving.server import distributed
from mindspore_serving.server import register
model = register.declare_model(model_file=["matmul_0.mindir", "matmul_1.mindir"], model_format="MindIR",
with_batch_dim=False)
def process(x, y):
z1 = model.call(x, subgraph=0) # 128,96 matmul 16,96 -> reduce sum axis 0-> 16
z2 = model.call(y, subgraph=1) # 8,96 matmul 16,96 -> reduce sum axis 0-> 16
return z1 + z2
@register.register_method(output_names=["z"])
def predict(x, y):
z = register.add_stage(process, x, y, outputs_count=1)
return z
If a model is stateful, multiple calls to the model need to performed in a Python function stage to avoid mutual interference between multiple instances. A mutli-subgraph model generally a stateful model.
In this example, the process
function invokes the two subgraphs through the interface Model.call
, and the subgraph
parameter of Model.call
specifies the graph index, which starts from 0. The number is the sequence number for loading graphs. In a standalone system, the number corresponds to the sequence number in the model_file
parameter list of the declare_model
interface. In a distributed system, this number corresponds to the sequence number in the model_files
parameter list of the startup_agents
interface.
Starting the Service
Run the serving_server.py script to start the Serving server:
import os
import sys
from mindspore_serving import server
def start():
servable_dir = os.path.dirname(os.path.realpath(sys.argv[0]))
servable_config = server.ServableStartConfig(servable_directory=servable_dir, servable_name="matmul",
device_ids=(0, 1))
server.start_servables(servable_config)
server.start_grpc_server("127.0.0.1:5500")
server.start_restful_server("127.0.0.1:1500")
if __name__ == "__main__":
start()
The above startup script will load and run two inference copies of matmul
on devices 0 and 1, and the inference requests from the client will be split to the two inference copies.
If the server prints the Serving gRPC server start success, listening on 127.0.0.1:5500
log, the Serving gRPC service has started successfully and the inference model has already loaded successfully.
Executing Inference
To access the inference service through gRPC, you need to specify the network address of the gRPC server on the client. Execute serving_client.py to call the predict
method of the MatMul Servable.
import numpy as np
from mindspore_serving.client import Client
def run_matmul():
"""Run client of distributed matmul"""
client = Client("localhost:5500", "matmul", "predict")
instance = {"x": np.ones((128, 96), np.float32), "y": np.ones((8, 96), np.float32)}
result = client.infer(instance)
print("result:\n", result)
assert "z" in result[0]
if __name__ == '__main__':
run_matmul()
If the following information is displayed, the Serving inference service has correctly executed the multi-subgraph inference:
result:
[{'z': array([6528., 6528., 6528., 6528., 6528., 6528., 6528., 6528., 6528.,
6528., 6528., 6528., 6528., 6528., 6528., 6528.], }]