Quantization

Overview

Quantization is an important technology for compressing foundation models. It converts floating-point parameters in a model into low-precision integer parameters to compress the parameters. As the parameters and specifications of a model increase, quantization can effectively reduce the model storage space and loading time during model deployment, improving the model inference performance.

MindSpore Transformers integrates the MindSpore Golden Stick tool component to provide a unified quantization inference process, facilitating out-of-the-box use. Please refer to MindSpore Golden Stick Installation Tutorial for installation and MindSpore Golden Stick Application PTQ algorithm to quantify the models in MindSpore Transformers.

Model Support

Currently, only the following models are supported, and the supported models are continuously being added.

Supported Model
DeepSeek-V3
DeepSeek-R1
Llama2