Evaluation
Harness Evaluation
Introduction
LM Evaluation Harness is an open-source language model evaluation framework that provides evaluation of more than 60 standard academic datasets, supports multiple evaluation modes such as HuggingFace model evaluation, PEFT adapter evaluation, and vLLM inference evaluation, and supports customized prompts and evaluation metrics, including the evaluation tasks of the loglikelihood, generate_until, and loglikelihood_rolling types. After MindFormers is adapted based on the Harness evaluation framework, the MindFormers model can be loaded for evaluation.
Installation
pip install lm_eval==0.4.3
Usage
Run the eval_with_harness.py script.
Viewing a Dataset Evaluation Task
#!/bin/bash
python toolkit/benchmarks/eval_with_harness.py --tasks list
Starting the Single-Device Evaluation Script
#!/bin/bash
python toolkit/benchmarks/eval_with_harness.py --model mf --model_args "pretrained=MODEL_DIR,device_id=0" --tasks TASKS
Starting the Multi-Device Parallel Evaluation Script
#!/bin/bash
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
bash mindformers/scripts/msrun_launcher.sh "toolkit/benchmarks/eval_with_harness.py \
--model mf \
--model_args pretrained=MODEL_DIR,use_parallel=True,tp=1,dp=4 \
--tasks TASKS \
--batch_size 4" 4
You can set multiple device numbers through the environment variable ASCEND_RT_VISIBLE_DEVICES.
Evaluation Parameters
Harness parameters
Parameter |
Type |
Description |
Required |
---|---|---|---|
|
str |
The value must be mf, indicating the MindFormers evaluation policy. |
Yes |
|
str |
Model and evaluation parameters. For details, see "MindFormers model parameters." |
Yes |
|
str |
Dataset name. Multiple datasets can be specified and separated by commas (,). |
Yes |
|
int |
Number of batch processing samples. |
No |
|
int |
Number of few-shot samples. |
No |
|
int |
Number of samples for each task. This parameter is mainly used for function tests. |
No |
MindFormers model parameters
Parameter |
Type |
Description |
Required |
---|---|---|---|
|
str |
Model directory. |
Yes |
|
bool |
Specifies whether to enable incremental inference. This parameter must be enabled for evaluation tasks of the generate_until type. |
No |
|
int |
Device ID. |
No |
|
bool |
Specifies whether to enable the parallel policy. |
No |
|
int |
Data parallelism. |
No |
|
int |
Model parallelism. |
No |
Preparations Before Evaluation
Create a model directory MODEL_DIR.
Store the MindFormers weight, YAML file, and tokenizer file in the model directory. For details about how to obtain the weight and files, see the README file of the MindFormers model.
Configure the yaml file.
YAML configuration references:
run_mode: 'predict'
model:
model_config:
use_past: True
checkpoint_name_or_path: "model.ckpt"
processor:
tokenizer:
vocab_file: "tokenizer.model"
Evaluation Example
#!/bin/bash
python toolkit/benchmarks/eval_with_harness.py --model mf --model_args "pretrained=./llama3-8b,use_past=True" --tasks gsm8k
The evaluation result is as follows. Filter indicates the output mode of the matching model, Metric indicates the evaluation metric, Value indicates the evaluation score, and Stderr indicates the score error.
Tasks |
Version |
Filter |
n-shot |
Metric |
Value |
Stderr |
||
---|---|---|---|---|---|---|---|---|
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.5034 |
± |
0.0138 |
strict-match |
5 |
exact_match |
↑ |
0.5011 |
± |
0.0138 |
Features
For details about all Harness evaluation tasks, see Viewing a Dataset Evaluation Task.
VLMEvalKit Evaluation
Overview
VLMEvalKit is an open source toolkit designed for large visual language model evaluation, supporting one-click evaluation of large visual language models on various benchmarks, without the need for complicated data preparation, making the evaluation process easier. It supports a variety of graphic multimodal evaluation sets and video multimodal evaluation sets, a variety of API models and open source models based on PyTorch and HF, and customized prompts and evaluation metrics. After adapting MindFormers based on VLMEvalKit evaluation framework, it supports loading multimodal large models in MindFormers for evaluation.
Supported Feature Descriptions
Supports automatic download of evaluation datasets;
Support for user-defined input of multiple datasets and models (currently only
cogvlm2-llama3-chat-19B
is supported and will be added gradually in subsequent releases);Generate results with one click.
Installation
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
Usage
Run the script eval_with_vlmevalkit.py.
Launching a Single-Card Evaluation Script
#!/bin/bash
python eval_with_vlmevalkit.py \
--data MME \
--model cogvlm2-llama3-chat-19B \
--verbose \
--work-dir /{path}/evaluate_result \
--model-path /{path}/cogvlm2_model_path \
--config-path /{path}/cogvlm2_config_path
Evaluation Parameters
VLMEvalKit main parameters
Parameters |
Type |
Descriptions |
Compulsory(Y/N) |
---|---|---|---|
–data |
str |
Name of the dataset, multiple datasets can be passed in, split by spaces. |
Y |
–model |
str |
Name of the model, multiple models can be passed in, split by spaces. |
Y |
–verbose |
/ |
Outputs logs from the evaluation run. |
N |
–work-dir |
str |
The directory where the evaluation results are stored, by default, is stored in the folder with the same name as the model in the current directory. |
N |
–model-path |
str |
Contains the paths of all relevant files of the model (weights, tokenizer files, configuration files, processor files), multiple paths can be passed in, filled in according to the order of the model, split by spaces. |
Y |
–config-path |
str |
Model configuration file path, multiple paths can be passed in, fill in according to the model order, split by space. |
Y |
Preparation Before Evaluation
Create model directory model_path;
Model directory must be placed MindFormers weights, yaml configuration file, tokenizer file, which can refer to the MindFormers model README document;
Configure the yaml configuration file.
The yaml configuration reference:
load_checkpoint: "/{path}/model.ckpt" # Specify the path to the weights file
model:
model_config:
use_past: True # Turn on incremental inference
is_dynamic: False # Turn off dynamic shape
tokenizer:
vocab_file: "/{path}/tokenizer.model" # Specify the tokenizer file path
Evaluation Sample
#!/bin/bash
export USE_ROPE_SELF_DEFINE=True
python eval_with_vlmevalkit.py \
--data COCO_VAL \
--model cogvlm2-llama3-chat-19B \
--verbose \
--work-dir /{path}/evaluate_result \
--model-path /{path}/cogvlm2_model_path \
--config-path /{path}/cogvlm2_config_path
The results of the evaluation are as follows, where Bleu
and ROUGE_L
denote the metrics for evaluating the quality of the translation, and CIDEr
denotes the metrics for evaluating the image description task.
{
"Bleu": [
15.523950970070652,
8.971141548228058,
4.702477458554666,
2.486860744700995
],
"ROUGE_L": 15.575063213115946,
"CIDEr": 0.01734615519604295
}