NLP（八十五）多模态模型Yi-VL-34B初步使用

本文将会介绍笔者在使用多模态模型Yi-VL-34B中所遇到的坑以及具体的使用体验。

2024年1月22日，零一万物Yi系列模型家族迎来新成员：Yi Vision Language（Yi-VL）多模态语言大模型正式面向全球开源。据悉，Yi-VL模型基于Yi语言模型开发，包括 Yi-VL-34B和Yi-VL-6B两个版本。

凭借卓越的图文理解和对话生成能力，Yi-VL模型在英文数据集MMMU和中文数据集CMMMU上取得了领先成绩，其中在MMMU上的指标仅次于GPT-4V，而在CMMMU上的指标仅次于GPT-4V和Qwen-VL-Plus展示了在复杂跨学科任务上的强大实力。

目前，Yi-VL模型已经开源，网址如下：

国外平台HuggingFace: https://huggingface.co/01-ai/Yi-VL-34B
国内平台ModelScope(魔搭社区): https://www.modelscope.cn/organization/01ai

本文将分以下三部分展开：

Yi-VL模型部署
CLI模式模型推理
可视化模型问答

本文中的模型以Yi-VL-34B模型为准。

模型部署

HuggingFace网站中介绍了Yi-VL-34B模型的部署方式，主要参考LLaVA框架，但笔者尝试几天，发现需要修改的代码较多，遇到了不少坑，仍无法部署。

而在Yi的Github项目中给出了方便的模型部署方式，可参考项目：https://github.com/01-ai/Yi/tree/0124/VL。

模块安装：第三方模块: torch == 2.1.2，其余模块参考Github项目中的requirements.txt文件。
其余修改：将Yi-VL-34B模型中config.json中的mm_vision_tower改为本地的clip模型路径。

经过以上步骤即可完成部署。Yi-VL-34B模型在对话中的system prompt调整如下：

This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: <image_placeholder>
Describe the cats and what they are doing in detail.
### Assistant:

模型推理

以命令行（CLI）的模型进行模型推理，需要将图片下载至images文件夹，同时将single_inference.py略作调整，以支持多次提问。

运行命令如下：

1	`CUDA_VISIBLE_DEVICES=0 python single_inference.py --model-path /data-ai/usr/models/Yi-VL-34B --image-file images/cats.jpg --question "How many cats are there in this image?"`

模型推理时使用一张A100（显存80G）就可满足推理要求。

示例图片如下：

回复结果如下：

可视化模型问答

基于此，我们将会用gradio模块，对Yi-VL-34B模型和GPT-4V模型的结果进行对比。

Python代码如下：

import os
import traceback

import torch
import requests
import json
import base64
import gradio as gr
from PIL import Image
from datetime import datetime

from llava.conversation import conv_templates
from llava.mm_utils import (
    KeywordsStoppingCriteria,
    expand2square,
    load_pretrained_model,
    get_model_name_from_path,
    tokenizer_image_token
)
from llava.model.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX, key_info


def disable_torch_init():
    """
    Disable the redundant torch default initialization to accelerate model creation.
    """
    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)


disable_torch_init()
model_path = "/data-ai/usr/models/Yi-VL-34B"
key_info["model_path"] = model_path
get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path)
print("model loaded!")


def model_infer(qs, image_file):
    global model
    qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
    conv = conv_templates["mm_default"].copy()
    conv.append_message(conv.roles[0], qs)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = (
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
        .unsqueeze(0)
        .cuda()
    )

    image = Image.open(image_file)
    if getattr(model.config, "image_aspect_ratio", None) == "pad":
        image = expand2square(
            image, tuple(int(x * 255) for x in image_processor.image_mean)
        )
    image_tensor = image_processor.preprocess(image, return_tensors="pt")[
        "pixel_values"
    ][0]

    stop_str = conv.sep
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    model = model.to(dtype=torch.bfloat16)
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor.unsqueeze(0).to(dtype=torch.bfloat16).cuda(),
            do_sample=True,
            temperature=0.1,
            top_p=0.7,
            num_beams=1,
            stopping_criteria=[stopping_criteria],
            max_new_tokens=1024,
            use_cache=True,
        )

    input_token_len = input_ids.shape[1]
    n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
    if n_diff_input_output > 0:
        print(
            f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids"
        )
    outputs = tokenizer.batch_decode(
        output_ids[:, input_token_len:], skip_special_tokens=True
    )[0]
    outputs = outputs.strip()

    if outputs.endswith(stop_str):
        outputs = outputs[: -len(stop_str)]
    outputs = outputs.strip()
    return outputs


def gpt_4v_answer(question, image_path):
    try:
        with open(image_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode('utf-8')

        api_key = os.environ["OPENAI_API_KEY"]
        url = "https://api.openai.com/v1/chat/completions"

        payload = json.dumps({
            "model": "gpt-4-vision-preview",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": question
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 1024,
            "temperature": 0.1,
            "top_p": 0.7
        })
        headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {api_key}',
        }

        response = requests.request("POST", url, headers=headers, data=payload)
        result = response.json()["choices"][0]["message"]["content"]
    except Exception as err:
        print(traceback.format_exc())
        print(response.text)
        result = "something error with gpt-4v"

    return result


def model_answer(question, array):
    # save image
    im = Image.fromarray(array)
    image_path = f"./images/{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}.jpg"
    im.save(image_path)
    # yi-vl-34b output
    model_output = model_infer(question, image_path)
    # gpt-4v output
    gpt4v_output = gpt_4v_answer(question, image_path)
    print(f"question: {question}\n"
          f"yi-vl-34 output: {model_output}\n"
          f"gpt-4v output: {gpt4v_output}\n")
    return model_output, gpt4v_output


if __name__ == '__main__':
    with gr.Blocks() as demo:
        with gr.Row():
            with gr.Column():
                image_box = gr.inputs.Image()
                user_input = gr.TextArea(lines=5, placeholder="your question about the image")
            with gr.Column():
                yi_vl_output = gr.TextArea(lines=5, label='Yi-VL-34B')
                gpt_4v_output = gr.TextArea(lines=5, label='GPT-4V')
                submit = gr.Button("Submit")
        submit.click(fn=model_answer,
                     inputs=[user_input, image_box],
                     outputs=[yi_vl_output, gpt_4v_output])

    demo.launch(server_port=50072, server_name="0.0.0.0", share=True)

以下是对不同模型和问题的回复：

图片：taishan.jpg，问题：这张图片是中国的哪座山？

图片：dishini.jpg，问题：这张图片是哪个景点的logo？

图片：fruit.jpg，问题：详细描述下这张图片

图片：football.jpg，问题：图片中一个有几个人，他们在干什么？

图片：cartoon.jpg，问题：这张图片是哪部日本的动漫？

从以上的几个测试用例来看，Yi-VL-34B模型的效果很不错，但对比GPT-4V模型，不管在图片理解，还是模型的回答上，仍有一定的差距。

最后，我们来看一个验证码的例子（因为GPT-4V是不能用来破解验证码的！）

可以看到，Yi-VL-34B模型在尝试回答，但给出了错误答案，而GPT-4V模型则会报错，报错信息如下：

{
  "error": {
    "message": "Your input image may contain content that is not allowed by our safety system.",
    "type": "invalid_request_error",
    "param": null,
    "code": "content_policy_violation"
  }
}