NLP（八十）大模型评估（Evaluation）

本文将介绍如何使用LangChain中的Evalution工具来实现大模型评估。

概念澄清

首先，让我们来澄清下大模型的评测与评估的区别，虽然它们的英文都为evaluation.

在本文中，大模型的评测指的是带有评测数据集（比如MMLU, AGIEval, C-Eval等）的大模型能力评估，通常会带有评估指标，如accuracy等。大模型评测较为客观、全面，考虑的是模型本身自带的能力与世界知识。

在文本中，大模型的评估指的是针对具体问题时，大模型的回复质量，包括正确性、简洁性等。大模型评估偏重实际应用中的问题，考虑的是大模型能力的扩展与外延。

当然，这两者的区别仅是本文提出的概念上的区别，仅是为了叙述方便，不至于使读者产生混淆。实际上，关于大模型能力的评估，虽然业界有一些合理的方案，但现阶段仍是百家争鸣，并没有统一的评价标准。

介绍

大模型评估无疑是大模型生态中重要的一环，LangChain提供了Evaluation工具，能帮助我们更好、更方便地进行大模型评估。

LangChain中的每种评估器类型（evaluator type）都附带即用型的实现和可扩展 API，允许根据用户的独特需求进行定制。以下是LangChain中提供的常见评估器类型：

字符串评估器（String Evaluators）：这些评估器评估给定输入的预测字符串，通常将其与参考字符串进行比较。
轨迹评估器（Trajectory Evaluators）：用于评估代理动作的整个轨迹。
比较评估器（Comparison Evaluators）：这些评估器旨在比较公共输入上两次运行的预测。

参考LangChain(版本为: langchain==0.0.340)中的源码，给出的评估器类型及说明如下：

class EvaluatorType(str, Enum):
    """The types of the evaluators."""

    QA = "qa"
    """Question answering evaluator, which grades answers to questions
    directly using an LLM."""
    COT_QA = "cot_qa"
    """Chain of thought question answering evaluator, which grades
    answers to questions using
    chain of thought 'reasoning'."""
    CONTEXT_QA = "context_qa"
    """Question answering evaluator that incorporates 'context' in the response."""
    PAIRWISE_STRING = "pairwise_string"
    """The pairwise string evaluator, which predicts the preferred prediction from
    between two models."""
    SCORE_STRING = "score_string"
    """The scored string evaluator, which gives a score between 1 and 10 
    to a prediction."""
    LABELED_PAIRWISE_STRING = "labeled_pairwise_string"
    """The labeled pairwise string evaluator, which predicts the preferred prediction
    from between two models based on a ground truth reference label."""
    LABELED_SCORE_STRING = "labeled_score_string"
    """The labeled scored string evaluator, which gives a score between 1 and 10
    to a prediction based on a ground truth reference label."""
    AGENT_TRAJECTORY = "trajectory"
    """The agent trajectory evaluator, which grades the agent's intermediate steps."""
    CRITERIA = "criteria"
    """The criteria evaluator, which evaluates a model based on a
    custom set of criteria without any reference labels."""
    LABELED_CRITERIA = "labeled_criteria"
    """The labeled criteria evaluator, which evaluates a model based on a
    custom set of criteria, with a reference label."""
    STRING_DISTANCE = "string_distance"
    """Compare predictions to a reference answer using string edit distances."""
    EXACT_MATCH = "exact_match"
    """Compare predictions to a reference answer using exact matching."""
    REGEX_MATCH = "regex_match"
    """Compare predictions to a reference answer using regular expressions."""
    PAIRWISE_STRING_DISTANCE = "pairwise_string_distance"
    """Compare predictions based on string edit distances."""
    EMBEDDING_DISTANCE = "embedding_distance"
    """Compare a prediction to a reference label using embedding distance."""
    PAIRWISE_EMBEDDING_DISTANCE = "pairwise_embedding_distance"
    """Compare two predictions using embedding distance."""
    JSON_VALIDITY = "json_validity"
    """Check if a prediction is valid JSON."""
    JSON_EQUALITY = "json_equality"
    """Check if a prediction is equal to a reference JSON."""
    JSON_EDIT_DISTANCE = "json_edit_distance"
    """Compute the edit distance between two JSON strings after canonicalization."""
    JSON_SCHEMA_VALIDATION = "json_schema_validation"
    """Check if a prediction is valid JSON according to a JSON schema."""

本文将介绍其中的三种：

CRITERIA
LABELED_CRITERIA
CONTEXT_QA

一般使用GPT-4模型作为评估模型。

CRITERIA Evaluation

CRITERIA Evaluation一般不带reference(参考答案)，常见的CRITERIA如下：

from pprint import pprint
from langchain.evaluation import Criteria

# list criteria
pprint(list(Criteria))

输出结果如下：

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

以简洁性(CONCISENESS)标准为例，示例评估代码如下：

# -*- coding: utf-8 -*-
import os
from langchain.evaluation import load_evaluator, EvaluatorType, Criteria
from langchain.chat_models import ChatOpenAI
from pprint import pprint

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
evaluation_llm = ChatOpenAI(model="gpt-4", temperature=0)

evaluator = load_evaluator(EvaluatorType.CRITERIA,
                           criteria=Criteria.CONCISENESS,
                           llm=evaluation_llm)

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
)
pprint(eval_result)

输出结果如下：

{'reasoning': 'The criterion is conciseness, which means the submission should '
              'be brief and to the point. \n'
              '\n'
              'Looking at the submission, the answer to the question "What\'s '
              '2+2?" is given as "The answer you\'re looking for is that two '
              'and two is four." However, before providing the answer, the '
              'respondent adds an unnecessary comment: "That\'s an elementary '
              'question." \n'
              '\n'
              'This additional comment does not contribute to answering the '
              'question and therefore makes the response less concise. \n'
              '\n'
              'So, based on the criterion of conciseness, the submission does '
              'not meet the criterion.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'}

输出结果包括三个字段：

score: 评估分数，这里取0或1，1代表输出答案与标准（CRITERIA）符合，0则不符合
value: 评估值，这里取Y(Yes)或N(No)
reasoning: 评估理由，evaluator会给出评估结果的理由

在上述的例子中，GPT-4模型作为评估模型器，判断给出的答案不符合简洁性标准，并给出了具体的判断理由。

LABELED_CRITERIA Evaluation

CRITERIA Evaluation一般带有reference(参考答案)，有了参考答案，评估结果会更加可靠。

示例代码如下：

# -*- coding: utf-8 -*-
import os
from langchain.evaluation import load_evaluator, EvaluatorType, Criteria
from langchain.chat_models import ChatOpenAI
from pprint import pprint

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
evaluation_llm = ChatOpenAI(model="gpt-4", temperature=0)

evaluator = load_evaluator(EvaluatorType.LABELED_CRITERIA,
                           criteria=Criteria.CORRECTNESS,
                           llm=evaluation_llm)

eval_result = evaluator.evaluate_strings(
    input="What is the capital city of the Jiangsu Province?",
    prediction="Suzhou",
    reference="The capital city of Jiangsu Province is Nanjing.",
)
pprint(eval_result)

输出结果如下：

{'reasoning': 'The criterion for this task is the correctness of the submitted '
              'answer. This means the answer should be accurate and factual.\n'
              '\n'
              'The input asks for the capital city of the Jiangsu Province.\n'
              '\n'
              'The submitted answer is Suzhou.\n'
              '\n'
              'The reference answer, however, states that the capital city of '
              'Jiangsu Province is Nanjing.\n'
              '\n'
              'Comparing the submitted answer with the reference answer, it is '
              'clear that the submitted answer is incorrect. Suzhou is not the '
              'capital of Jiangsu Province, Nanjing is.\n'
              '\n'
              'Therefore, the submission does not meet the criterion of '
              'correctness.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'}

CONTEXT_QA Evaluation

CONTEXT_QA Evaluation会对文档问答的回复进行评估，当然这也可用于RAG(Retrieval Augmented Generation)的回复结果评估。

此时，需要提供的reference即为文档问答中的文档（document）.

示例代码如下:

import os
from langchain.evaluation import load_evaluator, EvaluatorType, Criteria
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
evaluation_llm = ChatOpenAI(model="gpt-4", temperature=0)

evaluator = load_evaluator(EvaluatorType.CONTEXT_QA,
                           criteria=Criteria.CORRECTNESS,
                           llm=evaluation_llm,
                           requires_reference=True)


question = "2022年上海的常住人口是多少?"
context = "上海市，简称沪。 是中华人民共和国直辖市、中国共产党诞生地、国家中心城市、超大城市 、上海大都市圈核心城市、中国历史文化名城、" \
          "世界一线城市。 上海基本建成国际经济、金融、贸易、航运中心，形成具有全球影响力的科技创新中心基本框架。" \
          "上海市总面积6340.5平方千米，辖16个区。 2022年，上海市常住人口为2475.89万人。"

pred = "2022年，上海市的常住人口为2000万人。"
# evaluate
eval_result = evaluator.evaluate_strings(
  input=question,
  prediction=pred,
  reference=context
)
print(eval_result)