NLP（一百零六）GSM8K测试集中存在的几个问题

本文将会介绍GSM8K测试集中答案错误的几道题目，并借助其它大模型给出正确的答案。

引言

在文章NLP（一百）大模型数学能力测评中，笔者介绍了如何使用自己微调的大模型，在GSM8K和MATH数据集上进行自定义评估。

GSM8K数据集是大模型数学能力的一个重要评价数据集，测试集共1319条，一般采用HuggingFace上的该数据集进行下载、评估，网址为https://huggingface.co/datasets/openai/gsm8k。

但笔者最近在使用自己微调的大模型对该测试集进行评估时，发现有几道题目给出的答案并不正确，因此，在本文中予以指出。当然，这些答案错误的样本也会影响大模型的数学能力表现，虽然影响程度不大，但也有必要进行指出，希望能引起大模型从业者的重视。

答案有问题的题目编号（按在test.jsonl文件的行号给出）：

455
953
1017
1310

我们将使用以下大模型进行题目解答，解答方式为不加任何系统提示词（System role），直接在网页中或调用API对原题进行解答，并获取回答。使用的大模型为：

Kimi
万知
通义千问
GPT-4o
Claude

下面，笔者将就上述题目进行一一解答验证，当然，过程较为简单繁琐，因此，只对第一题进行详细地介绍和解答，后面的题目思路类似，只给出结果。

第455题

原题：

Marin and his neighbor Nancy each eat 4 apples a day. How many apples do they eat in 30 days?

原来答案：

1	`In one day, Marin and Nancy eat 4 + 1 = <<4+1=5>>5 apples.\nIn 30 days, they eat 30 * 5 = <<30*5=150>>150 apples.\n#### 150`

题目分析：

在原来的解答中，错误地将Marin和Nancy中的其中一个人每天只吃1个苹果，实际上是他们俩每人每天各吃4个苹果，所以正确答案为 (4+4)*30=240.

接下来，我们使用上述的大模型进行答案验证。

Kimi(https://kimi.moonshot.cn)

万知(https://www.wanzhi.com/)

通义千问（https://tongyi.aliyun.com/qianwen，通义千问2.5系列模型）

ChatGPT(App, GPT-4o模型)

Claude(代码实现，模型为claude 3.5 sonnet)

因国内无法访问Claude官方页面，因此，让别人帮忙运行了自己实现的代码。实现的Python代码如下：

# -*- coding: utf-8 -*-
# @file: chat.py
# pip install anthropic
import os
import anthropic

from dotenv import load_dotenv

load_dotenv()

problem = "Marin and his neighbor Nancy each eat 4 apples a day. How many apples do they eat in 30 days?"

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=4096,
    temperature=0.0,
    messages=[
        {
            "role": "user",
            "content": problem,
        }
    ],
)
print(message.content[0].text)

输出结果如下:

Let's break this problem down step by step:

1. First, let's find out how many apples Marin and Nancy eat together in one day:
   * Marin eats 4 apples per day
   * Nancy eats 4 apples per day
   * Together, they eat 4 + 4 = 8 apples per day

2. Now, we need to calculate how many apples they eat in 30 days:
   * They eat 8 apples per day
   * There are 30 days
   * So we multiply: 8 × 30 = 240 apples

Therefore, Marin and Nancy eat 240 apples in 30 days.

注意：上述所有大模型给出的结果都是240，再结合我们人工分析，可以得出结论：第453题原始答案有误。

第953题

原题：

Vince can staple 30 reports every 15 minutes. If he was stapling reports from 8:00 AM until 11:00 PM, how many reports did he staple altogether?

原来答案：

In one hour, there are 60 / 15 = <<60/15=4>>4 (15 minutes).\nFrom 8am - 11am, there are 11 - 8 = <<11-8=3>>3 hours.\nIn 3 hours, there are 3 x 4 = <<3*4=12>>12 (15 minutes).\nIn those 3 hours, Vince stapled 12 x 30 = <<12*30=360>>360 reports.\n#### 360

题目分析：

在原始的答案中，错误地将11PM当成了11AM，从而造成答案错误。实际上，总体时间为早上8点到晚上11点，共15个小时，因此正确答案为 30 * 4 * 150 = 1800.

使用大模型进行答案验证，结果如下：

Kimi: 1800
万知: 1800
通义千问: 1800
ChatGPT: 1800
Claude: 1800

第1017题

原题：

Mrs. Tatiana owns a grocery store that sells different fruits and vegetables, which includes carrots. The price of carrots in the grocery store increases by 5% of the original price every year. What would be the price of carrots after three years if it was $120 initially? (Round to the nearest integer)

原来答案：

In the first year, the price of carrots increases by 5/100*120 = $<<5/100*120=6>>6\nAfter 3 years, the price of carrots will increase by $6*3 = $<<6*3=18>>18\nThe total price of carrots after three years will be 120+18 = $<<120+18=138>>138\n#### 138

题目分析：

在原始答案中，错误地将这三年的增长率当年每年5%不变，基础为最开始的120美元。实际上，这是一个典型的复利问题，指的是每年在上一年的基础上增长5%，所以是指数增长。并且题目中给出提示，将最终答案取数，取最接近的整数，因此，答案应按照指数计算，一般答案为小数。

使用大模型进行答案验证，结果如下（过程省略，有兴趣的读者可以自行验证，笔者已验证）：

Kimi: 139
万知: 139
通义千问: 139
ChatGPT: 139
Claude: 139

第1310题

原题：

The girls are trying to raise money for a carnival. Kim raises $320 more than Alexandra, who raises $430, and Maryam raises $400 more than Sarah, who raises $300. How much money, in dollars, did they all raise in total?

原来答案：

1	`Kim raises 320+430=<<320+430=750>>750 dollars.\nMaryam raises 400+300=<<400+300=700>>700 dollars.\nThey raise 750+430+400+700=<<750+430+400+700=2280>>2280 dollars.\n#### 2280`

题目分析：

在原始答案的最后部分，应该加上的不是400美元，而是300美元（按照题意进行理解）。因此，最终答案为2180.

使用大模型进行答案验证，结果如下：

Kimi: 2180
万知: 2180
通义千问: 2180
ChatGPT: 2180
Claude: 2180

总结

笔者在本文中介绍了GSM8K测试集中标注错误（即答案错误）的几道题目。当然，可能还会有其他答案错误的题目，笔者没有一一进行验证，只给出了自己发现的4道错误题目，希望能给读者一些启发。

NLP

#GSM8K

NLP（一百零六）GSM8K测试集中存在的几个问题

https://percent4.github.io/NLP（一百零六）GSM8K测试集中存在的几个问题/

作者

Jclian91

发布于

2024年11月13日

许可协议

ElasticSearch入门之对单个字段使用多个分词器上一篇

NLP（一百零五）文本纠错语料的自动化生成下一篇