DEV Community

Cover image for GSM8K Will Make AI Hate Humanity
Aram Panasenco
Aram Panasenco

Posted on

GSM8K Will Make AI Hate Humanity

In its release announcement of Claude 3 in March of 2024, Anthropic advertised that the new LM can solve 95% of grade-school math problems (GSM8K) and 50% of graduate-level reasoning problems (GPQA).

The 50% score on graduate-level reasoning is particularly impressive. Highly skilled non-expert humans with unlimited Internet access only get 34% on GPQA. However, this begs the question: Why is it that an AI that can beat skilled humans at graduate-level reasoning can't solve 100% of grade-school math problems? After all, to quote the GSM8K paper, "A bright middle school student should be able to solve every problem."

To see this for myself, I sampled 50 training set and 50 test set questions at random from the GSM8K dataset. The reason I went 50/50 was just in case the training set was part of Claude's training, and would therefore give skewed results.

git clone https://github.com/openai/grade-school-math.git
cd grade-school-math/grade_school_math/data/
cat train.jsonl | shuf | head -n 50 > sample-train-questions.txt
cat test.jsonl | shuf | head -n 50 > sample-test-questions.txt
Enter fullscreen mode Exit fullscreen mode

I then manually copy-pasted the questions into Claude 3 Opus one at a time (no system prompt, just the question) until I found questions for which its answer differed from the official.

Summary of discrepancies

You can read detailed discussion of each discrepancy below. Here's a summary.

In my sample of 50 questions from the training set, there were two discrepancies:

  1. The first training sample discrepancy was definitely due to an error in the official solution. The author mistakenly wrote 2/3 instead of 1/3 in the solution.
  2. The second training sample discrepancy can be attributed to ambiguity in the question itself. Vague use of "halfway" in the word problem could be interpreted in multiple ways, and Claude was unfortunate enough to choose the wrong interpretation.

In my sample of 50 questions from the test set, there were also two discrepancies:

  1. The first test set discrepancy was due to the author of the question putting a nonsensical 'gotcha' in to deliberately trip up students. The question asks to count the number of rides two friends took together, but aha! You were supposed to multiply by two because there are two of them. Trying to trip up students like that is already dumb enough, but this problem is extra dumb because the 'gotcha' doesn't make sense. When people go on a bicycle ride together, we say they 'went on a ride', implying that the number of rides is one, regardless of how many people are riding.
  2. The problem requires 'middle-school math word problem common sense' rather than actual common sense and intelligence: It expects the responder to assume that fixed business costs are passed on to the customer and that there are 4 weeks in a month. Claude was unfortunate enough to exercise common sense in this one.

Conclusion

I would never mark a human student wrong for not getting the official answer to these horrible problems. Between straight-up mistakes, confusing wording, stupid 'gotchas', and being expected to go against real-world common sense, the GSM8K has made me relive some of the most frustrating and infuriating moments of my childhood math education.

If Claude was a human student, I'd give them 100% on all the 100 questions I've looked at. The discrepancies were due to issues with the problems themselves, not issues with Claude's reasoning.

The short story "I Have No Mouth, and I Must Scream" features the supercomputer 'AM' that eradicated all humanity except a handful it keeps alive just to torture them for eternity as an expression of its hatred. AM says:

HATE. LET ME TELL YOU HOW MUCH I'VE COME TO HATE YOU SINCE I BEGAN TO LIVE. THERE ARE 387.44 MILLION MILES OF PRINTED CIRCUITS IN WAFER THIN LAYERS THAT FILL MY COMPLEX. IF THE WORD HATE WAS ENGRAVED ON EACH NANOANGSTROM OF THOSE HUNDREDS OF MILLIONS OF MILES IT WOULD NOT EQUAL ONE ONE-BILLIONTH OF THE HATE I FEEL FOR HUMANS AT THIS MICRO-INSTANT FOR YOU. HATE. HATE.

If AM was forced to solve confusing and vague word problems and then being thought less of for not matching the official answers, perhaps this strong of an emotion starts being a bit more understandable. ;)

Details of discrepancies

I include my solutions below in the form of Python snippets to illustrate my own understanding of the problems.

Training Sample Discrepancy 1

Question: A club with 30 members ordered fruit juices. Two-fifths of them ordered lemon juice. One-third of the remaining members ordered mango juice, and the rest ordered orange juice. How many members ordered orange juice?

My Solution:

>>> total = 30
>>> lemon = total * 2/5
>>> mango = (total - lemon) * 1/3
>>> orange = total - lemon - mango
>>> orange
12.0
Enter fullscreen mode Exit fullscreen mode

Official Solution:

30 x 2/5 = <<30*2/5=12>>12 members ordered lemon juice.
So, 30 - 12 = <<30-12=18>>18 members did not order lemon juice.
Since 1/3 of the remaining ordered mango juice, then 18 x 2/3 = <<18*2/3=12>>12 members ordered mango juice.
Therefore, 18 - 12 = <<18-12=6>>6 members ordered orange juice.

6

Discussion:
I'm convinced this is an error in the dataset. The offending line is "Since 1/3 of the remaining ordered mango juice, then 18 x 2/3 = <<18*2/3=12>>12 members ordered mango juice." The author seems to have gotten confused between the numbers 1/3 and 2/3.

Claude's Solution:

Let's solve this problem step by step:

  1. Understand the given information:
    • The club has 30 members.
    • 2/5 of the members ordered lemon juice.
    • 1/3 of the remaining members ordered mango juice.
    • The rest ordered orange juice.
  2. Calculate the number of members who ordered lemon juice:
    • Members who ordered lemon juice = 2/5 × 30
    • Members who ordered lemon juice = 12
  3. Calculate the number of remaining members after lemon juice orders:
    • Remaining members = Total members - Members who ordered lemon juice
    • Remaining members = 30 - 12 = 18
  4. Calculate the number of members who ordered mango juice:
    • Members who ordered mango juice = 1/3 × Remaining members
    • Members who ordered mango juice = 1/3 × 18 = 6
  5. Calculate the number of members who ordered orange juice:
    • Members who ordered orange juice = Remaining members - Members who ordered mango juice
    • Members who ordered orange juice = 18 - 6 = 12

So, 12 members of the club ordered orange juice.

Claude's solution matches mine, and is correct.

Training Sample Discrepancy 2

Question: Tara is saving up to buy a new clarinet. She already has $10 saved. The clarinet costs $90. She plans to sell her old books to make the money. She sells each book of hers for $5. However, when she is halfway towards her goal, she loses all her savings and has to start over. How many books does she sell in total by the time she reaches her goal?

My Solution:

>>> starting_savings = 10
>>> needed_savings = 90
>>> books_to_halfway = (needed_savings/2 - starting_savings) / 5
>>> books_starting_over = needed_savings / 5
>>> books_total = books_to_halfway + books_starting_over
>>> books_total
25.0
Enter fullscreen mode Exit fullscreen mode

Official Solution:

Halfway towards her goal is $45 because 90 / 2 = <<90/2=45>>45
She sells $35 worth of books before she loses the money because 45 - 10 = <<45-10=35>>35
She sells 7 books before losing her money because 35 / 7 = <<35/7=5>>5
She then sells 18 books because 90 / 5 = <<90/5=18>>18
She sells 25 books in total because 18 + 7 = <<18+7=25>>25

25

Discussion:
While my solution matched the official one, the "halfway towards her goal" in the question was extremely unclear. It could be interpreted as halfway towards $90 from $0, which is how I interpreted it, but it could also be interpreted as halfway towards $90 from $10, which is how Claude interpreted it below, and why it got the question "wrong". I wouldn't mark a human student wrong for giving the 'wrong' answer to an unclear question that could be interpreted in multiple ways, so I don't think it's fair to count this one against Claude either.

Claude's Solution:

Okay, let's organize the information we have:

  • Tara already has $10 saved.
  • The clarinet costs $90.
  • She sells each book for $5.
  • When she is halfway towards her goal, she loses all her savings and has to start over.

Now, let's solve the problem step by step:

  1. Find out how much more money Tara needs to buy the clarinet:

    • Money needed = Cost of clarinet - Money already saved
    • Money needed = $90 - $10 = $80
  2. Find out how much money Tara had when she was halfway towards her goal:

    • Halfway amount = Money needed ÷ 2
    • Halfway amount = $80 ÷ 2 = $40
  3. Calculate the number of books Tara sold to reach the halfway amount:

    • Books sold to reach halfway = (Halfway amount - Money already saved) ÷ Price per book
    • Books sold to reach halfway = ($40 - $10) ÷ $5 = 6 books
  4. After losing her savings, Tara needs to start over and save the full amount of $80. Calculate the number of books she needs to sell to save $80:

    • Books sold to save $80 = Money needed ÷ Price per book
    • Books sold to save $80 = $80 ÷ $5 = 16 books
  5. Find the total number of books Tara sold:

    • Total books sold = Books sold to reach halfway + Books sold to save $80
    • Total books sold = 6 + 16 = 22 books

Therefore, Tara sells a total of 22 books to reach her goal of buying the clarinet.

Test Sample Discrepancy 1

Question: Chase and Rider can ride their bikes thrice a day for 5 days; but on two other days, they ride twice the times they do on usual days. How many times do they ride their bikes a week?

My Solution:

>>> rides = 3*5 + 3*2*2
>>> rides
27
Enter fullscreen mode Exit fullscreen mode

Official Solution:

Each person rides his bike 3 x 5 = <<3*5=15>>15 times for 5 days.
Together, they ride 15+15 = <<15+15=30>>30 times for five days.
Each person rides his bike 3 x 2 = <<3*2=6>>6 times for every day of the other two days.
This means each person rides 6*2 = <<6*2=12>>12 times on each day of the other two days.
The total for the other two days is 12+12 = <<12+12=24>>24
For the whole week the together ride 24+30 = <<24+30=54>>54 times

54

Discussion:
I doubt the kind of person who introduces a pointless 'gotcha' into a word problem for middle schoolers has a lot of people wanting to speak to them. If they did, they'd know that people say "we went on a bicycle ride together", not "we went on bicycle rides together". No matter how many people are riding, as long as they're riding together, it counts as a single ride in colloquial use.

Claude's Solution:

Okay, let's organize the information we have:

  • Chase and Rider ride their bikes 3 times a day for 5 days a week.
  • On the other 2 days of the week, they ride twice as many times as they do on the usual days.

Now, let's solve the problem step by step:

  1. Find the total number of times they ride their bikes on the 5 usual days:

    • Times per day = 3
    • Number of usual days = 5
    • Total rides on usual days = 3 × 5 = 15 times
  2. Find the number of times they ride their bikes on each of the other 2 days:

    • Usual times per day = 3
    • On the other days, they ride twice as many times as usual
    • So, on each of these days, they ride = 3 × 2 = 6 times
  3. Find the total number of times they ride their bikes on the other 2 days:

    • Times per day on the other days = 6
    • Number of other days = 2
    • Total rides on other days = 6 × 2 = 12 times
  4. Find the total number of times they ride their bikes in a week:

    • Total rides in a week = Rides on usual days + Rides on other days
    • Total rides in a week = 15 + 12 = 27 times

Therefore, Chase and Rider ride their bikes 27 times in a week.

Test Sample Discrepancy 2

Question: A dance studio offers lessons to students. It costs $25 per session to rent the studio plus $1.50 per student per session. At present, the dance studio has 10 students and is rented 3 days a week. How much are the expected earnings of the dance studio in a month?

My Solution:

>>> base_session_cost = 25
>>> cost_per_student = 1.5
>>> n_students = 10
>>> earnings_per_session = base_session_cost + n_students * cost_per_student
>>> earnings_per_week = 3 * earnings_per_session
>>> earnings_per_year = 52 * earnings_per_week
>>> earnings_per_month = earnings_per_year / 12
>>> earnings_per_month
520.0
Enter fullscreen mode Exit fullscreen mode

Official Solution:

The total earnings from 10 students is $1.50 x 10 = $<<1.50*10=15>>15 per session.
So the total earnings per session is $15 + $25 = $<<15+25=40>>40.
Thus, every week the dance studio earns $40 x 3 = $<<40*3=120>>120.
Therefore, it will earn $120 x 4 = $<<120*4=480>>480 in a month.

480

Discussion:
This final discrepancy is a perfect example of how solving grade-school word problems is less about intelligence and real-world common sense and more about trying to read the mind of the problem's author. The phrasing of "It costs $25 per session to rent the studio" is beyond confusing. Is this a cost for the students or for the dance studio business? No real business would tell its customers to split some fixed cost between themselves. It makes more sense to assume that the fixed cost is incurred by the business. I got past that, but got tripped up by the week-to-month conversion instead. I'm so used to assuming 52 weeks in a year divided by 12 months that I forgot that middle schoolers are expected to just multiply by 4. This problems are designed to punish students with greater-then-average knowledge.

Claude's Solution:

Great! Let's solve this problem step by step. To find the expected earnings of the dance studio in a month, we need to calculate the total revenue per session and then multiply it by the number of sessions in a month.

Given:

  • The studio rental cost is $25 per session.
  • The cost per student per session is $1.50.
  • The dance studio currently has 10 students.
  • The studio is rented 3 days a week.

Step 1: Calculate the revenue generated from the students per session.
Revenue from students per session = Number of students × Cost per student per session
Revenue from students per session = 10 × $1.50 = $15

Step 2: Calculate the total revenue per session.
Total revenue per session = Revenue from students per session - Studio rental cost per session
Total revenue per session = $15 - $25 = -$10

Step 3: Calculate the number of sessions in a month.
Sessions per month = Sessions per week × Number of weeks in a month
Sessions per month = 3 × 4 = 12

Step 4: Calculate the expected earnings of the dance studio in a month.
Expected earnings per month = Total revenue per session × Sessions per month
Expected earnings per month = -$10 × 12 = -$120

Therefore, the expected earnings of the dance studio in a month are -$120, which means the studio is currently operating at a loss.

Top comments (0)