Benchmarking GPT-4 Turbo - A Cautionary Tale

Benchmarking GPT-4 Turbo - A Cautionary Tale

Nov 8, 2023

GPUs hard at work

Since we first introduced Mentat, GPT-4 has been the default model. While Mentat can run with GPT-3.5 or even local models, GPT-4’s higher quality code edits are well worth the extra cost. We are excited to unlock GPT-4 Turbo's full potential — being 2-3x cheaper and having a significantly larger context — but does it match the quality of GPT-4, especially for editing code?

Benchmarking GPT-4 Turbo

To answer this question, we ran both GPT-4 and GPT-4 Turbo on our Exercism benchmarks, compiled from a set of 122 Exercism programming exercises. We got the idea to use exercism exercises from Aider and ended up using the same prompts as them to allow us to directly compare our performances.

We ran GPT-4 on each task and gave it two tries to succeed (on the second attempt, it is shown why it failed). GPT-4 ended up solving 86/122, or 70%, of the JavaScript exercises.

Running that same benchmark with GPT-4 Turbo we got 84/122, or 68.8% of the exercises. Slightly worse, but close enough that it could just be statistical noise. But wait! Looking closer at the results there was a significant difference: GPT-4 solved 76 on the first try and only an additional 10 on the second attempt, while GPT-4 Turbo only solved 56 on the first try and solved an additional 28 on the second attempt.

Interpreting Results

Why would GPT-4 Turbo solve fewer tasks on the first attempt? We examined individual exercises and the code it wrote for them and found that it often wrote reasonable solutions but failed on the first attempt due to unclear or ambiguous instructions. But then how does GPT-4 solve them? Our theory was that GPT-4 had substantially memorized the Exercism training tasks, but that when GPT-4 was downsized (most likely through distillation) to GPT-4 Turbo, it lost some of this raw memorization capability.

We designed a test for this theory: we reran the benchmarks without showing the models the instructions to each exercise. Instead, we just told them that they were Exercism exercises, and gave them the exercise names and function stubs. This is not enough to solve the problem unless the model has it memorized. Due to rate limits on GPT-4 Turbo, we only ran the first 40 exercises. GPT-4 solved 23/40, or 57.5%, on the first try and an additional 5 on the second try. GPT-4 Turbo, on the other hand, only solved 12/40, or 30%, on the first try and an additional 11 on the second try. We interpret this as confirming that GPT-4 has more of the exercises memorized than GPT-4-Turbo.

Our results seem similar to this finding:

Although the author OCR’ed the SAT questions and believes that they weren’t in the training data, we believe it’s fairly likely that these questions – and definitely some incredibly similar questions – ended up in the training data at some point, making the drop in GPT-4 Turbo's measured performance hard to interpret.

Future Benchmarks

Benchmarks derived from content in GPT’s training data are still valuable for determining how good GPT is at responding in the correct edit format, as well as comparing fine-tuned models to each other. But these results make it pretty clear that they aren’t an accurate test for comparing models trained on separate datasets or distilled models – which is what we suspect GPT-4 Turbo is.

These results emphasize the need for better benchmarks, something we've already begun building for Mentat: real world, realistic coding tasks based on recent commits to open source repositories that were made after the training cutoff. Although no benchmark will ever be perfect, we are confident that these improvements will help us gauge the relative accuracy of different models in the future.