OpenAI Models Dominate Structured Code Edit Benchmark

Jan 11, 2024

A big value proposition of Mentat over other llm code editing tools is that it is open source. You can easily see what information is sent to the LLM and the LLM's raw response, and you are free to use any model you want! Like everyone else we've been excited to see so the release of so many promising new models over the last few months, especially the open weight models like mixtral 8x7B. Unfortunately thus far no model comes close to gpt-4's performance on Mentat - in fact no other model we've tested is even as good as gpt-3.5-turbo in terms of compliance with Mentat's edit formats.

Mentat requires a sophisticated LLM because it's one of the only coding assistants that truly *edits* existing code. Most assistants will provide autocomplete (like copilot) or rewrite sections of code, but this requires the human to tell the LLM where to make the edit. Mentat is different - the model decides where to make edits, and it coordinates multiple edits across several files to complete a task. To accomplish this the LLM must adhere to an "edit format" so that we can parse the files and locations for edits. The best two edit formats we've tried are the replacement format and the block format, and we've experimented with others.

For the benchmarks in this post we tested with the block format in which we ask the model to respond with "blocks" that look like this:

@@start
{
    "file": "core/script.py",
    "action": "replace",
    "start-line": 10,
    "end-line": 10
}
@@code
def main():
    name = get_name()
@@end

We've found it to perform very slightly better than our replacement format which asks for responses that look like:

@ core/hello_world.py starting_line=2 ending_line=4
def goodbye_world():
    print("Goodbye, World!")

The replacement format uses fewer tokens, so it's debatable which format is best. See here for the full prompts sent to the LLM which contain the complete format specs.

The day mixtral 8x7B was released I dmed my boss that I needed an M3 mac. But when it arrived I was a pretty disappointed by the performance of local models.

I tested using llama-2-7B, llama-2-70b and mixtral 7x8B to power Mentat to solve exercism exercises and the results were fairly anemic. While I was at it I also tried out gemini pro and mistral-medium, only available by endpoint. Exercism is an open source collection of programming challenges we have used previously (inspired by Aider) to test LLM's ability to make edits in the required format. In the previous post we discussed major problems with the benchmark: as an open source github repo it is certainly in the training set of most LLMs. But as a benchmark not of LLM problem solving performance but of the LLM's ability to understand our system prompt and output a structured parseable edit it is still worthwhile. Our team is working on a wider collection of more realistic benchmarks we will write about soon.

Previously we measured two things in our exercism benchmark:

Could the llm solve the problem given the problem statement included with the exercise.
If its solution failed any tests (exercism exercises come with a test suite) we would pipe the test output into context and ask Mentat to try again.

But the open source models performed so poorly that we recorded another metric this time:

After two attempts did the LLM succeed in passing any of the tests in the test suite.

We were only able to test mistral-medium on one iteration. See the end for an explanation. Here are our results:

Including the yellow lines almost makes the OpenAI models look even more impressive. There are only 135 exercises total. OpenAI's gpt-4-1106-preview passes at least one test of the suite 130/135. Looking through the outputs there are many exercises for which it failed only a single test case. Hopefully other models can catch up in 2024 but in the mean time I won't be using any models for programming work besides gpt-4.

Our results differ from other benchmarks such as the LMSYS Chatbot Arena which show a crowded space of about equally performing models after the gpt-4 models, with gemini and mixtral both performing slightly better than gpt-3.5. Some divergence isn't surprising because of course the benchmarks are measuring different things but I'm a little surprised by the magnitude. For some reason on our task gpt-3.5 is still significantly better than gemini pro and mistral-medium.

** Note for results on mistral-medium: Mentat uses the OpenAI python library. It is easy to use other LLMs, including non-OpenAI compatible ones, with a LiteLLM proxy. Local models are run with the delightful ollama (see here for details). But Mistral's endpoint is not quite OpenAI compatible even though their docs indicate they are and LiteLLM expects them to be. I asked LiteLLM for help and they made an issue so hopefully they'll be able to handle the inconsistency on their end. In the mean time I replaced AsyncOpenAi with MistralAsyncClient in a branch which mostly worked but mysteriously failed for conversations with more than one user request. By this time I was getting pretty frustrated so only partial results are available for mistral medium. This model also feels very wordy: