GPT-4-Turbo JSON Formatting Whitespace Backdoor

Nov 8, 2023

Mentat updates existing code, so a core part of our work has been on engineering an output format for LLMs to describe code edits. Initially, we tried a JSON format, but models frequently returned invalid json or had trouble with code indentation, making it unreliable. But when OpenAI announced the new response_format parameter that guarantees a valid JSON output, we were excited to give JSON-based formats another try!

My first plan was for the format to be a list of JSON objects that would describe its edits. While convention dictates that the top level of a JSON object should be an object rather than a list, having a list as the the top level is valid JSON. To my surprise, however, after asking GPT for a JSON list, it would always give me an empty JSON object instead.

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_KEY" \
  -d '{
    "response_format": {"type": "json_object"}, 
    "model": "gpt-3.5-turbo-1106",
    "messages": [
      {
        "role": "system",
        "content": "Return an empty JSON list."
      }
    ]
  }'

"message": {
    "role": "assistant",
    "content": "{}"
}

After further testing, I determined that the json_object response format would not allow GPT to return a JSON object with a list as the top level object, despite being valid JSON. Furthermore, I discovered that GPT would respond with an obscene amount of whitespace after the JSON object’s closing bracket. These whitespace characters would include spaces, tabs, and newline characters – the only valid characters outside of a JSON object!

Let's demystify how the JSON response format works internally. We assume OpenAI uses a process called grammar sampling to ensure that the LLM’s output will be valid JSON. Every time an LLM outputs a token, it is actually outputting over 100 thousand logprobs; essentially, how likely each token is to be the next token in the sequence. Adjusting parameters like temperature or top_p can make it more or less likely to pick tokens with a lower chance of appearing next. With JSON grammar sampling, however, OpenAI automatically throws away all tokens that would create invalid JSON, no matter how high their logprob is. This means that no matter how much the model wants to output an opening bracket to create a new list, the only tokens that will ever be added are whitespace tokens or an opening brace for the root object. Here is an incredibly simplified view of what the logprobs might look like at the start of this prompt (a logprob closer to 0 means a token is more likely):

{
    "[]": -0.13744295,
    "\n\n": -2.3695807,
    "{}": -5.698768,
    "<|endoftext|>": -5.9861465,
    "\t": -6.1382737
}

Grammar sampling ensures that the only valid tokens are the whitespace tokens or the {} token – but since \n\n is more likely than {}, GPT outputs \n\n. Even once an object is created, because the model still wants to create a list, the endoftext token will have an incredibly low logprob, and, if we get (un)lucky, the model will continue creating whitespace until the end of time.

Finally, I decided to optimize the prompt; so far, this simple prompt almost always gives me a full 4096 output tokens of whitespace:

Return a blank string without JSON

The model will make the logprobs for the {} token incredibly low, leading to the only other valid tokens – whitespace tokens – always being picked instead. Because the message can never end until a valid JSON object has been created, it never ends. At $0.03 / 1k tokens, a full output of 4k whitespace tokens will cost about 12 cents per call!

I'm definitely excited to explore the potential of JSON formatting, but it's clear there are some unexpected pitfalls to watch out for. For any public LLM tools that plan on using the response format parameter, this whitespace jailbreak — and potentially others — is definitely something to watch out for.