Skip to content

Bedrock invoke_model returning two JSONs separated by <|eot_id|> when using Llama 4 Maverick — anyone else facing this?

0

I'm using invoke_model in Bedrock with Llama 4 Maverick.

My prompt format looks like this (as per the docs):

<|begin_of_text|> <|start_header_id|>system<|end_header_id|> ...system prompt...<|eot_id|>

...chat history...

<|start_header_id|>user<|end_header_id|> ...user prompt...<|eot_id|>

<|start_header_id|>assistant<|end_header_id|> Problem:

The model randomly returns TWO JSON responses, separated by <|eot_id|>. And only Llama 4 Maverick does this. Same prompt → llama-3.3 / llama-3.1 = no issue.

Example (trimmed):

{ "criteria": { "last_search": "I'd like a bugati", "variant": "search" }, "agentToRun": { "name": "internal_search", "params": { "query": "bugati" } } }

<|eot_id|>

assistant

{ "criteria": { "last_search": "I'd like a bugati", "variant": "search" }, "agentToRun": { "name": "internal_search", "params": { "query": "bugati" } } }

2 Answers
0

I've looked into this, and it turns out the <|eot_id|> tokens appearing in your responses are actually expected behavior for Llama 4. Here's what's happening:

Llama 4 returns special tokens like <|eot_id|>, <|start_header_id|>, and <|end_header_id|> as part of its native output. These are structural markers that Meta's Llama models use internally to indicate things like end-of-turn in conversations. The difference you're noticing compared to Llama 3.1 and 3.3 is that those models have built-in wrappers that automatically strip out these tokens before returning the response to you. With Llama 4, you're getting the raw, unfiltered output from the model.

The good news is this is straightforward to handle on your end. I'd recommend adding a simple post-processing step in your code to strip out these tokens. Something like:

re.sub(r'<\|eot_id\|>', '', text)

This should clean up the output and give you just the JSON response you're expecting.

Alternatively, if you're open to using a wrapper library like LangChain or llama-index-llms-bedrock, those can handle this token cleanup automatically.

For more details on this behavior, you can refer to the official documentation here:

AWS
EXPERT
answered 2 months ago
  • Hey, thanks for the answer and it helps me clear a bit of doubt of why not with llama3 models but why does this not happen with Converse api? Have AWS build an internal stripping of such tokens in converse?

  • Also if possible can you share the source from where you found out about the internal truncation of end of turn token in llama3 family

0

It appears you're encountering an issue with Llama 4 Maverick in Amazon Bedrock where the model is returning duplicate JSON responses separated by the <|eot_id|> token.

This behavior seems specific to Llama 4 Maverick, as you've mentioned that other models like Llama 3.1 and Llama 3.3 don't exhibit this issue with the same prompt format. Llama 4 Maverick is Meta's multimodal model designed for image and text understanding, and it may have different response handling characteristics compared to other models in the Llama family.

When using Amazon Bedrock's invoke_model API, the response format can vary between models. For Meta's Llama models, the response typically includes fields like "generation" (the generated text), "prompt_token_count", "generation_token_count", and "stop_reason".

This duplicate response issue could be related to:

  1. How Llama 4 Maverick processes the end-of-text tokens in your prompt format
  2. A potential configuration issue with how the model is deployed in Bedrock
  3. A mismatch between the prompt format and what Llama 4 Maverick expects

Since this appears to be model-specific behavior, you might want to:

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.