On June 18, 2024, Meta posted an article announcing new research, models, and datasets from Meta FAIR. What immediately caught my eye was Chameleon, a new family of early-fusion token-based models described in an accompanying research paper and GitHub repo. I was intrigued by the potential of these models, so I decided to spend some time diving into the available resources and demos. In this article, I will share my personal experience and point of view on Chameleon.

For readers who want a quick overview, here are the key links and considerations as of June 27, 2024:

Important links:

Key considerations:

Meta Chameleon review

With those points in mind, let’s dive into a more technical overview of Chameleon, including the main contributions from the research paper.


Figure 1 illustrates Chameleon’s approach to representing multiple modalities – including images, text, and code – as discrete tokens. By tokenizing everything into a unified format, Chameleon can apply a single Transformer architecture that is trained end-to-end on roughly 10 trillion tokens of mixed-modal data. This allows the model to both understand and generate arbitrary combinations of interleaved images and text in a document. In the diagram, text tokens are shown in green while image tokens are depicted in blue.

Chameleon approach

Figure1: Chameleon approach for represntation

This is pretty different from how late-fusion models like Flamingo work. Those models keep the image and text processing separate until a later stage, and then combine them. By unifying the image and text tokens into a shared representation space from the beginning, Chameleon can more flexibility reason over and generate interleaved image-text outputs.

But getting this kind of model to train stably at the scale of billions of parameters is no easy task. The Meta researchers had to come up with some clever architectural tricks to make it work. For example, they used query-key normalization in the attention layers to keep the image and text representations aligned. They also had to be really careful about where they applied dropout regularization and how they ordered the layer normalization and residual connections.

Query-key normalization

The authors applied layer normalization to the query and key vectors in the attention mechanism. This helps to control the norm of the inputs to the softmax operation, which can otherwise grow unbounded and cause instability during training.


Meta team found that applying dropout regularization after the attention and feed-forward layers was necessary to stabilize training for the 7B parameter model, in addition to query-key normalization.

Layer norm re-ordering

For the 34B parameter model, the authors had to modify the order of layer normalization and residual connections in the Transformer block. They used the normalization strategy proposed in the Swin Transformer paper (Liu et al., 2021), where the layer norm is applied directly after the attention and feed-forward layers, rather than before. This re-ordering helps bound the norm growth in the feed-forward block.

These architectural modifications were crucial for maintaining stable training dynamics as the model size and number of training tokens increased. The authors provide ablation studies in the paper to demonstrate the impact of each technique on model convergence.

summary of chameleon architecture

Summary of Chameleon architecture in comparison to LLaMa-1 and LLaMa-2

In the paper, they show some experiments that demonstrate how important these architectural choices were. Without these stabilizing tweaks, the model would start to diverge and the image and text representations would drift apart, especially as they trained on more and more data. It’s clear that a lot of thought and engineering went into getting the architecture just right.


The training consisted of a two-stage pre-training approach, where the model was first trained on a massive dataset of 4.4 trillion tokens spanning a mix of text, image, and code modalities (publicly available data sources and licensed data).

In the first stage, which made up 80% of the total pre-training, the model was trained on a combination of large-scale unsupervised datasets. This included 2.9 trillion text-only tokens from sources like the Pile and C4, 1.5 trillion tokens of interleaved text-image data from LAION and other web sources, and 400 billion tokens of text-code pairs from GitHub and StackExchange.

For the second stage, which comprised the remaining 20% of pre-training, the authors continued training on the stage 1 data but also mixed in an additional 1.4 billion high-quality text-image pairs. They used a variety of filtering techniques to curate this data, such as upsampling images containing faces and applying an aesthetic scorer.

One interesting aspect of the pre-training was how they balanced the different modalities. The authors found that maintaining a roughly equal mix of text, image, and code data was important for stable training and good downstream performance. They also used data formatting techniques like randomly flipping the order of image-text pairs to avoid biasing the model.

After pre-training, the Chameleon models underwent supervised fine-tuning on curated task-specific datasets. This included standard benchmarks like MNIST for image classification, COCO for image captioning, and SQuAD for question answering, as well as several in-house datasets for more open-ended tasks like image-grounded dialogue and text-to-image generation.

Chameleon training parameters

In terms of compute, training the Chameleon models was no small feat. The 7B parameter model required 856481 GPU hours, while the 34B parameter model took more than 4.2 million GPU hours on V100 GPUs. The authors note that they used a combination of DeepSpeed and Megatron to enable efficient training at this scale.

Overall, the training process for Chameleon is a good example of the scale and complexity involved in building state-of-the-art multi-modal models today. The careful curation of the pre-training data and the extensive compute resources required are testament to the challenges in this space. In the next section, I’ll dive into some of the benchmark results that demonstrate the capabilities of the trained Chameleon models.

Benchmark results

To evaluate the performance of Chameleon, the authors conducted an extensive set of experiments across a range of standard benchmarks. The table below summarizes the results on several key text-only tasks, comparing Chameleon to other state-of-the-art models like LLaMA-2 and Mistral.

Chameleon benchmarks

Comparison of overall performance on collective academic benchmarks against open-source foundational models

On common sense reasoning and reading comprehension datasets like PIQA, HellaSwag, and ARC, Chameleon shows strong performance. The 7B parameter model is competitive with the similarly-sized LLaMA-2 model, while the 34B parameter Chameleon outperforms the larger LLaMA-2 70B model on 5 out of 8 tasks. It even matches the performance of the Mistral 8x7B model on some benchmarks.

For math and world knowledge tasks like GSM8K and MMLU, Chameleon also shines. The 7B model outperforms the LLaMA-2 7B model on GSM8K and matches the Mistral 7B model. Meanwhile, the 34B Chameleon surpasses LLaMA-2 70B and approaches the performance of Mistral 8x7B on both GSM8K and MMLU.

In addition to these text-only evaluations, the paper also reports impressive results for Chameleon on multimodal tasks like image captioning and visual question answering (VQA). Table below shows that the 34B parameter Chameleon model achieves state-of-the-art performance on the COCO and Flickr30k captioning datasets, outperforming models like Flamingo 80B and GIT 4×7.5B.

Chameleon image to text

Chameleon Image-To-Text

On VQA, Chameleon is competitive with other large-scale models. The fine-tuned 34B model matches the performance of the Flamingo 80B model on the VQAv2 dataset and slightly outperforms GIT.


Perhaps most exciting are the novel capabilities that Chameleon unlocks for mixed-modal reasoning and generation. The authors showcase examples where the model can engage in dialogue about images, answer questions that require both visual and textual understanding, and even generate coherent text-image stories.

While the paper doesn’t include quantitative benchmarks for these open-ended tasks, the samples are impressive and highlight the potential for Chameleon to enable new kinds of multimodal applications.

Overall, the benchmark results demonstrate that Chameleon is a highly capable model across a variety of tasks. Its strong performance on both text-only and multimodal benchmarks, combined with its unique architecture for mixed-modal reasoning, make it a significant contribution to the field of AI.

Limitations and issues

While Chameleon represents an exciting step forward in multi-modal AI modeling, there are a few key limitations and issues to be aware of based on the current public release.

First and foremost, the released Chameleon models do not include the capability for image generation in the outputs. As clarified in the Meta AI blog post and GitHub issues thread, the model is currently constrained to only generate text, even when conditioned on image inputs. The blog post sentence: “While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model.” This decision likely reflects the challenges and potential risks involved in responsibly deploying open-ended image generation models, such as the potential for misuse, unintended biases, or generation of harmful content. Restricting the public release to text-only generation allows researchers and developers to experiment with the model’s multi-modal understanding capabilities while mitigating some of these risks. However, if your use case requires the model to produce both text and images in the output, you may need to wait for a future release or explore other model options that include image generation functionality with appropriate safeguards in place.

Another limitation is the difficulty some users have encountered in getting the model inference code up and running, particularly on Windows. As seen in the GitHub issue, several users ran into issues with the viewer application not being able to establish the expected WebSocket connection to the inference backend. While workarounds were found, such as modifying the frontend configuration to use ‘localhost’ instead of ‘’, this highlights some of the challenges in making these complex models accessible to a broad audience.

Even for users who successfully run the inference code, there may be challenges in reproducing some of the more advanced training details outlined in the Chameleon paper. The authors describe a fairly involved training process, including a two-stage pre-training approach on a carefully curated mixture of web-scale text, image, and text+image data. Replicating this training setup would require significant compute resources and data pipeline engineering. And some of the architectural details, like the specific implementation of query-key normalization, may not be fully specified in the paper.

Lastly, the Huggingface link throws a 404 error: https://huggingface.co/facebook/chameleon (at least today – 27.06.2024)

That being said, the Chameleon team has done an admirable job in open-sourcing a good portion of their codebase and model weights. And the fact that they are actively engaging with the community to troubleshoot issues and incorporate feedback is a positive sign. As the project matures, I’m hopeful that we’ll see some of these initial limitations and rough edges smoothed out.

In the meantime, I think Chameleon still provides a valuable resource for researchers and developers looking to experiment with multi-modal AI. The ability to condition on both text and images, even if the model can only generate text, already unlocks a lot of interesting use cases. And the strong performance on established benchmarks suggests that the underlying architecture is sound.

My experiments

To further explore Chameleon’s capabilities and performance, I conducted my own experiments using the released models. For these tests, I used an NVIDIA H100 GPU on server with 380 GB of RAM. I focused primarily on the 7B parameter model, as I encountered some issues when attempting to run the larger 34B model.

The main error I ran into with the 34B model was a CUDA error indicating an “invalid device ordinal“. This error occurred in the distributed_workers function (chameleon/viewer/backend/models/chameleon_distributed.py) when the code attempted to set the CUDA device using torch.cuda.set_device(rank). The root cause seemed to be a mismatch between the number of distributed workers the code was attempting to use (world_size=4) and the fact that I only had a single GPU available.

To troubleshoot this, I tried several approaches:

Modifying the distributed_workers function to gracefully handle cases where the rank exceeds the number of available devices

  • Setting the world_size to 1 to match my single GPU setup;
  • Updating my PyTorch installation to resolve some deprecation warnings related to torch.set_default_tensor_type();
  • Using torch.multiprocessing instead of the distributed setup, which seemed more appropriate for a single-GPU scenario;
  • Double-checking that my PyTorch installation was compatible with CUDA 12.3, which was the version I had installed;

I also experimented with setting environment variables like CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES to ensure the GPU was properly visible to the program.

Despite these efforts, I wasn’t able to get the 34B model fully up and running in the time I had available (H100 is expensive :D) . I suspect there may be some additional configuration steps needed, perhaps related to GPU passthrough in the Docker setup or the NVIDIA Container Toolkit. Further debugging would be needed to resolve this, but I will do this when I have more time.

However, I was able to successfully run a number of experiments with the 7B model. Below, I share some qualitative results and observations from a few representative examples of image-to-text tasks I tried. For each example, I’ll show the input image and the text generated by the model, along with any notable successes, failures, or areas for improvement.


Traffic in Sofia


Prompt:How many cars can you see in this image? What color is the traffic light? Describe the architecture of the buildings visible.”

Response: I can see many cars, and the traffic light is green. The buildings in the background appear to be residential apartments or condominiums.

Gela village

Natural landscape

Prompt:Provide a detailed caption for this image, describing the scenery, colors, and mood it evokes.

Result:The image depicts a charming rural landscape, showcasing a picturesque village nestled in a valley. The focal point is a small, white house with a red roof, surrounded by lush green trees. The village is surrounded by rolling hills, covered in a lush carpet of vibrant green grass. The sky is a brilliant blue, dotted with fluffy white clouds, casting a soft, diffused light on the landscape. The overall atmosphere is one of tranquility and peace, inviting the viewer to escape to a simpler time and place. The colors are warm and inviting, with the soft hues of the houses and trees creating a harmonious blend. The mood is one of serenity and contentment, inviting the viewer to pause and appreciate the beauty of the natural world.

The Eiffel tower

The Eiffel Tower

Prompt: Identify this landmark and provide a brief history of its construction. Then generate an image showing what the surrounding area might have looked like before it was built.

Result:The Eiffel Tower is an iconic landmark in Paris, France. It was built for the 1889 World’s Fair, known as the Exposition Universelle, to celebrate the 100th anniversary of the French Revolution. The tower was designed by Gustave Eiffel and his engineering company, and it took over two years to complete. The tower was originally intended to be a temporary structure, but it became a popular attraction and has since become an integral part of Parisian culture. It stands at a height of 324 meters (1,063 feet) and has three levels for visitors, including an observation deck on the first floor. The tower has become a symbol of French engineering and innovation, and it attracts millions of visitors each year.

Before the construction of the Eiffel Tower, the area was largely undeveloped. The site was previously occupied by a racetrack and some industrial buildings. The Exposition Universelle was held in the Champ de Mars, a large park in the heart of Paris, and the Eiffel Tower was built to serve as the centerpiece of the fair. The tower was designed to be a temporary structure, but it quickly became a beloved landmark and a symbol of French ingenuity. Today, the Eiffel Tower remains one of the most iconic landmarks in the world and is a popular tourist destination, attracting millions of visitors each year.

I am attaching in this post also a few examples from Chameleon paper, which demonstrates the image generation capabilities and mixed-modality better:

chameleon example

Chameleon example from Appendix of Meta’s paper

Chameleon example 2

Another example from the Appendix

Chameleon example 3

Example for input image and text


Chameleon is an important step forward for open multimodal AI. The strong results across various language and vision tasks show what early fusion architectures can do. However, realizing the full potential of Chameleon will require active involvement from the broader community. The current release, while impressive, still has limitations and areas for improvement.

Ultimately, the success of such projects depends on a vibrant ecosystem of researchers, developers, and users working together to advance the state of the art. With the release of Chameleon, Meta AI has taken an important step in fostering this kind of collaborative, open innovation in multimodal AI. It’s now up to the community to build on this foundation and push the boundaries of what’s possible.

Citation of the paper: Team, C., 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818.

Last Update: 19/07/2024